進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-2108201815002200
論文名稱(中文) 使用深度神經網路考量口語與非口語之韻律短語語音情緒辨識
論文名稱(英文) Prosodic Phrase-Based Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Non-verbal Speech Signals
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 106
學期 2
出版年 107
研究生(中文) 陳毅軒
研究生(英文) Yi-Hsuan Chen
學號 P76054509
學位類別 碩士
語文別 英文
論文頁數 46頁
口試委員 指導教授-吳宗憲
口試委員-王新民
口試委員-王駿發
口試委員-戴顯權
口試委員-陳嘉平
中文關鍵字 語音情緒辨識  韻律短語  非口語音段  卷積神經網路  長短期記憶模型  序列對序列模型 
英文關鍵字 Speech emotion recognition  Prosodic Phrase  Non-verbal segment  Convolutional Neural Network  Long-short term memory  Sequence-to-sequence model 
學科別分類
中文摘要 由於對話機器人、心智疾病診斷協助、銷售、照護、娛樂等各種智慧型服務的普及,語音情緒辨識已經變得越來越重要。在人與機器的溝通中,情緒辨識與情感分析能夠增強機器與人類的互動。而當人腦在辨別他人的情緒時會同時感受口語以及非口語的聲音表達以達到更清楚的辨別。據我們所知,目前並沒有語音情緒辨識機制有對非口語的笑聲、哭聲等自然情緒表現有所著重。
本論文主要考慮人與人對話的語音中自然產生的情緒呈現與對話中情緒的變化,也透過參考人腦對他人情緒辨別的方法,期望透過分析語音中的口語音段與非口語音段的特徵來協助語音的辨識,因此選擇國立清華大學-國立台灣藝術大學中文情緒互動多模態語料庫(NTHU-NTUA Chinese Interactive Multimodal Emotion Corpus,NNIME)。此語料庫是一個指定情境而無語句腳本的自發性情緒語料庫,內部包含許多不同如笑聲、哭聲、氣音等自然情感對話所具有的非口語聲音片段。
本論文將NNIME資料庫的101場對話進行重新分段,取得4766個單一語者對話回應段的資料,將每個回應段透過支持向量機與韻律短語自動標記器,將音訊變為由非口語段、韻律短語、靜音段所組成的音段序列。各序列分別匯入訓練完成的卷積神經網路,抽取各音段的情緒特徵、聲音特徵。將各分段的特徵以向量表示,匯入具有注意力機制以長短期記憶模型為基底的序列對序列模型進行語音的情緒分段辨識。最後輸入的語者對話回應段會得到與分段數量相同長度的情緒標記表示序列。
依據各項實驗結果顯示,在自然的情緒表示下,考慮非口語特徵與聲音種類特徵可以提高分段情緒辨識的結果,期望透過語音情緒辨識的加強可以使機器更加人性化。
英文摘要 Speech emotion recognition is increasingly important for many applications, such as chatbot, mental problem diagnosis assistant, smart health care, sale advertising, smart entertainment and some other smart services. In Human-Machine communication, emotion recognition and sentiment analysis can enhance the interaction between people and devices. When people recognize others’ emotion, our brains independently process vocal representation and emotionality, then gain difference of emotionality more clearly from the voice effect. To the best of our knowledge, none of existing emotion recognition system considers laughter, cries or other emotion interjection in speech which naturally exists in our daily life when we express our emotion.
The thesis tries to observe spontaneous emotion expression and change within a single turn in daily dialogue. Considering how human brain discriminates others’ emotion, we extract the features of verbal and non-verbal parts in speech for emotion recognition. For these purposes, the thesis choose a spontaneous speech emotion corpus, NNIME (NTHU-NTUA Chinese Interactive Multimodal Emotion Corpus), which contains various emotional nonverbal sounds, such as laughter, sobbing, and sigh in speech.
Totally, 4766 single-speaker turns in dialogue produced based on the segments in the audio data of NNIME 101 sessions. In order to reconstruct each turn into a sequence of silence interval, prosodic phrase and non-verbal sound, an SVM-based verbal/nonverbal discriminator is developed and a Prosodic Phrase (PPh) auto-tagger is used. These segments are then used as the training data for emotion/sound feature extraction based on convolutional neural networks (CNNs). Finally, every turn is represented as a sequence of emotion/sound feature vectors and becomes the input of a sequence-to-sequence model. The attentive LSTM-based sequence-to-sequence model is finally adopted to give an emotion tag sequence as recognition result for a given turn.
According to the experimental results, a better emotion recognition performance of spontaneous speech can benefit human machine interaction.
論文目次 摘要 I
Abstract III
誌謝 V
Contents VII
List of Tables X
List of Figures XII
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 1
1.3 Literature Review 2
1.3.1 Emotion Database 2
1.3.2 Speech Emotion Features 3
1.3.3 Speech Emotion Recognition Unit 5
1.3.4 Speech Emotion Recognition Model 5
1.4 Problems and Proposed Methods 6
1.5 Research Framework 8
Chapter 2 Corpus and Annotation 9
2.1 Original Annotation of NNIME 9
2.1.1 NNIME Database 9
2.1.2 Recording Setup 9
2.2 Classification of Emotion and Sound 11
2.3 Re-Annotation of NNIME Corpus 12
2.3.1 Boundary Validation Set 12
2.3.2 System Training Set 15
Chapter 3 Proposed Methods 20
3.1 Segmentation of Input Audio 21
3.1.1 Silence Detection 21
3.1.2 Discrimination of Verbal and Nonverbal interval 22
3.1.3 Prosodic Phrases 23
3.2 Feature Extraction 25
3.2.1 Convolution Neural Network 25
3.2.2 Feature Vectors of Audio Segments 27
3.3 Emotion Recognition of Segmented Audio 28
3.3.1 Long-Short Term Memory 28
3.3.2 Attention Mechanism 30
3.3.3 Sequence-to-Sequence Emotion Recognition 31
Chapter 4 Experimental Results 33
4.1 Boundary validation of Verbal and Nonverbal Intervals 34
4.2 Evaluation of Feature Extraction 35
4.2.1 Sound Features Extracted from Nonverbal Segments 35
4.2.2 Emotion Features Extracted from Verbal Segments 37
4.3 Evaluation of Emotion Recognition 38
4.3.1 Effect of Nonverbal Segments and Sound Features 39
4.3.2 Compare with Traditional Emotion Recognition Methods 40
Chapter 5 Conclusion and Future Work 42
References 43

參考文獻 References
[1] S. Blanton, "The voice and the emotions," Quarterly Journal of Speech, vol. 1, no. 2, p. 154-172, 1915.
[2] B. W. Schuller, "Speech Emotion Recognition: Two Decades in a Nutshell, Benchmarks, and Ongoing Trends," Communications of the ACM, May 2018, Vol. 61 No. 5, Pages 90-99, 2018.
[3] S. Lugović, I. Dunđer, and M. Horvat, "Techniques and applications of emotion recognition in speech," in 2016 39th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2016, p. 1278-1283.
[4] X. Zhang, Y. Sun, and D. Shufei, "Progress in speech emotion recognition," in TENCON 2015 - 2015 IEEE Region 10 Conference, 2015, p. 1-6.
[5] N. Campbell, "On the Use of NonVerbal Speech Sounds in Human Communication," in Verbal and Nonverbal Communication Behaviours, Berlin, Heidelberg, 2007, p. 117-128: Springer Berlin Heidelberg.
[6] A. Schirmer and T. C. Gunter, "Temporal signatures of processing voiceness and emotion in sound," Social Cognitive and Affective Neuroscience, vol. 12, no. 6, p. 902-909, 2017.
[7] H. C. Chou, W. C. Lin, L. C. Chang, C. C. Li, H. P. Ma, and C. C. Lee, "NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus," in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017, p. 292-298.
[8] I. S. Engberg, A. V. Hansen, O. Andersen, and P. Dalsgaard, "Design, recording and verification of a Danish emotional speech database," in Fifth European Conference on Speech Communication and Technology, 1997.
[9] E. Douglas-Cowie, R. Cowie, and M. Schröder, "A new emotion database: considerations, sources and scope," in ISCA tutorial and research workshop (ITRW) on speech and emotion, 2000.
[10] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, "A database of German emotional speech," in Ninth European Conference on Speech Communication and Technology, 2005.
[11] C. Busso et al., "IEMOCAP: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
[12] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, "Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions," in Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, 2013, p. 1-8: IEEE.
[13] Y. Li, J. Tao, L. Chao, W. Bao, and Y. Liu, "CHEAVD: a Chinese natural emotional audio–visual database," Journal of Ambient Intelligence and Humanized Computing, vol. 8, no. 6, p. 913-924, 2017.
[14] E. Tzinis and A. Potamianos, "Segment-based speech emotion recognition using recurrent neural networks," in Affective Computing and Intelligent Interaction (ACII), 2017 Seventh International Conference on, 2017, p. 190-195: IEEE.
[15] K. S. Rao, S. G. Koolagudi, and R. R. Vempada, "Emotion recognition from speech using global and local prosodic features," International journal of speech technology, vol. 16, no. 2, p. 143-160, 2013.
[16] H. Cao, S. Benus, R. Gur, R. Verma, and A. Nenkova, "Prosodic cues for emotion: analysis with discrete characterization of intonation," Speech prosody 2014, 2014.
[17] N. Anand and P. Verma, "Convoluted feelings convolutional and recurrent nets for detecting emotion from audio data," in Technical Report: Stanford University, 2015.
[18] G. Trigeorgis et al., "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, p. 5200-5204.
[19] L. Zhu, L. Chen, D. Zhao, J. Zhou, and W. Zhang, "Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN," Sensors (Basel, Switzerland), vol. 17, no. 7, p. 1694, 2017.
[20] S. Kim and M. L. Seltzer, "Towards Language-Universal End-to-End Speech Recognition," eprint arXiv:1711.02207, 2017.
[21] N. T. V. Michael Neumann, "Attentive Convolutional Neural Network based Speech Emotion Recognition:A Study on the Impact of Input Features, Signal Length, and Acted Speech," 2017.
[22] F. B. Pokorny, F. Graf, F. Pernkopf, and B. W. Schuller, "Detection of negative emotions in speech signals using bags-of-audio-words," in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), 2015, p. 879-884.
[23] C.-W. Huang and S. S. Narayanan, "Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition," in Multimedia and Expo (ICME), 2017 IEEE International Conference on, 2017, p. 583-588: IEEE.
[24] C.-y. Tseng, S.-h. Pin, and Y.-l. Lee, "Speech prosody: issues, approaches and implications," From Traditional Phonology to Mandarin Speech Processing, Foreign Language Teaching and Research Process, p. 417-438, 2004.
[25] P. W. Boersma, David, "Praat: doing phonetics by computer [Computer program]. Version 6.0.40."
[26] M. Domínguez Bajo, M. Farrús, and L. Wanner, "An automatic prosody tagger for spontaneous speech," in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers;, Osaka, Japan, 2016, p. 377-387: COLING.
[27] F. Eyben, M. Wöllmer, and B. Schuller, openSMILE -- The Munich Versatile and Fast Open-Source Audio Feature Extractor. 2010, p. 1459-1462.
[28] E. Bozkurt, E. Erzin, Ç. E. Erdem, and A. T. Erdem, "INTERSPEECH 2009 Emotion Recognition Challenge evaluation," in 2010 IEEE 18th Signal Processing and Communications Applications Conference, 2010, p. 216-219.
[29] C.-C. Chang and C.-J. Lin, "LIBSVM: A library for support vector machines," ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 1-27, 2011.
[30] D. H. Hubel and T. N. Wiesel, "Receptive fields, binocular interaction and functional architecture in the cat's visual cortex," The Journal of Physiology, vol. 160, no. 1, p. 106-154.2, 1962.
[31] K. Fukushima and S. Miyake, "Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition," in Competition and Cooperation in Neural Nets, Berlin, Heidelberg, 1982, p. 267-285: Springer Berlin Heidelberg.
[32] S.-C. B. Lo, H.-P. Chan, J.-S. Lin, H. Li, M. T. Freedman, and S. K. Mun, "Artificial convolution neural network for medical image pattern recognition," Neural Networks, vol. 8, no. 7, p. 1201-1214, 1995.
[33] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, p. 2278-2324, 1998.
[34] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, p. 1735-1780, 1997.
[35] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to Sequence Learning with Neural Networks," ArXiv e-prints, Available: https://ui.adsabs.harvard.edu/#abs/2014arXiv1409.3215S
[36] D. Bahdanau, K. Cho, and Y. Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate," eprint arXiv:1409.0473, 2014.
[37] 陳垂康, "應用具語句關注之連續對話狀態追蹤與強化學習之面試訓練系統," 碩士, 資訊工程學系, 國立成功大學, 台南市, 2017.
[38] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "BLEU: a method for automatic evaluation of machine translation," presented at the Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, 2002.

論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2020-08-25起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw