||Prosodic Phrase-Based Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Non-verbal Speech Signals
||Institute of Computer Science and Information Engineering
Speech emotion recognition
Convolutional Neural Network
Long-short term memory
本論文主要考慮人與人對話的語音中自然產生的情緒呈現與對話中情緒的變化，也透過參考人腦對他人情緒辨別的方法，期望透過分析語音中的口語音段與非口語音段的特徵來協助語音的辨識，因此選擇國立清華大學-國立台灣藝術大學中文情緒互動多模態語料庫(NTHU-NTUA Chinese Interactive Multimodal Emotion Corpus，NNIME)。此語料庫是一個指定情境而無語句腳本的自發性情緒語料庫，內部包含許多不同如笑聲、哭聲、氣音等自然情感對話所具有的非口語聲音片段。
Speech emotion recognition is increasingly important for many applications, such as chatbot, mental problem diagnosis assistant, smart health care, sale advertising, smart entertainment and some other smart services. In Human-Machine communication, emotion recognition and sentiment analysis can enhance the interaction between people and devices. When people recognize others’ emotion, our brains independently process vocal representation and emotionality, then gain difference of emotionality more clearly from the voice effect. To the best of our knowledge, none of existing emotion recognition system considers laughter, cries or other emotion interjection in speech which naturally exists in our daily life when we express our emotion.
The thesis tries to observe spontaneous emotion expression and change within a single turn in daily dialogue. Considering how human brain discriminates others’ emotion, we extract the features of verbal and non-verbal parts in speech for emotion recognition. For these purposes, the thesis choose a spontaneous speech emotion corpus, NNIME (NTHU-NTUA Chinese Interactive Multimodal Emotion Corpus), which contains various emotional nonverbal sounds, such as laughter, sobbing, and sigh in speech.
Totally, 4766 single-speaker turns in dialogue produced based on the segments in the audio data of NNIME 101 sessions. In order to reconstruct each turn into a sequence of silence interval, prosodic phrase and non-verbal sound, an SVM-based verbal/nonverbal discriminator is developed and a Prosodic Phrase (PPh) auto-tagger is used. These segments are then used as the training data for emotion/sound feature extraction based on convolutional neural networks (CNNs). Finally, every turn is represented as a sequence of emotion/sound feature vectors and becomes the input of a sequence-to-sequence model. The attentive LSTM-based sequence-to-sequence model is finally adopted to give an emotion tag sequence as recognition result for a given turn.
According to the experimental results, a better emotion recognition performance of spontaneous speech can benefit human machine interaction.
List of Tables X
List of Figures XII
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 1
1.3 Literature Review 2
1.3.1 Emotion Database 2
1.3.2 Speech Emotion Features 3
1.3.3 Speech Emotion Recognition Unit 5
1.3.4 Speech Emotion Recognition Model 5
1.4 Problems and Proposed Methods 6
1.5 Research Framework 8
Chapter 2 Corpus and Annotation 9
2.1 Original Annotation of NNIME 9
2.1.1 NNIME Database 9
2.1.2 Recording Setup 9
2.2 Classification of Emotion and Sound 11
2.3 Re-Annotation of NNIME Corpus 12
2.3.1 Boundary Validation Set 12
2.3.2 System Training Set 15
Chapter 3 Proposed Methods 20
3.1 Segmentation of Input Audio 21
3.1.1 Silence Detection 21
3.1.2 Discrimination of Verbal and Nonverbal interval 22
3.1.3 Prosodic Phrases 23
3.2 Feature Extraction 25
3.2.1 Convolution Neural Network 25
3.2.2 Feature Vectors of Audio Segments 27
3.3 Emotion Recognition of Segmented Audio 28
3.3.1 Long-Short Term Memory 28
3.3.2 Attention Mechanism 30
3.3.3 Sequence-to-Sequence Emotion Recognition 31
Chapter 4 Experimental Results 33
4.1 Boundary validation of Verbal and Nonverbal Intervals 34
4.2 Evaluation of Feature Extraction 35
4.2.1 Sound Features Extracted from Nonverbal Segments 35
4.2.2 Emotion Features Extracted from Verbal Segments 37
4.3 Evaluation of Emotion Recognition 38
4.3.1 Effect of Nonverbal Segments and Sound Features 39
4.3.2 Compare with Traditional Emotion Recognition Methods 40
Chapter 5 Conclusion and Future Work 42
 S. Blanton, "The voice and the emotions," Quarterly Journal of Speech, vol. 1, no. 2, p. 154-172, 1915.
 B. W. Schuller, "Speech Emotion Recognition: Two Decades in a Nutshell, Benchmarks, and Ongoing Trends," Communications of the ACM, May 2018, Vol. 61 No. 5, Pages 90-99, 2018.
 S. Lugović, I. Dunđer, and M. Horvat, "Techniques and applications of emotion recognition in speech," in 2016 39th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2016, p. 1278-1283.
 X. Zhang, Y. Sun, and D. Shufei, "Progress in speech emotion recognition," in TENCON 2015 - 2015 IEEE Region 10 Conference, 2015, p. 1-6.
 N. Campbell, "On the Use of NonVerbal Speech Sounds in Human Communication," in Verbal and Nonverbal Communication Behaviours, Berlin, Heidelberg, 2007, p. 117-128: Springer Berlin Heidelberg.
 A. Schirmer and T. C. Gunter, "Temporal signatures of processing voiceness and emotion in sound," Social Cognitive and Affective Neuroscience, vol. 12, no. 6, p. 902-909, 2017.
 H. C. Chou, W. C. Lin, L. C. Chang, C. C. Li, H. P. Ma, and C. C. Lee, "NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus," in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017, p. 292-298.
 I. S. Engberg, A. V. Hansen, O. Andersen, and P. Dalsgaard, "Design, recording and verification of a Danish emotional speech database," in Fifth European Conference on Speech Communication and Technology, 1997.
 E. Douglas-Cowie, R. Cowie, and M. Schröder, "A new emotion database: considerations, sources and scope," in ISCA tutorial and research workshop (ITRW) on speech and emotion, 2000.
 F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, "A database of German emotional speech," in Ninth European Conference on Speech Communication and Technology, 2005.
 C. Busso et al., "IEMOCAP: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
 F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, "Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions," in Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, 2013, p. 1-8: IEEE.
 Y. Li, J. Tao, L. Chao, W. Bao, and Y. Liu, "CHEAVD: a Chinese natural emotional audio–visual database," Journal of Ambient Intelligence and Humanized Computing, vol. 8, no. 6, p. 913-924, 2017.
 E. Tzinis and A. Potamianos, "Segment-based speech emotion recognition using recurrent neural networks," in Affective Computing and Intelligent Interaction (ACII), 2017 Seventh International Conference on, 2017, p. 190-195: IEEE.
 K. S. Rao, S. G. Koolagudi, and R. R. Vempada, "Emotion recognition from speech using global and local prosodic features," International journal of speech technology, vol. 16, no. 2, p. 143-160, 2013.
 H. Cao, S. Benus, R. Gur, R. Verma, and A. Nenkova, "Prosodic cues for emotion: analysis with discrete characterization of intonation," Speech prosody 2014, 2014.
 N. Anand and P. Verma, "Convoluted feelings convolutional and recurrent nets for detecting emotion from audio data," in Technical Report: Stanford University, 2015.
 G. Trigeorgis et al., "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, p. 5200-5204.
 L. Zhu, L. Chen, D. Zhao, J. Zhou, and W. Zhang, "Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN," Sensors (Basel, Switzerland), vol. 17, no. 7, p. 1694, 2017.
 S. Kim and M. L. Seltzer, "Towards Language-Universal End-to-End Speech Recognition," eprint arXiv:1711.02207, 2017.
 N. T. V. Michael Neumann, "Attentive Convolutional Neural Network based Speech Emotion Recognition:A Study on the Impact of Input Features, Signal Length, and Acted Speech," 2017.
 F. B. Pokorny, F. Graf, F. Pernkopf, and B. W. Schuller, "Detection of negative emotions in speech signals using bags-of-audio-words," in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), 2015, p. 879-884.
 C.-W. Huang and S. S. Narayanan, "Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition," in Multimedia and Expo (ICME), 2017 IEEE International Conference on, 2017, p. 583-588: IEEE.
 C.-y. Tseng, S.-h. Pin, and Y.-l. Lee, "Speech prosody: issues, approaches and implications," From Traditional Phonology to Mandarin Speech Processing, Foreign Language Teaching and Research Process, p. 417-438, 2004.
 P. W. Boersma, David, "Praat: doing phonetics by computer [Computer program]. Version 6.0.40."
 M. Domínguez Bajo, M. Farrús, and L. Wanner, "An automatic prosody tagger for spontaneous speech," in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers;, Osaka, Japan, 2016, p. 377-387: COLING.
 F. Eyben, M. Wöllmer, and B. Schuller, openSMILE -- The Munich Versatile and Fast Open-Source Audio Feature Extractor. 2010, p. 1459-1462.
 E. Bozkurt, E. Erzin, Ç. E. Erdem, and A. T. Erdem, "INTERSPEECH 2009 Emotion Recognition Challenge evaluation," in 2010 IEEE 18th Signal Processing and Communications Applications Conference, 2010, p. 216-219.
 C.-C. Chang and C.-J. Lin, "LIBSVM: A library for support vector machines," ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 1-27, 2011.
 D. H. Hubel and T. N. Wiesel, "Receptive fields, binocular interaction and functional architecture in the cat's visual cortex," The Journal of Physiology, vol. 160, no. 1, p. 106-154.2, 1962.
 K. Fukushima and S. Miyake, "Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition," in Competition and Cooperation in Neural Nets, Berlin, Heidelberg, 1982, p. 267-285: Springer Berlin Heidelberg.
 S.-C. B. Lo, H.-P. Chan, J.-S. Lin, H. Li, M. T. Freedman, and S. K. Mun, "Artificial convolution neural network for medical image pattern recognition," Neural Networks, vol. 8, no. 7, p. 1201-1214, 1995.
 Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, p. 2278-2324, 1998.
 S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, p. 1735-1780, 1997.
 I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to Sequence Learning with Neural Networks," ArXiv e-prints, Available: https://ui.adsabs.harvard.edu/#abs/2014arXiv1409.3215S
 D. Bahdanau, K. Cho, and Y. Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate," eprint arXiv:1409.0473, 2014.
 陳垂康, "應用具語句關注之連續對話狀態追蹤與強化學習之面試訓練系統," 碩士, 資訊工程學系, 國立成功大學, 台南市, 2017.
 K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "BLEU: a method for automatic evaluation of machine translation," presented at the Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, 2002.