系統識別號 U0026-1908201323125300
論文名稱(中文) 應用多重時間層級單元與階層關聯模型於語音情緒辨識
論文名稱(英文) Recognition of Emotions in Speech Using Multi-level Units and Hierarchical Correlation Models
校院名稱 成功大學
系所名稱(中) 資訊工程學系碩博士班
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 101
學期 2
出版年 102
研究生(中文) 鄭冠群
研究生(英文) Kuan-Chun Cheng
學號 p76004211
學位類別 碩士
語文別 英文
論文頁數 58頁
口試委員 口試委員-陳有圳
中文關鍵字 語音情緒辨識  多重時間層級單元  階層關聯模型 
英文關鍵字 speech emotion recognition  multi-level units  hierarchical correlation models 
中文摘要 近年來,隨著電腦科技的發展與進步,如何實現電腦的情感智能,是實現自然以及良好人機互動的一個關鍵。而語音作為人與人之間最廣泛的溝通方式,許多得情感訊息也透過語音來做傳遞。因此語音情緒辨識則是達成情感智能中最為關鍵的技術之一。在本論文中,提出了一個結合了多層級時間單元的語音情緒辨識系統。首先,我們利用了Canny 邊緣偵測演算法來找自動找出頻譜上的邊界,並且利用這些邊界以決定語句中特徵表現一致的區段。對於每一個語句,我們會分別找出它所對應的基本單元,次情緒單元,以及情緒單元,分別是在頻譜,韻律特徵,以及情緒量表上有著一致的表現。另外,我們認為不同的時間層級可以提供不同的語音資訊,因此我們提出了階層關聯模型來有效結合不同時間層級的資訊。在階層關聯模型中包含了各個單一層級的模型以及層級之間的關聯資訊。其中,單一層級模型的辨識結果是由對應的層級單元的情緒量表所平均而得。我們並且利用向量量化技術來將所有單元對應的情緒作編碼,並且統計不同時間層級的情緒編碼的統計量來計算出其關聯資訊。最後,結合輸入語句在單一階層模型的辨識結果以及層級之間的關聯資訊獲得最終的辨識結果。在實驗部分,我們採用包含了七種情緒類別的德語情緒語料庫。在與近期的研究方法比較後,本文所提出的方法在語音情緒的辨識率上可達到71.59%正確率。加入了語者正規化後, 在六類情緒下可達83.55%正確率。
英文摘要 In recent years, with the development of affective computing, emotion recognition is a critical topic in creating an intelligent human-computer interface. Speech is one of the most efficient ways for human-human communication. Therefore to make machines able to communicate with humans more effectively, understanding the information carried by speech such as emotions and speech intentions is a very important technique. In this thesis, we focused on the technique to detect emotions in speech.
This thesis proposed an approach to speech emotion recognition using multi-level temporal information. To achieve this goal, Multi-level Unit Chunking is first employed to segment different temporal levels of emotional units and then the Hierarchical Correlation Model is used to integrate the information from those emotional units. For the Multi-level Unit Chunking, edge detection algorithm is employed to locate the boundaries of change and yield the emotional units automatically. Three types of chunking units will be determined for each utterance: basic unit, sub-emotion unit and emotion unit, within which consistent properties are shown in terms of spectral energy, prosodic feature, and emotion profile, respectively.
After locating the units for each level, a Hierarchical Correlation Model is proposed to model the hierarchical utterance structure. For each unit, static features are extracted and converted to emotion profile vectors as its soft-labeling emotion. Single segmentation level models are trained using the emotion profile vectors, which are weighted by the duration of its corresponding unit. To measure the correlation between the units, vector quantization is exploited using k-means clustering algorithm. The quantized vector of each unit is determined by the closest cluster. The correlation is calculated statistically and fused with the results from each single temporal level model. The final decision of the utterance will be determined by choosing the highest score.
The proposed approach was evaluated on Berlin Emotional Speech Database (EMO-DB) and the recognition results showed that the proposed speech emotion recognition system achieved 71.69% accuracy, which outperforms previously approaches. After using speaker normalization, the performance reaches 83.55% accuracy in six emotion recognition.
論文目次 摘要 I
Abstract II
誌謝 IV
Chapter 1 Introduction 1
1.1 Background 1
1.2 Related Work 1
1.3 Main Contribution 6
1.4 Thesis Organization 7
Chapter 2 System Overview 8
2.1 System Architecture 8
Chapter 3 Feature Extraction 10
3.1 Acoustic Low Level Descriptors 10
3.2 Intonation 10
3.2.1 Intensity 11
3.2.2 Zero Crossing Rate 11
3.2.3 Voice Quality 11
3.2.4 Cepstrum 12
3.3 Functionals 12
Chapter 4 Multilevel Unit Chunking 14
4.1 Edge-detection Based Segmentation 16
4.1.1 Canny Algorithm 16
4.1.2 Basic Unit Chunking 19
4.1.3 Sub-emotion Unit Chunking 21
4.1.4 Emotion Unit Chunking 23
4.2 Support Vector Machine (SVM) 26
Chapter 5 Hierarchical Correlation Model 28
5.1 Multilevel Joint Probability Estimation 28
5.2 Single Temporal Level Model 30
5.3 Correlation Estimation 32
5.3.1 Cluster-based Vector Quantization 33
5.3.2 Hierarchical Graph Structure 33
Chapter 6 Experiments 36
6.1 Emotional Databases 36
6.2 Tools and Experimental Setup 39
6.3 Performance Measurement 40
6.4 Evaluation on Temporal Levels 40
6.4.1 Dynamic Modeling Approach 41
6.4.2 Static Modeling Approach 42
6.4.3 Proposed Unit 44
6.5 Evaluation on Multilevel Modeling 45
6.5.1 Single Temporal Level Model 45
6.5.2 SVM-based Fusion 47
6.5.3 Hierarchical Correlation Model 48
6.6 System Comparison 52
Chapter 7 Conclusion and Future Work 54
7.1 Conclusion 54
7.2 Future Work 54
Reference 55
參考文獻 [1] Mehrabian. “Communication without words”. Psychology Today, 2:53–5
[2] Mehrabian and S.R. Ferris, “Inference of attitude from nonverbal communication in two channels”, Journal of Counseling Psychology, 1967, pp. 248–252
[3] N. Ambady, R. Rosenthal, “Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis”. Psychological Bulletin, 1992, 111, 256-274
[4] D.J. France, R.G. Shiavi, S. Silverman, M. Silverman, D.M. Wilkes Acoustical properties of speech as indicators of depression and suicide risk IEEE Transactions on Biomedical Engineering, 2000, pp. 829–837
[5] J. Ma, H. Jin, T. Yang, J. P. Tsai, Ubiquitous Intelligence and Computing, Springer Lecture Note in Computer Science, Vol. LNCS4159, 2006
[6] Z. Zeng, M. Pantic, G.I. Roisman, and T.S. Huang, “A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions”, Proc. Ninth ACM Int'l Conf. Multimodal Interfaces (ICMI '07), 2007, pp. 126-133
[7] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009 Emotion Challenge,” in Proc. INTERSPEECH 2009, Brighton, UK, 2009, pp. 312–315B
[8] Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. M¨uller, and S. Narayanan, “The INTERSPEECH 2010 Paralinguistic Challenge,” in Proc. INTERSPEECH 2010, Makuhari, Japan, 2010, pp. 2794–2797
[9] R. E. Thayer , “The Biopsychology of Mood and Arousal”, New York ,Oxford University Press,1989
[10] P. Ekman, “Basic Emotions,” in Handbook of Cognition and Emotion, T. Dalgleish and M. Power, Eds. Chichester, UK: Wiley, 1999
[11] W. Parrott, Emotions in Social Psychology, Psychology Press, Philadelphia, 2001
[12] R. Plutchik. "The Nature of Emotions". American Scientist. Retrieved 14 April 2011
[13] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A Database of German Emotional Speech”, in Proc. INTERSPEECH 2005, Lisbon, Portugal, 2005, pp. 1517-1520
[14] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. Narayanan, "IEMOCAP: Interactive emotional dyadic motion capture database," Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, December 2008
[15] S. Steidl, Automatic Classification of Emotion-Related User States in Spontaneous Children’s Speech, Logos Verlag, Berlin, 2009.
[16] T-L. Pao, Y-T. Chen, J-H. Yeh and J-J. Lu, “Detecting Emotions in Mandarin Speech,” in Proc. ROCLING XVI, Sep. 2004, pp. 365-373.
[17] D. Neiberg, K. Elenius, and K. Laskowski, “Emotion recognition in spontaneous speech using GMMs,” in Proc. INTERSPEECH 2006, Pittsburgh, Pennysylvania, 2006
[18] T. Nwe, S. Foo, and L. De Silva, "Speech emotion recognition using hidden markov models," Speech Communication, vol. 41, pp. 603-623, 2003
[19] J-C Lin, C-H Wu, W-L Wei, “Emotion Recognition of Conversational Affective Speech Using Temporal Course Modeling”. In Proc. INTERSPEECH, Lyon, France, 2013
[20] J-C Lin, C-H Wu, W-L Wei, “Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition.” IEEE Transactions on Multimedia 14(1): 142-156, 2012
[21] C-H Wu, W-B Liang, “Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels,” IEEE Transactions on Affective Computing, VOL. 2, NO. 1, 2011, pp. 10~21.
[22] W-L Wei, C-H Wu, J-C Lin, H Li, “Interaction Style Detection Based on Cross-Correlation Model in Spoken Conversation”. in Proc. of ICASSP, Vancouver, Canada, 2013
[23] M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: features, classification schemes,” Pattern Recognition, vol. 44, pp. 572-587, 2011
[24] Schuller, B.; Rigoll, G.: “Timing Levels in Segment-Based Speech Emotion Recognition,” in Proc. INTERSPEECH 2006 , Pittsburgh, Pennysylvania, 2006
[25] B. Schuller and L. Devillers, “Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm,” in Proc. INTERSPEECH, Japan, 2010
[26] E. Mower and S. Narayanan, “A hierarchical static-dynamic framework for emotion classification,” in Proc. of ICASSP, Prague, Czech Rep., May 2011
[27] J. H. Jeon, R. Xia, and Y. Liu, “Sentence level emotion recognition based on decisions from subsentnce segments,” in Proc. of ICASSP, Prague, Czech Rep., May 2011
[28] Li. Y, Zhao Y. “Recognizing emotions in speech using short-term and long
term features”. In Proc. Eurospeech, Budapest, 1999
[29] D, Jiang, L, Cai “Speech Emotion Classification with the Combination of Statistic Features” and Temporal Features, ICME, 2004
[30] Batliner et al., “”Segmenting into Adequate Units for Automatic Recognition of Emotion-Related Episodes: A Speech-Based Approach” Advances in Human–Computer Interaction, Vol. 2010
[31] D. Bitouk, R. Verma, A. Nenkova, “Class-level spectral features for emotion recognition”. Speech Communication, vol. 52, pp. 613-625, 2010
[32] W, Han, H-F, Li, “Research on the speech emotion recognition method with prosodic segment level features” 2009
[33] Yumoto, E., Gould, W.J., & Baer, T. “Harmonics-to-noise ratio as an index of the degree of hoarseness” 1982, Journal of the Acoustical Society of America 71: 1544-1550
[34] F. Eyben, M. Wollmer, B. Schuller “Speech and Music Interpretation by
Large-Space Extraction”, 2009, http://sourceforge.net/projects/openSMILE
[35] B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge” Speech Communication, vol. 53, no. 9/10, pp. 1062–1087, 2011
[36] Scherer, K.R. (2001). Appraisal considered as a process of multilevel sequential checking. In K.R. Scherer, A. Schorr, & T. Johnstone (Eds.), Appraisal processes in emotion: theory, methods, research (pp. 92-120). New York: Oxford University Press
[37] L. Sanders and D. Poeppel, “Local and global auditory processing: behavioral and erp evidence,” Neuropsychologia, vol. 45, no. 6, pp. 1172–1186, 2007
[38] D. Navon, “Forest before trees: The precedence of global features in visual perception,” Cognitive psychology, vol. 9, no. 3, pp. 353–383, 1977
[39] R. Fernandez, A computational model for the automatic recognition of affect in speech, Ph.D. Thesis, Massachusetts Institute of Technology, February 2004
[40] Canny, J., “A Computational Approach To Edge Detection”, IEEE Trans. Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986
[41] C. Cortes. and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995.
[42] E. Mower, M. Matari´c, and S. Narayanan, “A framework for automatic human emotion classification using emotion profiles,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 5, pp. 1057–1070, 2011
[43] M.P. Black, P. Georgiou, A. Katsamanis, B. Baucom, and S. Narayanan,” ”you made me do it”: Classification of blame in married couples’ interaction by fusing automatically derived speech and language information,” in Proc. INTERSPEECH, Florence, Italy, Aug. 2011
[44] J. Gibson, A. Katsamanis, M.P. Black, and S. Narayanan, “Automatic identification of salient acoustic instances in couples’ behavioral interactions using diverse density support vector machines,” in Proc. INTERPSEECH, Florence, Italy, Aug. 2011
[45] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movella “Recognizing facial expression: Machine learning and application to spontaneous behavior,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 568–573, 2005
[46] H. Pan, Z. P. Liang and T. S. Huang. "Estimation of the joint probability of multisensory signals", Pattern Recognit. Lett., vol. 22, pp.1432 -1437 2001.
[47] Mohammad T. Shami and Mohamed S. Kamel, “Segment-based approach to the recognition of emotions in speech,” in Proc. ICME, 2005
[48] S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, “The HTK Book.” Cambridge University, 1996
[49] C.-C. Chang. C.-J. Lin, “LIBSVM: a library for support vector machines Software”, 2001, available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[50] A. Jain, K. Nandakumar and A. Ross, ”Score Normalization in Multimodal Biometric Systems”, Pattern Recognition, vol. 38, no. 12, pp. 2270-2285, 2005
  • 同意授權校內瀏覽/列印電子全文服務,於2014-08-23起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2014-08-23起公開。

  • 如您有疑問,請聯絡圖書館