||Real Mood Detection Using Denoising Autoencoder and LSTM
||Institute of Computer Science and Information Engineering
speech emotion recognition
long-term emotion tracking
long short term memory
In a rapidly changing social environment, emotions are more and more difficult to handle for human beings. Sometimes people do not even know that they have negative emotions. As a result, the accumulation of negative emotions become a mental illness. Thus, it is essential to develop an emotion tracking system to help users manage their emotions. In current study, an extended subjective self-report method is generally used for measuring emotions.
Even though it is commonly accepted that the emotion perceived by the listener is close to the intended emotion conveyed by the speaker, several research indicated that there still remains a mismatch between them. In addition, the individuals with different personalities generally have different expressed emotions. Based on this investigation, this thesis proposes an emotion conversion model which characterizes the relationship between the perceived emotion and the expressed emotion of the user for a specific personality. Emotion conversion from perceived to expressed emotions is applied based on the personality traits of the user. This thesis considers mood swing as a long-term accumulation of emotions. A database containing user’s long-term speech data and mood annotation is collected. This database is used for constructing the temporal relationships between emotion and mood.
In order to reflect the real mood from people, an SVM-based emotion model is developed to generate multiple probabilistic class labels. Moreover, a Gaussian distribution is built to generate noisy data since there is a difference between expressed and perceived emotions. The input is the expressed emotion value contaminated by the generated noise and the target is the expressed emotion for denoising autoencoder (DAE) training. Finally, for modeling the temporal fluctuation of emotions, a long short-term memory (LSTM)-based mood model is constructed for mood detection.
In mood detection experiments, the mood database was provided by 10 participants. There were 104 positive moods and 96 negative moods. Leave-one-speaker-out cross validation was employed for evaluation. Experimental results show that the proposed method achieved a detection accuracy of 64.5%, which improves 5%, comparing to the HMM-based method. In the future, the tracking of the dialog content and blog of the users can be applied to obtain a better performance.
Table of Contents VI
List of Tables IX
List of Figures X
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Background 2
1.3 Literature Review 3
1.3.1 Emotional Speech Databases 3
1.3.2 Emotion Perception and Emotion Expression 3
1.3.3 Emotion and Personality Trait 5
1.3.4 Long-term Tracking 7
1.4 Problem and Goal 8
1.5 The Organization of this Thesis 9
Chapter 2 Emotional Database Design and Collection 10
2.1 Emotion with Personality Database (EP-DB) 10
2.1.1 Data Collection 11
2.1.2 Emotional Video Selection 14
2.1.3 Environment 15
2.1.4 Data Annotation 16
2.2 Long-Term Emotion Database (LT-DB) 17
2.2.1 Data Collection 18
2.2.2 Environment 19
2.2.3 Data Annotation 20
2.3 MHMC Emotion Database 21
Chapter 3 Proposed Method 22
3.1 Speech Preprocessing 23
3.2 Emotion Profile Prediction 25
3.3 Emotion Conversion with Personality 26
3.3.1 Training Data Construction 26
3.3.2 Denoising Autoencoder 29
3.4 Long-term tracking and Mood Detection 33
Chapter 4 Experimental Results and Discussion 38
4.1 Database Analysis 38
4.2 System Performance 40
4.2.1 Emotion Profile Prediction 40
4.2.2 Emotion Conversion 41
4.2.3 Mood Detection 45
4.3 Performance Comparison 47
Chapter 5 Conclusions and Future Work 49
1] M. Reddy, “Depression: the disorder and the burden,” Indian journal of psychological medicine, vol. 32, no. 1, pp. 1, 2010.
 “Google Ventures Investments 2015 Year in Review,” 2015.
 S. P. Robbins, Organizational behavior, 14/E: Pearson Education India, 2001.
 V. Petrushin, "Emotion in speech: Recognition and application to call centers."
 E. Douglas-Cowie, R. Cowie, and M. Schröder, "A new emotion database: considerations, sources and scope."
 N. Amir, S. Ron, and N. Laor, "Analysis of an emotional speech corpus in Hebrew based on objective criteria."
 F. Yu, E. Chang, Y.-Q. Xu, and H.-Y. Shum, "Emotion detection from speech to enrich multimedia content." pp. 550-557.
 F. Schiel, S. Steininger, and U. Türk, "The SmartKom Multimodal Corpus at BAS."
 F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, "A database of German emotional speech." pp. 1517-1520.
 C. Busso, and S. S. Narayanan, "The expression and perception of emotions: comparing assessments of self versus others." pp. 257-260.
 K. P. Truong, M. A. Neerincx, and D. A. Van Leeuwen, "Assessing agreement of observer-and self-annotations in spontaneous multimodal emotion data." pp. 318-321.
 C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335-359, 2008.
 R. R. McCrae, and O. P. John, “An introduction to the five‐factor model and its applications,” Journal of personality, vol. 60, no. 2, pp. 175-215, 1992.
 S. Kshirsagar, "A multilayer personality model." pp. 107-115.
 Personality-Central, “Extroversion-Introversion preferences.”
 A. P. Association, Diagnostic and statistical manual of mental disorders (DSM-5®): American Psychiatric Pub, 2013.
 E.-H. Jang, B.-J. Park, S.-H. Kim, and J.-H. Sohn, "Emotion classification based on physiological signals induced by negative emotions: Discriminantion of negative emotions by machine learning algorithm." pp. 283-288.
 A. Gaggioli, P. Cipresso, S. Serino, and G. Riva, “Psychophysiological correlates of flow during daily activities,” Stud. Health Technol. Inform, vol. 191, pp. 65-69, 2013.
 E. Mostafa, A. Farag, A. Shalaby, A. Ali, T. Gault, and A. Mahmoud, "Long term facial parts tracking in thermal imaging for uncooperative emotion recognition." pp. 1-6.
 L. Zhong, Y. Li, X. Wei, G. Li, Z. Wang, and Y. Jiang, "System design for monitoring infant speech emotion." pp. 952-955.
 K.-Y. Lam, J. Wang, J. K.-Y. Ng, S. Han, L. Zheng, C. H. C. Kam, and C. J. Zhu, “SmartMood: Toward Pervasive Mood Tracking and Analysis for Manic Episode Detection,” IEEE Transactions on Human-Machine Systems, vol. 45, no. 1, pp. 126-131, 2015.
 R. F. Dickerson, E. I. Gorlin, and J. A. Stankovic, "Empath: a continuous remote emotional health monitoring system for depressive illness." p. 5.
 K.-h. Chang, D. Fisher, and J. Canny, “Ammon: A speech analysis library for analyzing affect, stress, and mental health on mobile phones,” Proceedings of PhoneSense, vol. 2011, 2011.
 R. Wiseman, 59 Seconds: Motivation: Think A Little, Change A Lot: Pan Macmillan, 2011.
 T. Giannakopoulos, “A method for silence removal and segmentation of speech signals, implemented in Matlab,” University of Athens, Athens, vol. 2, 2009.
 F. Eyben, M. Wöllmer, and B. Schuller, "Opensmile: the munich versatile and fast open-source audio feature extractor." pp. 1459-1462.
 E. Mower, and S. Narayanan, "A hierarchical static-dynamic framework for emotion classification." pp. 2372-2375.
 C. Cortes, and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273-297, 1995.
 E. Mower, M. J. Mataric, and S. Narayanan, “A framework for automatic human emotion classification using emotion profiles,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 5, pp. 1057-1070, 2011.
 P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, "Extracting and composing robust features with denoising autoencoders." pp. 1096-1103.
 S. Hochreiter, and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
 M. Wöllmer, A. Metallinou, N. Katsamanis, B. Schuller, and S. Narayanan, "Analyzing the memory of BLSTM neural networks for enhanced emotion classification in dyadic spoken interactions." pp. 4157-4160.