||Mood Disorder Detection from Speech Using LSTM-Based Emotion Profile Tracking and Mood Verification
||Institute of Computer Science and Information Engineering
speech emotion recognition
long short-term memory
本論文之研究主要藉由應用觀看激發影片後產生之回應語音，做情感性疾患之短期偵測。首先，病患藉由觀看六種情緒影片，並以觀後訪談的方式來收集病患被激發的語音訊號。本論文方法中，由於低階語音特徵(LLDs)及深層散射頻譜(DSS)在參數頻域上更能取得較細微的能量分布資訊，因此結合DSS與LLDs，透過階層式頻譜分類方法將情緒語料庫調適至病狀語料庫降低病狀語料在情緒空間上的差異，並使用自動編碼器萃取瓶頸特徵參數，藉由長短期記憶(Long Short Term Memory)的情緒偵測模型預測情緒剖面，並將患者語音情緒上連續的表現軌跡應用隱藏式馬可夫模型來描述，最後運用驗證式的隱藏式馬可夫模型作驗證，以強化模型於症狀的識別效果。
In mental health disorder, Unipolar Depression (UD) and Bipolar Disorder (BD) have become the most common mental illness. A large portion of the BD patients is misdiagnosed as UD on initial presentation. As speech is the most natural way to express emotion, this thesis focus on tracking emotion profile of speech to build a short-term mood disorder detector for diagnosis assistance.
This thesis proposes an approach to short-term detection of mood disorder based on the elicited speech responses. At first, eliciting emotional videos are used to elicit the patients’ emotions. Speech responses of the patients are collected through the interviews by a clinician after watching each of six emotional video clips. As Deep Scattering Spectrum (DSS) can obtain more detailed energy distributions in frequency domain than the Low Level Descriptors (LLDs), this study combines LLDs and DSS as the speech features. A domain adaptation method combining hierarchical spectral clustering (HSC) algorithm and denoising autoencoder is proposed to adapt the emotion database to the mood disorder database to alleviate the data bias problem in the emotion space. The autoencoders is then adopted to extract the bottleneck features for dimensionality reduction. Hidden Markov model (HMM) is applied to characterize the trajectory of emotion profiles. Finally, HMM-based verification is used to improve the detection performance of mood disorders.
This study collected the elicited emotional speech data from 15 BDs, 15 UDs and 15 healthy controls. Five-fold cross validation scheme was employed for evaluation. Experimental results show that the proposed method achieved a detection accuracy of 73.33%, improving by 18%, compared to the SVM-based method. In the future, the patient’s personality, response context and facial images can be considered for obtaining a better performance.
List of Tables VII
List of Figures VIII
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation and Goal 4
1.3 Literature Review 4
1.3.1 Research Mood Disorder Detection 4
1.3.2 Recognition Units 6
1.3.3 Features 6
1.3.4 Models 8
1.4 Problems and Proposed Ideas 11
1.5 Research Framework 13
Chapter 2 Database Design and Collection 14
2.1 Chi-Mei Mood Database 14
2.1.1 Introduction 14
2.1.2 Criterion for Eliciting Emotion Video Selection 15
2.1.3 Collection Environment 17
2.1.4 Assessment Criterion 18
2.1.5 Collection Flow 19
2.2 MHMC Emotion Database 22
Chapter 3 Proposed Methods 23
3.1 Speech Preprocessing 24
3.2 Feature extraction 26
3.3 Autoencoder for Database Adaptation and Bottleneck Feature Extraction 28
3.3.1 Hierarchical Spectral Clustering Algorithm 28
3.3.2 Autoencoder 31
3.4 Emotion Model Construction and Emotion Profile Prediction 35
3.4.1 Emotion Profile 35
3.4.2 Long Short-Term Memory 36
3.5 Mood Model Construction and Verification 41
3.5.1 HMM-Based Verification Construction 41
3.5.2 Mood Detection by Verification Model 44
Chapter 4 Experimental Results and Discussion 46
4.1 MHMC Emotion Performance Analysis 46
4.2 Chi-Mei Mood Classification Performance 48
4.3 Evaluation of System Performance 53
4.3.1 Evaluation on LSTM+HMM 53
4.3.2 Evaluation on HSC adaptation 53
4.3.3 Evaluation on Bottleneck feature 54
4.3.4 Effect of Window size and Segment size 56
4.4 Discussion 57
Chapter 5 Conclusion and Future Work 58
 A. P. Association, Diagnostic and statistical manual of mental disorders (DSM-5®): American Psychiatric Pub, 2013.
 "聯合情緒健康教育中心 mood disorders " http://www.ucep.org.hk/cognition/mood_disorder.htm.
 "Wikipedia – Major depressive disorder," https://en.wikipedia.org/wiki/Major_depressive_disorder.
 "Wikipedia – Bipolar disorder," https://en.wikipedia.org/wiki/Bipolar_disorder.
 World Health Organization(WHO) – Global Burden of Disease(GBD)2000 estimates
 "World Health Organization(WHO) – Mental disorders," http://www.who.int/mediacentre/factsheets/fs396/en/.
 健康新知 – 台灣每2.5小時一人自殺！其中87%生前罹憂鬱症
 T. S.-T. Fu, C.-S. Lee, D. Gunnell, W.-C. Lee, and A. T.-A. Cheng, "Changing trends in the prevalence of common mental disorders in Taiwan: a 20-year repeated cross-sectional survey," The Lancet, vol. 381, pp. 235-241, 2013.
 "中央研究院 - 近20年台灣焦慮與憂鬱症患者比例倍增," http://www.sinica.edu.tw/manage/gatenews/showsingle.php?_op=?rid:5454.
 "中央通訊社 - 台灣精神疾病 20年成長1倍," http://www.cna.com.tw/News/FirstNews/201211160028.aspx.
 "自由時報 - 躁鬱症常被誤診 吃錯藥病更重," http://news.ltn.com.tw/news/life/paper/634434.
 R. H. Perlis, "Misdiagnosis of bipolar disorder," The American journal of managed care, vol. 11, pp. S271-4, 2005.
 S. Alghowinem, R. Goecke, M. Wagner, J. Epps, T. Gedeon, M. Breakspear, et al., "A comparative study of different classifiers for detecting depression from spontaneous speech," in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013, pp. 8022-8026.
 Y.-T. Chen, I. Hung, M.-W. Huang, C.-J. Hou, and K.-S. Cheng, "Physiological signal analysis for patients with depression," in Biomedical Engineering and Informatics (BMEI), 2011 4th International Conference on, 2011, pp. 805-808.
 J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T. Padilla, et al., "Detecting depression from facial actions and vocal prosody," in Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on, 2009, pp. 1-7.
 A. Greco, G. Valenza, A. Lanata, G. Rota, and E. P. Scilingo, "Electrodermal activity in bipolar patients during affective elicitation," Biomedical and Health Informatics, IEEE Journal of, vol. 18, pp. 1865-1873, 2014.
 A. Grunerbl, A. Muaremi, V. Osmani, G. Bahle, S. Ohler, G. Troster, et al., "Smartphone-based recognition of states and state changes in bipolar disorder patients," Biomedical and Health Informatics, IEEE Journal of, vol. 19, pp. 140-148, 2015.
 Y. Katyal, S. V. Alur, S. Dwivedi, and R. Menaka, "EEG signal and video analysis based depression indication," in Advanced Communication Control and Computing Technologies (ICACCCT), 2014 International Conference on, 2014, pp. 1353-1360.
 A. Lanata, A. Greco, G. Valenza, and E. P. Scilingo, "A pattern recognition approach based on electrodermal response for pathological mood identification in bipolar disorders," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 3601-3605.
 A. Muaremi, F. Gravenhorst, A. Grünerbl, B. Arnrich, and G. Tröster, "Assessing bipolar episodes using speech cues derived from phone calls," in Pervasive Computing Paradigms for Mental Health, ed: Springer, 2014, pp. 103-114.
 K. E. B. Ooi, M. Lech, and N. B. Allen, "Multichannel weighted speech classification system for prediction of major depression in adolescents," Biomedical Engineering, IEEE Transactions on, vol. 60, pp. 497-506, 2013.
 H. Peng, B. Hu, Q. Liu, Q. Dong, Q. Zhao, and P. Moore, "User-centered depression prevention: An EEG approach to pervasive healthcare," in Pervasive Computing Technologies for Healthcare (PervasiveHealth), 2011 5th International Conference on, 2011, pp. 325-330.
 M. N. Stolar, M. Lech, and N. B. Allen, "Detection of depression in adolescents based on statistical modeling of emotional influences in parent-adolescent conversations," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, 2015, pp. 987-991.
 T. Yingthawornsuk and R. G. Shiavi, "Distinguishing depression and suicidal risk in men using GMM based frequency contents of affective vocal tract response," in Control, Automation and Systems, 2008. ICCAS 2008. International Conference on, 2008, pp. 901-904.
 N. C. Maddage, R. Senaratne, L.-S. A. Low, M. Lech, and N. Allen, "Video-based detection of the clinical depression in adolescents," in Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEE, 2009, pp. 3723-3726.
 S. Mantri, P. Agrawal, D. Patil, and V. Wadhai, "Cumulative video analysis based smart framework for detection of depression disorders," in Pervasive Computing (ICPC), 2015 International Conference on, 2015, pp. 1-5.
 F. AliMardani, R. Boostani, and B. Blankertz, "Presenting a Spatial-Geometric EEG Feature to Classify BMD and Schizophrenic Patients," International Journal of Advances in Telecommunications, Electrotechnics, Signals and Systems, vol. 5, 2016.
 G. Valenza, M. Nardelli, G. Bertschy, A. Lanata, and E. Scilingo, "Mood states modulate complexity in heartbeat dynamics: A multiscale entropy analysis," EPL (Europhysics Letters), vol. 107, p. 18003, 2014.
 G. Valenza, M. Nardelli, A. Lanata, C. Gentili, G. Bertschy, and E. P. Scilingo, "Predicting mood changes in bipolar disorder through heartbeat nonlinear dynamics: A preliminary study," in Computing in Cardiology Conference (CinC), 2015, 2015, pp. 801-804.
 Z. N. Karam, E. M. Provost, S. Singh, J. Montgomery, C. Archer, G. Harrington, et al., "Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 4858-4862.
 G. Valenza, M. Nardelli, G. Bertschy, A. Lanata, R. Barbieri, and E. P. Scilingo, "Maximal-radius multiscale entropy of cardiovascular variability: A promising biomarker of pathological mood states in bipolar disorders," in Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE, 2014, pp. 6663-6666.
 G. Valenza, M. Nardelli, G. Bertschy, A. Lanata, and E. P. Scilingo, "Complexity modulation in heart rate variability during pathological mental states of bipolar disorders," in Cardiovascular Oscillations (ESGCO), 2014 8th Conference of the European Study Group on, 2014, pp. 99-100.
 G. Valenza, M. Nardelli, A. Lanata, C. Gentili, G. Bertschy, R. Paradiso, et al., "Wearable monitoring for mood recognition in bipolar disorder based on history-dependent long-term heart rate variability analysis," Biomedical and Health Informatics, IEEE Journal of, vol. 18, pp. 1625-1635, 2014.
 T. Fan, L. Yao, X. Wu, and C. Liu, "Independent component analysis of the resting-state brain functional MRI study in adults with bipolar depression," in Complex Medical Engineering (CME), 2012 ICME International Conference on, 2012, pp. 38-42.
 P. Hardy, R. Jouvent, and D. Widlöcher, "Speech pause time and the Retardation Rating Scale for Depression (ERD): Towards a reciprocal validation," Journal of Affective Disorders, 1984.
 G. Hoffmann, J. Gonze, and J. Mendlewicz, "Speech pause time as a method for the evaluation of psychomotor retardation in depressive illness," The British Journal of Psychiatry, vol. 146, pp. 535-538, 1985.
 C. Sobin and H. A. Sackeim, "Psychomotor symptoms of depression," The American journal of psychiatry, vol. 154, p. 4, 1997.
 E. Szabadi, C. Bradshaw, and J. Besson, "Elongation of pause-time in speech: a simple, objective measure of motor retardation in depression," The British Journal of Psychiatry, vol. 129, pp. 592-597, 1976.
 A. Nilsonne, "Speech characteristics as indicators of depressive illness," Acta Psychiatrica Scandinavica, vol. 77, pp. 253-263, 1988.
 "About.com: What is pressured speech in bipolar disorder – sings and symptoms of pressured speech in bipolar disorder," https://www.verywell.com/what-is-pressured-speech-378822.
 "Bipolar symptoms: Bipolar mania symptom," http://www.bipolardisordersymptoms.info/bipolar-symptoms/pressure-of-speech.htm.
 M. Summers, K. Papadopoulou, S. Bruno, L. Cipolotti, and M. A. Ron, "Bipolar I and bipolar II disorder: cognition and emotion processing," Psychological Medicine, vol. 36, pp. 1799-1809, 2006.
 G. Bersani, E. Polli, G. Valeriani, D. Zullo, C. Melcore, E. Capra, et al., "Facial expression in patients with bipolar disorder and schizophrenia in response to emotional stimuli: a partially shared cognitive and social deficit of the two disorders," Neuropsychiatric Disease & Treatment, vol. 9, 2013.
 D.-N. Jiang and L.-H. Cai, "Speech emotion classification with the combination of statistic features and temporal features," in Multimedia and Expo, 2004. ICME'04. 2004 IEEE International Conference on, 2004, pp. 1967-1970.
 D. Bitouk, R. Verma, and A. Nenkova, "Class-level spectral features for emotion recognition," Speech communication, vol. 52, pp. 613-625, 2010.
 J. H. Jeon, R. Xia, and Y. Liu, "Sentence level emotion recognition based on decisions from subsentence segments," in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, 2011, pp. 4940-4943.
 J. Kim, N. Kumar, A. Tsiartas, M. Li, and S. S. Narayanan, "Automatic intelligibility classification of sentence-level pathological speech," Computer speech & language, vol. 29, pp. 132-144, 2015.
 A. Batliner, S. Steidl, D. Seppi, and B. Schuller, "Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach," Advances in Human-Computer Interaction, vol. 2010, p. 3, 2010.
 B. Schuller and L. Devillers, "Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm," in INTERSPEECH, 2010, pp. 801-804.
 S. Jothilakshmi, "Automatic system to detect the type of voice pathology," Applied Soft Computing, vol. 21, pp. 244-249, 2014.
 M. Mansoorizadeh and N. M. Charkari, "Speech emotion recognition: comparison of speech segmentation approaches," Proc of IKT, Mashad, Iran, 2007.
 "Wikipedia - Mel-frequency cepstrum " https://en.wikipedia.org/wiki/Mel-frequency_cepstrum.
 E. Moore, M. A. Clements, J. W. Peifer, and L. Weisser, "Critical analysis of the impact of glottal features in the classification of clinical depression in speech," Biomedical Engineering, IEEE Transactions on, vol. 55, pp. 96-107, 2008.
 D. Vandyke, "Depression Detection & Emotion Classification via Data-Driven Glottal Waveforms," in Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, 2013, pp. 642-647.
 M. Cannizzaro, B. Harel, N. Reilly, P. Chappell, and P. J. Snyder, "Voice acoustical measurement of the severity of major depression," Brain and cognition, vol. 56, pp. 30-35, 2004.
 A. Guidi, N. Vanello, G. Bertschy, C. Gentili, L. Landini, and E. P. Scilingo, "Automatic analysis of speech F0 contour for the characterization of mood changes in bipolar patients," Biomedical Signal Processing and Control, vol. 17, pp. 29-37, 2015.
 P. Thanathamathee, "Boosting with feature selection technique for screening and predicting adolescents depression," in Digital Information and Communication Technology and it's Applications (DICTAP), 2014 Fourth International Conference on, 2014, pp. 23-27.
 J. S. Bhalla and A. Aggarwal, "Using Adaboost Algorithm along with Artificial neural networks for efficient human emotion recognition from speech," in Control, Automation, Robotics and Embedded Systems (CARE), 2013 International Conference on, 2013, pp. 1-6.
 D. Neiberg, K. Elenius, I. Karlsson, and K. Laskowski, "Emotion recognition in spontaneous speech," Working Papers in Linguistics, vol. 52, pp. 101-104, 2009.
 古鴻炎 and 游政人, "A speaker-clustering method using GMM and K-means," 2008.
 S.-C. Wang, "Artificial neural network," in Interdisciplinary Computing in Java Programming, ed: Springer, 2003, pp. 81-100.
 J. Deng, R. Xia, Z. Zhang, Y. Liu, and B. Schuller, "Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 4818-4822.
 J. Niu, Y. Qian, and K. Yu, "Acoustic emotion recognition using deep neural network," in Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on, 2014, pp. 128-132.
 A. Narayanan and D. Wang, "Ideal ratio mask estimation using deep neural networks for robust speech recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013, pp. 7092-7096.
 A. AbdAlmisreb, A. F. Abidin, and N. Md Tahir, "Maxout based deep neural networks for Arabic phonemes recognition," in Signal Processing & Its Applications (CSPA), 2015 IEEE 11th International Colloquium on, 2015, pp. 192-197.
 B. Jiang, Y. Song, S. Wei, M.-G. Wang, I. McLoughlin, and L.-R. Dai, "Performance evaluation of deep bottleneck features for spoken language identification," in Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on, 2014, pp. 143-147.
 R. Anila and A. Revathy, "Emotion recognition using continuous density HMM," in Communications and Signal Processing (ICCSP), 2015 International Conference on, 2015, pp. 0919-0923.
 B. Schuller, G. Rigoll, and M. Lang, "Hidden Markov model-based speech emotion recognition," in Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003 IEEE International Conference on, 2003, pp. II-1-4 vol. 2.
 L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proceedings of the IEEE, vol. 77, pp. 257-286, 1989.
 C.-H. Wu, J.-C. Lin, and W.-L. Wei, "Two-level hierarchical alignment for semi-coupled HMM-based audiovisual emotion recognition with temporal course," Multimedia, IEEE Transactions on, vol. 15, pp. 1880-1895, 2013.
 E. Mower and S. Narayanan, "A hierarchical static-dynamic framework for emotion classification," in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, 2011, pp. 2372-2375.
 L.-S. A. Low, N. C. Maddage, M. Lech, L. Sheeber, and N. Allen, "Influence of acoustic low-level descriptors in the detection of clinical depression in adolescents," in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, 2010, pp. 5154-5157.
 S. Mantri, P. Agrawal, S. S. Dorle, D. Patil, and V. M. Wadhai, "Clinical Depression Analysis Using Speech Features," in Emerging Trends in Engineering and Technology (ICETET), 2013 6th International Conference on, 2013, pp. 111-112.
 M. H. Sanchez, D. Vergyri, L. Ferrer, C. Richey, P. Garcia, B. Knoth, et al., "Using Prosodic and Spectral Features in Detecting Depression in Elderly Males," in INTERSPEECH, 2011, pp. 3001-3004.
 O. Schleusing, P. Renevey, M. Bertschi, S. Dasen, J.-M. Koller, and R. Paradiso, "Monitoring physiological and behavioral signals to detect mood changes of bipolar patients," in Medical Information & Communication Technology (ISMICT), 2011 5th International Symposium on, 2011, pp. 130-134.
 J. Andén and S. Mallat, "Deep scattering spectrum," IEEE Transactions on Signal Processing, vol. 62, pp. 4114-4128, 2014.
 J. J. Gross and R. W. Levenson, "Emotion elicitation using films," Cognition & emotion, vol. 9, pp. 87-108, 1995.
 梁育綺, 謝淑蘭, 翁嘉英, and 孫蒨如, "台灣地區華人情緒與相關心理生理資料庫─ 標準化華語版情緒電影短片材料庫與主觀評量常模," 中華心理學刊, vol. 55, pp. 601-621, 2013.
 T. Giannakopoulos, "A method for silence removal and segmentation of speech signals, implemented in Matlab," University of Athens, Athens, vol. 2, 2009.
 F. Eyben, M. W, #246, llmer, Bj, #246, et al., "Opensmile: the munich versatile and fast open-source audio feature extractor," presented at the Proceedings of the 18th ACM international conference on Multimedia, Firenze, Italy, 2010.
 J. A. Laurent Sifre, Michel Kapoko, Eduard Oyallon, and Vincent Lostanlen, "ScatNet: a MATLAB Toolbox for Scattering Networks," 2013.
 G. E. Hinton and R. R. Salakhutdinov, "Reducing the dimensionality of data with neural networks," Science, vol. 313, pp. 504-507, 2006.
 E. Mower and S. Narayanan, "A hierarchical static-dynamic framework for emotion classification," in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 2372-2375.
 S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, pp. 1735-1780, 1997.