||Short-Term Detection of Mood Disorder Using Latent Affective Structure Modeling of Speech
||Institute of Computer Science and Information Engineering
Speech emotion recognition
Latent affective structure model
Mood disorders, including unipolar depression (UD) and bipolar disorder (BD), are reported to be the most common mental illness in recent years. In diagnostic evaluation on the outpatients with mood disorder, a high percentage of BD patients are initially misdiagnosed as having UD. This results in significant negative consequences for the treatment of the BD patients. Therefore, it is crucial to establish an accurate distinction between BD and UD in order to make an accurate and early diagnosis, leading to improvements in treatment and course of illness. Given that speech is the most natural way for emotion expression, recognition of emotions in speech could be effectively applied to mood disorder detection. As current research focused on long-term monitoring of the mood disorders, short-term detection which could be used in early detection and intervention and thus reduce the severity of symptoms is desirable.
This thesis proposes an approach to short-term detection of mood disorder based on the elicited speech responses. At first, eliciting emotional videos were used to elicit the patients’ emotions. Speech responses of the patients were collected through the interviews by a clinician after watching each of six emotional video clips. The support vector machine (SVM)-based classifier was adopted to obtain emotion profiles for each speech responses. In order to deal with the data bias problem, hierarchical spectral clustering algorithm were employed to adapt the eNTERFACE emotion database to fit the collected mood disorder database. The adapted eNTERFACE emotion data were then fed to the trained autoencoder to reconstruct the eNTERFACE emotion data for SVM-based emotion classifier construction. Finally, based on the emotion profiles generated from the SVM-based emotion classifier, a latent affective structure model (LASM) is proposed to characterize the structural relationship among the speech responses to six emotional videos for mood disorder detection.
For system performance evaluation, speech responses were collected from 24 subjects, including 8 UD, 8 BD and 8 healthy people (control group) to construct the CHI-MEI mood database. Eight-fold cross validation was adopted for the following evaluation. Performance evaluation on the LASM-based approaches using autoencoder with different numbers of neurons and layers were conducted. The experimental results show that the proposed LASM-based method achieved 67%, improving by 9% accuracy compared with the commonly used classifiers like SVM and DNN. In future work, it will be helpful to improve system performance by integrating the proposed method with lexical and visual information. Furthermore, the individuality of the patient is also an important factor to be considered in mood disorder detection.
List of Tables IX
List of Figures X
Chapter 1 Introduction 1
1.1 Motivation 2
1.2 Background 5
1.3 Literature Review 7
1.3.1 Current Research 7
1.3.2 Recognition Units and Features 8
1.3.3 Classification 9
1.3.4 Speech Data Collection 12
1.4 Problem and Goal 13
1.5 Research Framework 14
Chapter 2 Mood and Emotional Databases 15
2.1 Mood Database Design and Collection 15
2.1.1 CHI-MEI Mood Database 15
2.1.2 Data Collection 16
2.1.3 Emotional Video Selection 20
2.2 eNTERFACE Emotion Database 22
Chapter 3 Proposed Method 23
3.1 Speech/Speaker Segmentation 24
3.2 Database Adaptation 25
3.2.1 Hierarchical Spectral Clustering 26
3.2.2 Autoencoder for Data Reconstruction 29
3.3 Latent Affective Space Model 34
3.3.1 Emotion Profile Prediction 34
3.3.2 Latent Affective Space Model Construction 35
3.3.3 Similarity Estimation 39
Chapter 4 Experimental Results and Discussion 41
4.1 Feature Extraction and Speech/Speaker Segmentation 41
4.2 Database Analysis 44
4.3 Performance Evaluation 46
4.3.1 Evaluation on HSC and AE 46
4.3.2 Evaluation on LASM 49
4.4 Performance Comparison 53
4.5 Discussion 56
Chapter 5 Conclusion and Future Work 57
Appendix A Clinical Trial Approval Certificate 64
 T. S.-T. Fu, C.-S. Lee, D. Gunnell, W.-C. Lee, and A. T.-A. Cheng, "Changing trends in the prevalence of common mental disorders in Taiwan: a 20-year repeated cross-sectional survey," The Lancet, vol. 381, pp. 235-241, 2013.
 G. Akinci, E. Polat, and O. M. Koçak, "A video based eye detection system for bipolar disorder diagnosis," in Signal Processing and Communications Applications Conference (SIU), 2012 20th, 2012, pp. 1-4.
 G. Bersani, E. Polli, G. Valeriani, D. Zullo, C. Melcore, E. Capra, A. Quartini, P. Marino, A. Minichino, and L. Bernabei, "Facial expression in patients with bipolar disorder and schizophrenia in response to emotional stimuli: a partially shared cognitive and social deficit of the two disorders," Neuropsychiatric disease and treatment, vol. 9, p. 1137, 2013.
 A. Greco, G. Valenza, A. Lanata, G. Rota, and E. P. Scilingo, "Electrodermal activity in bipolar patients during affective elicitation," Biomedical and Health Informatics, IEEE Journal of, vol. 18, pp. 1865-1873, 2014.
 A. Grunerbl, A. Muaremi, V. Osmani, G. Bahle, S. Ohler, G. Troster, O. Mayora, C. Haring, and P. Lukowicz, "Smartphone-based recognition of States and state changes in bipolar disorder patients," Biomedical and Health Informatics, IEEE Journal of, vol. 19, pp. 140-148, 2015.
 N. Howard, "Approach Towards a Natural Language Analysis for Diagnosing Mood Disorders and Comorbid Conditions," in Artificial Intelligence (MICAI), 2013 12th Mexican International Conference on, 2013, pp. 234-243.
 A. Lanata, A. Greco, G. Valenza, and E. P. Scilingo, "A pattern recognition approach based on electrodermal response for pathological mood identification in bipolar disorders," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 3601-3605.
 A. Muaremi, F. Gravenhorst, A. Grünerbl, B. Arnrich, and G. Tröster, "Assessing bipolar episodes using speech cues derived from phone calls," in Pervasive Computing Paradigms for Mental Health, ed: Springer, 2014, pp. 103-114.
 V. Osmani, "Smartphones in Mental Health: Detecting Depressive and Manic Episodes," Pervasive Computing, IEEE, vol. 14, pp. 10-13, 2015.
 O. Schleusing, P. Renevey, M. Bertschi, S. Dasen, J.-M. Koller, and R. Paradiso, "Monitoring physiological and behavioral signals to detect mood changes of bipolar patients," in Medical Information & Communication Technology (ISMICT), 2011 5th International Symposium on, 2011, pp. 130-134.
 G. Valenza, L. Citi, C. Gentili, A. Lanata, E. P. Scilingo, and R. Barbieri, "Characterization of depressive states in bipolar patients using wearable textile technology and instantaneous heart rate variability assessment," Biomedical and Health Informatics, IEEE Journal of, vol. 19, pp. 263-274, 2015.
 G. Valenza, M. Nardelli, A. Lanata, C. Gentili, G. Bertschy, R. Paradiso, and E. P. Scilingo, "Wearable monitoring for mood recognition in bipolar disorder based on history-dependent long-term heart rate variability analysis," Biomedical and Health Informatics, IEEE Journal of, vol. 18, pp. 1625-1635, 2014.
 R. H. Perlis, "Misdiagnosis of bipolar disorder," The American journal of managed care, vol. 11, pp. S271-4, 2005.
 Now今日新聞. 傻傻分不清！躁鬱症當成憂鬱症 誤診率高. Available: http://test.nownews.com/n/2010/03/20/657559
 A. K. Cuellar, S. L. Johnson, and R. Winters, "Distinctions between bipolar and unipolar depression," Clinical psychology review, vol. 25, pp. 307-339, 2005.
 A. P. Association, "Diagnostic and statistical manual of mental disorders (fifthedn)," 2011.
 J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T. Padilla, F. Zhou, and F. D. La Torre, "Detecting depression from facial actions and vocal prosody," in Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on, 2009, pp. 1-7.
 D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, and D. M. Wilkes, "Acoustical properties of speech as indicators of depression and suicidal risk," Biomedical Engineering, IEEE Transactions on, vol. 47, pp. 829-837, 2000.
 R. Horwitz, T. F. Quatieri, B. S. Helfer, B. Yu, J. R. Williamson, and J. Mundt, "On the relative importance of vocal source, system, and prosody in human depression," in Body Sensor Networks (BSN), 2013 IEEE International Conference on, 2013, pp. 1-6.
 L.-S. A. Low, N. C. Maddage, M. Lech, L. Sheeber, and N. Allen, "Influence of acoustic low-level descriptors in the detection of clinical depression in adolescents," in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, 2010, pp. 5154-5157.
 L.-S. A. Low, N. C. Maddage, M. Lech, L. B. Sheeber, and N. B. Allen, "Detection of clinical depression in adolescents’ speech during family interactions," Biomedical Engineering, IEEE Transactions on, vol. 58, pp. 574-586, 2011.
 K. E. B. Ooi, M. Lech, and N. B. Allen, "Multichannel weighted speech classification system for prediction of major depression in adolescents," Biomedical Engineering, IEEE Transactions on, vol. 60, pp. 497-506, 2013.
 M. H. Sanchez, D. Vergyri, L. Ferrer, C. Richey, P. Garcia, B. Knoth, and W. Jarrold, "Using Prosodic and Spectral Features in Detecting Depression in Elderly Males," in INTERSPEECH, 2011, pp. 3001-3004.
 T. Yingthawornsuk and R. G. Shiavi, "Distinguishing depression and suicidal risk in men using GMM based frequency contents of affective vocal tract response," in Control, Automation and Systems, 2008. ICCAS 2008. International Conference on, 2008, pp. 901-904.
 Z. N. Karam, E. M. Provost, S. Singh, J. Montgomery, C. Archer, G. Harrington, and M. G. Mcinnis, "Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 4858-4862.
 C. Sobin and H. A. Sackeim, "Psychomotor symptoms of depression," American Journal of Psychiatry, vol. 154, pp. 4-17, 1997.
 M. JH Balsters, E. J Krahmer, M. GJ Swerts, and A. JJM Vingerhoets, "Verbal and nonverbal correlates for depression: A review," Current Psychiatry Reviews, vol. 8, pp. 227-234, 2012.
 J. H. Jeon, R. Xia, and Y. Liu, "Sentence level emotion recognition based on decisions from subsentence segments," in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, 2011, pp. 4940-4943.
 E. Mower and S. Narayanan, "A hierarchical static-dynamic framework for emotion classification," in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, 2011, pp. 2372-2375.
 B. Schuller and L. Devillers, "Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm," in INTERSPEECH, 2010, pp. 801-804.
 A. Batliner, S. Steidl, D. Seppi, and B. Schuller, "Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach," Advances in Human-Computer Interaction, vol. 2010, p. 3, 2010.
 D. Bitouk, R. Verma, and A. Nenkova, "Class-level spectral features for emotion recognition," Speech communication, vol. 52, pp. 613-625, 2010.
 D.-N. Jiang and L.-H. Cai, "Speech emotion classification with the combination of statistic features and temporal features," in Multimedia and Expo, 2004. ICME'04. 2004 IEEE International Conference on, 2004, pp. 1967-1970.
 M. Cannizzaro, B. Harel, N. Reilly, P. Chappell, and P. J. Snyder, "Voice acoustical measurement of the severity of major depression," Brain and cognition, vol. 56, pp. 30-35, 2004.
 E. Moore, M. Clements, J. W. Peifer, and L. Weisser, "Critical analysis of the impact of glottal features in the classification of clinical depression in speech," Biomedical Engineering, IEEE Transactions on, vol. 55, pp. 96-107, 2008.
 J. C. Mundt, P. J. Snyder, M. S. Cannizzaro, K. Chappie, and D. S. Geralts, "Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology," Journal of neurolinguistics, vol. 20, pp. 50-64, 2007.
 Å. Nilsonne, J. Sundberg, S. Ternström, and A. Askenfelt, "Measuring the rate of change of voice fundamental frequency in fluent speech during mental depression," The Journal of the Acoustical Society of America, vol. 83, pp. 716-728, 1988.
 N. Vanello, A. Guidi, C. Gentili, S. Werner, G. Bertschy, G. Valenza, A. Lanata, and E. P. Scilingo, "Speech analysis for mood state characterization in bipolar patients," in Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE, 2012, pp. 2104-2107.
 C. JC, "A Tutorial on Support Vector Machines for Pattern Recognition."
 D. Neiberg, K. Elenius, and K. Laskowski, "Emotion recognition in spontaneous speech using GMMs," in Interspeech, 2006.
 L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proceedings of the IEEE, vol. 77, pp. 257-286, 1989.
 B. Schuller, G. Rigoll, and M. Lang, "Hidden Markov model-based speech emotion recognition," in Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003 IEEE International Conference on, 2003, pp. II-1-4 vol. 2.
 J. Nicholson, K. Takahashi, and R. Nakatsu, "Emotion recognition in speech using neural networks," Neural computing & applications, vol. 9, pp. 290-296, 2000.
 G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, and T. N. Sainath, "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," Signal Processing Magazine, IEEE, vol. 29, pp. 82-97, 2012.
 C.-H. Wu, J.-C. Lin, and W.-L. Wei, "Two-level hierarchical alignment for semi-coupled HMM-based audiovisual emotion recognition with temporal course," Multimedia, IEEE Transactions on, vol. 15, pp. 1880-1895, 2013.
 C.-H. Wu and H.-P. Shen, "A Study on Multilingual and Accented Speech Recognition."
 P. Ekman, Basic Emotions, 1999.
 M. Hamilton, "A rating scale for depression," Journal of neurology, neurosurgery, and psychiatry, vol. 23, p. 56, 1960.
 R. M. Hirschfeld, J. B. Williams, R. L. Spitzer, J. R. Calabrese, L. Flynn, P. E. Keck Jr, L. Lewis, S. L. McElroy, R. M. Post, and D. J. Rapport, "Development and validation of a screening instrument for bipolar spectrum disorder: the Mood Disorder Questionnaire," American Journal of Psychiatry, vol. 157, pp. 1873-1875, 2000.
 R. Young, J. Biggs, V. Ziegler, and D. Meyer, "A rating scale for mania: reliability, validity and sensitivity," The British Journal of Psychiatry, vol. 133, pp. 429-435, 1978.
 S. Leucht, G. Pitschel-Walz, D. Abraham, and W. Kissling, "Efficacy and extrapyramidal side-effects of the new antipsychotics olanzapine, quetiapine, risperidone, and sertindole compared to conventional antipsychotics and placebo. A meta-analysis of randomized controlled trials," Schizophrenia research, vol. 35, pp. 51-68, 1999.
 T. R. Barnes, "A rating scale for drug-induced akathisia," The British Journal of Psychiatry, vol. 154, pp. 672-676, 1989.
 W. Guy, "Clinical global impression scale," The ECDEU Assessment Manual for Psychopharmacology-Revised. Volume DHEW Publ No ADM 76, vol. 338, pp. 218-222, 1976.
 J. J. Gross and R. W. Levenson, "Emotion elicitation using films," Cognition & emotion, vol. 9, pp. 87-108, 1995.
 C. Cheng, H. Chen, Y. Chan, Y. Su, and C. Tseng, "Taiwan corpora of Chinese emotions and relevant psychophysiological data–Normative data for Chinese jokes," Chinese Journal of Psychology, vol. 55, pp. 555-569, 2013.
 T. Giannakopoulos, "A method for silence removal and segmentation of speech signals, implemented in Matlab," Department of Informatics and Telecommunications, University of Athens, Greece, Computational Intelligence Laboratory (CIL), Insititute of Informatics and Telecommunications (IIT), NCSR DEMOKRITOS, Greece, 2009.
 S. E. Bou-Ghazale and K. Assaleh, "A robust endpoint detection of speech for noisy environments with application to automatic speech recognition," in Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, 2002, pp. IV-3808-IV-3811.
 G. Saha, S. Chakroborty, and S. Senapati, "A new silence removal and endpoint detection algorithm for speech and speaker recognition applications," in Proceedings of the 11th National Conference on Communications (NCC), 2005, pp. 291-295.
 H. Hermansky and N. Morgan, "RASTA processing of speech," Speech and Audio Processing, IEEE Transactions on, vol. 2, pp. 578-589, 1994.
 F. Eyben, M. Wöllmer, and B. Schuller, "Opensmile: the munich versatile and fast open-source audio feature extractor," in Proceedings of the international conference on Multimedia, 2010, pp. 1459-1462.
 S. Furui, "Unsupervised speaker adaptation based on hierarchical spectral clustering," Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 37, pp. 1923-1930, 1989.
 J. Moody, S. Hanson, A. Krogh, and J. A. Hertz, "A simple weight decay can improve generalization," Advances in neural information processing systems, vol. 4, pp. 950-957, 1995.
 R. Hecht-Nielsen, "Theory of the backpropagation neural network," in Neural Networks, 1989. IJCNN., International Joint Conference on, 1989, pp. 593-605.
 E. Mower, M. J. Matarić, and S. Narayanan, "A framework for automatic human emotion classification using emotion profiles," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, pp. 1057-1070, 2011.
 D. Ellis. PLP and RASTA (and MFCC, and inversion) in Matlab. Available: http://labrosa.ee.columbia.edu/matlab/rastamat/
 R. B. Palm, "Prediction as a candidate for learning deep hierarchical models of data," Technical University of Denmark, 2012.
 A.-r. Mohamed, G. E. Dahl, and G. Hinton, "Acoustic modeling using deep belief networks," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, pp. 14-22, 2012.
 A. Fischer and C. Igel, "An introduction to restricted Boltzmann machines," in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, ed: Springer, 2012, pp. 14-36.
 G. E. Hinton, "Training products of experts by minimizing contrastive divergence," Neural computation, vol. 14, pp. 1771-1800, 2002.