進階搜尋


 
系統識別號 U0026-1908201519031900
論文名稱(中文) 應用語音隱含式情感結構於情感性疾患之短期偵測
論文名稱(英文) Short-Term Detection of Mood Disorder Using Latent Affective Structure Modeling of Speech
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 103
學期 2
出版年 104
研究生(中文) 郭育婷
研究生(英文) Yu-Ting Kuo
學號 P76021140
學位類別 碩士
語文別 英文
論文頁數 64頁
口試委員 指導教授-吳宗憲
口試委員-王新民
口試委員-王駿發
口試委員-禹良治
口試委員-李祈均
中文關鍵字 情感性疾患  語音情緒辨識  語料調適  自動編碼器  隱含式情感結構模型 
英文關鍵字 Mood disorder  Speech emotion recognition  Corpus adaptation  Autoencoder  Latent affective structure model 
學科別分類
中文摘要 在情感性疾病中,憂鬱症與躁鬱症被認為是現今最常見的精神疾病。而在診斷患有情感性疾患的門診病人時,躁鬱症患者在起初就診時有很高的比率會被誤診為憂鬱患者,錯誤的診斷結果將會使得躁鬱症患者沒有受到適當的治療,而導致病情惡化。因此,為了改善心理疾病治療的過程,而使醫生能即早做出準確的診斷,建立一個能精確區分躁鬱症與憂鬱症的診斷方法是十分關鍵的。有鑑於語音是人們溝通最自然且富含情緒之表達方式,語音情緒辨識將可有效地應用於情感性疾患的偵測。然而現今在情感性疾患偵測之研究皆著重於長期追蹤,較無法快速診斷,因而延誤病情。相反的,短期偵測將可提供預前治療,因而降低病情的惡化。
本論文之研究主要藉由運用觀看激發影片後產生之回應語音,做情感性疾患之短期偵測。首先,病患藉由觀看六種情緒影片,並以觀後訪談的方式來收集病患被激發的語音訊號。針對蒐集之語音,本論文利用支持向量機之分類器來獲得每句回應語音的情緒剖面,但為了解決資料差異的問題,首先將eNTERFACE情緒語料庫利用階層式頻譜分類方法將其調適至所蒐集的情感性疾患語料,而後將調是後之eNTERFACE情緒語料利用自動編碼器重建回eNTERFACE情緒語料,並利用此資料建置支持向量機情緒分類器。最後利用情緒分類器產生之情緒剖面,導入隱含式情感結構模型,以建立在情感性疾患偵測時,回應語音在六種情緒影片下的結構關係,並據以偵測情感性疾患。
在系統效能評估方面,本論文收集包含8筆憂鬱症、8筆躁鬱症與8筆正常人(控制組)共24筆回應語音之奇美情感性疾患語料庫,採用8次交叉驗證的方式來實驗下列評估,藉由自動編碼器在神經元個數與層數上的不同來評估隱含式情感結構模型方法的效能。實驗結果顯示,本論文提出的隱含式情感結構模型方法可達到67%,相較於常見的支持向量機辨識器與深度神經網路辨識器效能較多9%。未來希望可以將提出的方法結合文字與影像,以得到較佳之效能。除此之外,病患的個性亦是一個在情感性疾患偵測中,需要被考慮的重要議題。
英文摘要 Mood disorders, including unipolar depression (UD) and bipolar disorder (BD), are reported to be the most common mental illness in recent years. In diagnostic evaluation on the outpatients with mood disorder, a high percentage of BD patients are initially misdiagnosed as having UD. This results in significant negative consequences for the treatment of the BD patients. Therefore, it is crucial to establish an accurate distinction between BD and UD in order to make an accurate and early diagnosis, leading to improvements in treatment and course of illness. Given that speech is the most natural way for emotion expression, recognition of emotions in speech could be effectively applied to mood disorder detection. As current research focused on long-term monitoring of the mood disorders, short-term detection which could be used in early detection and intervention and thus reduce the severity of symptoms is desirable.
This thesis proposes an approach to short-term detection of mood disorder based on the elicited speech responses. At first, eliciting emotional videos were used to elicit the patients’ emotions. Speech responses of the patients were collected through the interviews by a clinician after watching each of six emotional video clips. The support vector machine (SVM)-based classifier was adopted to obtain emotion profiles for each speech responses. In order to deal with the data bias problem, hierarchical spectral clustering algorithm were employed to adapt the eNTERFACE emotion database to fit the collected mood disorder database. The adapted eNTERFACE emotion data were then fed to the trained autoencoder to reconstruct the eNTERFACE emotion data for SVM-based emotion classifier construction. Finally, based on the emotion profiles generated from the SVM-based emotion classifier, a latent affective structure model (LASM) is proposed to characterize the structural relationship among the speech responses to six emotional videos for mood disorder detection.
For system performance evaluation, speech responses were collected from 24 subjects, including 8 UD, 8 BD and 8 healthy people (control group) to construct the CHI-MEI mood database. Eight-fold cross validation was adopted for the following evaluation. Performance evaluation on the LASM-based approaches using autoencoder with different numbers of neurons and layers were conducted. The experimental results show that the proposed LASM-based method achieved 67%, improving by 9% accuracy compared with the commonly used classifiers like SVM and DNN. In future work, it will be helpful to improve system performance by integrating the proposed method with lexical and visual information. Furthermore, the individuality of the patient is also an important factor to be considered in mood disorder detection.
論文目次 摘要 I
Abstract III
誌謝 V
Contents VI
List of Tables IX
List of Figures X
Chapter 1 Introduction 1
1.1 Motivation 2
1.2 Background 5
1.3 Literature Review 7
1.3.1 Current Research 7
1.3.2 Recognition Units and Features 8
1.3.3 Classification 9
1.3.4 Speech Data Collection 12
1.4 Problem and Goal 13
1.5 Research Framework 14
Chapter 2 Mood and Emotional Databases 15
2.1 Mood Database Design and Collection 15
2.1.1 CHI-MEI Mood Database 15
2.1.2 Data Collection 16
2.1.3 Emotional Video Selection 20
2.2 eNTERFACE Emotion Database 22
Chapter 3 Proposed Method 23
3.1 Speech/Speaker Segmentation 24
3.2 Database Adaptation 25
3.2.1 Hierarchical Spectral Clustering 26
3.2.2 Autoencoder for Data Reconstruction 29
3.3 Latent Affective Space Model 34
3.3.1 Emotion Profile Prediction 34
3.3.2 Latent Affective Space Model Construction 35
3.3.3 Similarity Estimation 39
Chapter 4 Experimental Results and Discussion 41
4.1 Feature Extraction and Speech/Speaker Segmentation 41
4.2 Database Analysis 44
4.3 Performance Evaluation 46
4.3.1 Evaluation on HSC and AE 46
4.3.2 Evaluation on LASM 49
4.4 Performance Comparison 53
4.5 Discussion 56
Chapter 5 Conclusion and Future Work 57
Reference 58
Appendix A Clinical Trial Approval Certificate 64
參考文獻 [1] T. S.-T. Fu, C.-S. Lee, D. Gunnell, W.-C. Lee, and A. T.-A. Cheng, "Changing trends in the prevalence of common mental disorders in Taiwan: a 20-year repeated cross-sectional survey," The Lancet, vol. 381, pp. 235-241, 2013.
[2] G. Akinci, E. Polat, and O. M. Koçak, "A video based eye detection system for bipolar disorder diagnosis," in Signal Processing and Communications Applications Conference (SIU), 2012 20th, 2012, pp. 1-4.
[3] G. Bersani, E. Polli, G. Valeriani, D. Zullo, C. Melcore, E. Capra, A. Quartini, P. Marino, A. Minichino, and L. Bernabei, "Facial expression in patients with bipolar disorder and schizophrenia in response to emotional stimuli: a partially shared cognitive and social deficit of the two disorders," Neuropsychiatric disease and treatment, vol. 9, p. 1137, 2013.
[4] A. Greco, G. Valenza, A. Lanata, G. Rota, and E. P. Scilingo, "Electrodermal activity in bipolar patients during affective elicitation," Biomedical and Health Informatics, IEEE Journal of, vol. 18, pp. 1865-1873, 2014.
[5] A. Grunerbl, A. Muaremi, V. Osmani, G. Bahle, S. Ohler, G. Troster, O. Mayora, C. Haring, and P. Lukowicz, "Smartphone-based recognition of States and state changes in bipolar disorder patients," Biomedical and Health Informatics, IEEE Journal of, vol. 19, pp. 140-148, 2015.
[6] N. Howard, "Approach Towards a Natural Language Analysis for Diagnosing Mood Disorders and Comorbid Conditions," in Artificial Intelligence (MICAI), 2013 12th Mexican International Conference on, 2013, pp. 234-243.
[7] A. Lanata, A. Greco, G. Valenza, and E. P. Scilingo, "A pattern recognition approach based on electrodermal response for pathological mood identification in bipolar disorders," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 3601-3605.
[8] A. Muaremi, F. Gravenhorst, A. Grünerbl, B. Arnrich, and G. Tröster, "Assessing bipolar episodes using speech cues derived from phone calls," in Pervasive Computing Paradigms for Mental Health, ed: Springer, 2014, pp. 103-114.
[9] V. Osmani, "Smartphones in Mental Health: Detecting Depressive and Manic Episodes," Pervasive Computing, IEEE, vol. 14, pp. 10-13, 2015.
[10] O. Schleusing, P. Renevey, M. Bertschi, S. Dasen, J.-M. Koller, and R. Paradiso, "Monitoring physiological and behavioral signals to detect mood changes of bipolar patients," in Medical Information & Communication Technology (ISMICT), 2011 5th International Symposium on, 2011, pp. 130-134.
[11] G. Valenza, L. Citi, C. Gentili, A. Lanata, E. P. Scilingo, and R. Barbieri, "Characterization of depressive states in bipolar patients using wearable textile technology and instantaneous heart rate variability assessment," Biomedical and Health Informatics, IEEE Journal of, vol. 19, pp. 263-274, 2015.
[12] G. Valenza, M. Nardelli, A. Lanata, C. Gentili, G. Bertschy, R. Paradiso, and E. P. Scilingo, "Wearable monitoring for mood recognition in bipolar disorder based on history-dependent long-term heart rate variability analysis," Biomedical and Health Informatics, IEEE Journal of, vol. 18, pp. 1625-1635, 2014.
[13] R. H. Perlis, "Misdiagnosis of bipolar disorder," The American journal of managed care, vol. 11, pp. S271-4, 2005.
[14] Now今日新聞. 傻傻分不清!躁鬱症當成憂鬱症 誤診率高. Available: http://test.nownews.com/n/2010/03/20/657559
[15] A. K. Cuellar, S. L. Johnson, and R. Winters, "Distinctions between bipolar and unipolar depression," Clinical psychology review, vol. 25, pp. 307-339, 2005.
[16] A. P. Association, "Diagnostic and statistical manual of mental disorders (fifthedn)," 2011.
[17] J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T. Padilla, F. Zhou, and F. D. La Torre, "Detecting depression from facial actions and vocal prosody," in Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on, 2009, pp. 1-7.
[18] D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, and D. M. Wilkes, "Acoustical properties of speech as indicators of depression and suicidal risk," Biomedical Engineering, IEEE Transactions on, vol. 47, pp. 829-837, 2000.
[19] R. Horwitz, T. F. Quatieri, B. S. Helfer, B. Yu, J. R. Williamson, and J. Mundt, "On the relative importance of vocal source, system, and prosody in human depression," in Body Sensor Networks (BSN), 2013 IEEE International Conference on, 2013, pp. 1-6.
[20] L.-S. A. Low, N. C. Maddage, M. Lech, L. Sheeber, and N. Allen, "Influence of acoustic low-level descriptors in the detection of clinical depression in adolescents," in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, 2010, pp. 5154-5157.
[21] L.-S. A. Low, N. C. Maddage, M. Lech, L. B. Sheeber, and N. B. Allen, "Detection of clinical depression in adolescents’ speech during family interactions," Biomedical Engineering, IEEE Transactions on, vol. 58, pp. 574-586, 2011.
[22] K. E. B. Ooi, M. Lech, and N. B. Allen, "Multichannel weighted speech classification system for prediction of major depression in adolescents," Biomedical Engineering, IEEE Transactions on, vol. 60, pp. 497-506, 2013.
[23] M. H. Sanchez, D. Vergyri, L. Ferrer, C. Richey, P. Garcia, B. Knoth, and W. Jarrold, "Using Prosodic and Spectral Features in Detecting Depression in Elderly Males," in INTERSPEECH, 2011, pp. 3001-3004.
[24] T. Yingthawornsuk and R. G. Shiavi, "Distinguishing depression and suicidal risk in men using GMM based frequency contents of affective vocal tract response," in Control, Automation and Systems, 2008. ICCAS 2008. International Conference on, 2008, pp. 901-904.
[25] Z. N. Karam, E. M. Provost, S. Singh, J. Montgomery, C. Archer, G. Harrington, and M. G. Mcinnis, "Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 4858-4862.
[26] C. Sobin and H. A. Sackeim, "Psychomotor symptoms of depression," American Journal of Psychiatry, vol. 154, pp. 4-17, 1997.
[27] M. JH Balsters, E. J Krahmer, M. GJ Swerts, and A. JJM Vingerhoets, "Verbal and nonverbal correlates for depression: A review," Current Psychiatry Reviews, vol. 8, pp. 227-234, 2012.
[28] J. H. Jeon, R. Xia, and Y. Liu, "Sentence level emotion recognition based on decisions from subsentence segments," in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, 2011, pp. 4940-4943.
[29] E. Mower and S. Narayanan, "A hierarchical static-dynamic framework for emotion classification," in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, 2011, pp. 2372-2375.
[30] B. Schuller and L. Devillers, "Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm," in INTERSPEECH, 2010, pp. 801-804.
[31] A. Batliner, S. Steidl, D. Seppi, and B. Schuller, "Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach," Advances in Human-Computer Interaction, vol. 2010, p. 3, 2010.
[32] D. Bitouk, R. Verma, and A. Nenkova, "Class-level spectral features for emotion recognition," Speech communication, vol. 52, pp. 613-625, 2010.
[33] D.-N. Jiang and L.-H. Cai, "Speech emotion classification with the combination of statistic features and temporal features," in Multimedia and Expo, 2004. ICME'04. 2004 IEEE International Conference on, 2004, pp. 1967-1970.
[34] M. Cannizzaro, B. Harel, N. Reilly, P. Chappell, and P. J. Snyder, "Voice acoustical measurement of the severity of major depression," Brain and cognition, vol. 56, pp. 30-35, 2004.
[35] E. Moore, M. Clements, J. W. Peifer, and L. Weisser, "Critical analysis of the impact of glottal features in the classification of clinical depression in speech," Biomedical Engineering, IEEE Transactions on, vol. 55, pp. 96-107, 2008.
[36] J. C. Mundt, P. J. Snyder, M. S. Cannizzaro, K. Chappie, and D. S. Geralts, "Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology," Journal of neurolinguistics, vol. 20, pp. 50-64, 2007.
[37] Å. Nilsonne, J. Sundberg, S. Ternström, and A. Askenfelt, "Measuring the rate of change of voice fundamental frequency in fluent speech during mental depression," The Journal of the Acoustical Society of America, vol. 83, pp. 716-728, 1988.
[38] N. Vanello, A. Guidi, C. Gentili, S. Werner, G. Bertschy, G. Valenza, A. Lanata, and E. P. Scilingo, "Speech analysis for mood state characterization in bipolar patients," in Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE, 2012, pp. 2104-2107.
[39] C. JC, "A Tutorial on Support Vector Machines for Pattern Recognition."
[40] D. Neiberg, K. Elenius, and K. Laskowski, "Emotion recognition in spontaneous speech using GMMs," in Interspeech, 2006.
[41] L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proceedings of the IEEE, vol. 77, pp. 257-286, 1989.
[42] B. Schuller, G. Rigoll, and M. Lang, "Hidden Markov model-based speech emotion recognition," in Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003 IEEE International Conference on, 2003, pp. II-1-4 vol. 2.
[43] J. Nicholson, K. Takahashi, and R. Nakatsu, "Emotion recognition in speech using neural networks," Neural computing & applications, vol. 9, pp. 290-296, 2000.
[44] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, and T. N. Sainath, "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," Signal Processing Magazine, IEEE, vol. 29, pp. 82-97, 2012.
[45] C.-H. Wu, J.-C. Lin, and W.-L. Wei, "Two-level hierarchical alignment for semi-coupled HMM-based audiovisual emotion recognition with temporal course," Multimedia, IEEE Transactions on, vol. 15, pp. 1880-1895, 2013.
[46] C.-H. Wu and H.-P. Shen, "A Study on Multilingual and Accented Speech Recognition."
[47] P. Ekman, Basic Emotions, 1999.
[48] M. Hamilton, "A rating scale for depression," Journal of neurology, neurosurgery, and psychiatry, vol. 23, p. 56, 1960.
[49] R. M. Hirschfeld, J. B. Williams, R. L. Spitzer, J. R. Calabrese, L. Flynn, P. E. Keck Jr, L. Lewis, S. L. McElroy, R. M. Post, and D. J. Rapport, "Development and validation of a screening instrument for bipolar spectrum disorder: the Mood Disorder Questionnaire," American Journal of Psychiatry, vol. 157, pp. 1873-1875, 2000.
[50] R. Young, J. Biggs, V. Ziegler, and D. Meyer, "A rating scale for mania: reliability, validity and sensitivity," The British Journal of Psychiatry, vol. 133, pp. 429-435, 1978.
[51] S. Leucht, G. Pitschel-Walz, D. Abraham, and W. Kissling, "Efficacy and extrapyramidal side-effects of the new antipsychotics olanzapine, quetiapine, risperidone, and sertindole compared to conventional antipsychotics and placebo. A meta-analysis of randomized controlled trials," Schizophrenia research, vol. 35, pp. 51-68, 1999.
[52] T. R. Barnes, "A rating scale for drug-induced akathisia," The British Journal of Psychiatry, vol. 154, pp. 672-676, 1989.
[53] W. Guy, "Clinical global impression scale," The ECDEU Assessment Manual for Psychopharmacology-Revised. Volume DHEW Publ No ADM 76, vol. 338, pp. 218-222, 1976.
[54] J. J. Gross and R. W. Levenson, "Emotion elicitation using films," Cognition & emotion, vol. 9, pp. 87-108, 1995.
[55] C. Cheng, H. Chen, Y. Chan, Y. Su, and C. Tseng, "Taiwan corpora of Chinese emotions and relevant psychophysiological data–Normative data for Chinese jokes," Chinese Journal of Psychology, vol. 55, pp. 555-569, 2013.
[56] T. Giannakopoulos, "A method for silence removal and segmentation of speech signals, implemented in Matlab," Department of Informatics and Telecommunications, University of Athens, Greece, Computational Intelligence Laboratory (CIL), Insititute of Informatics and Telecommunications (IIT), NCSR DEMOKRITOS, Greece, 2009.
[57] S. E. Bou-Ghazale and K. Assaleh, "A robust endpoint detection of speech for noisy environments with application to automatic speech recognition," in Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, 2002, pp. IV-3808-IV-3811.
[58] G. Saha, S. Chakroborty, and S. Senapati, "A new silence removal and endpoint detection algorithm for speech and speaker recognition applications," in Proceedings of the 11th National Conference on Communications (NCC), 2005, pp. 291-295.
[59] H. Hermansky and N. Morgan, "RASTA processing of speech," Speech and Audio Processing, IEEE Transactions on, vol. 2, pp. 578-589, 1994.
[60] F. Eyben, M. Wöllmer, and B. Schuller, "Opensmile: the munich versatile and fast open-source audio feature extractor," in Proceedings of the international conference on Multimedia, 2010, pp. 1459-1462.
[61] S. Furui, "Unsupervised speaker adaptation based on hierarchical spectral clustering," Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 37, pp. 1923-1930, 1989.
[62] J. Moody, S. Hanson, A. Krogh, and J. A. Hertz, "A simple weight decay can improve generalization," Advances in neural information processing systems, vol. 4, pp. 950-957, 1995.
[63] R. Hecht-Nielsen, "Theory of the backpropagation neural network," in Neural Networks, 1989. IJCNN., International Joint Conference on, 1989, pp. 593-605.
[64] E. Mower, M. J. Matarić, and S. Narayanan, "A framework for automatic human emotion classification using emotion profiles," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, pp. 1057-1070, 2011.
[65] D. Ellis. PLP and RASTA (and MFCC, and inversion) in Matlab. Available: http://labrosa.ee.columbia.edu/matlab/rastamat/
[66] R. B. Palm, "Prediction as a candidate for learning deep hierarchical models of data," Technical University of Denmark, 2012.
[67] A.-r. Mohamed, G. E. Dahl, and G. Hinton, "Acoustic modeling using deep belief networks," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, pp. 14-22, 2012.
[68] A. Fischer and C. Igel, "An introduction to restricted Boltzmann machines," in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, ed: Springer, 2012, pp. 14-36.
[69] G. E. Hinton, "Training products of experts by minimizing contrastive divergence," Neural computation, vol. 14, pp. 1771-1800, 2002.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2017-08-27起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2017-08-27起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw