進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-3108201714154800
論文名稱(中文) 應用長短期記憶模型於融合非同步多模態情緒表達之心情偵測
論文名稱(英文) Mood Detection using LSTM-based Fusion of Emotion Expressions from Asynchronous Multimodal Inputs
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 105
學期 2
出版年 106
研究生(中文) 曾苑蓉
研究生(英文) Yuan-Rong Zeng
學號 P76044172
學位類別 碩士
語文別 英文
論文頁數 50頁
口試委員 口試委員-王駿發
指導教授-吳宗憲
口試委員-王家慶
口試委員-楊家輝
口試委員-戴顯權
中文關鍵字 情緒辨識  心情預測  卷積神經網路  長短期記憶模型  降噪自動編碼器 
英文關鍵字 Emotion recognition  Mood prediction  Convolutional Neural Network  Long-short term memory  denosing autoencoder 
學科別分類
中文摘要 在現今社會中,心理疾病的發生已越來越頻繁,除了治療之外,要如何預防心理疾病的發生更是一件重要的議題。然而人們經常沒有感知到自己已經產生低落的情緒,進而影響心情,久而久之便影響心理健康。本論文建立一套透過分析使用者之情緒表達之多模態資料來預測使用者之心情狀態,希望幫助使用者得以監控觀察自己的心情狀態。
本論文設計應用程式於智慧型裝置蒐集使用者透過不同媒介的語音、文字與情緒符號的語料。本論文主要觀察使用者經由不同媒介的紀錄,觀察情緒剖面的軌跡變化進而預測使用者之心情狀態。首先,針對語音輸入部份,我們利用語音參數建立語音情緒碼本,而後透過word2vec建立語音情緒碼字之向量,再使用卷積神經網路(CNN)生成語音情緒剖面;對於文字部分,我們利用情緒詞典將文字投射於情緒空間並使用自動編碼器萃取文字情緒空間分布的瓶頸參數,再串接由word2vec所訓練出來之詞向量來建立文字情緒碼字之向量,再以長短期記憶模型(LSTM)來生成文字情緒剖面;另外,為取得個人真實之情緒表達,本論文透過降噪自動編碼器將偵測之語音與文字情緒剖面轉換成真實情緒表達。為整合多模態輸入之情緒表達,本論文將融合非同步多模態情緒(包含語音、文字及表情符號)並應用LSTM模型來建立心情偵測模型,以預測心情狀態。
本論文共蒐集226天的長期心情資料,包含168筆正向心情與58筆負向心情資料,並採用5次交叉驗證方式評估系統效能。實驗結果顯示,本論文提出之融合多模態情緒表達並考慮情緒衰減之方法偵測使用者之心情識別率可達79.53%,相較於融合多模態情緒表達但不考慮情緒衰減之方法來預測心情狀態可提升1.17%,除此之外,本論文所提出之融合多模態情緒表達方式屬於特徵層級之融合方法,相較於使用決策層級之融合方法,本論文提出之方法預測心情狀態可提升1.75%之心情辨識率。因長期語料蒐集困難,未來希望蒐集更多多模態資料,來強化模型,並驗證所提方法之可行性。
英文摘要 Nowadays, there are more and more people suffering from mental health problems. In addition to the treatment of mental health problems, how to prevent people from experiencing mental illness is an important issue. However, people often neglect that they are feeling down, which finally affects their mood and thus impacts human mental health. This thesis analyzes and detects the user’s emotion expressions so as to monitor the mood states of the users.
This thesis designs a mood tracking software on the smart devices to collect the user’s input data from different modalities, consisting of speech, text, and emoticon. First, for the speech input, the acoustic features are used to construct an audio emotion codebook. Word2vec is then used for constructing the audio codeword vector for each audio emotion codeword. The convolutional neural networks (CNNs) are applied to generate the emotion profile of each speech input. Second, for the text input, an emotion dictionary is adopted to project the lexical words to the emotion space to extract the bottleneck features using an autoencoder. Besides, the bottleneck features are concatenated with the word vectors trained by word2vec to construct the textual emotion codeword vector. Then, the long-short term memory (LSTM) is used to generate the emotion profile for each text input. Furthermore, emotion profiles from both speech and text are converted into the real expressed emotions using a denoising autoencoder. Finally, the asynchronous emotion expressions from different modalities, including expressions from speech, text, and emoticons, are fused to determine the conclusive mood state using a long-shot term memory (LSTM).
For evaluation, this thesis collected totally 226 daily mood input data, in which 168 are positive moods and 58 are negative moods. To evaluate the performance of the proposed method, 5-fold cross-validation was employed. Experimental results show that the proposed method which considers the emotion decay achieved a detection accuracy of 79.53%, improving 1.17% comparing to fusion method in which the emotion decay was not considered. In addition, the proposed feature-level fusion method improved 1.75% of the mood detection accuracy comparing to the decision-level fusion method. In the future, more data should be collected to make the model more robust and hopefully improve the feasibility of the proposed method.
論文目次 Contents
中文摘要 I
Abstract III
誌謝 V
Contents VI
List of Tables IX
List of Figures X
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Literature Review 2
1.2.1 Mood/Emotion database 2
1.2.2 Mood/Emotion tracking system 3
1.2.3 Mood tracking technique 4
1.2.4 Emotion recognition technique 4
1.3 Problem and goal 6
1.4 Research framework 7
Chapter 2 Database design and collection 9
2.1 MHMC long-term mood database 9
2.1.1 Collection flow 9
2.1.2 Collection Environment 10
2.1.3 Annotations 12
2.2 MHMC speech emotion database 13
2.2.1 Collection flow 13
2.2.2 Annotations 14
2.3 NLPCC-MHMC text emotion database 14
2.3.1 Collection flow 14
2.3.2 Annotations 14
Chapter 3 Proposed methods 16
3.1 Feature extraction 17
3.1.1 Speech feature extraction 17
3.1.2 Text feature extraction 21
3.2 Emotion profile generation 25
3.2.1 Speech codeword-vector-based emotion recognition model 25
3.2.2 Text emotion recognition model 28
3.3 Construction and conversion of emotion profile 30
3.3.1 Construction of the training data 31
3.3.2 Denoising autoencoder 31
3.4 Fusion of the multi-module emotion expressions 34
3.5 Long-term tracking and mood detection model construction 36
Chapter 4 Experimental Results and Discussion 38
4.1 Evaluation of emotion recognition 38
4.1.1 Performance of speech emotion recognition 38
4.1.2 Performance of text emotion recognition 39
4.2 Evaluation of mood detection 41
Chapter 5 Conclusion and Future Work 46
Reference 48

參考文獻 [1] World Health Organization (2017), URL:http://www.who.int/whr/2001/media_centre/press_release/en/
[2] 衛生福利部, 申報件數統計 (2017), URL:http://iiqsw.mohw.gov.tw/InteractiveIntro.aspx?TID=50FBE9C5A2450C78
[3] Robbins, S. P. (2011), Organizational behavior, 14/E: Pearson Education India
[4] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005, September). A database of german emotional speech. In Interspeech (Vol. 5, pp. 1517-1520).
[5] Katsimerou, C., Albeda, J., Huldtgren, A., Heynderickx, I., & Redi, J. A. (2016). Crowdsourcing empathetic intelligence: the case of the annotation of EMMA database for emotion and mood recognition. ACM Transactions on Intelligent Systems and Technology (TIST), 7(4), 51.
[6] Amir, N., Ron, S., & Laor, N. (2000). Analysis of an emotional speech corpus in Hebrew based on objective criteria. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion.
[7] The conference on Natural Language Processing and Chinese Computing (2017), URL: http://tcci.ccf.org.cn/conference/2014/pages/page04_tdata.html.
[8] Ren_CECps 1.0 (2017), URL: http://a1-www.is.tokushima-u.ac.jp/member/ren/Ren-CECps1.0/Ren-CECps1.0.html
[9] Matthews, M., & Doherty, G. (May, 2011). In the mood: engaging teenagers in psychotherapy using mobile phones. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 2947-2956). ACM.
[10] MoodPanda (2017), URL: http://www.moodpanda.com/
[11] Valenza, G., Gentili, C., Lanatà, A., & Scilingo, E. P. (2013). Mood recognition in bipolar patients through the PSYCHE platform: Preliminary evaluations and perspectives. Artificial intelligence in medicine, 57(1), 49-58.
[12] Lam, K. Y., Wang, J., Ng, J. K. Y., Han, S., Zheng, L., Kam, C. H. C., & Zhu, C. J. (2015). Smartmood: Toward pervasive mood tracking and analysis for manic episode detection. IEEE Transactions on Human-Machine Systems, 45(1), 126-131.
[13] Lee, J. A., Efstratiou, C., & Bai, L. (September, 2016). OSN mood tracking: exploring the use of online social network activity as an indicator of mood changes. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct (pp. 1171-1179). ACM.
[14] KATO, N., & HAGIWARA, M. (2016). An Emotion Transition Model using Fuzzy Inference. International Journal of Affective Engineering, 15(3), 305-311.
[15] Kang, K., Yoon, C., & Kim, E. Y. (January, 2016). Identifying depressive users in Twitter using multimodal analysis. In Big Data and Smart Computing (BigComp), 2016 International Conference on (pp. 231-238). IEEE.
[16] Joshi, A., Tripathi, V., Soni, R., Bhattacharyya, P., & Carman, M. J. (March, 2016). EmoGram: An Open-Source Time Sequence-Based Emotion Tracker and Its Innovative Applications. In AAAI Workshop: Knowledge Extraction from Text.
[17] Pokorny, F. B., Graf, F., Pernkopf, F., & Schuller, B. W. (September, 2015). Detection of negative emotions in speech signals using bags-of-audio-words. In Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on (pp. 879-884). IEEE.
[18] Erdem, E. S., & Sert, M. (December, 2014). Efficient recognition of human emotional states from audio signals. In Multimedia (ISM), 2014 IEEE International Symposium on (pp. 139-142). IEEE.
[19] Asad, M. U., Afroz, N., Dey, L., Nath, R. P. D., & Azim, M. A. (December, 2014). Introducing active learning on text to emotion analyzer. In Computer and Information Technology (ICCIT), 2014 17th International Conference on (pp. 35-40). IEEE.
[20] Bhaskar, J., Sruthi, K., & Nedungadi, P. (2015). Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science, 46, 635-643.
[21] Chen, J., Chen, Z., Chi, Z., & Fu, H. (November, 2014). Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction (pp. 508-513). ACM.
[22] Gupta, P., & Rajput, N. (2007). Two-stream emotion recognition for call center monitoring. In Eighth Annual Conference of the International Speech Communication Association.
[23] Datcu, D., & Rothkrantz, L. J. (September, 2008). Automatic bi-modal emotion recognition system based on fusion of facial expressions and emotion extraction from speech. In Automatic Face & Gesture Recognition, 2008. FG'08. 8th IEEE International Conference on (pp. 1-2). IEEE.
[24] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[25] Schuller, B., Steidl, S., & Batliner, A. (2009). The interspeech 2009 emotion challenge. In Tenth Annual Conference of the International Speech Communication Association.
[26] genism toolkit (2017), URL: https://radimrehurek.com/gensim/index.html
[27] Yu, L. C., Lee, L. H., Hao, S., Wang, J., He, Y., Hu, J., ... & Zhang, X. J. (June, 2016). Building Chinese Affective Resources in Valence-Arousal Dimensions. In HLT-NAACL (pp. 540-545).
[28] 廣義知網知識本體架構2.0版 (2017), URL: http://ehownet.iis.sinica.edu.tw/index.php
[29] Liu, X. Y., Zhou, Y. M., & Zheng, R. S. (August, 2007). Measuring semantic similarity in WordNet. In Machine Learning and Cybernetics, 2007 International Conference on (Vol. 6, pp. 3431-3435). IEEE.
[30] Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets (pp. 267-285). Springer, Berlin, Heidelberg.
[31] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
[32] Busso, C., & Narayanan, S. S. (September, 2008). The expression and perception of emotions: comparing assessments of self versus others. In Interspeech (pp. 257-260).
[33] Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (July, 2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning (pp. 1096-1103). ACM.
[34] Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. The Journal of physiology, 160(1), 106-154.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2019-08-31起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw