進階搜尋


下載電子全文  
系統識別號 U0026-2804201414032600
論文名稱(中文) 應用資料融合策略於語音視覺情緒辨識之研究
論文名稱(英文) A Study on Data Fusion Strategy for Audio-Visual Emotion Recognition
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 102
學期 2
出版年 103
研究生(中文) 林仁俊
研究生(英文) Jen-Chun Lin
學號 p78961265
學位類別 博士
語文別 英文
論文頁數 94頁
口試委員 指導教授-吳宗憲
召集委員-廖弘源
口試委員-王新民
口試委員-賴尚宏
口試委員-李宗南
口試委員-柳金章
口試委員-曾新穆
口試委員-戴顯權
口試委員-連震杰
中文關鍵字 情緒辨識  資料融合策略  隱藏式馬可夫模型  時間歷程 
英文關鍵字 Emotion recognition  data fusion strategy  hidden Markov model  temporal course 
學科別分類
中文摘要 近年來,自動化語音視覺情緒辨識之研究逐漸受到重視。因此,為了提升自動化語音視覺情緒辨識之準確性,資料融合策略亦即相關到如何有效地去結合語音以及視覺信號成為主要研究議題。在語音視覺情緒辨識研究中,資料融合策略被歸納成三大類:特徵階層融合、決策階層融合以及模型階層融合。然而,不同的資料融合策略本身即具有不同的特性及優缺點。本論文根據分析上述資料融合策略之特性,首先提出一種混合融合的方法來有效地結合不同資料融合策略的優點,達到提升整體情緒辨識準確性之目的。
本論文首先結合模型階層融合(半耦合隱藏式馬可夫模型)與決策階層融合(貝氏分類器權重技術)兩種方法來提升語音視覺情緒辨識結果,此一改良方法被命名為錯誤加權半耦合隱藏式馬可夫模型。首先,以狀態為基礎之雙模態校準策略被提出並應用在半耦合隱藏式馬可夫模型中有效地考慮語音及視覺串流於時間上的關聯性;貝氏分類器權重技術則進一步被應用來探討各種半耦合隱藏式馬可夫模型分類器(每一種分類器被建置於不同的語音及視覺特徵配對)對於情緒辨識的貢獻作為此改良模型最終的辨識依據。經由實驗與檢定驗證,本論文所提出之混合融合技術確實能有效地結合模型階層融合以及決策階層融合的優點,提升整體語音視覺情緒辨識之辨識率。
在面對面的溝通過程中,一個完整的情緒展現往往包含著複雜的時間歷程。因此,本論文除了探討資料融合策略外,更進一步針對情緒展現在時間歷程上之演化模式做探討。根據過去心理學家的研究指出,一個完整的情緒展現可被歸納為三個接續的時間階段:情緒被喚起時的初始階段、情緒達到頂峰的階段以及情緒慢慢緩和到中性情緒的階段。然而,在自然的對話中,一個完整的情緒展現往往會被拆散到多個句子之中,即一個句子往往包含著一到多個的時間階段。根據此分析結果,欲模組化此複雜的時間結構,本論文進一步提出一個以時間過程建模方案為基礎之兩階層校準半耦合隱藏式馬可夫模型,來有效地模組化一個情緒展現在一個句子當中時間的演化模式,並進一步考慮在情緒展現的過程中語音與視覺串流在時間上的關聯性來達到理想的語音視覺情緒辨識結果。
最後,對於所提出之兩階層校準半耦合隱藏式馬可夫模型進行實驗與檢定驗證顯示,本論文所提出之方法,在語音視覺情緒辨識的辨識準確性上,具有相當程度的改進。
英文摘要 Recent years have seen increased attention being given to research topic in automatic audio-visual emotion recognition. To increase the recognition accuracy, data fusion strategy, that is, related to how effectively integrate the audio and visual cues became the major research issue. The fusion operations reported can be classified into three major categories: feature-level fusion, decision-level fusion, and model-level fusion for audio-visual emotion recognition. Obviously, the different data fusion strategies have different characteristics and distinct advantages and disadvantages. According to the analysis of characteristics of current data fusion strategies, this dissertation firstly presented a hybrid fusion method to effectively integrate the advantages of data fusion strategies of different characteristics for increasing the recognition performance.
This dissertation presented a hybrid fusion method named Error Weighted Semi-Coupled Hidden Markov Model (EWSC-HMM) to effectively integrate the advantages of model-level fusion method Semi-Coupled Hidden Markov Model (SC-HMM) and the decision-level fusion method Error Weighted Classifier Combination (EWC) to obtain the optimal emotion recognition result based on audio-visual bimodal fusion. The state-based bimodal alignment strategy in SC-HMM is proposed to align the temporal relationship between audio and visual streams. The Bayesian classifier weighting scheme EWC is then adopted to explore the contributions of the SC-HMM-based classifiers for different audio-visual feature pairs to make a final emotion recognition decision. For performance evaluation, two databases are considered: the posed MHMC database and the spontaneous SEMAINE database. Experimental results show that the proposed method not only outperforms other fusion-based bimodal emotion recognition methods for posed expressions but also provide acceptable results for spontaneous expressions.
A complete emotional expression typically contains a complex temporal course in face-to-face natural conversation. In this dissertation, we further focused on exploring the temporal evolution of an emotional expression for audio-visual emotion recognition. Previous psychologist research showed that a complete emotional expression can be characterized in three sequential temporal phases: onset (application), apex (release), and offset (relaxation), when considering the manner and intensity of expression. However, a complete emotional expression is expressed by more than one utterance in natural conversation, and in more detail, each utterance may contain several temporal phases of emotional expression. Accordingly, this dissertation further presented a novel data fusion method with respect to the temporal course modeling scheme named Two-Level Hierarchical Alignment-Based Semi-Coupled Hidden Markov model (2H-SC-HMM) to effectively solve the problem of complex temporal structures of an emotional expression and consider the temporal relationship between audio and visual streams for increasing the performance of audio-visual emotion recognition in a conversational utterance.
Finally, the experimental results demonstrate that the proposed 2H-SC-HMM substantially improves apparent performance of audio-visual emotion recognition.
論文目次 TABLE OF CONTENT VI
LIST OF FIGURES VIII
LIST OF TABLES X
CHAPTER 1. INTRODUCTION 1
1.1. Motivation 1
1.2. Application Areas 2
1.3. Literature Review 4
1.4. Problem of Current Data Fusion Strategy 10
1.5. The Approach of this Dissertation 11
1.6. Contributions 12
1.7. The Organization of this Dissertation 13
CHAPTER 2. DATA COLLECTION 14
2.1. MHMC Emotion Database 14
2.2. SEMAINE Emotion Database 18
CHAPTER 3. FEATURE EXTRACTION 22
3.1. Facial Feature Extraction 22
3.2. Prosodic Feature Extraction 26
CHAPTER 4. ERROR WEIGHTED SEMI-COUPLED HIDDEN MARKOV MODEL 29
4.1. Model Derivation of Error Weighted Semi-Coupled Hidden Markov Model 31
4.2. State-based Bimodal Alignment Strategy 37
4.3. Empirical Weight Calculation 39
4.4. Summary 40
CHAPTER 5. TWO-LEVEL HIERARCHICAL ALIGNMENT-BASED SEMI-COUPLED HIDDEN MARKOV MODEL 41
5.1. Temporal Phase Definition 43
5.2. Model Derivation of Two-Level Hierarchical Alignment-Based Semi-Coupled Hidden Markov Model 46
5.3. Model- and State-level Alignment Mechanism 53
5.4. Summary 57
CHAPTER 6. EXPERIMENTS AND RESULTS 59
6.1. Performance Comparison for the MHMC Database 60
6.1.1 Performance Comparison based on Unimodal Features 60
6.1.2 Performance Comparison between Unimodal and Bimodal Features 64
6.1.3 Performance Comparison for Small Training Data Conditions 67
6.1.4 Performance Comparison for Noisy Conditions 69
6.2. Performance Comparison for the SEMAINE Database 72
6.2.1 Performance Comparison based on Unimodal Features 73
6.2.2 Performance Comparison between Unimodal and Bimodal Features 76
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 81
REFERENCES 84
參考文獻 [Ambady and Rosenthal 1992] N. Ambady and R. Rosenthal, “Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis,” Psychol. Bull., vol. 111, no. 2, pp. 256–274, 1992.
[Ananthakrishnan and Narayanan 2005] S. Ananthakrishnan and S. S. Narayanan, “An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 269–272, 2005.
[Ayadi et al. 2011] M. E. Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572-587, 2011.
[Becker et al. 2005] C. Becker, A. Nakasone, H. Prendinger, M. Ishizuka, and I. Wachsmuth, “Physiologically interactive gaming with the 3D agent max,” Int’l Workshop on Conversational Informatics, JSAI, 2005.
[Boersma and Weenink 2007] P. Boersma and D. Weenink, Praat: doing phonetics by computer. http://www.praat.org/. 2007.
[Bradley et al. 2001] M. M. Bradley, M. Codispoti, B. N. Cuthbert, and T. J. Lang, “Emotion and motivation I: defensive and appetitive reactions in picture processing,” Emotion, vol. 1, no. 3, pp. 276–298, 2001.
[Brand et al. 1997] M. Brand, N. Oliver, and A. Pentland “Coupled hidden Markov models for complex action recognition,” Int’l Conf. Computer Vision Pattern Recognition, pp. 994–999, 1997.
[Busso et al. 2011] C. Busso, A. Metallinou, and S. Narayanan, “Iterative feature normalization for emotional speech detection” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 5692–5695, 2011.
[Busso et al. 2004] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recognition using facial expression, speech and multimodal information,” ACM Int’l Conf. Multimodal Interfaces, pp. 205–211, 2004.
[Caridakis et al. 2006] G. Caridakis, L. Malatesta, L. Kessous, N. Amir, A. Paouzaiou, and K. Karpouzis, “Modeling naturalistic affective states via facial and vocal expression recognition,” ACM Int’l Conf. Multimodal Interfaces, pp. 146–154, 2006.
[Chen and Wang 2008] C. W. Chen and C. C. Wang, “3D active appearance model for aligning faces in 2D images,” IEEE Int’l Conf. on Intelligent Robots and Systems, pp. 3133–3139, 2008.
[Choi and Oh 2006] H. C. Choi and S. Y. Oh, “Real-time recognition of facial expression using active appearance model with second order minimization and neural network,” IEEE Int’l Conf. on Systems, Man and Cybernetics, vol. 2, pp. 1559–1564, 2006.
[Cooper et al. 2009] H. M. Cooper, L. V. Hedges, and J. C. Valentine, The Handbook of Research Synthesis and Meta-Analysis. Russell Sage Foundation, NY 2009.
[Cootes et al. 2001] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 681–685, 2001.
[D’Mello and Kory 2012] S. D’Mello and J. Kory, “Consistent but modest: a meta-analysis on unimodal and multimodal affect detection accuracies from 30 studies,” ACM Int’l Conf. Multimodal Interaction, pp. 31–38, 2012.
[Douglas-Cowie et al. 2008] E. Douglas-Cowie, R. Cowie, C. Cox, N. Amir, and D. Heylen, “The sensitive artifical listener: an induction technique for generating emotionally coloured conversation,” in Programme of the Workshop on Corpora for Research on Emotion and Affect, 2008.
[Ekman 1999] P. Ekman, Handbook of Cognition and Emotion. Wiley, 1999.
[Ekman 1993] P. Ekman, “Facial expression and emotion,” American Psychologist, vol. 48, no. 4, pp. 384–392, 1993.
[Ekman and Friesen 1978] P. Ekman and W. Friesen, The Facial Action Coding System: A technique for the measurement of facial movement. Consulting Psychologists Press, Palo Alto, 1978.
[Fleiss 1971] J. L. Fleiss, “Measuring nominal scale agreement among many raters,” Psychological Bulletin, vol. 76, no. 5, pp. 378–382, 1971.
[Forlizzi 2005] J. Forlizzi, “Robotic products to assist the aging population,” ACM Interactions Special Issue on Human-Robot Interaction, vol. 5, no. 2, pp.16–18, 2005.
[Fredrickson 2001] B. L. Fredrickson, “The role of positive emotions in positive psychology: the broaden-and-build theory of positive emotions,” American Psychologist, vol. 56, no. 3, pp. 218–226, 2001.
[Freedman 2005] D. A. Freedman, Statistical models: theory and practice. Cambridge University Press, Cambridge, 2005.
[Fried 2011] L. Fried, “Teaching teachers about emotion regulation in the classroom,” Australian Journal of Teacher Education, vol. 36, no. 3, pp. 117–127, 2011.
[Gilleade et al. 2005] K. Gilleade, A. Dix, and J. Allanson, “Affective videogames and modes of affective gaming: assist me, challenge me, emote me,” DIGRA, 2005.
[Hoch et al. 2005] S. Hoch, F. Althoff, G. McGlaun, and G. Rigoll, “Bimodal fusion of emotional data in an automotive environment,” Int’l Conf. Acoustics, Speech, and Signal Processing, vol. II, pp. 1085–1088, 2005.
[Hudlicka and Broekens 2009] E. Hudlicka and J. Broekens, “Foundations for modelling emotions in game characters: modelling emotion effects on cognition,” Int’l Conf. Affective Computing & Intelligent Interaction, 2009.
[Ivanov et al. 2005] Y. Ivanov, T. Serre, and J. Bouvrie, “Error weighted classifier combination for multi-modal human identification,” Technical Report CBCL paper 258, Massachusetts Institute of Technology, Cambridge, MA, 2005.
[Jiang et al. 2011] D. Jiang, Y. Cui, X. Zhang, P. Fan, I. Gonzalez, and H. Sahli, “Audio visual emotion recognition based on triple-stream dynamic Bayesian network models,” Proc. Affective Computing and Intelligent Interaction, pp. 609–618, 2011.
[Kapoor et al. 2004] A. Kapoor, R. W. Picard, and Y. Ivanov, “Probabilistic combination of multiple modalities to detect interest,” Int’l Conf. Pattern Recognition, vol. 3, pp. 969–972, 2004.
[Karpouzis et al. 2007] K. Karpouzis, G. Caridakis, L. Kessous, N. Amir, A. Raouzaiou, L. Malatesta, and S. Kollias, “Modeling naturalistic affective states via facial, vocal, and bodily expressions recognition,” Artificial Intelligence for Human Computing, vol. 4451, pp. 91–112, 2007.
[Kooladugi et al. 2011] S. G. Kooladugi, N. Kumar, and K. S. Rao, “Speech emotion recognition using segmental level prosodic analysis,” Int’l Conf. Devices and Communications, pp. 1–5, 2011.
[Kwon et al. 2003] O. W. Kwon, K. Chan, J. Hao, and T. W. Lee, “Emotion recognition by speech signals,” Proc. Eighth European Conf. Speech Comm. and Technology, 2003.
[Landis and Koch 1977] J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, no. 1, pp. 159–174, 1977.
[Lane and Tranel 1971] H. Lane and B. Tranel, “The Lombard sign and the role of hearing in speech,” Journal of Speech and Hearing Research, vol. 14, pp. 677–709, 1971.
[Lang et al. 1993] P. J. Lang, M. K. Greenwald, M. M. Bradley, and A. O. Hamm, “Looking at pictures: affective, facial, visceral, and behavioral reactions,” Psychophysiology, vol. 30, no. 3, pp. 261–273, 1993.
[Lee and Narayanan 2005] C. M. Lee and S. Narayanan, “Toward detecting emotions in spoken dialogs,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp. 293–303, 2005.
[Levenson 2003] R. W. Levenson, “Autonomic specificity and emotion,” Handbook of Affective Sciences, pp. 212–224, 2003. Oxford: Oxford University press.
[Lin et al. 2013(a)] J. C. Lin, C. H. Wu, and W. L. Wei, “A probabilistic fusion strategy for audiovisual emotion recognition of sparse and noisy data,” Int’l Conf. Orange Technologies, pp. 278–281, 2013.
[Lin et al. 2013(b)] J. C. Lin, C. H. Wu, and W. L. Wei, “Emotion recognition of conversational affective speech using temporal course modeling,” INTERSPEECH, pp. 1336–1340, 2013.
[Lin et al. 2013(c)] J. C. Lin, C. H. Wu, and W. L. Wei, “Facial action unit prediction under partial occlusion based on error weighted cross-correlation model,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 3482–3486, 2013.
[Lin et al. 2012] J. C. Lin, C. H. Wu, and W. L. Wei, “Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition,” IEEE Trans. Multimedia, vol. 14, no.1, pp. 142–156, 2012.
[Lin et al. 2011] J. C. Lin, C. H. Wu, and W. L. Wei, “Semi-coupled hidden Markov model with state-based alignment strategy for audio-visual emotion recognition,” Int’l Conf. Affective Computing & Intelligent Interaction, pp. 185–194, 2011.
[Lin et al. 2010] J. C. Lin, C. H. Wu, W. L. Wei, and C. J. Liu, “Audio-visual emotion recognition using semi-coupled HMM and error-weighted classifier combination,” Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 903–906, 2010.
[Linnenbrink and Pintrich 2000] E. A. Linnenbrink and P. R. Pintrich, Multiple pathways to learning and achievement: the role of goal orientation in fostering adaptive motivation, affect and cognition. In C. Sansone and J. Harackiewicz (Eds.), Intrinsic and extrinsic motivation: the search for optimal motivation and performance, San Diego, CA: Academic Press, pp. 195–227, 2000.
[Lu and Jia 2012] K. Lu and Y. Jia, “Audio-visual emotion recognition with boosted coupled HMM,” Int’l Conf. Pattern Recognition, pp. 1148–1151, 2012.
[Luengo et al. 2005] I. Luengo, E. Navas, I. Hernáez, and J. Sánchez, “Automatic emotion recognition using prosodic parameters,” INTERSPEECH, pp. 493–496, 2005.
[Mana and Pianesi 2007] N. Mana and F. Pianesi, “Modeling of emotional facial expressions during speech in synthetic talking heads using a hybrid approach,” Int’l Conf. Auditory-Visual Speech Processing, 2007.
[Mckeown et al. 2012] G. Mckeown, M. F. Valstar, R. Cowie, M. Pantic, and M. Schroe, “The SEMAINE database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent,” IEEE Transactions on Affective Computing, vol. 3, no.1, pp. 5–17, 2012.
[Mckeown et al. 2010] G. Mckeown, M. F. Valstar, R. Cowie, and M. Pantic, “The SEMAINE corpous of emotionally coloured character interactions,” IEEE Int’l Conf. on Multimedia and Expo, pp. 1079–1084, 2010.
[Mehrabian 1968] A. Mehrabian, “Communication without words,” Psychol. Today, vol. 2, no.4, pp.53–56, 1968.
[Metallinou et al. 2012] A. Metallinou, M. Wollmer, A. Katsamanis, F. Eyben, B. Schuller, and S. Narayanan, “Context-sensitive learning for enhanced audiovisual emotion classification,” IEEE Trans. Affective Computing, vol. 3, no. 2, pp. 184-198, 2012.
[Metallinou et al. 2010] A. Metallinou, S. Lee, and S. Narayanan, “Decision level combination of multiple modalities for recognition and analysis of emotional expression,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 2462–2465, 2010.
[Metallinou et al. 2008] A. Metallinou, S. Lee, and S. Narayanan, “Audio-visual emotion recognition using Gaussian mixture models for face and voice,” Int’l Symposium on Multimedia, pp. 250–257, 2008.
[Mikels et al. 2005] J. A. Mikels, B. L. Fredrickson, G. R. Larkin, C. M. Lindberg, S. J. Maglio, and P. A. Reuter-Lorenz, “Emotional category data on images from the International Affective Picture System,” Behavior Research Methods, vol. 37, no. 4, pp. 626–630, 2005.
[Morrison et al. 2007] D. Morrison, R. Wang, and L. C. De Silva, “Ensemble methods for spoken emotion recognition in call-centres,” Speech Communication, vol. 49, no. 2, pp. 98–112, 2007.
[Murray and Arnott 1993] I. R. Murray and J. L. Arnott, “Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion,” Journal of the Acoustical Society of America, vol. 93, no.2, pp. 1097–1108, 1993.
[Nefian et al. 2002] A. V. Nefian, L. Liang, X. Pi, X. Liu, C. Mao, and K. Murphy, “A coupled HMM for audio-visual speech recognition,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 2013–2016, 2002.
[Ntalampiras and Fakotakis 2012] S. Ntalampiras and N. Fakotakis, “Modeling the temporal evolution of acoustic parameters for speech emotion recognition,” IEEE Trans. Affective Computing, vol. 3, no. 1, pp. 116–125, 2012.
[Pekrun 1992] R. Pekrun, “The impact of emotions on learning and achievement: Towards theory of cognitive/motivational mediators,” Applied Psychology, vol. 41, no. 4, pp. 359–376, 1992.
[Petridis and Pantic 2008] S. Petridis and M. Pantic, “Audiovisual discrimination between laughter and speech,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 5117–5120, 2008.
[Picard 1997] R. W. Picard, Affective Computing. MIT Press, 1997.
[Raouzaiou et al. 2002] A. Raouzaiou, N. Tsapatsoulis, K. Karpouzis, and S. Kollias, “Parameterized facial expression synthesis based on MPEG-4,” EURASIP Journal on Applied Signal Processing, vol. 2002, no. 10, pp. 1021–1038, 2002.
[Russell 1980] J. A. Russell, “A circumplex model of affect,” Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161–1178, 1980.
[Sayedelahl et al. 2013] A. Sayedelahl, P. Araujo, and M. S. Kamel, “Audio-visual feature-decision level fusion for spontaneous emotion estimation in speech conversations,” Int’l Conf. Multimedia and Expo Workshops (ICMEW), pp. 1–6, 2013.
[Scherer 2003] K. R. Scherer, “Vocal communication of emotion: a review of research paradigms,” Speech Communication, vol. 40, no. 1-2, pp. 227–256, 2003.
[Schuller et al. 2012] B. Schuller, M. Valstar, F. Eyben, R. Cowie, and M. Pantic, “AVEC 2012 - the continuous audio/visual emotion challenge,” in Proc. of International Audio/Visual Emotion Challenge and Workshop (AVEC), ACM ICMI, 2012.
[Schuller et al. 2011] B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic, “AVEC 2011 the first international audio/visual emotion challenge,” In Proceedings First International Audio/Visual Emotion Challenge and Workshop (ACII), pp. 415–424, 2011.
[Schuller et al. 2007] B. Schuller, R. Muller, B. Hornler, A. Hothker, H. Konosu, and G. Rigoll, “Audiovisual recognition of spontaneous interest within conversations,” ACM Int’l Conf. Multimodal Interfaces, pp. 30–37, 2007.
[Schuller et al. 2003] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. II 1–4, 2003.
[Schutz and Lanehart 2002 ] P. A. Schutz and S. L. Lanehart, “Introduction: emotions in education,” Educational Psychologist, vol. 37, no.2, pp. 67–68, 2002.
[Sebe et al. 2006] N. Sebe, I. Cohen, T. Gevers, and T. S. Huang, “Emotion recognition based on joint visual and audio cues,” Int’l Conf. Pattern Recognition, pp. 1136–1139, 2006.
[Song et al. 2008] M. Song, M. You, N. Li, and C. Chen, “A robust multimodal approach for emotion recognition,” Neurocomputing, vol. 71, no. 10-12, pp. 1913–1920, 2008.
[Song et al. 2004] M. Song, J. Bu, C. Chen, and N. Li, “Audio-visual-based emotion recognition: a new approach,” Int’l Conf. Computer Vision and Pattern Recognition, pp. 1020–1025, 2004.
[Stevenson et al. 2007] R. A. Stevenson, J. A. Mikels, and T. W. James, “Characterization of the affective norms for english words by discrete emotional categories,” Behavior Research Methods, vol. 39, no. 4, pp. 1020–1024, 2007.
[Tang and Deng 2007] F. Tang and B. Deng, “Facial expression recognition using AAM and local facial features,” Int’l Conf. on Natural Computation, vol. 2, pp. 632–635, 2007.
[Tao et al. 1999] H. Tao, H. H. Chen, W. Wu, and T. S. Huang, “Compression of MPEG-4 facial animation parameters for transmission of talking heads,” IEEE Trans. Circuits and Systems for Video Technology, vol. 9, no.2, pp. 264–276, 1999.
[Tekalp and Ostermann 2000] A. M. Tekalp and J. Ostermann, “Face and 2-D mesh animation in MPEG-4,” Signal Processing: Image Communication, vol. 15, no. 4-5, pp. 387–421, 2000.
[Toothaker 1992] L. E. Toothaker, Multiple Comparison Procedures. Sage Pubns, 1992.
[Valstar 2008] M. F. Valstar, Timing is everything: a spatio-temporal approach to the analysis of facial actions. Ph.D. thesis, Imperial College, London, 2008.
[Valstar and Pantic 2012] M. F. Valstar and M. Pantic, “Fully automatic recognition of the temporal phases of facial actions,” IEEE Trans. Systems, Man and Cybernetics–Part B, vol. 42, no.1, pp. 28–43, 2012.
[Valstar and Pantic 2006] M. F. Valstar and M. Pantic, “Fully automatic facial action unit detection and temporal analysis,” Int’l Conf. on Computer Vision and Pattern Recognition, vol. 3, 2006.
[Viola and Jones 2001] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” Int’l Conf. Computer Vision Pattern Recognition, vol. 1, pp. 511–518, 2001.
[Wagner et al. 2011] J. Wagner, E. Andre, F. Lingenfelser, J. Kim, and T. Vogt, “Exploring fusion methods for multimodal emotion recognition with missing data,” IEEE Trans. Affective Computing, vol. 2, no. 4, pp. 206–218, 2011.
[Wang and Guan 2008] Y. Wang and L. Guan, “Recognizing human emotional state from audiovisual signals,” IEEE Trans. Multimedia, vol. 10, no.5, pp. 936–946, Aug. 2008.
[Wang and Guan 2005] Y. Wang and L. Guan, “Recognizing human emotion from audiovisual information,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 1125–1128, 2005.
[Wu et al. 2013(a)] C. H. Wu, J. C. Lin, W. L. Wei, and K. C. Cheng, “Emotion recognition from multi-modal information,” Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–8, 2013.
[Wu et al. 2013(b)] C. H. Wu, J. C. Lin, and W. L. Wei, “Two-level hierarchical alignment for semi-coupled HMM-based audiovisual emotion recognition with temporal course,” IEEE Trans. Multimedia, vol.15, no.8, pp. 1880–1895, 2013.
[Wu et al. 2009] C. H. Wu, J. F. Yeh, and Z. J. Chuang, “Emotion perception and recognition from speech,” in Affective Information Processing. New York: Springer, ch. 6, pp. 93–110, 2009.
[Wu et al. 2006] C. H. Wu, Z. J. Chuang, and Y. C. Lin, “Emotion recognition from text using semantic label and separable mixture model,” ACM Trans. on Asian Language Information Processing, vol. 5, no. 2, pp. 165–183, Jun. 2006.
[Wu and Liang 2011] C. H. Wu and W.B. Liang, “Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels,” IEEE Trans. Affective Computing, vol. 2, no.1, pp. 1–12, 2011.
[Xie and Liu 2007] L. Xie and Z. Q. Liu, “A coupled HMM approach to video-realistic speech animation,” Pattern Recognition, vol. 40, no. 8, pp. 2325–2340, 2007.
[Yeasin et al. 2006] M. Yeasin, B. Bullot, and R. Sharma, “Recognition of facial expressions and measurement of levels of interest from video,” IEEE Trans. Multimedia, vol. 8, no.3, pp. 500–508 June, 2006.
[Zeng et al. 2009] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: audio, visual, and spontaneous expressions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp. 39–58, 2009.
[Zeng et al. 2008] Z. Zeng, J. Tu, B. M. Pianfetti, Jr., and T. S. Huang, “Audio-visual affective expression recognition through multistream fused HMM,” IEEE Trans. Multimedia, vol. 10, no. 4, pp. 570–577, 2008.
[Zeng et al. 2007(a)] Z. Zeng, Y. Hu, G. I. Roisman, Z. Wen, Y. Fu, and T. S. Huang, “Audio-visual spontaneous emotion recognition,” Artificial Intelligence for Human Computing, T.S. Huang, A. Nijholt, M. Pantic, and A. Pentland, eds., pp. 72–90, Springer, 2007.
[Zeng et al. 2007(b)] Z. Zeng, J. Tu, M. Liu, T. S. Huang, B. Pianfetti, D. Roth, and S. Levinson, “Audio-visual affect recognition,” IEEE Trans. Multimedia, vol. 9, no. 2, pp. 424–428, Feb. 2007.
[Zeng et al. 2005] Z. Zeng, Z. Zhang, B. Pianfetti, J. Tu, and T. S. Huang, “Audio-visual affect recognition in activation-evaluation space,” IEEE Int’l Conf. on Multimedia and Expo, pp.828–831, 2005.
[Zeng et al. 2004] Z. Zeng, J. Tu, M. Liu, T. Zhang, N. Rizzolo, Z. Zhang, T. S. Huang, D. Roth, and S. Levinson, “Bimodal HCI-related emotion recognition,” ACM Int’l Conf. Multimodal Interfaces, pp. 137–143, 2004.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2019-05-01起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2019-05-01起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw