||A Study on Data Fusion Strategy for Audio-Visual Emotion Recognition
||Institute of Computer Science and Information Engineering
data fusion strategy
hidden Markov model
Recent years have seen increased attention being given to research topic in automatic audio-visual emotion recognition. To increase the recognition accuracy, data fusion strategy, that is, related to how effectively integrate the audio and visual cues became the major research issue. The fusion operations reported can be classified into three major categories: feature-level fusion, decision-level fusion, and model-level fusion for audio-visual emotion recognition. Obviously, the different data fusion strategies have different characteristics and distinct advantages and disadvantages. According to the analysis of characteristics of current data fusion strategies, this dissertation firstly presented a hybrid fusion method to effectively integrate the advantages of data fusion strategies of different characteristics for increasing the recognition performance.
This dissertation presented a hybrid fusion method named Error Weighted Semi-Coupled Hidden Markov Model (EWSC-HMM) to effectively integrate the advantages of model-level fusion method Semi-Coupled Hidden Markov Model (SC-HMM) and the decision-level fusion method Error Weighted Classifier Combination (EWC) to obtain the optimal emotion recognition result based on audio-visual bimodal fusion. The state-based bimodal alignment strategy in SC-HMM is proposed to align the temporal relationship between audio and visual streams. The Bayesian classifier weighting scheme EWC is then adopted to explore the contributions of the SC-HMM-based classifiers for different audio-visual feature pairs to make a final emotion recognition decision. For performance evaluation, two databases are considered: the posed MHMC database and the spontaneous SEMAINE database. Experimental results show that the proposed method not only outperforms other fusion-based bimodal emotion recognition methods for posed expressions but also provide acceptable results for spontaneous expressions.
A complete emotional expression typically contains a complex temporal course in face-to-face natural conversation. In this dissertation, we further focused on exploring the temporal evolution of an emotional expression for audio-visual emotion recognition. Previous psychologist research showed that a complete emotional expression can be characterized in three sequential temporal phases: onset (application), apex (release), and offset (relaxation), when considering the manner and intensity of expression. However, a complete emotional expression is expressed by more than one utterance in natural conversation, and in more detail, each utterance may contain several temporal phases of emotional expression. Accordingly, this dissertation further presented a novel data fusion method with respect to the temporal course modeling scheme named Two-Level Hierarchical Alignment-Based Semi-Coupled Hidden Markov model (2H-SC-HMM) to effectively solve the problem of complex temporal structures of an emotional expression and consider the temporal relationship between audio and visual streams for increasing the performance of audio-visual emotion recognition in a conversational utterance.
Finally, the experimental results demonstrate that the proposed 2H-SC-HMM substantially improves apparent performance of audio-visual emotion recognition.
TABLE OF CONTENT VI
LIST OF FIGURES VIII
LIST OF TABLES X
CHAPTER 1. INTRODUCTION 1
1.1. Motivation 1
1.2. Application Areas 2
1.3. Literature Review 4
1.4. Problem of Current Data Fusion Strategy 10
1.5. The Approach of this Dissertation 11
1.6. Contributions 12
1.7. The Organization of this Dissertation 13
CHAPTER 2. DATA COLLECTION 14
2.1. MHMC Emotion Database 14
2.2. SEMAINE Emotion Database 18
CHAPTER 3. FEATURE EXTRACTION 22
3.1. Facial Feature Extraction 22
3.2. Prosodic Feature Extraction 26
CHAPTER 4. ERROR WEIGHTED SEMI-COUPLED HIDDEN MARKOV MODEL 29
4.1. Model Derivation of Error Weighted Semi-Coupled Hidden Markov Model 31
4.2. State-based Bimodal Alignment Strategy 37
4.3. Empirical Weight Calculation 39
4.4. Summary 40
CHAPTER 5. TWO-LEVEL HIERARCHICAL ALIGNMENT-BASED SEMI-COUPLED HIDDEN MARKOV MODEL 41
5.1. Temporal Phase Definition 43
5.2. Model Derivation of Two-Level Hierarchical Alignment-Based Semi-Coupled Hidden Markov Model 46
5.3. Model- and State-level Alignment Mechanism 53
5.4. Summary 57
CHAPTER 6. EXPERIMENTS AND RESULTS 59
6.1. Performance Comparison for the MHMC Database 60
6.1.1 Performance Comparison based on Unimodal Features 60
6.1.2 Performance Comparison between Unimodal and Bimodal Features 64
6.1.3 Performance Comparison for Small Training Data Conditions 67
6.1.4 Performance Comparison for Noisy Conditions 69
6.2. Performance Comparison for the SEMAINE Database 72
6.2.1 Performance Comparison based on Unimodal Features 73
6.2.2 Performance Comparison between Unimodal and Bimodal Features 76
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 81
[Ambady and Rosenthal 1992] N. Ambady and R. Rosenthal, “Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis,” Psychol. Bull., vol. 111, no. 2, pp. 256–274, 1992.
[Ananthakrishnan and Narayanan 2005] S. Ananthakrishnan and S. S. Narayanan, “An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 269–272, 2005.
[Ayadi et al. 2011] M. E. Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572-587, 2011.
[Becker et al. 2005] C. Becker, A. Nakasone, H. Prendinger, M. Ishizuka, and I. Wachsmuth, “Physiologically interactive gaming with the 3D agent max,” Int’l Workshop on Conversational Informatics, JSAI, 2005.
[Boersma and Weenink 2007] P. Boersma and D. Weenink, Praat: doing phonetics by computer. http://www.praat.org/. 2007.
[Bradley et al. 2001] M. M. Bradley, M. Codispoti, B. N. Cuthbert, and T. J. Lang, “Emotion and motivation I: defensive and appetitive reactions in picture processing,” Emotion, vol. 1, no. 3, pp. 276–298, 2001.
[Brand et al. 1997] M. Brand, N. Oliver, and A. Pentland “Coupled hidden Markov models for complex action recognition,” Int’l Conf. Computer Vision Pattern Recognition, pp. 994–999, 1997.
[Busso et al. 2011] C. Busso, A. Metallinou, and S. Narayanan, “Iterative feature normalization for emotional speech detection” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 5692–5695, 2011.
[Busso et al. 2004] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recognition using facial expression, speech and multimodal information,” ACM Int’l Conf. Multimodal Interfaces, pp. 205–211, 2004.
[Caridakis et al. 2006] G. Caridakis, L. Malatesta, L. Kessous, N. Amir, A. Paouzaiou, and K. Karpouzis, “Modeling naturalistic affective states via facial and vocal expression recognition,” ACM Int’l Conf. Multimodal Interfaces, pp. 146–154, 2006.
[Chen and Wang 2008] C. W. Chen and C. C. Wang, “3D active appearance model for aligning faces in 2D images,” IEEE Int’l Conf. on Intelligent Robots and Systems, pp. 3133–3139, 2008.
[Choi and Oh 2006] H. C. Choi and S. Y. Oh, “Real-time recognition of facial expression using active appearance model with second order minimization and neural network,” IEEE Int’l Conf. on Systems, Man and Cybernetics, vol. 2, pp. 1559–1564, 2006.
[Cooper et al. 2009] H. M. Cooper, L. V. Hedges, and J. C. Valentine, The Handbook of Research Synthesis and Meta-Analysis. Russell Sage Foundation, NY 2009.
[Cootes et al. 2001] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 681–685, 2001.
[D’Mello and Kory 2012] S. D’Mello and J. Kory, “Consistent but modest: a meta-analysis on unimodal and multimodal affect detection accuracies from 30 studies,” ACM Int’l Conf. Multimodal Interaction, pp. 31–38, 2012.
[Douglas-Cowie et al. 2008] E. Douglas-Cowie, R. Cowie, C. Cox, N. Amir, and D. Heylen, “The sensitive artifical listener: an induction technique for generating emotionally coloured conversation,” in Programme of the Workshop on Corpora for Research on Emotion and Affect, 2008.
[Ekman 1999] P. Ekman, Handbook of Cognition and Emotion. Wiley, 1999.
[Ekman 1993] P. Ekman, “Facial expression and emotion,” American Psychologist, vol. 48, no. 4, pp. 384–392, 1993.
[Ekman and Friesen 1978] P. Ekman and W. Friesen, The Facial Action Coding System: A technique for the measurement of facial movement. Consulting Psychologists Press, Palo Alto, 1978.
[Fleiss 1971] J. L. Fleiss, “Measuring nominal scale agreement among many raters,” Psychological Bulletin, vol. 76, no. 5, pp. 378–382, 1971.
[Forlizzi 2005] J. Forlizzi, “Robotic products to assist the aging population,” ACM Interactions Special Issue on Human-Robot Interaction, vol. 5, no. 2, pp.16–18, 2005.
[Fredrickson 2001] B. L. Fredrickson, “The role of positive emotions in positive psychology: the broaden-and-build theory of positive emotions,” American Psychologist, vol. 56, no. 3, pp. 218–226, 2001.
[Freedman 2005] D. A. Freedman, Statistical models: theory and practice. Cambridge University Press, Cambridge, 2005.
[Fried 2011] L. Fried, “Teaching teachers about emotion regulation in the classroom,” Australian Journal of Teacher Education, vol. 36, no. 3, pp. 117–127, 2011.
[Gilleade et al. 2005] K. Gilleade, A. Dix, and J. Allanson, “Affective videogames and modes of affective gaming: assist me, challenge me, emote me,” DIGRA, 2005.
[Hoch et al. 2005] S. Hoch, F. Althoff, G. McGlaun, and G. Rigoll, “Bimodal fusion of emotional data in an automotive environment,” Int’l Conf. Acoustics, Speech, and Signal Processing, vol. II, pp. 1085–1088, 2005.
[Hudlicka and Broekens 2009] E. Hudlicka and J. Broekens, “Foundations for modelling emotions in game characters: modelling emotion effects on cognition,” Int’l Conf. Affective Computing & Intelligent Interaction, 2009.
[Ivanov et al. 2005] Y. Ivanov, T. Serre, and J. Bouvrie, “Error weighted classifier combination for multi-modal human identification,” Technical Report CBCL paper 258, Massachusetts Institute of Technology, Cambridge, MA, 2005.
[Jiang et al. 2011] D. Jiang, Y. Cui, X. Zhang, P. Fan, I. Gonzalez, and H. Sahli, “Audio visual emotion recognition based on triple-stream dynamic Bayesian network models,” Proc. Affective Computing and Intelligent Interaction, pp. 609–618, 2011.
[Kapoor et al. 2004] A. Kapoor, R. W. Picard, and Y. Ivanov, “Probabilistic combination of multiple modalities to detect interest,” Int’l Conf. Pattern Recognition, vol. 3, pp. 969–972, 2004.
[Karpouzis et al. 2007] K. Karpouzis, G. Caridakis, L. Kessous, N. Amir, A. Raouzaiou, L. Malatesta, and S. Kollias, “Modeling naturalistic affective states via facial, vocal, and bodily expressions recognition,” Artificial Intelligence for Human Computing, vol. 4451, pp. 91–112, 2007.
[Kooladugi et al. 2011] S. G. Kooladugi, N. Kumar, and K. S. Rao, “Speech emotion recognition using segmental level prosodic analysis,” Int’l Conf. Devices and Communications, pp. 1–5, 2011.
[Kwon et al. 2003] O. W. Kwon, K. Chan, J. Hao, and T. W. Lee, “Emotion recognition by speech signals,” Proc. Eighth European Conf. Speech Comm. and Technology, 2003.
[Landis and Koch 1977] J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, no. 1, pp. 159–174, 1977.
[Lane and Tranel 1971] H. Lane and B. Tranel, “The Lombard sign and the role of hearing in speech,” Journal of Speech and Hearing Research, vol. 14, pp. 677–709, 1971.
[Lang et al. 1993] P. J. Lang, M. K. Greenwald, M. M. Bradley, and A. O. Hamm, “Looking at pictures: affective, facial, visceral, and behavioral reactions,” Psychophysiology, vol. 30, no. 3, pp. 261–273, 1993.
[Lee and Narayanan 2005] C. M. Lee and S. Narayanan, “Toward detecting emotions in spoken dialogs,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp. 293–303, 2005.
[Levenson 2003] R. W. Levenson, “Autonomic specificity and emotion,” Handbook of Affective Sciences, pp. 212–224, 2003. Oxford: Oxford University press.
[Lin et al. 2013(a)] J. C. Lin, C. H. Wu, and W. L. Wei, “A probabilistic fusion strategy for audiovisual emotion recognition of sparse and noisy data,” Int’l Conf. Orange Technologies, pp. 278–281, 2013.
[Lin et al. 2013(b)] J. C. Lin, C. H. Wu, and W. L. Wei, “Emotion recognition of conversational affective speech using temporal course modeling,” INTERSPEECH, pp. 1336–1340, 2013.
[Lin et al. 2013(c)] J. C. Lin, C. H. Wu, and W. L. Wei, “Facial action unit prediction under partial occlusion based on error weighted cross-correlation model,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 3482–3486, 2013.
[Lin et al. 2012] J. C. Lin, C. H. Wu, and W. L. Wei, “Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition,” IEEE Trans. Multimedia, vol. 14, no.1, pp. 142–156, 2012.
[Lin et al. 2011] J. C. Lin, C. H. Wu, and W. L. Wei, “Semi-coupled hidden Markov model with state-based alignment strategy for audio-visual emotion recognition,” Int’l Conf. Affective Computing & Intelligent Interaction, pp. 185–194, 2011.
[Lin et al. 2010] J. C. Lin, C. H. Wu, W. L. Wei, and C. J. Liu, “Audio-visual emotion recognition using semi-coupled HMM and error-weighted classifier combination,” Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 903–906, 2010.
[Linnenbrink and Pintrich 2000] E. A. Linnenbrink and P. R. Pintrich, Multiple pathways to learning and achievement: the role of goal orientation in fostering adaptive motivation, affect and cognition. In C. Sansone and J. Harackiewicz (Eds.), Intrinsic and extrinsic motivation: the search for optimal motivation and performance, San Diego, CA: Academic Press, pp. 195–227, 2000.
[Lu and Jia 2012] K. Lu and Y. Jia, “Audio-visual emotion recognition with boosted coupled HMM,” Int’l Conf. Pattern Recognition, pp. 1148–1151, 2012.
[Luengo et al. 2005] I. Luengo, E. Navas, I. Hernáez, and J. Sánchez, “Automatic emotion recognition using prosodic parameters,” INTERSPEECH, pp. 493–496, 2005.
[Mana and Pianesi 2007] N. Mana and F. Pianesi, “Modeling of emotional facial expressions during speech in synthetic talking heads using a hybrid approach,” Int’l Conf. Auditory-Visual Speech Processing, 2007.
[Mckeown et al. 2012] G. Mckeown, M. F. Valstar, R. Cowie, M. Pantic, and M. Schroe, “The SEMAINE database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent,” IEEE Transactions on Affective Computing, vol. 3, no.1, pp. 5–17, 2012.
[Mckeown et al. 2010] G. Mckeown, M. F. Valstar, R. Cowie, and M. Pantic, “The SEMAINE corpous of emotionally coloured character interactions,” IEEE Int’l Conf. on Multimedia and Expo, pp. 1079–1084, 2010.
[Mehrabian 1968] A. Mehrabian, “Communication without words,” Psychol. Today, vol. 2, no.4, pp.53–56, 1968.
[Metallinou et al. 2012] A. Metallinou, M. Wollmer, A. Katsamanis, F. Eyben, B. Schuller, and S. Narayanan, “Context-sensitive learning for enhanced audiovisual emotion classification,” IEEE Trans. Affective Computing, vol. 3, no. 2, pp. 184-198, 2012.
[Metallinou et al. 2010] A. Metallinou, S. Lee, and S. Narayanan, “Decision level combination of multiple modalities for recognition and analysis of emotional expression,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 2462–2465, 2010.
[Metallinou et al. 2008] A. Metallinou, S. Lee, and S. Narayanan, “Audio-visual emotion recognition using Gaussian mixture models for face and voice,” Int’l Symposium on Multimedia, pp. 250–257, 2008.
[Mikels et al. 2005] J. A. Mikels, B. L. Fredrickson, G. R. Larkin, C. M. Lindberg, S. J. Maglio, and P. A. Reuter-Lorenz, “Emotional category data on images from the International Affective Picture System,” Behavior Research Methods, vol. 37, no. 4, pp. 626–630, 2005.
[Morrison et al. 2007] D. Morrison, R. Wang, and L. C. De Silva, “Ensemble methods for spoken emotion recognition in call-centres,” Speech Communication, vol. 49, no. 2, pp. 98–112, 2007.
[Murray and Arnott 1993] I. R. Murray and J. L. Arnott, “Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion,” Journal of the Acoustical Society of America, vol. 93, no.2, pp. 1097–1108, 1993.
[Nefian et al. 2002] A. V. Nefian, L. Liang, X. Pi, X. Liu, C. Mao, and K. Murphy, “A coupled HMM for audio-visual speech recognition,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 2013–2016, 2002.
[Ntalampiras and Fakotakis 2012] S. Ntalampiras and N. Fakotakis, “Modeling the temporal evolution of acoustic parameters for speech emotion recognition,” IEEE Trans. Affective Computing, vol. 3, no. 1, pp. 116–125, 2012.
[Pekrun 1992] R. Pekrun, “The impact of emotions on learning and achievement: Towards theory of cognitive/motivational mediators,” Applied Psychology, vol. 41, no. 4, pp. 359–376, 1992.
[Petridis and Pantic 2008] S. Petridis and M. Pantic, “Audiovisual discrimination between laughter and speech,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 5117–5120, 2008.
[Picard 1997] R. W. Picard, Affective Computing. MIT Press, 1997.
[Raouzaiou et al. 2002] A. Raouzaiou, N. Tsapatsoulis, K. Karpouzis, and S. Kollias, “Parameterized facial expression synthesis based on MPEG-4,” EURASIP Journal on Applied Signal Processing, vol. 2002, no. 10, pp. 1021–1038, 2002.
[Russell 1980] J. A. Russell, “A circumplex model of affect,” Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161–1178, 1980.
[Sayedelahl et al. 2013] A. Sayedelahl, P. Araujo, and M. S. Kamel, “Audio-visual feature-decision level fusion for spontaneous emotion estimation in speech conversations,” Int’l Conf. Multimedia and Expo Workshops (ICMEW), pp. 1–6, 2013.
[Scherer 2003] K. R. Scherer, “Vocal communication of emotion: a review of research paradigms,” Speech Communication, vol. 40, no. 1-2, pp. 227–256, 2003.
[Schuller et al. 2012] B. Schuller, M. Valstar, F. Eyben, R. Cowie, and M. Pantic, “AVEC 2012 - the continuous audio/visual emotion challenge,” in Proc. of International Audio/Visual Emotion Challenge and Workshop (AVEC), ACM ICMI, 2012.
[Schuller et al. 2011] B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic, “AVEC 2011 the first international audio/visual emotion challenge,” In Proceedings First International Audio/Visual Emotion Challenge and Workshop (ACII), pp. 415–424, 2011.
[Schuller et al. 2007] B. Schuller, R. Muller, B. Hornler, A. Hothker, H. Konosu, and G. Rigoll, “Audiovisual recognition of spontaneous interest within conversations,” ACM Int’l Conf. Multimodal Interfaces, pp. 30–37, 2007.
[Schuller et al. 2003] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. II 1–4, 2003.
[Schutz and Lanehart 2002 ] P. A. Schutz and S. L. Lanehart, “Introduction: emotions in education,” Educational Psychologist, vol. 37, no.2, pp. 67–68, 2002.
[Sebe et al. 2006] N. Sebe, I. Cohen, T. Gevers, and T. S. Huang, “Emotion recognition based on joint visual and audio cues,” Int’l Conf. Pattern Recognition, pp. 1136–1139, 2006.
[Song et al. 2008] M. Song, M. You, N. Li, and C. Chen, “A robust multimodal approach for emotion recognition,” Neurocomputing, vol. 71, no. 10-12, pp. 1913–1920, 2008.
[Song et al. 2004] M. Song, J. Bu, C. Chen, and N. Li, “Audio-visual-based emotion recognition: a new approach,” Int’l Conf. Computer Vision and Pattern Recognition, pp. 1020–1025, 2004.
[Stevenson et al. 2007] R. A. Stevenson, J. A. Mikels, and T. W. James, “Characterization of the affective norms for english words by discrete emotional categories,” Behavior Research Methods, vol. 39, no. 4, pp. 1020–1024, 2007.
[Tang and Deng 2007] F. Tang and B. Deng, “Facial expression recognition using AAM and local facial features,” Int’l Conf. on Natural Computation, vol. 2, pp. 632–635, 2007.
[Tao et al. 1999] H. Tao, H. H. Chen, W. Wu, and T. S. Huang, “Compression of MPEG-4 facial animation parameters for transmission of talking heads,” IEEE Trans. Circuits and Systems for Video Technology, vol. 9, no.2, pp. 264–276, 1999.
[Tekalp and Ostermann 2000] A. M. Tekalp and J. Ostermann, “Face and 2-D mesh animation in MPEG-4,” Signal Processing: Image Communication, vol. 15, no. 4-5, pp. 387–421, 2000.
[Toothaker 1992] L. E. Toothaker, Multiple Comparison Procedures. Sage Pubns, 1992.
[Valstar 2008] M. F. Valstar, Timing is everything: a spatio-temporal approach to the analysis of facial actions. Ph.D. thesis, Imperial College, London, 2008.
[Valstar and Pantic 2012] M. F. Valstar and M. Pantic, “Fully automatic recognition of the temporal phases of facial actions,” IEEE Trans. Systems, Man and Cybernetics–Part B, vol. 42, no.1, pp. 28–43, 2012.
[Valstar and Pantic 2006] M. F. Valstar and M. Pantic, “Fully automatic facial action unit detection and temporal analysis,” Int’l Conf. on Computer Vision and Pattern Recognition, vol. 3, 2006.
[Viola and Jones 2001] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” Int’l Conf. Computer Vision Pattern Recognition, vol. 1, pp. 511–518, 2001.
[Wagner et al. 2011] J. Wagner, E. Andre, F. Lingenfelser, J. Kim, and T. Vogt, “Exploring fusion methods for multimodal emotion recognition with missing data,” IEEE Trans. Affective Computing, vol. 2, no. 4, pp. 206–218, 2011.
[Wang and Guan 2008] Y. Wang and L. Guan, “Recognizing human emotional state from audiovisual signals,” IEEE Trans. Multimedia, vol. 10, no.5, pp. 936–946, Aug. 2008.
[Wang and Guan 2005] Y. Wang and L. Guan, “Recognizing human emotion from audiovisual information,” Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 1125–1128, 2005.
[Wu et al. 2013(a)] C. H. Wu, J. C. Lin, W. L. Wei, and K. C. Cheng, “Emotion recognition from multi-modal information,” Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–8, 2013.
[Wu et al. 2013(b)] C. H. Wu, J. C. Lin, and W. L. Wei, “Two-level hierarchical alignment for semi-coupled HMM-based audiovisual emotion recognition with temporal course,” IEEE Trans. Multimedia, vol.15, no.8, pp. 1880–1895, 2013.
[Wu et al. 2009] C. H. Wu, J. F. Yeh, and Z. J. Chuang, “Emotion perception and recognition from speech,” in Affective Information Processing. New York: Springer, ch. 6, pp. 93–110, 2009.
[Wu et al. 2006] C. H. Wu, Z. J. Chuang, and Y. C. Lin, “Emotion recognition from text using semantic label and separable mixture model,” ACM Trans. on Asian Language Information Processing, vol. 5, no. 2, pp. 165–183, Jun. 2006.
[Wu and Liang 2011] C. H. Wu and W.B. Liang, “Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels,” IEEE Trans. Affective Computing, vol. 2, no.1, pp. 1–12, 2011.
[Xie and Liu 2007] L. Xie and Z. Q. Liu, “A coupled HMM approach to video-realistic speech animation,” Pattern Recognition, vol. 40, no. 8, pp. 2325–2340, 2007.
[Yeasin et al. 2006] M. Yeasin, B. Bullot, and R. Sharma, “Recognition of facial expressions and measurement of levels of interest from video,” IEEE Trans. Multimedia, vol. 8, no.3, pp. 500–508 June, 2006.
[Zeng et al. 2009] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: audio, visual, and spontaneous expressions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp. 39–58, 2009.
[Zeng et al. 2008] Z. Zeng, J. Tu, B. M. Pianfetti, Jr., and T. S. Huang, “Audio-visual affective expression recognition through multistream fused HMM,” IEEE Trans. Multimedia, vol. 10, no. 4, pp. 570–577, 2008.
[Zeng et al. 2007(a)] Z. Zeng, Y. Hu, G. I. Roisman, Z. Wen, Y. Fu, and T. S. Huang, “Audio-visual spontaneous emotion recognition,” Artificial Intelligence for Human Computing, T.S. Huang, A. Nijholt, M. Pantic, and A. Pentland, eds., pp. 72–90, Springer, 2007.
[Zeng et al. 2007(b)] Z. Zeng, J. Tu, M. Liu, T. S. Huang, B. Pianfetti, D. Roth, and S. Levinson, “Audio-visual affect recognition,” IEEE Trans. Multimedia, vol. 9, no. 2, pp. 424–428, Feb. 2007.
[Zeng et al. 2005] Z. Zeng, Z. Zhang, B. Pianfetti, J. Tu, and T. S. Huang, “Audio-visual affect recognition in activation-evaluation space,” IEEE Int’l Conf. on Multimedia and Expo, pp.828–831, 2005.
[Zeng et al. 2004] Z. Zeng, J. Tu, M. Liu, T. Zhang, N. Rizzolo, Z. Zhang, T. S. Huang, D. Roth, and S. Levinson, “Bimodal HCI-related emotion recognition,” ACM Int’l Conf. Multimodal Interfaces, pp. 137–143, 2004.