||Attentively-Coupled Long Short-Term Memory for Audio-Visual Emotion Recognition
||Institute of Computer Science and Information Engineering
Audio-visual emotion recognition
segment based attention mechanism
Long Short-Term Memory
With the continuous evolution of human-computer interaction products, many smart products can support our daily needs, such as smart speakers, home robots and self-driving cars. In the interaction with these products, the ability to add emotion recognition to users will make these products more humane and increase the flexibility of interaction. There have been more and more studies on emotion recognition. In the existing audio-visual modal emotion recognition systems, only few of them focused on segment-based recognition of emotion expression, contrast to utterance-based emotion recognition. From the segment-based emotion expression, we can find the fluctuations of the more detailed expression of emotion.
This thesis uses segments as the identification unit to capture the facial expressions and audio signals of the speakers, considers and analyzes the different features of the facial and audio signals, and considers the pre- and post-dependence of the segmented signals. In the segmentation process, an important segment that has a great influence on the expression of the whole sentence is firstly found, and the segment is given a higher attention in the overall recognition to improve the recognition accuracy of each segment.
Different from single-modal emotion recognition, multi-modal emotion recognition architecture considers the data from different modalities. This thesis focuses on how to improve the fusion mechanism to improve the performance of segment-based emotion recognition by using a attentively-coupled long-term memory model. With the attention mechanism, in each fusion operation, the coupling unit can simultaneously consider the mutual influence relationship of the two modal signal characteristics when updating the unit, and add the degree of attention of each sequential segment for emotion recognition. The long-short term memory is adopted to control the flow of information to learn the long and short-term dependence of the signal. The model obtains the emotion prediction sequence of each segment, and expects to recognize the emotion from both facial and audio emotion expressions of the speaker.
In the experimental results, the accuracy of the proposed audio-visual emotion recognition system achieved 70.1%, which outperformed other existing traditional audio-visual emotion recognition systems. The experimental results showed that the proposed attentively-coupled long short-term memory model achieved good results in multi-modal emotion recognition or emotion recognition using segment-based attention.
List of Tables VIII
List of Figures X
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 3
1.3 Literature Review 4
1.3.1 Speech Emotion Recognition Systems 4
1.3.2 Facial Expression Recognition Systems 6
1.3.3 Fusion Methods of Multi-modal Signals 7
1.4 Problems 9
1.5 Description of Proposed Method 10
Chapter 2 Database Design and Collection 11
2.1 Original Annotation of BAUM-1 11
2.1.1 BAUM-1 Database 11
2.1.2 Recording Setup 12
2.2 Classification of Emotion and Sound 14
2.2.1 Classification of Emotion 14
2.2.2 Classification of Sound 15
2.3 Segmentation and Re-Annotation of BAUM-1 15
2.3.1 Segmentation 16
2.3.2 Re-Annotation 18
2.3.3 Statistical Information of Re-Annotation Corpus 19
Chapter 3 Proposed Methods 22
3.1 Pre-processing 23
3.1.1 Pre-Processing of Audio Data 24
3.1.2 Pre-Processing of Visual Data 24
3.2 Feature Extraction 26
3.2.1 Feature Extraction of Audio Data 26
3.2.2 Feature Extraction of Visual Data 30
3.2.3 Calculation of Segment Weights 33
3.3 Emotion Recognition Model 35
3.3.1 Coupled LSTM model 36
3.3.2 Attentively-Coupled LSTM model 39
Chapter 4 Experimental Results and Discussion 41
4.1 Evaluation of Audio Feature Extraction Model 41
4.1.1 Audio Emotion Feature Extraction Model 42
4.1.2 Audio Sound Type Feature Extraction Model 44
4.2 Evaluation of Visual Feature Extraction Model 47
4.3 Evaluation of NTN Model 49
4.4 Evaluation of Audio-Visual Emotion Recognition Model 50
4.5 Comparison with Other Methods 52
4.5.1 Evaluation of Modalities 52
4.5.2 Evaluation of Different Methods 54
Chapter 5 Conclusion and Future Work 56
 Service Robotics: Sales up 25 percent - 2019 boom predicted [Online]. Available: https://ifr.org/news/service-robotics.
 Z. Zeng et al., "Bimodal HCI-related affect recognition," in Proceedings of the 6th international conference on Multimodal interfaces, 2004: ACM, pp. 137-143.
 (2018). Emotion Recognition and Sentiment Analysis Market to Reach $3.8 Billion by 2025 [Online]. Available: https://www.tractica.com/newsroom/press-releases/emotion-recognition-and-sentiment-analysis-market-to-reach-3-8-billion-by-2025/.
 M. Pantic and L. J. Rothkrantz, "Automatic analysis of facial expressions: The state of the art," IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 12, pp. 1424-1445, 2000.
 Y.-H. Chen, "Prosodic Phrase-Based Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Non-verbal Speech Signals," National Cheng Kung University, 2018.
 A. Schirmer and T. C. Gunter, "Temporal signatures of processing voiceness and emotion in sound," Social cognitive and affective neuroscience, vol. 12, no. 6, pp. 902-909, 2017.
 S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, "BAUM-1: A spontaneous audio-visual face database of affective and mental states," IEEE Transactions on Affective Computing, vol. 8, no. 3, pp. 300-313, 2016.
 N. Anand and P. Verma, "Convoluted feelings convolutional and recurrent nets for detecting emotion from audio data," in Technical Report: Stanford University, 2015.
 K. S. Rao, S. G. Koolagudi, and R. R. Vempada, "Emotion recognition from speech using global and local prosodic features," International journal of speech technology, vol. 16, no. 2, pp. 143-160, 2013.
 H. Cao, S. Benus, R. Gur, R. Verma, and A. Nenkova, "Prosodic cues for emotion: analysis with discrete characterization of intonation," Speech prosody 2014, 2014.
 E. Tzinis and A. Potamianos, "Segment-based speech emotion recognition using recurrent neural networks," in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017: IEEE, pp. 190-195.
 L. Zhu, L. Chen, D. Zhao, J. Zhou, and W. Zhang, "Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN," Sensors, vol. 17, no. 7, p. 1694, 2017.
 G. Trigeorgis et al., "Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016: IEEE, pp. 5200-5204.
 J. Deng, S. Fruhholz, Z. Zhang, and B. Schuller, "Recognizing Emotions from Whispered Speech Based on Acoustic Feature Transfer Learning," IEEE Access, pp. 1-1, 2017.
 C.-W. Huang and S. S. Narayanan, "Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition," in 2017 IEEE International Conference on Multimedia and Expo (ICME), 2017: IEEE, pp. 583-588.
 S. Kim and M. L. Seltzer, "Towards language-universal end-to-end speech recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 4914-4918.
 T. Ahonen, A. Hadid, and M. Pietikainen, "Face description with local binary patterns: Application to face recognition," IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 12, pp. 2037-2041, 2006.
 T. Kalsum, S. M. Anwar, M. Majid, B. Khan, and S. M. Ali, "Emotion recognition from facial expressions using hybrid feature descriptors," IET Image Processing, vol. 12, no. 6, pp. 1004-1012, 2018.
 B. Yang, J. Cao, R. Ni, and Y. Zhang, "Facial expression recognition using weighted mixture deep neural network based on double-channel facial images," IEEE Access, vol. 6, pp. 4630-4640, 2017.
 Y. Li, J. Zeng, S. Shan, and X. Chen, "Occlusion aware facial expression recognition using cnn with attention mechanism," IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2439-2450, 2018.
 G. Zhang, X. Huang, S. Z. Li, Y. Wang, and X. Wu, "Boosting local binary pattern (LBP)-based face recognition," in Chinese Conference on Biometric Recognition, 2004: Springer, pp. 179-186.
 J. Luo, Y. Ma, E. Takikawa, S. Lao, M. Kawade, and B.-L. Lu, "Person-specific SIFT features for face recognition," in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07, 2007, vol. 2: IEEE, pp. II-593-II-596.
 K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
 Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, "A survey of affect recognition methods: Audio, visual, and spontaneous expressions," IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 1, pp. 39-58, 2008.
 S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, "Learning affective features with a hybrid deep model for audio–visual emotion recognition," IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 3030-3043, 2017.
 Z. Zeng et al., "Audio-visual affect recognition through multi-stream fused HMM for HCI," in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005, vol. 2: IEEE, pp. 967-972.
 J. Cai et al., "Feature-level and Model-level Audiovisual Fusion for Emotion Recognition in the Wild," in 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), 2019: IEEE, pp. 443-448.
 S. Sahoo and A. Routray, "Emotion recognition from audio-visual data using rule based decision level fusion," in 2016 IEEE Students’ Technology Symposium (TechSym), 2016: IEEE, pp. 7-12.
 X. Qiu and X. Huang, "Convolutional neural tensor network architecture for community-based question answering," in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
 M.-H. Su, C.-H. Wu, K.-Y. Huang, and T.-H. Yang, "Cell-Coupled Long Short-Term Memory With L-Skip Fusion Mechanism for Mood Disorder Detection Through Elicited Audiovisual Features," IEEE transactions on neural networks and learning systems, 2019.
 O. Martin, I. Kotsia, B. Macq, and I. Pitas, "The eNTERFACE'05 audio-visual emotion database," in 22nd International Conference on Data Engineering Workshops (ICDEW'06), 2006: IEEE, pp. 8-8.
 H.-C. Chou, W.-C. Lin, L.-C. Chang, C.-C. Li, H.-P. Ma, and C.-C. Lee, "Nnime: The nthu-ntua chinese interactive multimodal emotion corpus," in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017: IEEE, pp. 292-298.
 J.-P. Goldman, "EasyAlign: an automatic phonetic alignment tool under Praat," 2011.
 F. Eyben, M. Wöllmer, and B. Schuller, "Opensmile: the munich versatile and fast open-source audio feature extractor," in Proceedings of the 18th ACM international conference on Multimedia, 2010: ACM, pp. 1459-1462.
 E. Bozkurt, E. Erzin, Ç. E. Erdem, and A. T. Erdem, "Interspeech 2009 emotion recognition challenge evaluation," in 2010 IEEE 18th Signal Processing and Communications Applications Conference, 2010: IEEE, pp. 216-219.
 C.-C. Chang and C.-J. Lin, "LIBSVM: A library for support vector machines," ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, p. 27, 2011.
 M. Dominguez, M. Farrús, and L. Wanner, "An automatic prosody tagger for spontaneous speech," in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 377-386.
 C. Saravanan, "Color image to grayscale image conversion," in 2010 Second International Conference on Computer Engineering and Applications, 2010, vol. 2: IEEE, pp. 196-199.
 D. E. King, "Max-margin object detection," arXiv preprint arXiv:1502.00046, 2015.
 J.-S. Lin, S.-H. Liou, W.-C. Hsieh, Y.-Y. Liao, H. Wang, and Q. Lan, "Facial Expression Recognition Based on Field Programmable Gate Array," in 2009 Fifth International Conference on Information Assurance and Security, 2009, vol. 1: IEEE, pp. 547-550.
 D. H. Hubel and T. N. Wiesel, "Receptive fields, binocular interaction and functional architecture in the cat's visual cortex," The Journal of physiology, vol. 160, no. 1, pp. 106-154, 1962.
 S.-C. B. Lo, H.-P. Chan, J.-S. Lin, H. Li, M. T. Freedman, and S. K. Mun, "Artificial convolution neural network for medical image pattern recognition," Neural networks, vol. 8, no. 7-8, pp. 1201-1214, 1995.
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
 C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
 Y. Fan, X. Lu, D. Li, and Y. Liu, "Video-based emotion recognition using CNN-RNN and C3D hybrid networks," in Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016: ACM, pp. 445-450.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
 S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
 S. Panchapagesan et al., "Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting," in Interspeech, 2016, pp. 760-764.