||Deep Neural Network Based Emotion Recognition System for Humanoid Robot
||Department of Electrical Engineering
Convolutional Neural Network
Long Short-Term Memory
It is crucial for robots to recognize human emotions during the interaction between human and robot. Therefore, this thesis proposes an emotion recognition system for a humanoid robot. The robot is equipped with a camera in order to capture the image of the user's face and the goal is for the robot to respond appropriately according to the user's emotion which is recognized by our system. The emotion recognition system, based on a deep neural network, learns the six basic emotions including happiness, anger, disgust, fear, sadness and surprise. The whole structure of the system consists of four steps: the first step takes advantage of a convolutional neural network to extract visual features by learning on a great amount of static images; the second step utilizes a long short-term memory recurrent neural network to figure out the relationship between the transformation of facial expressions in image sequences and the six basic emotions; the third step combines the advantages of both CNN and LSTMs by integrating them into our model; the last step but not least improves the performance of the emotion recognition system by using transfer learning, which is a method to transfer the knowledge of related but different problems. Finally, the performance of the proposed system is verified by leave-one-out cross validation and is compared with other models. Then the proposed system is applied to the interaction between human and robot to demonstrate the practicability of this system.
LIST OF FIGURES VI
LIST OF TABLES VIII
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Related Work 4
1.3 Thesis Organization 5
Chapter 2 Deep Neural Network and Transfer Learning 7
2.1 Introduction 7
2.2 Convolution Neural Network 8
2.2.1 Introduction to Convolutional Neural Network 8
2.2.2 Convolution Layer 11
2.2.3 Pooling Layer 13
2.2.4 Activation Layer 13
2.2.5 Fully connected and global average pooling layer 14
2.2.6 Softmax 15
2.2.7 Residual block 16
2.3 Long Short-Term Memory Networks (LSTMs) 18
2.3.1 Introduction to Traditional Recurrent Neural Network 18
2.3.2 Long Short-Term Memory 19
2.4 Transfer Learning 22
2.4.1 Introduction to Transfer Learning 22
2.4.2 Categories of Transfer Learning 23
2.4.3 Inductive Transfer Learning 23
2.4.4 Layer transferring and layer sharing 24
Chapter 3 The Proposed Models by Combining CNN and LSTMs 26
3.1 Introduction 26
3.2 CNN Model 27
3.3 The Proposed Models by Combining CNN and LSTMs 30
3.3.1. The LSTM network 30
3.3.2. CNN feature extractor 31
3.3.3. Combination of CNN and LSTM 32
3.4 Transferring parameters of the CNN 33
3.5 Enhanced model 34
Chapter 4 Simulations and Experimental Results 36
4.1 Introduction 36
4.2 Simulations 37
4.2.1. Databases 37
4.2.2. Data preprocessing 39
4.2.3. Simulation platform 39
4.2.4. Leave-one-out cross-validation 40
4.3 Experimental Setup 50
4.3.1. Robot Harley 50
4.3.2. Camera 51
4.3.3. Computer 52
4.4 Experiment I 55
4.5 Experiment II 56
4.5.1. Experimental environment 56
4.5.2. Scenario 57
4.6 Summary 59
Chapter 5 Conclusion and Future Works 61
5.1. Conclusion 61
5.2. Future works 62
 K. Dautenhahn, “Methodology and themes of human-robot interaction: a growing research Field,” International Journal of Advanced Robotic Systems, vol. 4, no. 1, pp. 15, 2007.
 L. Parker, F. E. Schneider, and A. C. Schultz, Multi-robot systems: from swarms to intelligent automata, Dordrecht: Springer, 2005.
 K. R. Scherer, “What are emotions? and how can they be measured?,” Social Science Information, vol. 44, no. 4, pp. 695-729, Dec. 2005.
 V. Mayya, R. M. Pai, and M. M. Manohara Pai, “Automatic Facial Expression Recognition using DCNN”, Proce. Comp. Sci, vol. 93, pp. 453-461, 2016.
 K. Zhang, Y. Huang, and Y. Du, L. Wang, “Facial expression recognition based on deep evolutional spatial-temporal networks”, IEEE Trans. Image Process., vol. 26, no. 9, pp. 4193-4203, Mar. 2017.
 Y. Byeon and K. Kwak, “Facial expression recognition using 3D convolutional neural network,” International Journal of Advanced Computer Science and Applications, vol. 5, no. 12, 2014.
 W. Zhang, Y. Zhang, L. Ma, J. Guan, and S. Gong, “Multimodal learning for facial expression recognition”, Pattern Recognit., vol. 48, no. 10, pp. 3191-3202, 2015.
 X. Fan and T. Tjahjadi, “A spatial-temporal framework based on histogram of gradients and optical flow for facial expression recognition in video sequences,” Pattern Recognit., vol. 48, no. 11, pp. 3407-3416, 2015.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of Annual Conference on Neural Information Processing Systems, 2012, pp. 1097-1105.
 S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
 M. Sundermeyer, H. Ney, and R. Schlüter, “From feedforward to recurrent LSTM neural networks for language modeling,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 3, pp. 517-529, 2015.
 S. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, 2010.
 A. T. Lopes, E. De Aguiar, A. F. De Souza, and T. Oliveira-Santos, “Facial expression recognition with convolutional neural networks: Coping with few data and the training sample order,” Pattern Recognit., vol. 61, pp. 610-628, Jan. 2017.
 “Six basic emotions,” Managementmania.com, 2018. [Online]. Available: https://managementmania.com/en/six-basic-emotions.
 J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554-2558, 1982.
 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
 S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift, ” arXiv:1502.03167, 2015.
 X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” Proc. Conf. Artificial Intelligence and Statistics, 2011.
 J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 3431-3440, 2015.
 L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834-848, 2018.
 Z. Yu and C. Zhang, “Image based static facial expression recognition with multiple deep network learning,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 435–442, ACM, 2015.
 K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Proc. Int. Conf. Learn. Representations, 2015.
 A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), pp. 6645–6649, May 2013.
 M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural networks for language modeling,” in Proc. Interspeech, pp. 194-197, 2012.
 R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 142-158, 2016.
 N. Jean, M. Burke, M. Xie, W. Davis, D. Lobell, and S. Ermon, “Combining satellite imagery and machine learning to predict poverty,” Science, vol. 353, no. 6301, pp. 790-794, 2016.
 D. Hubel and T. Wiesel, “Receptive fields and functional architecture of monkey striate cortex,” The Journal of Physiology, vol. 195, no. 1, pp. 215-243, 1968.
 R. Hahnloser, R. Sarpeshkar, M. Mahowald, R. Douglas, and H. Seung, “Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit,” Nature, vol. 405, no. 6789, pp. 947-951, Jun. 2000.
 M. Lin, Q. Chen, and S. Yan, “Network in network. ,” arXiv:1312.4400, 2013.
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 2016
 F. A. Gers, J. Schmidhuber, F. Cummins, “Learning to forget: Continual prediction with LSTM,” Proc. 9th Int. Conf. Artif. Neural Netw. (ICANN), vol. 2, pp. 850-855, Sep. 1999.
 S. Ruder, “An overview of multi-task learning in deep neural networks, ” 2017.
 A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, 2017.
 P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended Cohn-Kanade dataset (ck + ): A complete dataset for action unit and emotion-speciﬁed expression,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 94–101, Jun. 2010.
 D. Kingma and J. Ba. “Adam: A method for stochastic optimization,” ICLR, 2015.
 Webcamera, Logitech C920[Online]. Available:
 Industrial computer, PICO880[Online]. Available:
 P. Viola and M. J. Jones, “Robust real-time object detection,” In IEEE ICCV Workshop on Statistical and Computational Thesis of Vision, 2001.
 G. Bradski. The opencv library. Dr. Dobb's Journal of Software Tools, 2000.