||Video Based Action Recognition with Multiple Feature Fusion
||Institute of Computer Science and Information Engineering
3D convolution network
support vector machine
Recent researches have confirmed the importance of skeleton joint points in action recognition. However, the current methods only use the joint points to select he motion-related features in the neural network. These methods ignore the movement information of the joint point itself over time，which limit the learning of complex actions. Because of the above mentioned problems, this paper proposes a novel algorithm that can simultaneously use the motion-related features and skeleton-related features to discriminate complex action. We validate our proposed action recognition model using the ground-truth joint points and the joint points estimated by deep learning method on the JHMDB, SUB-JHMDB and PENN-ACTION data sets. The experimental results show that our proposed method has better recognition rate than the state-of-art methods.
Oral presentation document
Abstract (Chinese) i
Abstract (English) ii
Table of Contents iv
List of Tables vi
List of Figures vii
Chapter 1. Introduction 1
Chapter 2. Related Work 3
2.1 Deep learning based method 3
2.2 skeleton information Based method 4
Chapter 3. JOINT KINETIC AND RELATIONAL FEATURE 6
3.1 Kinetic Feature 6
3.2 Correlation Relational Feature 7
3.3 Distance Relational Feature 7
3.4 Geometric Relational Feature 8
3.5 Bag of Visual Words 9
Chapter 4. P3D WITH RATIO SCALAING 10
4.1 CNN 10
4.2 C3D 11
4.3 P3D 11
4.4 Feature pooling 13
4.5 Feature Fusion 16
Chapter 5. EXPERIMENT 18
5.1 Dataset 18
5.1.1 JHMDB 18
5.1.2 SUB-JHMDB 18
5.1.3 PENN-action 18
5.2 Experimental details 18
5.3 Experiment Result 19
5.4 Experiment on estimated Joint 20
5.5 Comparison of different layers of P3D 22
5.6 Compare kernel fusion methods 22
5.7 Compare with the state-of-art methods 25
5.7.1 Experimental results on the SUB-JHMDB data set 25
5.7.2 Experimental results on the PENN-ACTION data set 27
5.7.3 Experimental results in the JHMDB data set: 29
Chapter 6. Conclusions & Future Work 31
 Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. IEEE Conference on Computer Vision and Pattern Recognition, pages 5297–5307, 2016.
 C. Cao, Y. Zhang, C. Zhang, and H. Lu. Action recognition with joints-pooled 3d deep convolutional descriptors. IJCAI’16 Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 3324–3330, 2016.
 C. Cao, Y. Zhang, C. Zhang, and H. Lu. Body joint guided 3d deep convolutional descriptors for action recognition. CoRR, abs/1704.07160, 2017.
 J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. CoRR, abs/1705.07750, 2017.
 G. Chéron, I. Laptev, and C. Schmid. P-cnn: Pose-based cnn features for action recognition. 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 3218–3226, 2015.
 W. Du, Y. Wang, and Y. Qiao. RPAN: an end-to-end recurrent pose-attention network for action recognition in videos. International Conference on Computer Vision, pages 3745–3754, 2017.
 Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1110–1118, 2015.
 C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1933–1941, 2016.
 R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3165–3174, 2017.
 U. Iqbal, M. Garbade, and J. Gall. Pose for action - action for pose. In IEEE International Conference on Automatic Face & Gesture Recognition, pages 438–445, 2017.
 H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 3192–3199, 2013.
 G. Johansson. Visual perception of biological motion and a model for its analysis. In Perception & Psychophysics, volume 14, page 201–211, 1973.
 A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. pages 1725–1732, 2014.
 Z. Lan, Y. Zhu, A. G. Hauptmann, and S. D. Newsam. Deep local video feature for action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops, Honolulu, HI, USA, July 21-26, 2017, pages 1219–1225, 2017.
 I. Laptev. On space-time interest points. In International Journal of Computer Vision, volume 64, pages 107–123, 2005.
 Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In ISCAS, pages 253–256. IEEE, 2010.
 I.Lillo, J.C.Niebles, andA.Soto. Ahierarchicalpose-basedapproachtocomplex action understanding using dictionaries of actionlets and motion poselets. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 1981–1990, 2016.
 B. X. Nie, C. Xiong, and S.-C. Zhu. Joint action recognition and pose estimation from video. IEEE Conference on Computer Vision and Pattern Recognition, pages 1293–1301, 2015.
 Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5534–5542, 2017.
 S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. pages 1234–1241, 2012.
 X. Tian and J. Fan. Joints kinetic and relational features for action recognition. In Signal Process, volume 142, pages 412–422, 2018.
 D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision, pages 4489–4497, 2015.
 D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. CoRR, abs/1711.11248, 2017.
 C. Wang, Y. Wang, and A. L. Yuill. An approach to pose based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 H. Wang and C. Schmid. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 3551–3558, 2013.
 J. Wang, X. Nie, Y. Xia, Y. Wu, and S. Zhu. Cross-view action modeling, learning, and recognition. 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 2649–2656, 2014.
 L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4305–4314, 2015.
 L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pages 20–36, 2016.
 Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixturesof-parts. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pages 1385–1392, 2011.
 W. Zhang, M. Zhu, and K. G. Derpanis. From actemes to action: A stronglysupervised representation for detailed action understanding. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 2248–2255, 2013.
 M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2923–2932, 2017.