進階搜尋


 
系統識別號 U0026-3008201821483000
論文名稱(中文) 基於視覺融合特徵之運動動作辨識
論文名稱(英文) Video Based Action Recognition with Multiple Feature Fusion
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 106
學期 2
出版年 107
研究生(中文) 鄭凱杰
研究生(英文) Kai-Jie Zheng
電子信箱 zhengkaijie139@gmail.com
學號 P76053058
學位類別 碩士
語文別 英文
論文頁數 35頁
口試委員 指導教授-胡敏君
口試委員-朱威達
口試委員-邱維辰
口試委員-蘇文鈺
中文關鍵字 行為辨識  3D卷積網路  支持向量機 
英文關鍵字 action recognition  3D convolution network  support vector machine 
學科別分類
中文摘要 近年的研究已經證實骨架關節點在行為識別的重要性。然而現行的方法,只是使用骨架關節點篩選出神經網路中與運動相關的特徵。這樣的做法忽視了關節點本身隨時間變化的運動資訊,大大限制了對複雜動作的學習。針對上述所提到的問題,本文提出了一種新穎的演算法,能夠同時利用神經網路中篩選出的運動相關特徵和骨架相關的特徵對行為進行判別。同時我們在JHMDB,SUB-JHMDB以及PENN-ACTION三個公開資料集上使用真實骨架與基於深度學習方法估測出的骨架來驗證所提出的運動辨識模型。實驗結果顯示本論文所提出之方法相比最先進的演算法均有較佳的辨識率。
英文摘要 Recent researches have confirmed the importance of skeleton joint points in action recognition. However, the current methods only use the joint points to select he motion-related features in the neural network. These methods ignore the movement information of the joint point itself over time,which limit the learning of complex actions. Because of the above mentioned problems, this paper proposes a novel algorithm that can simultaneously use the motion-related features and skeleton-related features to discriminate complex action. We validate our proposed action recognition model using the ground-truth joint points and the joint points estimated by deep learning method on the JHMDB, SUB-JHMDB and PENN-ACTION data sets. The experimental results show that our proposed method has better recognition rate than the state-of-art methods.
論文目次 Cover
Oral presentation document
Chinese version
English version
Abstract (Chinese) i
Abstract (English) ii
Acknowledgments iii
Table of Contents iv
List of Tables vi
List of Figures vii
Chapter 1. Introduction 1
Chapter 2. Related Work 3
2.1 Deep learning based method 3
2.2 skeleton information Based method 4
Chapter 3. JOINT KINETIC AND RELATIONAL FEATURE 6
3.1 Kinetic Feature 6
3.2 Correlation Relational Feature 7
3.3 Distance Relational Feature 7
3.4 Geometric Relational Feature 8
3.5 Bag of Visual Words 9
Chapter 4. P3D WITH RATIO SCALAING 10
4.1 CNN 10
4.2 C3D 11
4.3 P3D 11
4.4 Feature pooling 13
4.5 Feature Fusion 16
Chapter 5. EXPERIMENT 18
5.1 Dataset 18
5.1.1 JHMDB 18
5.1.2 SUB-JHMDB 18
5.1.3 PENN-action 18
5.2 Experimental details 18
5.3 Experiment Result 19
5.4 Experiment on estimated Joint 20
5.5 Comparison of different layers of P3D 22
5.6 Compare kernel fusion methods 22
5.7 Compare with the state-of-art methods 25
5.7.1 Experimental results on the SUB-JHMDB data set 25
5.7.2 Experimental results on the PENN-ACTION data set 27
5.7.3 Experimental results in the JHMDB data set: 29
Chapter 6. Conclusions & Future Work 31
References 32
參考文獻 [1] Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. IEEE Conference on Computer Vision and Pattern Recognition, pages 5297–5307, 2016.
[2] C. Cao, Y. Zhang, C. Zhang, and H. Lu. Action recognition with joints-pooled 3d deep convolutional descriptors. IJCAI’16 Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 3324–3330, 2016.
[3] C. Cao, Y. Zhang, C. Zhang, and H. Lu. Body joint guided 3d deep convolutional descriptors for action recognition. CoRR, abs/1704.07160, 2017.
[4] J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. CoRR, abs/1705.07750, 2017.
[5] G. Chéron, I. Laptev, and C. Schmid. P-cnn: Pose-based cnn features for action recognition. 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 3218–3226, 2015.
[6] W. Du, Y. Wang, and Y. Qiao. RPAN: an end-to-end recurrent pose-attention network for action recognition in videos. International Conference on Computer Vision, pages 3745–3754, 2017.
[7] Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1110–1118, 2015.
[8] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1933–1941, 2016.
[9] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3165–3174, 2017.
[10] U. Iqbal, M. Garbade, and J. Gall. Pose for action - action for pose. In IEEE International Conference on Automatic Face & Gesture Recognition, pages 438–445, 2017.
[11] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 3192–3199, 2013.
[12] G. Johansson. Visual perception of biological motion and a model for its analysis. In Perception & Psychophysics, volume 14, page 201–211, 1973.
[13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. pages 1725–1732, 2014.
[14] Z. Lan, Y. Zhu, A. G. Hauptmann, and S. D. Newsam. Deep local video feature for action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops, Honolulu, HI, USA, July 21-26, 2017, pages 1219–1225, 2017.
[15] I. Laptev. On space-time interest points. In International Journal of Computer Vision, volume 64, pages 107–123, 2005.
[16] Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In ISCAS, pages 253–256. IEEE, 2010.
[17] I.Lillo, J.C.Niebles, andA.Soto. Ahierarchicalpose-basedapproachtocomplex action understanding using dictionaries of actionlets and motion poselets. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 1981–1990, 2016.
[18] B. X. Nie, C. Xiong, and S.-C. Zhu. Joint action recognition and pose estimation from video. IEEE Conference on Computer Vision and Pattern Recognition, pages 1293–1301, 2015.
[19] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5534–5542, 2017.
[20] S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. pages 1234–1241, 2012.
[21] X. Tian and J. Fan. Joints kinetic and relational features for action recognition. In Signal Process, volume 142, pages 412–422, 2018.
[22] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision, pages 4489–4497, 2015.
[23] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. CoRR, abs/1711.11248, 2017.
[24] C. Wang, Y. Wang, and A. L. Yuill. An approach to pose based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2013.
[25] H. Wang and C. Schmid. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 3551–3558, 2013.
[26] J. Wang, X. Nie, Y. Xia, Y. Wu, and S. Zhu. Cross-view action modeling, learning, and recognition. 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 2649–2656, 2014.
[27] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4305–4314, 2015.
[28] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pages 20–36, 2016.
[29] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixturesof-parts. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pages 1385–1392, 2011.
[30] W. Zhang, M. Zhu, and K. G. Derpanis. From actemes to action: A stronglysupervised representation for detailed action understanding. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 2248–2255, 2013.
[31] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2923–2932, 2017.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2019-12-01起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2019-12-01起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw