||Quadro-W Learning for Behavior Prediction in Evolved Environment: Case Study of Intelligent Butler
||Institute of Computer Science and Information Engineering
In recent years, with the progress of embedded hardware devices (sensors, microprocessors, etc.), the maturity of software technology, the popularity of the internet and the decline in prices, embedded systems have been widely used in various scenes. Including crop growth monitoring, defection detection of commodity, transportation system management and vital signs monitoring, etc. However, in order to obtain enough information to get good analysis, the application environment is usually full of various sensors. That will make many problems, such as: the original environment changed by hardware devices; the initial system setup is too time-consuming; many hardware devices make the system expensive and the system is difficult to maintain, etc. Without affecting the results, need to reduce the hardware equipment required to achieve the goal. In addition to the hardware equipment used to collect environmental information, each application scene has theirs corresponding software models for analyzing information. This model is usually set according to some situation in the environment, and it can’t evolve with the evolution of the environment. The model cannot learn by itself that leads to decrease the flexibility and life cycle of the system.
In this study, we propose a Quadro-W Learning (QW-Learning) method to predict human behavior. Quadro-W means human (Who), object (what), place (where) and time (when). We only obtain the Quadro-W information through the data which collected by the camera, and not use extra sensors. Build a behavior prediction model by Quadro-W information, this model can not only make predictions based on the initial environment. It can also evolve as evolved environment to increase the flexibility and life cycle.
LIST OF CONTENTS V
LIST OF FIGURES VI
LIST OF TABLES VII
Chapter 1. Introduction 1
1.1 Introduction & Motivation 1
1.2 Thesis Overview 3
Chapter 2. Background & Related Work 4
2.1 Background 4
2.1.1 Action Recognition 4
2.1.2 Temporal Action Detection 8
2.2 Related Work 9
2.2.1 Residual Block 9
2.2.2 Q-Learning 12
Chapter 3. Method 14
3.1 Problem Description 14
3.2 System Architecture 15
3.3 Data Pre-processing 17
3.4 Quadro-W Model 19
3.4.1 Human Detection & Recognition 19
3.4.2 Object Detection & Recognition 21
3.4.3 Place Recognition 22
3.4.4 Sound Split & Recognition 23
3.5 Quadro-W Information Merge 25
3.6 Evolved Behavior Prediction 30
Chapter 4. Experiment 34
4.1 Experiment Environment Setup 34
4.2 Implementation 34
4.3 Experiment Result 35
Chapter 5. Conclusion & Future Work 43
 Library of Congress. "Who is credited with inventing the telephone?" https://www.loc.gov/everyday-mysteries/item/who-is-credited-with-inventing-the-telephone/ (accessed 07/01, 2020).
 H. Wang and C. Schmid, "Action Recognition with Improved Trajectories," in 2013 IEEE International Conference on Computer Vision, 1-8 Dec. 2013 2013, pp. 3551-3558, doi: 10.1109/ICCV.2013.441.
 K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Advances in neural information processing systems, 2014, pp. 568-576.
 J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, "Beyond short snippets: Deep networks for video classification," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4694-4702.
 Z. Qiu, T. Yao, and T. Mei, "Learning spatio-temporal representation with pseudo-3d residual networks," in proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5533-5541.
 Z. Shou, D. Wang, and S.-F. Chang, "Temporal action localization in untrimmed videos via multi-stage cnns," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1049-1058.
 T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, "Bsn: Boundary sensitive network for temporal action proposal generation," in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3-19.
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
 K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
 K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
 T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature pyramid networks for object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117-2125.
 R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580-587.
 K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904-1916, 2015.
 S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," in Advances in neural information processing systems, 2015, pp. 91-99.
 C. J. Watkins and P. Dayan, "Q-learning," Machine learning, vol. 8, no. 3-4, pp. 279-292, 1992.
 E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley, "Color transfer between images," IEEE Computer graphics and applications, vol. 21, no. 5, pp. 34-41, 2001.
 Z. Li, Z. Jing, X. Yang, and S. Sun, "Color transfer based remote sensing image fusion using non-separable wavelet frame transform," Pattern Recognition Letters, vol. 26, no. 13, pp. 2006-2014, 2005/10/01/ 2005, doi: https://doi.org/10.1016/j.patrec.2005.02.010.
 L. Vincent and P. Soille, "Watersheds in digital spaces: an efficient algorithm based on immersion simulations," IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6, pp. 583-598, 1991.
 J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788.
 W. Liu et al., "Ssd: Single shot multibox detector," in European conference on computer vision, 2016: Springer, pp. 21-37.
 O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in International Conference on Medical image computing and computer-assisted intervention, 2015: Springer, pp. 234-241.
 Google, "Speech-to-Text: Automatic Speech Recognition | Cloud Speech-to-Text." [Online]. Available: https://cloud.google.com/speech-to-text.