||Developing A Portable Smart Label Recognition System
of Environmental Control for Quadriplegic Patients
||Department of BioMedical Engineering
scene text recognition
lightweight convolutional neural network
daily activity aid
四肢癱瘓患者由於運動能力的喪失, 無法在日常生活中照顧自己。隨著高科技的發展，智慧家電和眼動控制器的出現，患者將有新的可能透過科技重新獲得自主權。本研究提出了一種基於增強實境和眼動追踪系統的輔助設備的願景。該系統可讓患者通過眼動來控制眼前的電器。在本研究中，使用了Textboxes ++和CRNN場景文本識別算法，並使用類似Mobilenet-v2的網路作為骨幹的可攜式標籤識別器，並針對重度癱瘓者的可能使用環境做演算法的驗證。在本研究中，結合了類似Mobilenet-v2的網路作為骨幹的Textboxes ++和CRNN網路架構分別稱為TBPP-lite和CRNN-lite。本研究使用Synthtext數據集作為預訓練數據來訓練基本型TBPP-lite，並使用ICDAR2015數據集和自定義數據進行遷移學習。 分別在三種不同複雜度的環境與七種不同的標籤相對影像大小使用兩種不同的行動裝置拍攝的影像上做模型的極限測試，準度在一定的條件限制下達到97%-100%，並且在桌上型電腦可達到即時運算。將來，如果根據本研究對模型的極限測試結果設計相應的標籤尺寸和環境，則識別將足夠準確。結合後續的眼科控制系統或腦機介面系統和智慧家電，它將有潛力極大地改善四肢癱瘓患者的生活品質。
Due to the impaired motor function, quadriplegic patients are unable to take care of their own daily life. As the advance of technology with the emergence of smart homes and eye-tracking controllers, it is promising that patients can regain autonomy. This research proposed a vision of assistive devices based on augmented reality and eye-tracking systems that allow patients to control electrical appliances by gazing. The proposal is to use Mobilenet-v2 as backbone network to simplify the combination of textboxes++ and CRNN scene text recognition algorithm as a portable label recognizer. Textboxes++ and CRNN that combined a Mobilenet-like backbone were called TBPP-lite and CRNN-lite, respectively. This study used the Synthtext dataset as the pre-training data to train the baseline TBPP-lite, and used the ICDAR2015 dataset and customized data for transfer learning. When using Synthtext to train CRNN-lite, the model's limit test was performed on images taken with two different mobile devices in three environments of different complexity and seven different relative image sizes of labels. The accuracy could reach 97%-100% in some combinations, and the running speed was close to that of real-time operation. In the future, if the corresponding label size and environment are designed according to the limit test results of this research on the model, the recognition can be sufficiently accurate. When combined with the ongoing eye control system, the system becomes a brain-computer interface and a smart appliance system and will greatly improve the life quality of quadriplegic patients.
Chapter 1 Introduction 1
1.1 Background 1
1.2 Artificial Neural Network 2
1.3 Evolution of Convolutional Neural Network Architecture 3
1.4 Computer Vision 5
1.5 The main goal of this study 8
Chapter 2 Methods and Material 9
2.1 Introduction 9
2.2 Principles of the system 10
2.3 Overall framework 11
2.4 Label detection 13
2.5 Text Recognition 21
2.6 Training strategy 24
2.7 Limitation test 28
Chapter 3 Results 31
3.1 Introduction 31
3.2 Results of label detection 31
3.3 Results of text recognition 36
3.4 Results of the combined model 37
Chapter 4 Discussion 39
4.1 Comparison with original TBPP 39
4.2 The effect of Mobilenet and model redundant 40
4.3 Transfer learning for customization 40
4.4 The effects of relative box size on detection 42
4.5 Complexity 43
4.6 Labeling ground truth for detection 43
4.7 Causes of recognition failure 45
4.8 Factors to be considered for practical application 45
Chapter 5 Conclusions 47
A. ADD Gate Multiply Gate and ResNet 51
B. Detailed numerical values of the results 53
 S. Bona et al., "The development of an augmented reality device for the autonomous management of the electric bed and the electric wheelchair for patients with amyotrophic lateral sclerosis: a pilot study," Disability and Rehabilitation: Assistive Technology, pp. 1-7, 2019.
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
 G. E. Hinton and R. R. Salakhutdinov, "Reducing the dimensionality of data with neural networks," science, vol. 313, no. 5786, pp. 504-507, 2006.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
 K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
 S. Chetlur et al., "cudnn: Efficient primitives for deep learning," arXiv preprint arXiv:1410.0759, 2014.
 C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
 K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
 J. Kim, J. Hong, and H. Park, "Prospects of deep learning for medical imaging," 2018.
 A. G. Howard et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017.
 J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, "Selective search for object recognition," International journal of computer vision, vol. 104, no. 2, pp. 154-171, 2013.
 J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788.
 W. Liu et al., "Ssd: Single shot multibox detector," in European conference on computer vision, 2016: Springer, pp. 21-37.
 T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal loss for dense object detection," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980-2988.
 A. Gupta, A. Vedaldi, and A. Zisserman, "Synthetic data for text localisation in natural images," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2315-2324.
 M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, "Reading text in the wild with convolutional neural networks," International journal of computer vision, vol. 116, no. 1, pp. 1-20, 2016.
 Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, "Detecting text in natural image with connectionist text proposal network," in European conference on computer vision, 2016: Springer, pp. 56-72.
 S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," in Advances in neural information processing systems, 2015, pp. 91-99.
 J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440.
 M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, "Textboxes: A fast text detector with a single deep neural network," in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
 M. Liao, B. Shi, and X. Bai, "Textboxes++: A single-shot oriented scene text detector," IEEE transactions on image processing, vol. 27, no. 8, pp. 3676-3690, 2018.
 M. Zhu and S. Gupta, "To prune, or not to prune: exploring the efficacy of pruning for model compression," arXiv preprint arXiv:1710.01878, 2017.
 G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," arXiv preprint arXiv:1503.02531, 2015.
 B. Jacob et al., "Quantization and training of neural networks for efficient integer-arithmetic-only inference," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704-2713.
 X. Zhang, X. Zhou, M. Lin, and J. Sun, "Shufflenet: An extremely efficient convolutional neural network for mobile devices," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6848-6856.
 N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, "Shufflenet v2: Practical guidelines for efficient cnn architecture design," in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 116-131.
 M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510-4520.
 B. Shi, X. Bai, and C. Yao, "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2298-2304, 2016.
 A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
 D. Karatzas et al., "ICDAR 2015 competition on robust reading," in 2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015: IEEE, pp. 1156-1160.
 K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026-1034.
 D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
 A. Karpathy, "A Recipe for Training Neural Networks," ed, 2019.
 L. Liu et al., "On the variance of the adaptive learning rate and beyond," arXiv preprint arXiv:1908.03265, 2019.
 M. Zhang, J. Lucas, J. Ba, and G. E. Hinton, "Lookahead Optimizer: k steps forward, 1 step back," in Advances in Neural Information Processing Systems, 2019, pp. 9593-9604.
 M. Abadi et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016.
 C. Redies, S. A. Amirshahi, M. Koch, and J. Denzler, "PHOG-derived aesthetic measures applied to color photographs of artworks, natural scenes and objects," in European Conference on Computer Vision, 2012: Springer, pp. 522-531.
 G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700-4708.