系統識別號 U0026-1008202018565400
論文名稱(中文) 基於唇部特徵點座標差之中文唇語識別系統
論文名稱(英文) Chinese Lipreading System base on Coordinate Differences of Lips Feature Points
校院名稱 成功大學
系所名稱(中) 工程科學系
系所名稱(英) Department of Engineering Science
學年度 108
學期 2
出版年 109
研究生(中文) 游士龍
研究生(英文) Shi-Lung Yu
學號 N96071180
學位類別 碩士
語文別 中文
論文頁數 41頁
口試委員 指導教授-王宗一
中文關鍵字 唇語識別  深度學習  特徵點座標差  神經網路 
英文關鍵字 Lip reading  Deep learning  Coordinate differences of feature point  Neural network 
中文摘要 對聽障人士來說,手語是他們的生活必要技能,彼此間的溝通均可利用手語完成。一般人如果不會手語,手寫則是與聽障人士溝通的方法。聽障人士或能透過讀唇語來直接了解說話人的表達內容。但是讀唇語的能力訓練較為困難,因為唇形的判讀並不像手語一樣,一個手勢通常對應一個涵義,相似的唇形有時卻有著不同的意思,如何判讀唇形並了解其真正的涵義是一個非常有挑戰的學習議題。隨著科技進步,幾乎人人攜帶手機,如能透過手機攝影、判讀唇語、並顯示對話內容,則對聽障人士及不會手語的人將是一非常方便的溝通工具。過去雖然有很多關於中英文唇語識別的研究,但是在非特定語彙辨識之準確率不高,尤其是中文唇語辨識上。本研究透過自製實際應用之中文唇語資料集,並提出特徵點座標差的方法來訓練一神經網路,實作出一個中文唇語識別系統以驗證在實際應用層面上之可行性,並與過去的方法做比較。
英文摘要 Sign language is an essential tool for those who are hearing impaired. They can use sign language to communicate with each other. For those who cannot use sign language, handwriting is a way to communicate with hearing impaired people, who, otherwise, might also read lips of people to comprehend the content of communications. Nevertheless, lip-reading skill is not as easy as using sign language. In sign language, one gesture clearly represents one meaning, while in lip-reading, similar lip shapes may be interpreted as different characters, which raises an interesting issue for study. Today most people use mobile phones. On any occasion, if you take video on a talking person by a mobile phone, which can read the lips of the person, and display the contents of the conversations, it will be a very convenient tool for people who are hearing impaired and people who do not know sign language. Although there are many researches on lip-reading for Chinese and English, their accuracy on recognizing non-specific vocabulary is not high enough, especially on Chinese lip-reading. Using real life vocabulary, this study employs coordinate differences of lips feature points and different combinations of neuro networks to establish a real-world lip-reading application on mobile phones.
In this study, several vocabularies from sentences frequently used in daily life are collected by an ordinary mobile phone camera. The faces of persons reading sentences are filmed as 30-frame per second videos that are split as clips according to the vocabularies in the sentences. In order to reflect the different light conditions of a scene in actual applications, brightness adjustments on the clips are performed, which also increases the number of clips for training. Then for every frame of a clip, the feature points of the face are found and the coordinates of those belonging to the lip area are recorded. For a vocabulary the sequence of coordinate differences of feature points between the frames of a clip are calculated to form sequence vectors. The training set comprises all such vectors, vocabularies, and all the original clips. The training model comprises CNN and Resnet for extracting lip features and LSTM and GRU for extracting time sequence features of clips. Resnet and GRU will be used for the original clips and LSTM will be used for the sequences of coordinate differences. The last stage of the training model is a fully connected layer. The lip-reading system established in this study, when uses different combinations of training models, can reach up to 76% and 62% accuracies when predicting vocabularies and whole sentences respectively, and can confirm the feasibility in practical applications.
論文目次 摘要 I
Extended Abstract II
致謝 IX
目錄 X
表目錄 XII
圖目錄 XIII
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 1
1.3 研究方法 2
1.4 研究貢獻 3
第二章 文獻探討 4
2.1 自動語音辨識 4
2.2 唇語辨讀 6
第三章 系統設計與架構 9
3.1 系統流程 9
3.2 自製資料與處理 11
3.3 臉部追蹤及唇形輪廓影像生成 11
3.4 亮度與對比擴增資料 11
3.5 特徵點之差來構成向量 13
3.6 網路結構與特徵提取 14
3.6.1 卷積神經網路 15
3.6.2 長短期記憶神經網路 17
3.6.3 門循環單元 18
3.6.4 殘差網路 19
3.7 損失函數及目標唇形判斷 20
第四章 實驗設計與結果 22
4.1 資料及實驗設置 22
4.2 評估工具 23
4.3 實驗結果與分析 24
4.4 系統展示 29
第五章 結論與未來展望 37
5.1 結論 37
5.2 未來展望 37
參考文獻 38
參考文獻 [1] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444-3453, 2017.
[2] Y. M. Assael, B. Shillingford, S. Whiteson, and N. De Freitas. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016.
[3] K. Xu, D. Li, N. Cassimatis, and X. Wang. Lcanet: End-to-end lipreading with cascaded attention-ctc. 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 548–555, 2018.
[4] G. Zhu, L. Zhang, P. Shen, A. J. Song. Multimodal Gesture Recognition Using 3-D Convolution and Convolutional LSTM. IEEE Access, pp. 4517-4524, 17 March. 2017.
[5] X. Zhao, S. Yang, S. Shan, X. Chen. Mutual Information Maximization for Effective Lip Reading. arXiv:2003.06439, 13 Mar. 2020.
[6] S. Mitra, T. Acharya. Gesture Recognition: A Survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C(Applications and Reviews), pp. 311-324, May. 2007.
[7] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems 25, 2012.
[8] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. In MIT Press Journals on Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[9] JS. Chung, A. Zisserman. Lip reading in the wild. Asian Conference on Computer Vision, 2016
[10] D. Wu, Q. Ruan. Lip reading based on cascade feature extraction and hmm. In International Conference on Signal Processing (ICSP). 2014.
[11] A. A. Shaikh, D. K. Kumar, W. C. Yau, M. Z. Che Azemin, J. Gubbi. Lip reading using optical flow and support vector machines. In International Congress on Image and Signal Processing, 2010.
[12] J. Luettin, N. A. Thacker, S. W. Beet. Visual speech recognition using active shape models and hidden markov models. In ICASSP, 1996.
[13] T. F. Cootes, G. J. Edwards, C. J. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 681–685, 2001.
[14] A. Graves, S. Fernández, F. Gomez, J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning, pp. 369-376, 2006.
[15] I. Sutskever, O. Vinyals, QV. Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, pp. 3104-3112, 2014
[16] D Bahdanau, J Chorowski, D Serdyuk, P Brakel, Y Bengio. End-to-end attention-based large vocabulary speech recognition. In ICASSP, 2016.
[17] T. Hori, S. Watanabe, J. R. Hershey. Joint ctc/attention decoding for end-to-end speech recognition. In Annual Meeting of the Association for Computational Linguistics, 2017.
[18] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, pp. 577–585, 2015.
[19] A. Graves, N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning, pp. 1764–1772, 2014.
[20] A. L. Maas, Z. Xie, D. Jurafsky, A. Y. Ng. Lexicon-free conversational speech recognition with neural networks. In NAACL, 2015.
[21] D.Amodei, R.Anubhai, E.Battenberg, C.Case,J.Casper,B.Catanzaro, J.Chen, M.Chrzanowski, A.Coates, G. Diamos, et al. Deep Speech 2: End-to-end speech recognition in English and Mandarin. arXiv preprint arXiv:1512.02595, 2015.
[22] T. Thein, K. M. San. Lip Movements Recognition Towards An Automatic Lip Reading System for Myanmar Consonants. In 2018 12th International Conference on Research Challenges in Information Science (RCIS). 2018.
[23] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, T. Ogata. Lipreading using convolutional neural network. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[24] T. Stafylakis and G. Tzimiropoulos. Combining residual networks with lstms for lipreading. arXiv preprint arXiv:1703.04105, 2017.
[25] T. Stafylakis, M. H. Khan, and G. Tzimiropoulos. Pushing the boundaries of audiovisual word recognition using residual networks and lstms. Computer Vision and Image Understanding, 176:22–32, 2018.
[26] C. Wang. Multi-grained spatio-temporal modeling for lip-reading. arXiv preprint arXiv:1908.11618, 2019.
[27] TF. Cootes, GJ. Edwards, CJ. Taylor. Comparing active shape models with active appearance models. Bmvc 99(1), pp. 173-182, 1999.
[28] V. Estellers, JP. Thiran. Multi-pose lipreading and audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2012.
[29] V. Nair, G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th Int. Conf. Mach. Learn., pp. 807–814, 2010.
[30] K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
[31] Davis E. King. Dlib-ml: A machine learning toolkit.Journal of Machine Learning Research,10:1755–1758, 2009.
[32] G Bradski, A Kaehler. Learning OpenCV: Computer vision with the OpenCV library, 2008.
[33] R Dey, FM Salemt. Gate-variants of gated recurrent unit (GRU) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), 2017.
[34] Z. Zhou, G. Zhao, X. Hong, and M. Pietik¨ainen. A review of recent advances in visual speech decoding. Image and vision computing, 32(9):590–605, 2014.
  • 同意授權校內瀏覽/列印電子全文服務,於2020-08-20起公開。

  • 如您有疑問,請聯絡圖書館