進階搜尋


 
系統識別號 U0026-0812200911332789
論文名稱(中文) 以最大熵準則結合語音及語言特徵於語音辨識之研究
論文名稱(英文) Integration of Acoustic and Linguistic Features for Maximum Entropy Speech Recognition
校院名稱 成功大學
系所名稱(中) 資訊工程學系碩博士班
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 93
學期 2
出版年 94
研究生(中文) 錢鐸樟
研究生(英文) To-Chang Chien
學號 p7692153
學位類別 碩士
語文別 中文
論文頁數 92頁
口試委員 指導教授-簡仁宗
口試委員-王駿發
口試委員-吳宗憲
口試委員-廖弘源
口試委員-余孝先
中文關鍵字 語言模型  語音模型  最大交互資訊  最大熵  鑑別式 
英文關鍵字 discriminative training  maximum entropy  speech recognition  maximum mutual information 
學科別分類
中文摘要 在傳統語音辨識系統中,語音及語言兩種資訊來源常假設為互相獨立的,分別訓練其各自的模型參數,在進行語音辨識時,我們會將聲學模型與語言模型的機率值結合起來當作最後的決策法則。由於語音辨識時擷取出的候選文字串與所輸入的語音信號存在有相互的影響性,語音模型和語言模型也應將這樣的關係考慮進來,因此我們提出一套整合性最大熵(Maximum Entropy)模型作為語音辨識器的主要架構,並且提出如何結合語言和語音特徵在一致的模型架構下完成訓練。在這樣的架構下,我們將語音及語言之間共變異性特徵在整合性模型內適當的描述。在鑑別性最大熵模型的課題上,我們透過理論分析來建立起整合性模型與鑑別式訓練準則之間的關聯性,另外,以最大熵模型為架構的語音辨識系統可以有效地將其他額外的資訊來源結合在一致的語音及語言模型之中,例如一些語意性主題和長距離關聯法等資訊來源,都可以有效的結合在模型裡。我們將實現新穎的語音及語言模型於自發性廣播新聞語音辨識系統上,並比較傳統以最佳相似度為主語音模型與語言模型相互獨立的系統。

英文摘要 In traditional speech recognition system, we assume that acoustic and linguistic information sources are independent. Parameters of acoustic hidden Markov model (HMM) and linguistic n-gram model are estimated individually and then combined together to build a plug-in maximum a posteriori (MAP) classification rule. However, the acoustic model and language model are correlated in essence. We should relax the independence assumption so as to improve speech recognition performance. In this study, we propose an integrated approach based on maximum entropy (ME) principle where acoustic and linguistic features are optimally combined in an unified framework. Using this approach, the associations between acoustic and linguistic features are explored and merged in the integrated models. On the issue of discriminative training, we also establish the relationship between ME and discriminative maximum mutual information (MMI) models. In addition, this ME integrated model is general so that the semantic topics and long distance association patterns can be further combined. In the experiments, we carry out the proposed ME model for broadcast news transcription using MATBN database. In preliminary experimental results, we obtain improvement compared to conventional speech recognition system based on plug-in MAP classification rule.

論文目次 中文摘要 5
Abstract 6
誌謝 8
章節目次 9
圖目錄 12
表目錄 13
第 一 章 緒論 14
1.1 研究背景 14
1.2 研究動機與方法 15
1.3 章節概要 17
第 二 章 語音辨識系統 18
2.1 語音辨識概述 18
2.2 隱藏式馬可夫模型 20
2.3 語言模型 28
2.4 鑑別式訓練 33
2.4.1 最小分類錯誤法(MCE) 34
2.4.2 最大交互資訊法 40
2.4.3 其他鑑別式訓練相關研究 43
第 三 章 最大熵語言模型 45
3.1 最大熵模型與GIS迭代演算法 45
3.2 以最大熵準則實現語言模型 49
3.3 潛在最大熵模型(LME)與EM-IS迭代演算法 51
第 四 章 整合性最大熵語音及語言模型 54
4.1 鑑別式最大熵語音模型 54
4.2 鑑別式最大熵語言模型 56
4.3 結合語音及語言資訊之整合性最大熵模型 60
4.4 整合性最大熵模型與MMIE鑑別式訓練之關係 65
第 五 章 實驗 68
5.1 實驗設定 68
5.2 實驗結果 70
5.3 系統展示 75
第 六 章 結論與未來研究方向 79
參考文獻 80
附錄: Interspeech 2005 論文 88
作者簡歷 92
參考文獻 [1] J. Bellegarda, “Exploiting latent semantic information in statistical language modeling,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1279-1296, August 2000.
[2] J. Bellegarda, “Large vocabulary speech recognition with multispan statistical language models,” IEEE Transactions on Speech and Audio Processing 8, vol. 1, pp. 76-84, January 2000.
[3] J. Bellegarda, “A multispan language modeling framework for large vocabulary speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 6, no. 5, pp. 456-467, September 1998.
[4] J. R. Bellegarda, “Statistical language model adaptation: review and perspectives,” Speech Communication, vol. 42, pp. 93-108, 2004.
[5] L. Bahl, P. Brown, P. de Souza and R. Mercer, “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 11, pp. 49-52, April 1986.
[6] M. Berry, S. Dumais and G. O’Brien, “Using linear algebra for intelligent information retrieval,” SIAM Review, vol. 37, no. 4, pp. 573-595, 1995.
[7] A. Berger, S. D. Pietra and V. D. Pietra, “A maximum entropy approach to natural language processing,” Computational Linguistics, vol. 22, no. 1, pp. 39-71, 1996.
[8] J. Bellegarda, K. Silverman, “ Natural language spoken interface control using data-driven semantic inference,” IEEE Transactions on Speech and Audio Processing, vol. 11, pp. 267-277, 2003.
[9] C.-H. Chueh, T.-C. Chien, and J.-T. Chien, “Discriminative maximum entropy language model for speech recognition,” submitted to Proc. of Interspeech, 2005.
[10] C.-H. Chueh, J.-T. Chien, and H.-M. Wang, “A maximum entropy approach for integrating semantic information in statistical language models,” in Proc. International Symposium on Chinese Spoken Language Processing (ISCSLP2004), pp. 309-312 ,Hong Kong, December 2004.
[11] S. F. Chen and J. Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling,” Computer Speech and Language, vol. 13, 359-394, 1999.
[12] C. Chelba and F. Jelinek, “Structured language modeling,” Computer Speech and Language, vol. 14, no. 4, pp. 283-332, October 2000.
[13] P. C. Chang and B.-H. Juang, “Discriminative training of dynamic programming based speech recognizers,” IEEE Transactions on Speech and Audio Processing, vol. 1, no. 2, pp. 135-143, April 1993.
[14] W. Chou, C.-H. Lee and B.-H. Juang, “Segmental GPD training of an hidden Markov model based speech recognizer,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 473-476, 1992.
[15] Z. Chen, K.-F. Lee, M.-J Li, “Discriminative training on language model,” in Proc. International Conference on Spoken Language Processing, pp. 16-20, 2000.
[16] P. R. Clarkson and A. J. Robinson, “Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, pp.799-802, 1997.
[17] S. F. Chen and R. Rosenfeld, “A survey of smoothing techniques for ME models,” IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, January 2000.
[18] A. Dempster, N. Laird and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1-38, 1997.
[19] J. Darroch and D. Ratcliff. “Generalized iterative scaling for log-linear models,” The Annals of Mathematical Statistics, vol. 43, pp. 1470-1480, 1972.
[20] M. Federico, “Efficient language model adaptation through MDI estimation,” in Proc. of EUROSPEECH, pp. 1583-1586, 1999.
[21] M. Federico, “Language model adaptation through topic decomposition and MDI estimation,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, 2002.
[22] J.-L. Gauvain, and C.-H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observation of Markov chain,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 4, pp. 291-298, 1994.
[23] X. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Pearson Education 2001-04-25.
[24] R. Iyer and M. Ostendorf, “Relevance weighting for combining multi-domain data for n-gram language modeling,” Computer Speech and Language, vol. 13, pp. 267-282, 1999.
[25] E. T. Jaynes, “Information Theory and Statistical Mechanics,” Physics Reviews 106, pp. 620-630, 1957.
[26] B.-H. Juang, W. Hou and C.-H. Lee, “Minimum classification error rate Methods for Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3 , pp. 257-265, May 1997.
[27] B.-H. Juang and S. Katagirl, “Discriminative learning for minimum error classification,” IEEE Transactions on Signal Processing, vol. 40, pp. 3043-3054, December 1992.
[28] D. Klakow,“Selecting Articles from the Language Model Training Corpus,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, pp. 1695-1698, 2000.
[29] H.-K. J. Kuo, E. Fosle-Lussier, H. Jiang and C.-H. Lee, “Discriminative training of language models for speech recognition,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 1, pp. I325-328, 2002.
[30] W. Ma Kristine, Z. George and M. Marie,“Bi-modal sentence structure for language modeling,” Speech Communication, vol. 31, pp. 51-67, 2000.
[31] R. Kuhn and R. De Mori, “A Cache-Based Natural Language Model for Speech Reproduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 6, pp. 570-583, 1990.
[32] S. Katagiri, B.-H. Juang, and C.-H. Lee, “Pattern recognition using a family of design algorithms based upon the generalized probabilistic descent method,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2345–2373, Nov. 1998.
[33] S. Katagiri, C.-H. Lee, and B.-H. Juang, “New discriminative algorithm based on the generalized probabilistic descent method,” in Proc. of IEEE Workshop on Neural Network for Signal Processing, Princeton, pp.299–309, September 1991.
[34] S. Khudanpur and J. Wu, “Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling,” Computer Speech and Language, pp. 355-372, 2000.
[35] S. Khudanpur and J. Wu, “A maximum entropy language model integrating N-grams and topic dependencies for conversational speech recognition,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, 1999.
[36] Q. Li, “Discovering relations among discriminative training objectives,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, Montreal, May 2004.
[37] C.-H. Lee and B.-H. Juang, “A survey on automatic speech recognition with an illustrative example on continuous speech recognition of Mandarin,” Computational Linguistics and Chinese Language Processing, vol. 1, no.1, pp. 1-36, August 1996.
[38] Q. Li, B.-H. Juang, “A new algorithm for fast discriminative training,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 97-100, 2002.
[39] Q. Li, B.-H. Juang, “Fast discriminative training for sequential observations with application to speaker identification,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 2, pp. 397-400, 2003.
[40] R. Lau, R. Rosenfeld, and S. Roukos, “Trigger-based language models: A maximum entropy approach,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 2, pp. 45-48, 1993.
[41] W. Macherey and H. Ney, “A comparative study on maximum entropy and discriminative training for acoustic modeling in automatic speech recognition”, in Proc. of EUROSPEECH, vol. 1, pp. 493-496,September 2003.
[42] Y. Normandin, R. Cardin and R. De Mori, “High-performance connected digit recognition using maximum mutual information estimation,” IEEE Transactions on Speech and Audio Processing, vol. 2, pp. 299-311, 1994.
[43] S. D. Pietra , V. D. Pietra and J. Lafferty, “Inducing Features of Random Fields,” IEEE Transaction on Pattern Analysis and Machine Intelligence, pp. 380-393, vol. 19, no.4 ,April , 1997.
[44] D. Pietra, S. Della Pietra, R.L. Mercer, S. Roukous, “Adaptive language modeling using minimum discriminant estimation,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 633-636,March 1992.
[45] C. Paciorek and R. Rosenfeld, “Minimum classification error training in exponential language models,” in Proc. of NIST/DARPA Speech Transcription Workshop, 2002.
[46] D. Povey, P. C. Woodland, “Minimum phone error and I-Smoothing for improved discriminative training,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, Montreal,2002.
[47] R. Rosenfeld, “A maximum entropy approach to adaptive statistical language model,” Computer Speech and Language, vol. 10, pp. 187-228, 1996.
[48] B.-Y. Ricardo and Berthier Ribeiro-Neto , Modern Information Retrieval , Addison-Wesley Longman, May 1999.
[49] R. Rosenfeld, S. F. Chen and X. Zhu, “Whole-sentence exponential language models: a vehicle for linguistic-statistical integration,” Computer Speech and Language, vol. 15, pp. 55-73, 2001.
[50] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.
[51] P. S. Rao, M. D. Monkowski, and S. Roukos,“Language model adaptation via minimum discrimination information,”in Proc. of International Conference on Acoustic, Speech and Signal Processing, Detroit, Michigan, USA , pp. 161-164, 1995.
[52] R. Schluter, W. Macherey, “Comparison of discriminative training criteria,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 493-496, 1998.
[53] I. H. Witten and T. C. Bell, “The zero-frequency problem : Estimating the probabilities of novel events in adaptive text compression,” IEEE Transactions on Information Theory , vol. 37, pp. 1085-1094, 1991.
[54] J. Wu and S. Khudanpur, “Building a topic-dependent maximum entropy model for very large corpora,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, pp. I777-780, 2002.
[55] S. Wang, D. Schuurmans, F. Peng, Y. Zhao, “Learning Mixture Models with the Latent Maximum Entropy Principle,” in Proceedings of ICML, Washington DC, 2003.
[56] S. Young, J. Jansen, J. Odell, D. Ollason, and P Woodland. The HTK Book (Version 2.0). ECRL, 1995.
[57]G. D. Zhou and K. T. Lua, “Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition,” Computer Speech and Language, vol. 13, pp. 125-141, 1999.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2005-07-21起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2005-07-21起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw