進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-1308201415480500
論文名稱(中文) 基於少量未標記語料自動切割與韻律詞階層參數平滑之個人自發性中文語音合成
論文名稱(英文) Automatic Segmentation of Small-sized Unlabeled Data and Prosodic Word-level Smoothing for Personalized Spontaneous Mandarin Speech Synthesis
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 102
學期 2
出版年 103
研究生(中文) 謝明格
研究生(英文) Ming-Ge Shie
學號 p76014240
學位類別 碩士
語文別 英文
論文頁數 61頁
口試委員 指導教授-吳宗憲
口試委員-王駿發
口試委員-楊家輝
口試委員-王新民
口試委員-陳嘉平
中文關鍵字 自動語音切割  韻律詞結構參數平滑化  自發性語音合成  個人化語音合成 
英文關鍵字 Auto segmentation  Prosodic word-level smoothing  Spontaneous speech synthesis  Personalized speech synthesis 
學科別分類
中文摘要 語音是人與人溝通的一種最直接的方式,而近期各項電子產品也都具備語音功能,讓使用者能夠與機器直接用語音來溝通以達到需求。因此合成相似人們平常對話自發性語音將是很重要的一項研究議題。
自發性語音是一種高度個人化的語音呈現,故本論文之目標為建立出一套具有個人化特色之自發性語音合成系統。在此類的研究中,語音資料庫的收集往往是最耗費成本與人力。在本研究中,為了讓此系統更容易運用在實際生活上,必須在使用者能錄製最少語音的條件之下,完全以自動化處理的方式來進行語料標記以及語音模型訓練等動作,以期達到在降低複雜度的狀況下,建立出一套屬於個人化之自發性語音合成系統。
首先本論文提出了針對於自發性語音的自動語音切割的方法,讓使用者不需要自己去切割標記語料。接著藉由語音模型調適的方式得到擁有使用者音色的語音模型。而在合成時,對於韻律詞階層去作參數的平滑化以及合成語速的調整,最後運用預先訓練好的後置濾波器(Post-filter)讓語音參數更為接近目標語者之聲音,使得合成的語音能夠擁有自發性之個人化語音的效果。
在實驗的部分,我們使用了不同語者之語音來進行系統的評估,根據主觀以及客觀的實驗結果顯示,本論文所提出的語音合成系統可合成出較具有個人音色且具自發性之語音。
英文摘要 Speech is an intuitive way to communicate with people. Recently, people can control electronic artifacts with speech commands and receive synthesized voice responses. Synthesizing natural-sounded speech is recently becoming an important research topic.
Spontaneous speech is highly speaker-dependent and the speaking styles of people are quite different. This thesis aims to construct a personalized spontaneous speech synthesis system. The main problem of the research issue is that collecting speech corpus is time consuming and labor-intensive. In order to make the system more practical for daily use, it is essential that the user records only a few speech utterances and the system should be able to automatically segment and label the corpus and construct the voice models for the target speaker.
This thesis proposed an automatic segmentation method to label the corpus and establish the voice model that can synthesize the speech perceived similar to the target speaker. In the synthesis phase, a method for prosodic word-level parameter smoothing and the change of the speaking rate of the prosodic words are proposed for spontaneous speech synthesis. The pre-trained post-filter is adopted to make the synthesized speech perceived more similar to the target speaker.
According to objective and subjective tests, experimental results show that the proposed method can improve the spontaneity and personalization property of the synthesized speech for a target speaker compared to the MLLR-based model adaptation method.
論文目次 中文摘要 I
Abstract III
誌謝 V
Table of Contents VI
List of tables IX
List of figures X
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 Problem 3
1.4 Proposed Ideas 4
1.5 Organization 5
Chapter 2 Related Work 7
2.1 Model-based synthesis and model adaptation 7
2.1.1 Model-based synthesis 7
2.1.2 Model adaptation 8
2.2 Spontaneous speech 11
2.3 Speech segmentation 12
2.4 Spontaneous speech generation 12
2.5 Articulatory features 15
Chapter 3 Proposed method 18
3.1 Mandarin model definition 19
3.2 Automatic segmentation 20
3.2.1 Segmentation 21
3.2.2 Candidate Segmentation Point Expansion 22
3.2.3 Segmentation Based on Dynamic Programming 23
3.3 Model construction 25
3.3.1 Model Parameter modification for smooth speech generation 25
3.4 Spontaneous speech generation 28
3.4.1 Prosodic Word-Level Smoothing 29
3.4.2 Decision tree-based clustering for parameter smoothing ratio 32
3.5 Synthesis phase 38
3.5.1 Voicing detection for smoothed segment 38
3.5.2 Post-filter 40
Chapter 4 Experimental results 45
4.1 Corpus 45
4.1.1 Corpus preparation 45
4.1.2 Evaluation Metrics 47
4.2 Objective test 49
4.3 Subjective test 50
Chapter 5 Conclusion and future works 54
5.1 Conclusion 54
5.2 Future works 54
References 55
Appendix 60
Appendix 1 60
參考文獻 [1] 行政院研究發展考核委員會, 102年個人/家戶數位機會調查報告, 台灣, 2013.
[2] Tomoki Koriyama, Takashi Nose and Takao Kobayashi, “On the Use of Extended Context for HMM-Based Spontaneous Conversational Speech Synthesis,” In: INTERSPEECH, 2011, p. 2657-2660.
[3] Yu Maeno, Takashi Nosea, Takao Kobayashia, Tomoki Koriyamaa, Yusuke Ijimab, Hideharu Nakajimab, Hideyuki Mizunob and Osamu Yoshiokab, “Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis,” Speech Communication, 2014, 57: 144-154.
[4] Chung-Hsien Wu, Chi-Chun Hsia, Chung-Han Lee and Mai-Chun Lin, “Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis,” Audio, Speech, and Language Processing, IEEE Transactions on, 2010, 18.6: 1394-1405.
[5] Carlos Monzo, Ignasi Iriondo and Joan Claudi Socoró, “Voice Quality Modelling for Expressive Speech Synthesis,” The Scientific World Journal, 2014, 2014.
[6] Chung-Hsien Wu, Chung-Han Lee and Chung-Hau Liang, “Idiolect extraction and generation for personalized speaking style modeling,” Audio, Speech, and Language Processing, IEEE Transactions on, 2009, 17.1: 127-137.
[7] Yi-Chin Huang, Chung-Hsien Wu and Yu-Ting Chao, “Personalized spectral and prosody conversion using frame-based codeword distribution and adaptive CRF,” Audio, Speech, and Language Processing, IEEE Transactions on, 2013, 21.1: 51-62.
[8] Asaf Rendel, Alexander Sorin, Ron Hoory and Andrew Breen, “Towards automatic phonetic segmentation for TTS,” In: Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012. p. 4533-4536.
[9] Chiu-yu Tseng, Zhao-yu Su and Lin-shan Lee, “Mandarin spontaneous narrative planning-prosodic evidence from national Taiwan university lecture corpus,” In: INTERSPEECH, 2009, p. 2943-2946.
[10] Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi and Tadashi Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” In: Acoustics, Speech, and Signal Processing, 2000. ICASSP'00. Proceedings. 2000 IEEE International Conference on. IEEE, 2000. p. 1315-1318.
[11] Keiichi Tokuda, Takayoshi Mausko, N. Miyazaki, Takao Kobayashi, “Multi-space probability distribution HMM,” IEICE Trans. Inf. & Syst., 2002, vol.E85-D, no.3, pp.455-464.
[12] Junichi Yamagishi, “Average-voice-based speech synthesis,” Tokyo Institute of Technology, 2006.
[13] Junichi Yamagishi, Takao Kobayashi, Yuji Nakano, Katsumi Ogata and Juri Isogai, “Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm,” Audio, Speech, and Language Processing, IEEE Transactions on, 2009, 17.1: 66-83.
[14] Yuya Akita and Tatsuya Kawahara, “Statistical transformation of language and pronunciation models for spontaneous speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, 2010, 18.6: 1539-1549.
[15] Chung-Han Lee, Chung-Hsien Wu and Jun-Cheng Guo, “Pronunciation variation generation for spontaneous speech synthesis using state-based voice transformation,” In: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010. p. 4826-4829.
[16] Kishore Prahallad, Alan W. Black and Ravishankhar Mosur, “Sub-phonetic modeling for capturing pronunciation variations for conversational speech synthesis,” In: Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. IEEE, 2006. p. I-I.
[17] Chiu-yu Tseng and Yeh-lin Lee, “Speech rate and prosody units: Evidence of interaction from Mandarin Chinese,” In: Speech Prosody 2004, International Conference. 2004.
[18] Chiu-yu Tseng, Zhao-yu Su and Lin-shan Lee, “Mandarin spontaneous narrative planning-prosodic evidence from national taiwan university lecture corpus,” In: INTERSPEECH, 2009, p. 2943-2946.
[19] Shu-Chuan Tseng and Yi-Fen Liu, “Annotation of Mandarin conversational dialogue corpus,” Academia Sinica, CKIP Tech. Rep.-01, 2002.
[20] Shu-Chuan Tseng, “Syllable contractions in a Mandarin conversational dialogue corpus,” International journal of corpus linguistics, 2005, 10.1: 63-83.
[21] Chen-Hsiu Kuo, “The Production of Syllable Contraction in Taiwan Mandarin,” 2011. PhD Thesis.
[22] Geoffrey Zweig and Patrick Nguyen, “A segmental CRF approach to large vocabulary continuous speech recognition,” In: Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on. IEEE, 2009. p. 152-157.
[23] Mauro Cettolo, Michele Vescovi and Romeo Rizzi, “Evaluation of BIC-based algorithms for audio segmentation,” Computer Speech & Language, 2005, 19.2: 147-170.
[24] David Rybach, Christian Gollan, Ralf Schluter and Hermann Ney, “Audio segmentation for speech recognition using segment features,” In: Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on. IEEE, 2009. p. 4197-4200.
[25] Alan W. Black and John Kominek, “Optimizing segment label boundaries for statistical speech synthesis,” In: ICASSP, 2009, p. 3785-3788.
[26] Christina L. Bennett and Alan W. BLACK, “Prediction of Pronunciation Variations for Speech Synthesis: A Data-Driven Approach,” In: ICASSP 2005, p. 297-300.
[27] Steffen Werner, Matthias Eichner, Matthias Wolff and Ruediger Hoffmann, “Toward spontaneous speech synthesis-utilizing language model information in TTS,” Speech and Audio Processing, IEEE Transactions on, 2004, 12.4: 436-445.
[28] Chung-Hsien Wu, Yi-Chin Huang, Chung-Han Lee and Jun-Cheng Guo, “Synthesis of Spontaneous Speech With Syllable Contraction Using State-Based Context-Dependent Voice Transformation,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22.3, 2014: 585-595.
[29] Kishore Prahallad, Alan W. BLACK and Ravishankhar Mosur, “Sub-phonetic modeling for capturing pronunciation variations for conversational speech synthesis,” In: Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. IEEE, 2006. p. I-I.
[30] Tomoki Koriyama, Takashi Nose and Takao Kobayashi, “Conversational spontaneous speech synthesis using average voice model,” In: INTERSPEECH, 2010, p. 853-856.
[31] Ellen Eide, “Distinctive features for use in an automatic speech recognition system,” In: INTERSPEECH, 2001, p. 1613-1616.
[32] Mari Ostendorf, “Moving beyond the ‘beads-on-a-string’model of speech,” In: Proc. IEEE ASRU Workshop. Piscataway, NJ: IEEE, 1999. p. 79-84.
[33] Jinyu Li, Yu Tsao and Chin-Hui LEE, “A Study on Knowledge Source Integration for Candidate Rescoring in Automatic Speech Recognition,” In: ICASSP, 2005, p. 837-840.
[34] Nikko Strom, The NICO artificial neural network toolkit. 2010-09-23. http://nico.nikkostrom.com, 1996.
[35] Chu, Min, and Yao Qian. “Locating boundaries for prosodic constituents in unrestricted Mandarin texts.” Computational linguistics and Chinese language processing 6.1, 2001: 61-82.
[36] Chao Huang, Yu Shi, Jianlai Zhou, Min Chu, Terry Wang and Eric Chang, “Segmental tonal modeling for phone set design in Mandarin LVCSR,” In: Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP'04). IEEE International Conference on. IEEE, 2004. p. I-901-4 vol. 1.
[37] 國立臺灣師範大學國音教材編輯委員會編篡,國音學,台北縣: 正中出版
[38] Forney G. David Jr, “The Viterbi algorithm,” Proceedings of the IEEE, 1973, 61.3: 268-278.
[39] Peter Grünwald, “A tutorial introduction to the minimum description length principle,” arXiv preprint math/0406077, 2004.
[40] 中華民國教育部國語推行委員會,國語注音符號手冊,中華民國教育部,2000年11月
[41] Tomoki Toda and Tokuda Keiichi, “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” IEICE TRANSACTIONS on Information and Systems, 2007, 90.5: 816-824.
[42] Shinnosuke Takamichi, Tomoki Toda, Graham Neubig, Sakriani Sakti and Satoshi Nakamura, “A Postfilter to Modify the Modulation Spectrum in HMM-Based Speech Synthesis,” In: Acoustics, Speech, and Signal Processing, 2014. Proceedings. IEEE International Conference on. IEEE, 2014.
[43] Vivek Tyagi, Iain Mccowan, Hemant Misra and Herve Bourlard, “Mel-cepstrum modulation spectrum (MCMS) features for robust ASR,” In: Automatic Speech Recognition and Understanding, 2003. ASRU'03. 2003 IEEE Workshop on. IEEE, 2003. p. 399-404.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2024-12-31起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw