||Mask-based Speech Enhancement Considering Speech Quality and Acoustic Confidence for Noisy Speech Recognition
||Institute of Computer Science and Information Engineering
automatic speech recognition
deep neural network
convolutional neural network
近年來可連網之設備急遽上升，許多設備皆能以自動語音辨識系統(Automatic Speech Recognition)與人們互動，使用語音操作的方式也漸漸受到大眾接受，但是在生活環境中存在著許多噪音，如何在吵雜環境中利用語音降噪並有效地改善音訊品質，提升語音辨識就顯得相當重要。另外現行單純使用均方差作為損失函數的降噪模型，雖然可以有效降噪，但降噪後的結果與其語音辨識之結果間仍有一段落差。
因此在本論文中，主要貢獻為使用考量語音品質和聲學可信度之遮罩進行語音降噪以提升吵雜語音辨識之字錯率(Word Error Rate, WER)。首先我們抽取語者特徵、音素特徵、噪音特徵，然後將這些相關特徵與吵雜音訊作為遮罩生成模型之輸入，使得遮罩完之降噪音訊有較好的音訊品質。另外，我們利用Kaldi自動語音辨識系統所得到之音素可信度、以乾淨音訊訓練之音素判斷器，配合均方差及STOI和PESQ之損失作為損失函數之訓練方式，並對遮罩生成模型進行修正，使得訓練完成之模型與基準模型(baseline)相比，成功地提升降噪後的音訊品質與降低語音辨識中的WER。
在實驗方面，我們選擇使用TIMIT作為語音資料與noiseX-92作為噪音資料，並以訊號雜訊比-10、-5、0、5和10dB混合音訊。在以均方差、音素判斷器之損失和STOI與PESQ之損失，三者相乘的降噪結果中相比於基礎模型，不只提升STOI 2.14%和PESQ 7.22%；另外相比於基礎模型字錯率33.72%和吵雜音訊字錯率29.08%，本實驗模型最低字錯率21.59%，因此本研究對吵雜狀況下的語音辨識有相當大的改善。
In recent years, the number of network-connected devices has risen rapidly. Many devices can interact with people by automatic speech recognition (ASR). The behavior of using voice operations has gradually been accepted by the public, but there is much background noise which makes it difficult for ASR. It is very important to effectively improve speech recognition by speech enhancement in a noisy speech. In addition, although simply using mean square error (MSE) as the loss function can effectively enhance speech quality, there is still a gap between speech enhancement and speech recognition.
Therefore, the main contribution of this thesis is to generate a mask that takes into account speech quality and acoustic credibility for speech enhancement to reduce the word error rate (WER) for noisy speech recognition. First, we extract the features of speakers, phones, and noises, and then use these related features and noisy speech as inputs to make the enhanced speech with better speech quality. On the other hand, this study uses the phone confidence from Kaldi, the phone judgment trained with clean speech, the MSE and the loss of STOI and PESQ as the loss function to train the mask generation model. Compared with the baseline model, the proposed model successfully improves the speech quality and reduces the WER in speech recognition.
In the experiment, we chose to use TIMIT as the speech data and noiseX-92 as the noise data, and mixed the speech and noise at the signal-to-noise ratio (SNR) of -10, -5, 0, 5, and 10 dB. Compared with the baseline model, multiplying the MSE, the loss of the phone judgment, and the loss of STOI and PESQ not only improved STOI by 2.14% and PESQ by 7.22%, but also achieved the lowest WER of 21.59%, compared to the baseline model which achieved 33.72% and the model for the noisy speech without enhancement which was 29.08%. Experiments shows that the proposed method greatly improves the results of speech recognition on noisy speech.
List of Tables VII
List of Figures VIII
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 3
1.3 Literature Review 6
1.3.1 Methods of Speech Enhancement 6
1.3.2 Methods with Phones as the Feature 7
1.3.3 Methods with Speaker Identity as the Feature 8
1.3.4 Methods with Noise Information as the Feature 9
1.3.5 Judgment on the Speech Quality to Guide Training 10
1.3.6 Deep Neural Network 11
1.3.7 Convolutional Neural Network 16
1.3.8 Kaldi Speech Recognition System 19
1.4 Problems 20
1.5 Proposed Method 22
Chapter 2 System Framework 24
2.1 Phone Feature Extraction Model 25
2.1.1 CNN-based Phone Feature Extraction Model 26
2.2 Speaker Feature Extraction Model 28
2.2.1 Time Delay Neural Network 28
2.2.2 TDNN-based Speaker Feature Extraction Model 30
2.3 Noise Estimation model 31
2.4 Judgment Model 33
2.4.1 Objective Judgement Model of the Speech Quality 34
2.4.2 Phone-based Judgment Model 35
2.4.3 Phone Confidence from Kaldi 35
2.5 Mask Generation Model 37
Chapter 3 Experimental Results and Discussion 40
3.1 Evaluation Metrics 40
3.1.1 STOI 40
3.1.2 PESQ 42
3.1.3 Word Error Rate 44
3.2 Dataset 45
3.3 Experimental results and discussion 49
3.3.1 Evaluation of Phone Feature Extraction Model 49
3.3.2 Evaluation of Noise Estimation Model 51
3.3.3 Evaluation of Judgment Model 53
3.3.4 Evaluation of Mask Generation Model 53
Chapter 4 Conclusion and Future Work 65
 K. L. Lueth. "State of the IoT 2018: Number of IoT devices now at 7B – Market accelerating." https://iot-analytics.com/state-of-the-iot-update-q1-q2-2018-number-of-iot-devices-now-7b/ (accessed June 14, 2020).
 "Google Home - Google Store." https://store.google.com/us/product/google_home_speaker (accessed June 14, 2020).
 "Echo (3rd Gen)- Smart speaker with Alexa- Charcoal." https://www.amazon.com/all-new-Echo/dp/B07NFTVP7P (accessed June 14, 2020).
 "Hearing Aids -AmericaHear." https://americahears.com/product-category/hearing-aids/ (accessed June 14, 2020).
 "Model S." https://www.tesla.com/zh_tw/models` (accessed June 14, 2020).
 D. Wang and J. Chen, "Supervised speech separation based on deep learning: An overview," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702-1726, 2018.
 K. Z. J. R. Hassall, Acoustic noise measurements. 1988.
 Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "A regression approach to speech enhancement based on deep neural networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7-19, 2014.
 Y. Xu, J. Du, Z. Huang, L.-R. Dai, and C.-H. Lee, "Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement," arXiv preprint arXiv:1703.07172, 2017.
 M. Ge, L. Wang, N. Li, H. Shi, J. Dang, and X. Li, "Environment-dependent attention-driven recurrent convolutional neural network for robust speech enhancement," Proc. Interspeech 2019, pp. 3153-3157, 2019.
 T. Rutkowski, A. Cichocki, and A. K. Barros, "Speech enhancement from interfering sounds using CASA techniques and blind source separation," ICA’01, pp. 728-733, 2001.
 D. Wang and G. J. Brown, "Fundamentals of computational auditory scene analysis," 2006.
 J. Rouat, "Computational auditory scene analysis: Principles, algorithms, and applications (wang, d. and brown, gj, eds.; 2006)[book review]," IEEE Transactions on Neural Networks, vol. 19, no. 1, pp. 199-199, 2008.
 Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error log-spectral amplitude estimator," IEEE transactions on acoustics, speech, and signal processing, vol. 33, no. 2, pp. 443-445, 1985.
 T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, "Speech dereverberation based on variance-normalized delayed linear prediction," IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717-1731, 2010.
 S. Pascual, A. Bonafonte, and J. Serra, "SEGAN: Speech enhancement generative adversarial network," arXiv preprint arXiv:1703.09452, 2017.
 F.-K. Chuang, S.-S. Wang, J.-w. Hung, Y. Tsao, and S.-H. Fang, "Speaker-Aware Deep Denoising Autoencoder with Embedded Speaker Identity for Speech Enhancement," in Interspeech, 2019, pp. 3173-3177.
 I. Cohen, "On speech enhancement under signal presence uncertainty," in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), 2001, vol. 1: IEEE, pp. 661-664.
 Y. Wang, A. Narayanan, and D. Wang, "On training targets for supervised speech separation," IEEE/ACM transactions on audio, speech, and language processing, vol. 22, no. 12, pp. 1849-1858, 2014.
 S. E. Chazan, S. Gannot, and J. Goldberger, "A phoneme-based pre-training approach for deep neural network with application to speech enhancement," in 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), 13-16 Sept. 2016 2016, pp. 1-5.
 P. Karjol and P. K. Ghosh, "Broad Phoneme Class Specific Deep Neural Network Based Speech Enhancement," in 2018 International Conference on Signal Processing and Communications (SPCOM), 16-19 July 2018 2018, pp. 372-376, doi: 10.1109/SPCOM.2018.8724388.
 K.-F. Lee and H.-W. Hon, "Speaker-independent phone recognition using hidden Markov models," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 11, pp. 1641-1648, 1989.
 S. Shon, H. Tang, and J. Glass, "Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model," in 2018 IEEE Spoken Language Technology Workshop (SLT), 2018: IEEE, pp. 1007-1013.
 C. Lopes and F. Perdigão, "A hierarchical broad-class classification to enhance phoneme recognition," in 2009 17th European Signal Processing Conference, 24-28 Aug. 2009 2009, pp. 1760-1764.
 T. Afouras, J. S. Chung, and A. Zisserman, "My lips are concealed: Audio-visual speech enhancement through obstructions," arXiv preprint arXiv:1907.04975, 2019.
 R. Yao and M. T. Johnson, "An improved SNR estimator for speech enhancement," in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 31 March-4 April 2008 2008, pp. 4901-4904.
 R. Yao, Z. Zeng, and P. Zhu, "A priori SNR estimation and noise estimation for speech enhancement," EURASIP journal on advances in signal processing, vol. 2016, no. 1, p. 101, 2016.
 S.-W. Fu, Y. Tsao, and X. Lu, "SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement," in Interspeech, 2016, pp. 3768-3772.
 M. L. Seltzer, D. Yu, and Y. Wang, "An investigation of deep neural networks for noise robust speech recognition," in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 26-31 May 2013 2013, pp. 7398-7402.
 C.-F. Liao, Y. Tsao, H.-Y. Lee, and H.-M. Wang, "Noise adaptive speech enhancement using domain adversarial training," arXiv preprint arXiv:1807.07501, 2018.
 S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, "End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 1570-1584, 2018.
 S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang, "Quality-Net: An end-to-end non-intrusive speech quality assessment model based on BLSTM," arXiv preprint arXiv:1808.05344, 2018.
 S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, "MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement," arXiv preprint arXiv:1905.04874, 2019.
 Y.-L. Shen, C.-Y. Huang, S.-S. Wang, Y. Tsao, H.-M. Wang, and T.-S. Chi, "Reinforcement learning based speech enhancement for robust speech recognition," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 6750-6754.
 "Why are neuron axons long and spindly? Study shows they're optimizing signaling efficiency." https://medicalxpress.com/news/2018-07-neuron-axons-spindly-theyre-optimizing.html (accessed June 28, 2020).
 W. S. McCulloch and W. Pitts, "A logical calculus of the ideas immanent in nervous activity," The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115-133, 1943.
 F. Rosenblatt, The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
 T. Kohonen, "A thousand-word recognition system based on the learning subspace method and redundant hash addressing," in Proc. 5ICPR, International Conference on Pettern Recognition, 1980, 1980.
 T. Kohonen, "Physiological interpretation of the self-organizing map algorithm," Neural Networks, vol. 6, no. 7, pp. 895-905, 1993.
 P. Smolensky, "Information processing in dynamical systems: Foundations of harmony theory," Colorado Univ at Boulder Dept of Computer Science, 1986.
 D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, "A learning algorithm for Boltzmann machines," Cognitive science, vol. 9, no. 1, pp. 147-169, 1985.
 D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors," nature, vol. 323, no. 6088, pp. 533-536, 1986.
 H. Mhaskar, Q. Liao, and T. Poggio, "Learning functions: when is deep better than shallow," arXiv preprint arXiv:1603.00988, 2016.
 H. N. Mhaskar and T. Poggio, "Deep vs. shallow networks: An approximation theory perspective," Analysis and Applications, vol. 14, no. 06, pp. 829-848, 2016.
 A. Schindler, T. Lidy, and A. Rauber, "Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification," in FMT, 2016, pp. 17-21.
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
 K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
 K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
 D. Povey et al., "The Kaldi speech recognition toolkit," in IEEE 2011 workshop on automatic speech recognition and understanding, 2011, no. CONF: IEEE Signal Processing Society.
 "Kaldi - github." https://github.com/kaldi-asr/kaldi (accessed June 27, 2020).
 S. R. Park and J. Lee, "A fully convolutional neural network for speech enhancement," arXiv preprint arXiv:1609.07132, 2016.
 "The CMU Pronouncing Dictionary." http://www.speech.cs.cmu.edu/cgi-bin/cmudict#phones (accessed June 28, 2020).
 Q.-B. Hong, C.-H. Wu, H.-M. Wang, and C.-L. Huang, "Statistics Pooling Time Delay Neural Network Based on X-Vector for Speaker Verification," in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4-8 May 2020 2020, pp. 6849-6853.
 A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, "Phoneme recognition using time-delay neural networks," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328-339, 1989.
 X. Li, V. Chebiyyam, and K. Kirchhoff, "Multi-stream network with temporal attention for environmental sound classification," arXiv preprint arXiv:1901.08608, 2019.
 C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "A short-time objective intelligibility measure for time-frequency weighted noisy speech," in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 14-19 March 2010 2010, pp. 4214-4217.
 C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125-2136, 2011.
 J. Jensen and C. H. Taal, "An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009-2022, 2016.
 A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 7-11 May 2001 2001, vol. 2, pp. 749-752 vol.2.
 J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, "DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1," STIN, vol. 93, p. 27403, 1993.
 A. Varga and H. J. Steeneken, "Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech communication, vol. 12, no. 3, pp. 247-251, 1993.