On training targets for deep learning approaches to clean speech magnitude spectrum estimation
View/ Open
File version
Version of Record (VoR)
Author(s)
Nicolson, Aaron
Paliwal, Kuldip K
Year published
2021
Metadata
Show full item recordAbstract
Estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three categories: computational auditory scene analysis (CASA), MS, and minimum mean square error (MMSE) estimator training targets. The choice of the training target can have a significant impact on speech enhancement/separation and robust ASR performance. Motivated by this, the training target ...
View more >Estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three categories: computational auditory scene analysis (CASA), MS, and minimum mean square error (MMSE) estimator training targets. The choice of the training target can have a significant impact on speech enhancement/separation and robust ASR performance. Motivated by this, the training target that produces enhanced/separated speech at the highest quality and intelligibility and that which is best for an ASR front-end is found. Three different deep neural network (DNN) types and two datasets, which include real-world nonstationary and coloured noise sources at multiple signal-to-noise ratio (SNR) levels, were used for evaluation. Ten objective measures were employed, including the word error rate of the Deep Speech ASR system. It is found that training targets that estimate the a priori SNR for MMSE estimators produce the highest objective quality scores. Moreover, it is established that the gain of MMSE estimators and the ideal amplitude mask produce the highest objective intelligibility scores and are most suitable for an ASR front-end.
View less >
View more >Estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three categories: computational auditory scene analysis (CASA), MS, and minimum mean square error (MMSE) estimator training targets. The choice of the training target can have a significant impact on speech enhancement/separation and robust ASR performance. Motivated by this, the training target that produces enhanced/separated speech at the highest quality and intelligibility and that which is best for an ASR front-end is found. Three different deep neural network (DNN) types and two datasets, which include real-world nonstationary and coloured noise sources at multiple signal-to-noise ratio (SNR) levels, were used for evaluation. Ten objective measures were employed, including the word error rate of the Deep Speech ASR system. It is found that training targets that estimate the a priori SNR for MMSE estimators produce the highest objective quality scores. Moreover, it is established that the gain of MMSE estimators and the ideal amplitude mask produce the highest objective intelligibility scores and are most suitable for an ASR front-end.
View less >
Journal Title
Journal of the Acoustical Society of America
Volume
149
Issue
5
Copyright Statement
© 2021 Acoustical Society of America. The attached file is reproduced here in accordance with the copyright policy of the publisher. Please refer to the journal's website for access to the definitive, published version.
Subject
Speech pathology
Science & Technology
Life Sciences & Biomedicine
Acoustics
Audiology & Speech-Language Pathology