Perceptually-Motivated Speech Parameters for Efficient Coding and Noise-Robust Cepstral-Based ASR Features
Author(s)
Primary Supervisor
Paliwal, Kuldip
Other Supervisors
So, Stephen
Year published
2018-03
Metadata
Show full item recordAbstract
With speech being a natural form of human communication in everyday life, speech
processing technologies are gradually being incorporated into modern devices and
many applications such as Automatic Speech Recognition (ASR). Efficient ASR im-
plementations are key for technologies which simplify service accessibility for clients
such as automatic translation and dictation software. ASR task is known as a
client/server model. Various modes have been proposed for this approach includ-
ing Network Speech Recognition (NSR) and Distributed Speech Recognition (DSR).
In NSR, feature extraction and recognition task are performed at ...
View more >With speech being a natural form of human communication in everyday life, speech processing technologies are gradually being incorporated into modern devices and many applications such as Automatic Speech Recognition (ASR). Efficient ASR im- plementations are key for technologies which simplify service accessibility for clients such as automatic translation and dictation software. ASR task is known as a client/server model. Various modes have been proposed for this approach includ- ing Network Speech Recognition (NSR) and Distributed Speech Recognition (DSR). In NSR, feature extraction and recognition task are performed at the server-side. The advantage of NSR is that no changes are required for the existing mobile tele- phony equipment and networks. Conversely, the ASR features are extracted from the speech signal at the client-side in DSR. Unlike the Linear Predictive Coding (LPC) parameters used in NSR, which are known to be more sensitive to the effects of undesirable noise, DSR features can exploit human auditory effects which signi - cantly aid in speech recognition. The downside of DSR, however, is the requirement of a dedicated channel for the transmission process. This thesis presents novel signal processing algorithms that extract perceptually- motivated LPC parameters for computation at the client-side { which provide good quality speech coding { and enable noise-robust ASR features at the server-side. These algorithms are demonstrated through three studies. The rst study introduces a proposed method for estimating perceptually- motivated LPC parameters { the Smoothed and Thresholded Power Spectrum Lin- ear Prediction (STPS-LP) analysis. The algorithm is based on the property of simultaneous masking found in the human auditory system to estimate noise-robust LPC parameters. The proposed method is evaluated and compared to three other LP analysis methods: the conventional Autocorrelation Method (AM-LP), the Spec- tral Envelope Estimation Vocoder method (SEEVOC-LP), and the LP Spectrum Modi cation method (LPSM-LP). Comparisons of the robustness and quality of speech demonstrated that the proposed STPS-LP method outperforms the three other schemes. The second study investigates the quantisation performance of various LPC pa- rameters using different quantiser schemes. The proposed STPS-LP method pro- duced less quantisation distortion than the aforementioned methods. The third study builds on the proposed coefficients from the rst study, proposing a conversion algorithm for obtaining a set of noise-robust ASR features to be used at the server-side. These cepstral-based features are called the STPS-LP Cepstral Coefficients (STPS-LPCCs). The recognition performance using the STPS-LPCCs is improved in comparison to that using the AM-LP, SEEVOC-LP, and LPSM-LP cepstral coefficients, under clean and noisy conditions in the baseline, matched, and mismatched models. This research proposal provides two main advantages over the conventional NSR and DSR schemes: (i) the perceptually-motivated LPC parameters are used for speech coding, and (ii) no dedicated communications channel is required since the existing LPC bitstream in speech coders is used for transmitting the features. These bene ts make the proposed method greatly applicable to current mobile telephony networks and should improve the user experience when interacting with ASR ser- vices.
View less >
View more >With speech being a natural form of human communication in everyday life, speech processing technologies are gradually being incorporated into modern devices and many applications such as Automatic Speech Recognition (ASR). Efficient ASR im- plementations are key for technologies which simplify service accessibility for clients such as automatic translation and dictation software. ASR task is known as a client/server model. Various modes have been proposed for this approach includ- ing Network Speech Recognition (NSR) and Distributed Speech Recognition (DSR). In NSR, feature extraction and recognition task are performed at the server-side. The advantage of NSR is that no changes are required for the existing mobile tele- phony equipment and networks. Conversely, the ASR features are extracted from the speech signal at the client-side in DSR. Unlike the Linear Predictive Coding (LPC) parameters used in NSR, which are known to be more sensitive to the effects of undesirable noise, DSR features can exploit human auditory effects which signi - cantly aid in speech recognition. The downside of DSR, however, is the requirement of a dedicated channel for the transmission process. This thesis presents novel signal processing algorithms that extract perceptually- motivated LPC parameters for computation at the client-side { which provide good quality speech coding { and enable noise-robust ASR features at the server-side. These algorithms are demonstrated through three studies. The rst study introduces a proposed method for estimating perceptually- motivated LPC parameters { the Smoothed and Thresholded Power Spectrum Lin- ear Prediction (STPS-LP) analysis. The algorithm is based on the property of simultaneous masking found in the human auditory system to estimate noise-robust LPC parameters. The proposed method is evaluated and compared to three other LP analysis methods: the conventional Autocorrelation Method (AM-LP), the Spec- tral Envelope Estimation Vocoder method (SEEVOC-LP), and the LP Spectrum Modi cation method (LPSM-LP). Comparisons of the robustness and quality of speech demonstrated that the proposed STPS-LP method outperforms the three other schemes. The second study investigates the quantisation performance of various LPC pa- rameters using different quantiser schemes. The proposed STPS-LP method pro- duced less quantisation distortion than the aforementioned methods. The third study builds on the proposed coefficients from the rst study, proposing a conversion algorithm for obtaining a set of noise-robust ASR features to be used at the server-side. These cepstral-based features are called the STPS-LP Cepstral Coefficients (STPS-LPCCs). The recognition performance using the STPS-LPCCs is improved in comparison to that using the AM-LP, SEEVOC-LP, and LPSM-LP cepstral coefficients, under clean and noisy conditions in the baseline, matched, and mismatched models. This research proposal provides two main advantages over the conventional NSR and DSR schemes: (i) the perceptually-motivated LPC parameters are used for speech coding, and (ii) no dedicated communications channel is required since the existing LPC bitstream in speech coders is used for transmitting the features. These bene ts make the proposed method greatly applicable to current mobile telephony networks and should improve the user experience when interacting with ASR ser- vices.
View less >
Thesis Type
Thesis (PhD Doctorate)
Degree Program
Doctor of Philosophy (PhD)
School
School of Eng & Built Env
Copyright Statement
The author owns the copyright in this thesis, unless stated otherwise.
Subject
Perceptually-motivated speech parameters
Efficient coding
Cepstral-based ASR features
Speech processing technologies
Smoothed and thresholded power spectrum linear prediction (STPS-LP)