A Text to speech synthesis system for the Mongolian language
Abstract
The first Text-to-Speech (TTS) system for the Mongolian language has implemented using the general speech synthesis architecture of Festival. The conversion process from input text into acoustic waveform is performed in a number of steps consisting of functional components. The TTS is based on diphone concatenative synthesis, applying TD-PSOLA technique. Hand written letter to sound rules are applied in sequence mapping strings of letters to strings of phones. Prosodic phrasing is provided by a CART tree making decisions based on distance from punctuation and whether the current word is a function or content word. Intonation ...
View more >The first Text-to-Speech (TTS) system for the Mongolian language has implemented using the general speech synthesis architecture of Festival. The conversion process from input text into acoustic waveform is performed in a number of steps consisting of functional components. The TTS is based on diphone concatenative synthesis, applying TD-PSOLA technique. Hand written letter to sound rules are applied in sequence mapping strings of letters to strings of phones. Prosodic phrasing is provided by a CART tree making decisions based on distance from punctuation and whether the current word is a function or content word. Intonation is provided by a CART tree predicting ToBI accents and an F0 contour generated from a model trained from natural speech. The duration model is also trained from data using a CART tree. The quality of synthesised speech is assessed in terms of acceptability and intelligibility. The synthetic speech produced by the current version of the system is intelligible, but utterances sometimes suffer from a lack of naturalness and fluency.
View less >
View more >The first Text-to-Speech (TTS) system for the Mongolian language has implemented using the general speech synthesis architecture of Festival. The conversion process from input text into acoustic waveform is performed in a number of steps consisting of functional components. The TTS is based on diphone concatenative synthesis, applying TD-PSOLA technique. Hand written letter to sound rules are applied in sequence mapping strings of letters to strings of phones. Prosodic phrasing is provided by a CART tree making decisions based on distance from punctuation and whether the current word is a function or content word. Intonation is provided by a CART tree predicting ToBI accents and an F0 contour generated from a model trained from natural speech. The duration model is also trained from data using a CART tree. The quality of synthesised speech is assessed in terms of acceptability and intelligibility. The synthetic speech produced by the current version of the system is intelligible, but utterances sometimes suffer from a lack of naturalness and fluency.
View less >
Conference Title
Griffith School of Engineering Research Conference (GSERC)