Synthetically Generated Speech for Training a Pronunciation Evaluation System

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Machine Learning, Data Science and Artificial Intelligence
Degree programme
Master’s Programme in Computer, Communication and Information Sciences
Computer-Aided Pronunciation Training (CAPT) Systems are designed to help users acquire speaking skills in a non-native language (L2 Language). Generally, CAPT systems employ speech recognition techniques to give a wellness score for an utterance. The score helps the learner evaluate themselves and gives the support to improve their pronunciation. Scoring from such systems correlate well with human-annotated scores when the uttered sequences are long and the speakers are adult. However, in the Say It Again Kid (SIAK) project, a CAPT game built for children, utterances are short, and consequently the correlation between scores of the system and human annotator is weak. The unavailability of children’s speech data for training is the main reason for the poor performance. The thesis shows how to mitigate the problem of the unavailability of transcribed data by generating them using a modern text-to-speech (TTS) system. Such systems have shown to reach a human level of naturalness. In this work, a TTS system is trained to generate Finnish speech in children’s accents. The system utilizes a large quantity of adult speech and a small set of children’s speech to generate speech with children’s accents. Finnish accented English is generated from the same system by mapping English words to their nearest Finnish phonetic representation and inputting them into the TTS system. Thus, the thesis proposes a simple way of achieving accented speech. We add the generated data to the training of the phonetic recognition model employed in SIAK. The thesis shows that this technique improves the recognition accuracy of the model: the Phoneme Error Rate (PER) reduced from 0.27 to 0.13 for the Finnish children’s test set. Unfortunately, this improvement in recognition results does not imply an improvement in the SIAK scoring. This was due to a mismatch between the data used for training and testing the recognition system and the target game words: even though the generated speech resembles the target game words, they belong to different distributions.
Kurimo, Mikko
Thesis advisor
Karhila, Reima
pronunciation training, text-to-speech, synthetic speech data, children's speech, phonetic distance
Other note