Synthetically Generated Speech for Training a Pronunciation Evaluation System

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorKarhila, Reima
dc.contributor.authorPadaru Shrikantha, Sujith
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.supervisorKurimo, Mikko
dc.date.accessioned2020-06-21T17:01:24Z
dc.date.available2020-06-21T17:01:24Z
dc.date.issued2020-05-19
dc.description.abstractComputer-Aided Pronunciation Training (CAPT) Systems are designed to help users acquire speaking skills in a non-native language (L2 Language). Generally, CAPT systems employ speech recognition techniques to give a wellness score for an utterance. The score helps the learner evaluate themselves and gives the support to improve their pronunciation. Scoring from such systems correlate well with human-annotated scores when the uttered sequences are long and the speakers are adult. However, in the Say It Again Kid (SIAK) project, a CAPT game built for children, utterances are short, and consequently the correlation between scores of the system and human annotator is weak. The unavailability of children’s speech data for training is the main reason for the poor performance. The thesis shows how to mitigate the problem of the unavailability of transcribed data by generating them using a modern text-to-speech (TTS) system. Such systems have shown to reach a human level of naturalness. In this work, a TTS system is trained to generate Finnish speech in children’s accents. The system utilizes a large quantity of adult speech and a small set of children’s speech to generate speech with children’s accents. Finnish accented English is generated from the same system by mapping English words to their nearest Finnish phonetic representation and inputting them into the TTS system. Thus, the thesis proposes a simple way of achieving accented speech. We add the generated data to the training of the phonetic recognition model employed in SIAK. The thesis shows that this technique improves the recognition accuracy of the model: the Phoneme Error Rate (PER) reduced from 0.27 to 0.13 for the Finnish children’s test set. Unfortunately, this improvement in recognition results does not imply an improvement in the SIAK scoring. This was due to a mismatch between the data used for training and testing the recognition system and the target game words: even though the generated speech resembles the target game words, they belong to different distributions.en
dc.format.extent51+0
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/44935
dc.identifier.urnURN:NBN:fi:aalto-202006213892
dc.language.isoenen
dc.programmeMaster’s Programme in Computer, Communication and Information Sciencesfi
dc.programme.majorMachine Learning, Data Science and Artificial Intelligencefi
dc.programme.mcodeSCI3044fi
dc.subject.keywordpronunciation trainingen
dc.subject.keywordtext-to-speechen
dc.subject.keywordsynthetic speech dataen
dc.subject.keywordchildren's speechen
dc.subject.keywordphonetic distanceen
dc.titleSynthetically Generated Speech for Training a Pronunciation Evaluation Systemen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessyes
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
master_Padaru_Shrikantha_Sujith_2020.pdf
Size:
5.91 MB
Format:
Adobe Portable Document Format