Augmentation, Oversampling and Curriculum Learning for Small Imbalanced Speech Data

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Sähkötekniikan korkeakoulu | Master's thesis
Date
2023-12-11
Department
Major/Subject
Signal Processing and Data Science
Mcode
ELEC3049
Degree programme
CCIS - Master’s Programme in Computer, Communication and Information Sciences (TS2013)
Language
en
Pages
81+10
Series
Abstract
Automatic Speech Recognition (ASR) systems have seen remarkable breakthrough in recent years, which has in turn fostered the development of ASR-supported Automatic Speaking Assessment (ASA) systems. However, their advancement is engaged with two main challenges: data scarcity and data imbalance, especially in languages such as Finnish and Finland Swedish. This thesis aims to explore methods that alleviate these two challenges when training ASR and ASA systems for second language (L2) speakers. These systems could be found in applications such as language learning apps and language proficiency tests. Training such ASR systems requires transcribed L2 speech data, which is scarce in most languages. Additionally, proficiency scores are required to train ASA systems, but very expensive to obtain. Thus, it is important to maximise the utilisation of existing datasets. This study works with a L2 Finnish dataset and a L2 Finland Swedish dataset, both are small (approx. 14 hours or less) and imbalanced. In particular, intermediate proficiency levels are well-represented in the datasets, while beginner- and advanced-levels have only very few samples. To solve these two problems, four methods were explored: 1) audio augmentation, 2) augmentation using Text-To-Speech (TTS) synthesisers, 3) oversampling with augmentation, and 4) class-wise curriculum learning. To improve ASR performance on L2 speech, audio augmentation is shown to be an effective method, while augmentation with TTS synthesiser has positive impact mainly for speech of lower proficiency. For ASA training, audio augmentation alone does not yield significant improvement, while its combination with oversampling leads to the best results. Lastly, class-wise curriculum learning is shown to be less effective than other methods in our experiments.
Description
Supervisor
Kurimo, Mikko
Thesis advisor
Voskoboinik, Ekaterina
Al-Ghezi, Ragheb
Keywords
automatic speech recognition, automated speaking assessment, augmentation, Wav2Vec2.0, TTS
Other note
Citation