Use of Self-Supervised Learning in Automated Speaking Scoring for Low Resource Languages
dc.contributor | Aalto-yliopisto | fi |
dc.contributor | Aalto University | en |
dc.contributor.author | Al-Ghezi, Ragheb | |
dc.contributor.department | Informaatio- ja tietoliikennetekniikan laitos | fi |
dc.contributor.department | Department of Information and Communications Engineering | en |
dc.contributor.lab | Speech Recognition Research Group | en |
dc.contributor.school | Sähkötekniikan korkeakoulu | fi |
dc.contributor.school | School of Electrical Engineering | en |
dc.contributor.supervisor | Kurimo, Mikko, Prof., Aalto University, Department of Information and Communications Engineering, Finland | |
dc.date.accessioned | 2024-05-31T09:00:42Z | |
dc.date.available | 2024-05-31T09:00:42Z | |
dc.date.defence | 2024-06-14 | |
dc.date.issued | 2024 | |
dc.description.abstract | Developing automatic systems for assessing speaking proficiency has become increasingly important in second language learning, as it facilitates self-regulated learning and serves as a valuable tool for language proficiency assessment and teacher training programs. However, such systems have primarily been designed for languages with many learners, benefiting from abundanthuman-transcribed and speech-scored training data. In contrast, languages with fewer learners, such as Finnish and Swedish, face significant challenges due to the limited availability of training data. Nevertheless, recent advancements in AI, particularly in self-supervised machine learning, offer the possibility of developing automatic speech recognition systems even with constrained training data, making it feasible to create automatic speaking assessment systems for underresourced languages. This dissertation investigates the potential of a self-supervised speech model, specifically Wav2vec2, to develop automatic speech recognition (ASR) and automated scoring models for second language (L2) young Swedish and Finnish, L2 child Swedish and Finnish, and native Swedish children with speech sound disorders (SSD). Results include that finetuning the monolingual Swedish Wav2vec2 model for ASR achieved 7% relative improvement in word error rate (WER) using only 5.6 hrs of training data compared to traditional ASR pipeline without using an external language model or customized pronunciation dictionaries. In addition, Wav2vec2 models were also shown to adapt to holistic speaking proficiency tasks when finetuned directly to predict proficiency levels or incorporated in a multitasking system, capable of decoding spoken utterances and predicting ratings concurrently. Furthermore, deep latent representations (embeddings) extracted from ASR-finetuned Wav2vec2 were shown to predict holistic proficiency of L2 Finnish and Swedish, yielding 20% improvement in F1 score relative to the pre-trained embeddings and manually-crafted features. The dissertation also presents an experimental evaluation of analytical models assessing components of spontaneous speaking proficiency, such as pronunciation, fluency, and lexicogrammatical proficiency, yielding human-machine agreement comparable to that of humanhuman inter-rater agreement. In short, finetuned ASR models facilitated the design and implementation of automated read-aloud and spontaneous speaking rating models for the aforementioned low resource tasks. | en |
dc.format.extent | 75 + app. 89 | |
dc.identifier.isbn | 978-952-64-1863-6 (electronic) | |
dc.identifier.isbn | 978-952-64-1862-9 (printed) | |
dc.identifier.issn | 1799-4942 (electronic) | |
dc.identifier.issn | 1799-4934 (printed) | |
dc.identifier.issn | 1799-4934 (ISSN-L) | |
dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/128400 | |
dc.identifier.urn | URN:ISBN:978-952-64-1863-6 | |
dc.language.iso | en | en |
dc.opn | Strik, Helmer, Assoc. Prof., Radboud University, The Netherlands | |
dc.publisher | Aalto University | en |
dc.publisher | Aalto-yliopisto | fi |
dc.relation.haspart | [Publication 1]: Ragheb Al-Ghezi, Yaroslav Getman, Aku Rouhe, Raili Hilden, Mikko Kurimo. Self-supervised end-to-end ASR for low resource L2 Swedish. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association (ISCA), pp. 1086-1090, Oct 2020. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-2021120110502. | |
dc.relation.haspart | [Publication 2]: Ragheb Al-Ghezi, Yaroslav Getman, Ekaterina Voskoboinik, Mittul Singh, Mikko Kurimo. Automatic Rating of Spontaneous Speech for Low-Resource Languages. IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023, pp. 339-345, Jan 2023. DOI: 10.1109/SLT54892.2023.10022381 | |
dc.relation.haspart | [Publication 3]: Yaroslav Getman, Ragheb Al-Ghezi, Ekaterina Voskoboinik, Tamas Grosz, Mikko Kurimo, Giampiero Salvi, Torbjørn Svendsen, Sofia Strombergsson. wav2vec2-based Speech Rating System for Children with Speech Sound Disorder. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association (ISCA), pp. 3618-3622, Sept 2022. | |
dc.relation.haspart | [Publication 4]: Yaroslav Getman, Nhan Phan, Ragheb Al-Ghezi, Ekaterina Voskoboinik, Mittul Singh, Tamás Grósz, Mikko Kurimo, Giampiero Salvi, Torbjørn Svendsen, Sofia Strömbergsson, Anna Smolander, and Sari Ylinen. Developing an AI-Assisted Low-Resource Spoken Language Learning App for Children. IEEE Access Journal, 11, 86025-86037., Aug 2023. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202308305294. DOI: 10.1109/ACCESS.2023.3304274 | |
dc.relation.haspart | [Publication 5]: Ragheb Al-Ghezi, Katja Voskoboinik, Yaroslav Getman, Anna von Zansen, Heini Kallio, Mikko Kurimo, Ari Huhta, Raili Hildén. Automatic Speaking Assessment of Spontaneous L2 Finnish and Swedish. Language Assessment Quarterly Journal, 20:4-5, 421-444, Oct 2023. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202401171460. DOI: 10.1080/15434303.2023.2292265 | |
dc.relation.haspart | [Publication 6]: Yaroslav Getman, Ragheb Al-Ghezi, Tamas Grosz, Mikko Kurimo. Multi-task wav2vec2 Serving as a Pronunciation Training System for Children. In 9th Workshop on Speech and Language Technology in Education (SLaTE) (ISCA International Workshop on Speech and Language Technology in Education). International Speech Communication Association (ISCA)., Aug 2023. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202312117213. | |
dc.relation.haspart | [Publication 7]: Ragheb Al-Ghezi, Mikko Kurimo. Graph-based Syntactic Word Embeddings. In Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), pages 72–78, Barcelona, Spain (Online). Association for Computational Linguistics., Dec 2020. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202102091962. | |
dc.relation.ispartofseries | Aalto University publication series DOCTORAL THESES | en |
dc.relation.ispartofseries | 120/2024 | |
dc.rev | Zechner, Klaus, Dr., ETS, USA | |
dc.rev | Knill, Kate, Dr., Cambridge University, UK | |
dc.subject.keyword | speech recognition | en |
dc.subject.keyword | self-supervised learning | en |
dc.subject.keyword | automatic speaking assessment | en |
dc.subject.other | Electrical engineering | en |
dc.title | Use of Self-Supervised Learning in Automated Speaking Scoring for Low Resource Languages | en |
dc.type | G5 Artikkeliväitöskirja | fi |
dc.type.dcmitype | text | en |
dc.type.ontasot | Doctoral dissertation (article-based) | en |
dc.type.ontasot | Väitöskirja (artikkeli) | fi |
local.aalto.acrisexportstatus | checked 2024-06-18_1340 | |
local.aalto.archive | yes | |
local.aalto.formfolder | 2024_05_31_klo_11_35 | |
local.aalto.infra | Science-IT |
Files
Original bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- isbn9789526418636.pdf
- Size:
- 1.71 MB
- Format:
- Adobe Portable Document Format