Use of Self-Supervised Learning in Automated Speaking Scoring for Low Resource Languages

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorAl-Ghezi, Ragheb
dc.contributor.departmentInformaatio- ja tietoliikennetekniikan laitosfi
dc.contributor.departmentDepartment of Information and Communications Engineeringen
dc.contributor.labSpeech Recognition Research Groupen
dc.contributor.schoolSähkötekniikan korkeakoulufi
dc.contributor.schoolSchool of Electrical Engineeringen
dc.contributor.supervisorKurimo, Mikko, Prof., Aalto University, Department of Information and Communications Engineering, Finland
dc.date.accessioned2024-05-31T09:00:42Z
dc.date.available2024-05-31T09:00:42Z
dc.date.defence2024-06-14
dc.date.issued2024
dc.description.abstractDeveloping automatic systems for assessing speaking proficiency has become increasingly important in second language learning, as it facilitates self-regulated learning and serves as a valuable tool for language proficiency assessment and teacher training programs. However, such systems have primarily been designed for languages with many learners, benefiting from abundanthuman-transcribed and speech-scored training data. In contrast, languages with fewer learners, such as Finnish and Swedish, face significant challenges due to the limited availability of training data. Nevertheless, recent advancements in AI, particularly in self-supervised machine learning, offer the possibility of developing automatic speech recognition systems even with constrained training data, making it feasible to create automatic speaking assessment systems for underresourced languages. This dissertation investigates the potential of a self-supervised speech model, specifically Wav2vec2, to develop automatic speech recognition (ASR) and automated scoring models for second language (L2) young Swedish and Finnish, L2 child Swedish and Finnish, and native Swedish children with speech sound disorders (SSD). Results include that finetuning the monolingual Swedish Wav2vec2 model for ASR achieved 7% relative improvement in word error rate (WER) using only 5.6 hrs of training data compared to traditional ASR pipeline without using an external language model or customized pronunciation dictionaries. In addition, Wav2vec2 models were also shown to adapt to holistic speaking proficiency tasks when finetuned directly to predict proficiency levels or incorporated in a multitasking system, capable of decoding spoken utterances and predicting ratings concurrently. Furthermore, deep latent representations (embeddings) extracted from ASR-finetuned Wav2vec2 were shown to predict holistic proficiency of L2 Finnish and Swedish, yielding 20% improvement in F1 score relative to the pre-trained embeddings and manually-crafted features. The dissertation also presents an experimental evaluation of analytical models assessing components of spontaneous speaking proficiency, such as pronunciation, fluency, and lexicogrammatical proficiency, yielding human-machine agreement comparable to that of humanhuman inter-rater agreement. In short, finetuned ASR models facilitated the design and implementation of automated read-aloud and spontaneous speaking rating models for the aforementioned low resource tasks.en
dc.format.extent75 + app. 89
dc.identifier.isbn978-952-64-1863-6 (electronic)
dc.identifier.isbn978-952-64-1862-9 (printed)
dc.identifier.issn1799-4942 (electronic)
dc.identifier.issn1799-4934 (printed)
dc.identifier.issn1799-4934 (ISSN-L)
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/128400
dc.identifier.urnURN:ISBN:978-952-64-1863-6
dc.language.isoenen
dc.opnStrik, Helmer, Assoc. Prof., Radboud University, The Netherlands
dc.publisherAalto Universityen
dc.publisherAalto-yliopistofi
dc.relation.haspart[Publication 1]: Ragheb Al-Ghezi, Yaroslav Getman, Aku Rouhe, Raili Hilden, Mikko Kurimo. Self-supervised end-to-end ASR for low resource L2 Swedish. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association (ISCA), pp. 1086-1090, Oct 2020. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-2021120110502.
dc.relation.haspart[Publication 2]: Ragheb Al-Ghezi, Yaroslav Getman, Ekaterina Voskoboinik, Mittul Singh, Mikko Kurimo. Automatic Rating of Spontaneous Speech for Low-Resource Languages. IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023, pp. 339-345, Jan 2023. DOI: 10.1109/SLT54892.2023.10022381
dc.relation.haspart[Publication 3]: Yaroslav Getman, Ragheb Al-Ghezi, Ekaterina Voskoboinik, Tamas Grosz, Mikko Kurimo, Giampiero Salvi, Torbjørn Svendsen, Sofia Strombergsson. wav2vec2-based Speech Rating System for Children with Speech Sound Disorder. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association (ISCA), pp. 3618-3622, Sept 2022.
dc.relation.haspart[Publication 4]: Yaroslav Getman, Nhan Phan, Ragheb Al-Ghezi, Ekaterina Voskoboinik, Mittul Singh, Tamás Grósz, Mikko Kurimo, Giampiero Salvi, Torbjørn Svendsen, Sofia Strömbergsson, Anna Smolander, and Sari Ylinen. Developing an AI-Assisted Low-Resource Spoken Language Learning App for Children. IEEE Access Journal, 11, 86025-86037., Aug 2023. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202308305294. DOI: 10.1109/ACCESS.2023.3304274
dc.relation.haspart[Publication 5]: Ragheb Al-Ghezi, Katja Voskoboinik, Yaroslav Getman, Anna von Zansen, Heini Kallio, Mikko Kurimo, Ari Huhta, Raili Hildén. Automatic Speaking Assessment of Spontaneous L2 Finnish and Swedish. Language Assessment Quarterly Journal, 20:4-5, 421-444, Oct 2023. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202401171460. DOI: 10.1080/15434303.2023.2292265
dc.relation.haspart[Publication 6]: Yaroslav Getman, Ragheb Al-Ghezi, Tamas Grosz, Mikko Kurimo. Multi-task wav2vec2 Serving as a Pronunciation Training System for Children. In 9th Workshop on Speech and Language Technology in Education (SLaTE) (ISCA International Workshop on Speech and Language Technology in Education). International Speech Communication Association (ISCA)., Aug 2023. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202312117213.
dc.relation.haspart[Publication 7]: Ragheb Al-Ghezi, Mikko Kurimo. Graph-based Syntactic Word Embeddings. In Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), pages 72–78, Barcelona, Spain (Online). Association for Computational Linguistics., Dec 2020. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202102091962.
dc.relation.ispartofseriesAalto University publication series DOCTORAL THESESen
dc.relation.ispartofseries120/2024
dc.revZechner, Klaus, Dr., ETS, USA
dc.revKnill, Kate, Dr., Cambridge University, UK
dc.subject.keywordspeech recognitionen
dc.subject.keywordself-supervised learningen
dc.subject.keywordautomatic speaking assessmenten
dc.subject.otherElectrical engineeringen
dc.titleUse of Self-Supervised Learning in Automated Speaking Scoring for Low Resource Languagesen
dc.typeG5 Artikkeliväitöskirjafi
dc.type.dcmitypetexten
dc.type.ontasotDoctoral dissertation (article-based)en
dc.type.ontasotVäitöskirja (artikkeli)fi
local.aalto.acrisexportstatuschecked 2024-06-18_1340
local.aalto.archiveyes
local.aalto.formfolder2024_05_31_klo_11_35
local.aalto.infraScience-IT

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
isbn9789526418636.pdf
Size:
1.71 MB
Format:
Adobe Portable Document Format