End-to-End Disfluency Detection in Automatic Speech Recognition for Second Language Learners

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorGetman, Yaroslav
dc.contributor.advisorGrósz, Tamás
dc.contributor.authorMateiu, Tudor
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.supervisorKurimo, Mikko
dc.date.accessioned2022-12-18T18:10:39Z
dc.date.available2022-12-18T18:10:39Z
dc.date.issued2022-12-12
dc.description.abstractSecond language (L2) learner's speech data is a big challenge for Automatic Speech Recognition (ASR) models. Moreover, L2 students' speech contains many grammatical errors, mispronunciations and disfluencies, depending on the person's proficiency level. Disfluency detection tasks have conventionally been carried out as an added step after an ASR pipeline, which is inconvenient, as data needs to be prepared in addition to the one used for ASR, as well as the need of finetuning a supplemental model and incorporating it into the downstream task. Conventional ASR systems are comprised of separate model components, an acoustic model, a language model and a lexicon. End-to-end ASR introduces a simplified pipeline over traditional systems, such that the acoustic feature sequences are directly mapped to word sequences, without the need for additional modules. As end-to-end systems streamline the ASR process, this thesis investigates the incorporation of disfluency detection into the same low-resource end-to-end ASR task, thus eliminating the need for a separate component, and ultimately resulting in reduced computations. The disfluency detection models in this work are developed for L2 speakers learning Finnish and obtain good performance without substantially deviating from an end-to-end L2 Finnish ASR baseline. The best model's ASR performance is promising, reaching a word error rate of 30.41 % and a character error rate of 13.17 %. Moreover, for disfluency detection the model obtains a Recall of 0.5655 and a Precision of 0.6017. The results are encouraging as the models can successfully extrapolate different disfluency types from low-resource L2 Finnish speech.en
dc.format.extent60+2
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/118382
dc.identifier.urnURN:NBN:fi:aalto-202212187124
dc.language.isoenen
dc.programmeMaster’s Programme in Computer, Communication and Information Sciencesfi
dc.programme.majorMachine Learning, Data Science and Artificial Intelligencefi
dc.programme.mcodeSCI3044fi
dc.subject.keywordend-to-enden
dc.subject.keyworddisfluency detectionen
dc.subject.keywordASRen
dc.subject.keywordWav2Vec2.0en
dc.titleEnd-to-End Disfluency Detection in Automatic Speech Recognition for Second Language Learnersen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessyes
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
master_Mateiu_Tudor_2022.pdf
Size:
1.85 MB
Format:
Adobe Portable Document Format