End-to-End Disfluency Detection in Automatic Speech Recognition for Second Language Learners
dc.contributor | Aalto-yliopisto | fi |
dc.contributor | Aalto University | en |
dc.contributor.advisor | Getman, Yaroslav | |
dc.contributor.advisor | Grósz, Tamás | |
dc.contributor.author | Mateiu, Tudor | |
dc.contributor.school | Perustieteiden korkeakoulu | fi |
dc.contributor.supervisor | Kurimo, Mikko | |
dc.date.accessioned | 2022-12-18T18:10:39Z | |
dc.date.available | 2022-12-18T18:10:39Z | |
dc.date.issued | 2022-12-12 | |
dc.description.abstract | Second language (L2) learner's speech data is a big challenge for Automatic Speech Recognition (ASR) models. Moreover, L2 students' speech contains many grammatical errors, mispronunciations and disfluencies, depending on the person's proficiency level. Disfluency detection tasks have conventionally been carried out as an added step after an ASR pipeline, which is inconvenient, as data needs to be prepared in addition to the one used for ASR, as well as the need of finetuning a supplemental model and incorporating it into the downstream task. Conventional ASR systems are comprised of separate model components, an acoustic model, a language model and a lexicon. End-to-end ASR introduces a simplified pipeline over traditional systems, such that the acoustic feature sequences are directly mapped to word sequences, without the need for additional modules. As end-to-end systems streamline the ASR process, this thesis investigates the incorporation of disfluency detection into the same low-resource end-to-end ASR task, thus eliminating the need for a separate component, and ultimately resulting in reduced computations. The disfluency detection models in this work are developed for L2 speakers learning Finnish and obtain good performance without substantially deviating from an end-to-end L2 Finnish ASR baseline. The best model's ASR performance is promising, reaching a word error rate of 30.41 % and a character error rate of 13.17 %. Moreover, for disfluency detection the model obtains a Recall of 0.5655 and a Precision of 0.6017. The results are encouraging as the models can successfully extrapolate different disfluency types from low-resource L2 Finnish speech. | en |
dc.format.extent | 60+2 | |
dc.format.mimetype | application/pdf | en |
dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/118382 | |
dc.identifier.urn | URN:NBN:fi:aalto-202212187124 | |
dc.language.iso | en | en |
dc.programme | Master’s Programme in Computer, Communication and Information Sciences | fi |
dc.programme.major | Machine Learning, Data Science and Artificial Intelligence | fi |
dc.programme.mcode | SCI3044 | fi |
dc.subject.keyword | end-to-end | en |
dc.subject.keyword | disfluency detection | en |
dc.subject.keyword | ASR | en |
dc.subject.keyword | Wav2Vec2.0 | en |
dc.title | End-to-End Disfluency Detection in Automatic Speech Recognition for Second Language Learners | en |
dc.type | G2 Pro gradu, diplomityö | fi |
dc.type.ontasot | Master's thesis | en |
dc.type.ontasot | Diplomityö | fi |
local.aalto.electroniconly | yes | |
local.aalto.openaccess | yes |
Files
Original bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- master_Mateiu_Tudor_2022.pdf
- Size:
- 1.85 MB
- Format:
- Adobe Portable Document Format