Browsing by Author "Kurimo, Mikko, Prof., Aalto University, Department of Information and Communications Engineering, Finland"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Attention-based End-to-End Models in Language Technology(Aalto University, 2024) Rouhe, Aku; Grósz, Tamás, Dr., Aalto University, Speech Recognition, Finland; Informaatio- ja tietoliikennetekniikan laitos; Department of Information and Communications Engineering; Speech Recognition Research Group; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, Mikko, Prof., Aalto University, Department of Information and Communications Engineering, FinlandSpeech recognition specifically, and language technology more generally, have started to find everyday use. Challenging language tasks have become feasible through a continued growth in data resources and compute capacity, and through neural networks methods which are able to take advantage of this growth. As applications continue to integrate more deeply into our lives, it is important to understand and follow the many directions that these fields may take. At the turn of the 2020-decade, end-to-end models have received a lot of attention. End-to-end models hold promise of simpler solutions, which nonetheless may scale better with data and compute. On the other hand, end-to-end models defy decomposing tasks into easier subproblems. This decomposition allows modular designs, which permit a wider variety of data sources to be used. It remains unclear whether the end-to-end models are truly an improvement over previous technologies. It is not straight-forward to compare end-to-end and decomposed solutions fairly, because of their many differences. This thesis proposes a principled approach for comparisons of such heterogeneous solutions and applies it to speech recognition. In their default configuration, the end-to-end models forego many useful data sources, and rely solely on expensive end-to-end labeled data. This thesis explores methods for leveraging additional data sources in speech recognition, canonical morpheme segmentation, and spoken language translation. Additional data sources are especially useful in low data and under-resourced tasks. These difficult tasks often need the structure imposed by decomposed solutions. This thesis investigates end-to-end models in an under-resourced speech recognition and a low data canonical morpheme segmentation task. The tasks explored in this thesis are connected through a shared architecture: attention-based encoder-decoder models. Though these attention-based models are most often outperformed by hidden Markov model speech recognition systems, they showcase remarkable flexibility. They succeed in speech recognition using just tens of hours and upto thousands of hours of data. They learn to exploit auxiliary speaker and segmentation-marker inputs. They perform spoken language translation in one step. They even yield the author a first place in a public benchmark competition.Item Use of Self-Supervised Learning in Automated Speaking Scoring for Low Resource Languages(Aalto University, 2024) Al-Ghezi, Ragheb; Informaatio- ja tietoliikennetekniikan laitos; Department of Information and Communications Engineering; Speech Recognition Research Group; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, Mikko, Prof., Aalto University, Department of Information and Communications Engineering, FinlandDeveloping automatic systems for assessing speaking proficiency has become increasingly important in second language learning, as it facilitates self-regulated learning and serves as a valuable tool for language proficiency assessment and teacher training programs. However, such systems have primarily been designed for languages with many learners, benefiting from abundanthuman-transcribed and speech-scored training data. In contrast, languages with fewer learners, such as Finnish and Swedish, face significant challenges due to the limited availability of training data. Nevertheless, recent advancements in AI, particularly in self-supervised machine learning, offer the possibility of developing automatic speech recognition systems even with constrained training data, making it feasible to create automatic speaking assessment systems for underresourced languages. This dissertation investigates the potential of a self-supervised speech model, specifically Wav2vec2, to develop automatic speech recognition (ASR) and automated scoring models for second language (L2) young Swedish and Finnish, L2 child Swedish and Finnish, and native Swedish children with speech sound disorders (SSD). Results include that finetuning the monolingual Swedish Wav2vec2 model for ASR achieved 7% relative improvement in word error rate (WER) using only 5.6 hrs of training data compared to traditional ASR pipeline without using an external language model or customized pronunciation dictionaries. In addition, Wav2vec2 models were also shown to adapt to holistic speaking proficiency tasks when finetuned directly to predict proficiency levels or incorporated in a multitasking system, capable of decoding spoken utterances and predicting ratings concurrently. Furthermore, deep latent representations (embeddings) extracted from ASR-finetuned Wav2vec2 were shown to predict holistic proficiency of L2 Finnish and Swedish, yielding 20% improvement in F1 score relative to the pre-trained embeddings and manually-crafted features. The dissertation also presents an experimental evaluation of analytical models assessing components of spontaneous speaking proficiency, such as pronunciation, fluency, and lexicogrammatical proficiency, yielding human-machine agreement comparable to that of humanhuman inter-rater agreement. In short, finetuned ASR models facilitated the design and implementation of automated read-aloud and spontaneous speaking rating models for the aforementioned low resource tasks.