Browsing by Author "Al-Ghezi, Ragheb"
Now showing 1 - 16 of 16
Results Per Page
Sort Options
Item Augmentation, Oversampling and Curriculum Learning for Small Imbalanced Speech Data(2023-12-11) Lun, Tin; Voskoboinik, Ekaterina; Al-Ghezi, Ragheb; Sähkötekniikan korkeakoulu; Kurimo, MikkoAutomatic Speech Recognition (ASR) systems have seen remarkable breakthrough in recent years, which has in turn fostered the development of ASR-supported Automatic Speaking Assessment (ASA) systems. However, their advancement is engaged with two main challenges: data scarcity and data imbalance, especially in languages such as Finnish and Finland Swedish. This thesis aims to explore methods that alleviate these two challenges when training ASR and ASA systems for second language (L2) speakers. These systems could be found in applications such as language learning apps and language proficiency tests. Training such ASR systems requires transcribed L2 speech data, which is scarce in most languages. Additionally, proficiency scores are required to train ASA systems, but very expensive to obtain. Thus, it is important to maximise the utilisation of existing datasets. This study works with a L2 Finnish dataset and a L2 Finland Swedish dataset, both are small (approx. 14 hours or less) and imbalanced. In particular, intermediate proficiency levels are well-represented in the datasets, while beginner- and advanced-levels have only very few samples. To solve these two problems, four methods were explored: 1) audio augmentation, 2) augmentation using Text-To-Speech (TTS) synthesisers, 3) oversampling with augmentation, and 4) class-wise curriculum learning. To improve ASR performance on L2 speech, audio augmentation is shown to be an effective method, while augmentation with TTS synthesiser has positive impact mainly for speech of lower proficiency. For ASA training, audio augmentation alone does not yield significant improvement, while its combination with oversampling leads to the best results. Lastly, class-wise curriculum learning is shown to be less effective than other methods in our experiments.Item Automated Assessment of Task Completion in Spontaneous Speech for Finnish and Finland Swedish Language Learners(2023-05-16) Voskoboinik, Ekaterina; Getman, Yaroslav; Al-Ghezi, Ragheb; Kurimo, Mikko; Grosz, Tamas; Department of Information and Communications Engineering; Speech Recognition; Speech Recognition; Department of Information and Communications EngineeringThis study investigates the feasibility of automated content scoring for spontaneous spoken responses from Finnish and Finland Swedish learners. Our experiments reveal that pretrained Transformer-based models outperform the tf-idf baseline in automatic task completion grading. Furthermore, we demonstrate that pre-fine-tuning these models to differentiate between responses to distinct prompts enhances subsequent task completion finetuning. We observe that task completion classifiers exhibit accelerated learning and produce predictions with stronger correlations to human grading when accounting for task differences. Additionally, we find that employing similarity learning, as opposed to conventional classification fine-tuning, further improves the results. It is especially helpful to learn not just the similarities between the responses in one score bin, but the exact differences between the average human scores responses received. Lastly, we demonstrate that models applied to both manual and ASR transcripts yield comparable correlations to human grading.Item Automatic Assessment of Fluency in L2. Finnish and Finland Swedish(2022-05-15) Packalén, Aaro; Al-Ghezi, Ragheb; Sähkötekniikan korkeakoulu; Turunen, MarkusItem Automatic Assessment of Spoken Lexico-Grammatical Proficiency in L2 Finnish and Swedish(2022-07-29) Akiki, Clara; Al-Ghezi, Ragheb; Voskoboinik, Ekaterina; Perustieteiden korkeakoulu; Kurimo, MikkoItem Automatic Speaking Assessment of Spontaneous L2 Finnish and Swedish(Taylor & Francis, 2023) Al-Ghezi, Ragheb; Voskoboinik, Ekaterina; Getman, Yaroslav; Von Zansen, Anna; Kallio, Heini; Kurimo, Mikko; Huhta, Ari; Hildén, Raili; Department of Information and Communications Engineering; Dept Signal Process and Acoust; Speech Recognition; Department of Information and Communications Engineering; University of Jyväskylä; Helsinki University Central HospitalThe development of automated systems for evaluating spontaneous speech is desirable for L2 learning, as it can be used as a facilitating tool for self-regulated learning, language proficiency assessment, and teacher training programs. However, languages with fewer learners face challenges due to the scarcity of training data. Recent advancements in machine learning have made it possible to develop systems with a limited amount of target domain data. To this end, we propose automatic speaking assessment systems for spontaneous L2 speech in Finnish and Finland Swedish, comprising six machine learning models each, and report their performance in terms of statistical evaluation criteria.Item Developing an AI-assisted Low-resource Spoken Language Learning App for Children(IEEE, 2023) Getman, Yaroslav; Phan, Nhan; Al-Ghezi, Ragheb; Voskoboinik, Ekaterina; Singh, Mittul; Grosz, Tamas; Kurimo, Mikko; Salvi, Giampiero; Svendsen, Torbjorn; Strombergsson, Sofia; Smolander, Anna; Ylinen, Sari; Department of Information and Communications Engineering; Speech Recognition; Speech Recognition; Norwegian University of Science and Technology; Tampere University; Karolinska InstitutetComputer-assisted Language Learning (CALL) is a rapidly developing area accelerated by advancements in the field of AI. A well-designed and reliable CALL system allows students to practice language skills, like pronunciation, any time outside of the classroom. Furthermore, gamification via mobile applications has shown encouraging results on learning outcomes and motivates young users to practice more and perceive language learning as a positive experience. In this work, we adapt the latest speech recognition technology to be a part of an online pronunciation training system for small children. As part of our gamified mobile application, our models will assess the pronunciation quality of young Swedish children diagnosed with Speech Sound Disorder, and participating in speech therapy. Additionally, the models provide feedback to young non-native children learning to pronounce Swedish and Finnish words. Our experiments revealed that these new models fit into an online game as they function as speech recognizers and pronunciation evaluators simultaneously. To make our systems more trustworthy and explainable, we investigated whether the combination of modern input attribution algorithms and time-aligned transcripts can explain the decisions made by the models, give us insights into how the models work and provide a tool to develop more reliable solutions.Item End-to-End Low-Resource Automatic Speech Recognition for Second Language Learners(2021-10-19) Getman, Yaroslav; Al-Ghezi, Ragheb; Sähkötekniikan korkeakoulu; Kurimo, MikkoApart from native speech, second language learners' (L2) speech is more difficult to recognize for automatic speech recognition (ASR) systems, since it is much more likely to contain lexical and grammatical errors, as well as disfluencies and mispronunciations. Furthermore, L2 ASR is challenging, because it is low-resource, meaning that the amount of training data is very limited. Unlike conventionally used Hidden Markov Model-based ASR systems, end-to-end ASR systems eliminate the need for separate components by directly mapping acoustic features to text. However, these systems require large amounts of labelled training data, which makes it difficult to apply them on L2 ASR. Recent advancements in self-supervised acoustic learning leverage the highly available untranscribed speech data to learn powerful acoustic representations which can be incorporated in end-to-end systems. This work explores and deploys mono- and multilingual self-supervised acoustic models on low-resource L2 ASR. In this thesis, the ASR systems are developed for L2 speakers of Finland-Swedish, Finnish, and German. Depending on the target language, the self-supervised end-to-end models provide a relative improvement of the word error rate by 31.3-45.1\% compared to the results of the conventional ASR systems. The results obtained in this thesis show the high performance and the promising potential of self-supervised end-to-end acoustic models for low-resource L2 ASR. In addition, this work is an important step in the development of automatic speaking assessment tools for L2 speakers, in which an accurate ASR system is a crucial component.Item Graph-based Syntactic Word Embeddings(2020-12-30) Al-Ghezi, Ragheb; Kurimo, Mikko; Dept Signal Process and Acoust; Speech Recognition; Speech RecognitionWe propose a simple and efficient framework to learn syntactic embeddings based on information derived from constituency parse trees. Using biased random walk methods, our embeddings not only encode syntactic information about words, but they also capture contextual information. We also propose a method to train the embeddings on multiple constituency parse trees to ensure the encoding of global syntactic representation. Quantitative evaluation of the embeddings shows competitive performance on POS tagging task when compared to other types of embeddings, and qualitative evaluation reveals interesting facts about the syntactic typology learned by these embeddings.Item Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model(International Speech Communication Association, 2023-08-20) Grosz, Tamas; Getman, Yaroslav; Al-Ghezi, Ragheb; Rouhe, Aku; Kurimo, Mikko; Department of Information and Communications Engineering; Speech Recognition; Department of Information and Communications EngineeringSelf-supervised speech models, such as the wav2vec2, have become extremely popular in the past few years. Their main appeal is that after their pre-training on a large amount of audio, they require only a small amount of supervised, finetuning data to achieve outstanding results. Despite their immense success, very little is understood about the pre-trained models and how finetuning changes them. In this work, we take the first steps towards a better understanding of wav2vec2 systems using model interpretation tools such as visualization and latent embedding clustering. Through our analysis, we gain new insights into the abilities of the pre-trained networks and the effect that finetuning has on them. We demonstrate that the clusters learned by the pre-trained model are just as important a factor as the supervised training data distribution in determining the accuracy of the finetuned system, which could aid us in selecting the most suitable pre-trained model for the supervised data.Item Multi-task wav2vec2 Serving as a Pronunciation Training System for Children(2023-08-18) Getman, Yaroslav; Al-Ghezi, Ragheb; Grosz, Tamas; Kurimo, Mikko; Department of Information and Communications Engineering; Dept Signal Process and Acoust; Speech Recognition; Dept Signal Process and AcoustItem Multilingual TTS Accent Impressions for Accented ASR(2023) Karakasidis, Georgios; Robinson, Nathaniel; Getman, Yaroslav; Ogayo, Atieno; Al-Ghezi, Ragheb; Ayasi, Ananya; Watanabe, Shinji; Mortensen, David R.; Kurimo, Mikko; Department of Information and Communications Engineering; Ekštein, Kamil; Pártl, František; Konopík, Miloslav; Speech Recognition; Department of Information and Communications Engineering; Carnegie Mellon University; Speech RecognitionAutomatic Speech Recognition (ASR) for high-resource languages like English is often considered a solved problem. However, most high-resource ASR systems favor socioeconomically advantaged dialects. In the case of English, this leaves behind many L2 speakers and speakers of low-resource accents (a majority of English speakers). One way to mitigate this is to fine-tune a pre-trained English ASR model for a desired low-resource accent. However, collecting transcribed accented audio is costly and time-consuming. In this work, we present a method to produce synthetic L2-English speech via pre-trained text-to-speech (TTS) in an L1 language (target accent). This can be produced at a much larger scale and lower cost than authentic speech collection. We present initial experiments applying this augmentation method. Our results suggest that success of TTS augmentation relies on access to more than one hour of authentic training data and a diversity of target-domain prompts for speech synthesis.Item New data, benchmark and baseline for L2 speaking assessment for low-resource languages(ISCA - International Speech Communication Association, 2023) Kurimo, Mikko; Getman, Yaroslav; Voskoboinik, Ekaterina; Al-Ghezi, Ragheb; Kallio, Heini; Kuronen, Mikko; von Zansen, Anna; Hilden, Raili; Kronholm, Sirkku; Huhta, Ari; Lindén, Krister; Department of Information and Communications Engineering; Speech Recognition; Speech Recognition; University of Jyväskylä; University of HelsinkiThe development of large multilingual speech models provides the possibility to construct high-quality speech technology even for low-resource languages. In this paper, we present the speech data of L2 learners of Finnish and Finland Swedish that we have recently collected for training and evaluation of automatic speech recognition (ASR) and speaking assessment (ASA). It includes over 4000 recordings by over 300 students per language in short read-aloud and free-form tasks. The recordings have been manually transcribed and assessed for pronunciation, fluency, range, accuracy, task achievement, and a holistic proficiency level. We present also an ASR and ASA benchmarking setup we have constructed using this data and include results from our baseline systems built by fine-tuning self-supervised multilingual model for the target language. In addition to benchmarking, our baseline system can be used by L2 students and teachers for online self-training and evaluation of oral proficiency.Item A pronunciation Scoring System Embedded into Children’s Foreign Language Learning Games with Experimental Verification of Learning Benefits(2023-08-18) Karhila, Reima; Ylinen, Sari; Smolander, Anna-Riikka; Rouhe, Aku; Al-Ghezi, Ragheb; Getman, Yaroslav; Grosz, Tamas; Uther, Maria; Kurimo, Mikko; Department of Information and Communications Engineering; Dept Signal Process and Acoust; Speech Recognition; Tampere University; Dept Signal Process and Acoust; Birmingham City University; Silo AI OyOver the years, language technology has become a valuable asset for foreign language learners. In this work, we introduce pronunciation feedback scoring systems for 6-12 year old children. The scoring systems were embedded in second-language (L2) English learning games that were designed to prompt children to repeat words. Speech and phone recognition models were used to validate utterances and extract phoneme-wise statistics, which were used to compute feedback scores of 0-5 stars. The scoring systems were trained to mimic the preferences of a single expert who evaluated all the training data. Our automatic scoring system reached a correlation of 0.59 to the human annotation. This system was also tested in a learning experiment, where EEG measurements indicated that children who played our learning game with our scoring engine for pronunciation feedback improved their perception of speech sounds. We release the game codes and the speech data used to train the scoring system.Item Self-supervised end-to-end ASR for low resource L2 Swedish(2021) Al-Ghezi, Ragheb; Getman, Yaroslav; Rouhe, Aku; Hildén, Raili; Kurimo, Mikko; Dept Signal Process and Acoust; Speech Recognition; Dept Signal Process and Acoust; University of HelsinkiUnlike traditional (hybrid) Automatic Speech Recognition (ASR), end-to-end ASR systems simplify the training procedure by directly mapping acoustic features to sequences of graphemes or characters, thereby eliminating the need for specialized acoustic, language, or pronunciation models. However, one drawback of end-to-end ASR systems is that they require more training data than conventional ASR systems to achieve similar word error rate (WER). This makes it difficult to develop ASR systems for tasks where transcribed target data is limited such as developing ASR for Second Language (L2) speakers of Swedish. Nonetheless, recent advancements in selfsupervised acoustic learning, manifested in wav2vec models [1, 2, 3], leverage the available untranscribed speech data to provide compact acoustic representation that can achieve low WER when incorporated in end-to-end systems. To this end, we experiment with several monolingual and cross-lingual selfsupervised acoustic models to develop end-to-end ASR system for L2 Swedish. Even though our test is very small, it indicates that these systems are competitive in performance with traditional ASR pipeline. Our best model seems to reduce the WER by 7% relative to our traditional ASR baseline trained on the same target data.Item Use of Self-Supervised Learning in Automated Speaking Scoring for Low Resource Languages(Aalto University, 2024) Al-Ghezi, Ragheb; Informaatio- ja tietoliikennetekniikan laitos; Department of Information and Communications Engineering; Speech Recognition Research Group; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, Mikko, Prof., Aalto University, Department of Information and Communications Engineering, FinlandDeveloping automatic systems for assessing speaking proficiency has become increasingly important in second language learning, as it facilitates self-regulated learning and serves as a valuable tool for language proficiency assessment and teacher training programs. However, such systems have primarily been designed for languages with many learners, benefiting from abundanthuman-transcribed and speech-scored training data. In contrast, languages with fewer learners, such as Finnish and Swedish, face significant challenges due to the limited availability of training data. Nevertheless, recent advancements in AI, particularly in self-supervised machine learning, offer the possibility of developing automatic speech recognition systems even with constrained training data, making it feasible to create automatic speaking assessment systems for underresourced languages. This dissertation investigates the potential of a self-supervised speech model, specifically Wav2vec2, to develop automatic speech recognition (ASR) and automated scoring models for second language (L2) young Swedish and Finnish, L2 child Swedish and Finnish, and native Swedish children with speech sound disorders (SSD). Results include that finetuning the monolingual Swedish Wav2vec2 model for ASR achieved 7% relative improvement in word error rate (WER) using only 5.6 hrs of training data compared to traditional ASR pipeline without using an external language model or customized pronunciation dictionaries. In addition, Wav2vec2 models were also shown to adapt to holistic speaking proficiency tasks when finetuned directly to predict proficiency levels or incorporated in a multitasking system, capable of decoding spoken utterances and predicting ratings concurrently. Furthermore, deep latent representations (embeddings) extracted from ASR-finetuned Wav2vec2 were shown to predict holistic proficiency of L2 Finnish and Swedish, yielding 20% improvement in F1 score relative to the pre-trained embeddings and manually-crafted features. The dissertation also presents an experimental evaluation of analytical models assessing components of spontaneous speaking proficiency, such as pronunciation, fluency, and lexicogrammatical proficiency, yielding human-machine agreement comparable to that of humanhuman inter-rater agreement. In short, finetuned ASR models facilitated the design and implementation of automated read-aloud and spontaneous speaking rating models for the aforementioned low resource tasks.Item wav2vec2-based Speech Rating System for Children with Speech Sound Disorder(International Speech Communication Association, 2022) Getman, Yaroslav; Al-Ghezi, Ragheb; Voskoboinik, Ekaterina; Grósz, Tamás; Kurimo, Mikko; Salvi, Giampiero; Svendsen, Torbjørn; Strömbergsson, Sofia; Dept Signal Process and Acoust; Speech Recognition; Dept Signal Process and Acoust; Speech Recognition; Norwegian University of Science and Technology; Karolinska InstitutetSpeaking is a fundamental way of communication, developed at a young age. Unfortunately, some children with speech sound disorder struggle to acquire this skill, hindering their ability to communicate efficiently. Speech therapies, which could aid these children in speech acquisition, greatly rely on speech practice trials and accurate feedback about their pronunciations. To enable home therapy and lessen the burden on speech-language pathologists, we need a highly accurate and automatic way of assessing the quality of speech uttered by young children. Our work focuses on exploring the applicability of state-of-the-art self-supervised, deep acoustic models, mainly wav2vec2, for this task. The empirical results highlight that these self-supervised models are superior to traditional approaches and close the gap between machine and human performance.