Aaltodoc - homepage
Communities & Collections
Browse Aaltodoc publication archive
EN | FI |
Log In
  1. Home
  2. Browse by Author

Browsing by Author "Singh, Mittul"

Filter results by typing the first few letters
Now showing 1 - 13 of 13
  • Results Per Page
  • Sort Options
  • No Thumbnail Available
    Automatic Rating of Spontaneous Speech for Low-Resource Languages
    (2023) Al-Ghezi, Ragheb; Getman, Yaroslav; Voskoboinik, Ekaterina; Singh, Mittul; Kurimo, Mikko
    A4 Artikkeli konferenssijulkaisussa
    Automatic spontaneous speaking assessment systems bring numerous advantages to second language (L2) learning and assessment such as promoting self-learning and reducing language teachers' workload. Conventionally, these systems are developed for languages with a large number of learners due to the abundance of training data, yet languages with fewer learners such as Finnish and Swedish remain at a disadvantage due to the scarcity of required training data. Nevertheless, recent advancements in self-supervised deep learning make it possible to develop automatic speech recognition systems with a reasonable amount of training data. In turn, this advancement makes it feasible to develop systems for automatically assessing spoken proficiency of learners of underresourced languages: L2 Finnish and Finland Swedish. Our work evaluates the overall performance of the L2 ASR systems as well as the the rating systems compared to human reference ratings for both languages.
  • Loading...
    Thumbnail Image
    Data augmentation using prosody and false starts to recognize non-native children's speech
    (2020) Kathania, Hemant; Singh, Mittul; Grósz, Tamás; Kurimo, Mikko
    A4 Artikkeli konferenssijulkaisussa
    This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition (ASR) for non-native children's speech. The task is to recognize non-native speech from children of various age groups given a limited amount of speech. Moreover, the speech being spontaneous has false starts transcribed as partial words, which in the test transcriptions leads to unseen partial words. To cope with these two challenges, we investigate a data augmentation-based approach. Firstly, we apply the prosody-based data augmentation to supplement the audio data. Secondly, we simulate false starts by introducing partial-word noise in the language modeling corpora creating new words. Acoustic models trained on prosody-based augmented data outperform the models using the baseline recipe or the SpecAugment-based augmentation. The partial-word noise also helps to improve the baseline language model. Our ASR system, a combination of these schemes, is placed third in the evaluation period and achieves the word error rate of 18.71%. Post-evaluation period, we observe that increasing the amounts of prosody-based augmented data leads to better performance. Furthermore, removing low-confidence-score words from hypotheses can lead to further gains. These two improvements lower the ASR error rate to 17.99%.
  • Loading...
    Thumbnail Image
    Developing an AI-assisted Low-resource Spoken Language Learning App for Children
    (2023) Getman, Yaroslav; Phan, Nhan; Al-Ghezi, Ragheb; Voskoboinik, Ekaterina; Singh, Mittul; Grosz, Tamas; Kurimo, Mikko; Salvi, Giampiero; Svendsen, Torbjorn; Strombergsson, Sofia; Smolander, Anna; Ylinen, Sari
    A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä
    Computer-assisted Language Learning (CALL) is a rapidly developing area accelerated by advancements in the field of AI. A well-designed and reliable CALL system allows students to practice language skills, like pronunciation, any time outside of the classroom. Furthermore, gamification via mobile applications has shown encouraging results on learning outcomes and motivates young users to practice more and perceive language learning as a positive experience. In this work, we adapt the latest speech recognition technology to be a part of an online pronunciation training system for small children. As part of our gamified mobile application, our models will assess the pronunciation quality of young Swedish children diagnosed with Speech Sound Disorder, and participating in speech therapy. Additionally, the models provide feedback to young non-native children learning to pronounce Swedish and Finnish words. Our experiments revealed that these new models fit into an online game as they function as speech recognizers and pronunciation evaluators simultaneously. To make our systems more trustworthy and explainable, we investigated whether the combination of modern input attribution algorithms and time-aligned transcripts can explain the decisions made by the models, give us insights into how the models work and provide a tool to develop more reliable solutions.
  • Loading...
    Thumbnail Image
    Effect of Speech Modification on Wav2Vec2 Models for Children Speech Recognition
    (2024) Sinha, Abhijit; Singh, Mittul; Kadiri, Sudarsana Reddy; Kurimo, Mikko; Kathania, Hemant Kumar
    A4 Artikkeli konferenssijulkaisussa
    Speech modification methods normalize children's speech towards adults' speech, enabling off-the-shelf generic automatic speech recognition (ASR) for this low-resource scenario. On the other hand, ASR models like Wav2Vec2 have shown remarkable robustness towards various speakers, thus streamlining their deployment. This paper examines the benefit of speech modification methods when using Wav2Vec2 models on children's speech. We experimented with prototypical speech modification methods and found that while models trained on large datasets exhibit similar performance across unmodified and modified children's speech, models trained on smaller datasets exhibit notably enhanced performance with modified speech. However, analyzing age effects on PF-Star and CMU Kids evaluation sets, we observe that all Wav2Vec2 variants still underperform for children under 10 years. In this scenario, speech modification methods and their combinations help improve performance for small and large Wav2Vec2 models but have plenty of room for improvement.
  • Loading...
    Thumbnail Image
    Effects of Language Relatedness for Cross-lingual Transfer Learning in Character-Based Language Models
    (2020) Singh, Mittul; Smit, Peter; Virpioja, Sami; Kurimo, Mikko
    A4 Artikkeli konferenssijulkaisussa
    Character-based Neural Network Language Models (NNLM) have the advantage of smaller vocabulary and thus faster training times in comparison to NNLMs based on multi-character units. However, in low-resource scenarios, both the character and multi-character NNLMs suffer from data sparsity. In such scenarios, the cross-lingual transfer has improved multi-character NNLM performance by allowing information transfer from a source to the target language. In the same vein, we propose to use cross-lingual transfer for character NNLMs applied to low-resource Automatic Speech Recognition (ASR). However, applying cross-lingual transfer to character NNLMs is not as straightforward. We observe that relatedness of the source language plays an important role in cross-lingual pretraining of character NNLMs. We evaluate this aspect on ASR tasks for two target languages: Finnish (with English and Estonian as source) and Swedish (with Danish, Norwegian, and English as source). Prior work has observed no difference between using the related or unrelated language for multi-character NNLMs. We, however, show that for character-based NNLMs, only pretraining with a related language improves the ASR performance, and using an unrelated language may deteriorate it. We also observe that the benefits are larger when there is much lesser target data than source data.
  • No Thumbnail Available
    End-to-end speech summarization
    (2023-09-10) Oinonen, Tommi
    Sähkötekniikan korkeakoulu | Bachelor's thesis
  • No Thumbnail Available
    An Equal Data Setting for Attention-Based Encoder-Decoder and HMM/DNN Models: A Case Study in Finnish ASR
    (2021) Rouhe, Aku; Van Camp, Astrid; Singh, Mittul; Van Hamme, Hugo; Kurimo, Mikko
    A4 Artikkeli konferenssijulkaisussa
    Standard end-to-end training of attention-based ASR models only uses transcribed speech. If they are compared to HMM/DNN systems, which additionally leverage a large corpus of text-only data and expert-crafted lexica, the differences in modeling cannot be disentangled from differences in data. We propose an experimental setup, where only transcribed speech is used to train both model types. To highlight the difference that text-only data can make, we use Finnish, where an expert-crafted lexicon is not needed. With 1500h equal data, we find that both ASR paradigms perform similarly, but adding text data quickly improves the HMM/DNN system. On a smaller 160h subset we find that HMM/DNN models outperform AED models.
  • Loading...
    Thumbnail Image
    First-pass decoding with n-gram approximation of RNNLM: The problem of rare words
    (2018-09) Singh, Mittul; Smit, Peter; Virpioja, Sami; Kurimo, Mikko
    A4 Artikkeli konferenssijulkaisussa
  • Loading...
    Thumbnail Image
    Handling noisy labels for robustly learning from self-training data for low-resource sequence labeling
    (2019-01-01) Paul, Debjit; Singh, Mittul; Hedderich, Michael A.; Klakow, Dietrich
    A4 Artikkeli konferenssijulkaisussa
    In this paper, we address the problem of effectively self-training neural networks in a lowresource setting. Self-training is frequently used to automatically increase the amount of training data. However, in a low-resource scenario, it is less effective due to unreliable annotations created using self-labeling of unlabeled data. We propose to combine self-training with noise handling on the self-labeled data. Directly estimating noise on the combined clean training set and self-labeled data can lead to corruption of the clean data and hence, performs worse. Thus, we propose the Clean and Noisy Label Neural Network which trains on clean and noisy self-labeled data simultaneously by explicitly modelling clean and noisy labels separately. In our experiments on Chunking and NER, this approach performs more robustly than the baselines. Complementary to this explicit approach, noise can also be handled implicitly with the help of an auxiliary learning task. To such a complementary approach, our method is more beneficial than other baseline methods and together provides the best performance overall.
  • Loading...
    Thumbnail Image
    MatsuLM - Python implementation of a neural network language modeling toolkit
    (2020-08-19) Nyberg, Riko
    Perustieteiden korkeakoulu | Master's thesis
    Language models (LMs) give a probability of how likely a sequence of words might appear in a particular order in a sentence, and they are an essential part of automatic speech recognition (ASR) and natural language processing (NLP) systems. These systems have improved at a considerable pace over the past decade. Similarly, language models have significantly advanced after the invention of recurrent neural network language models (RNNLMs) in 2010. These RNNLMs are generally called neural network language models (NNLMs) and they have become the state-of-the-art language models because of their superior performance compared to N-gram models. This thesis is creating a new NNLM toolkit, called MatsuLM that is using the latest machine learning frameworks and industry standards. Hence, it is faster and easier to use and set up than the existing NNLM tools. Currently, there are very few open-source toolkits for NNLMs; however, these toolkits have both become outdated and are no longer supported, or they suffer from functionality issues. This work introduces a new NNLM toolkit, called MatsuLM, that includes all the essential components to create and monitor NNLM development effortlessly. This toolkit is built to be as lightweight and straightforward as possible to decrease development effort in the future. MatsuLM’s performance is compared against two existing NNLM toolkits (TheanoLM and awd-lstm-lm). In the experiments conducted during this thesis, both existing toolkits were slower than the newly presented MatsuLM in training the language models. Conse- quently, MatsuLM is currently the fastest and most up to date NNLM toolkit compared to TheanoLM and awd-lstm-lm.
  • Loading...
    Thumbnail Image
    Service registration chatbot: collecting and comparing dialogues from AMT workers and service’s users
    (2020-11) Molteni, Luca; Singh, Mittul; Leinonen, Juho; Leino, Katri; Kurimo, Mikko; Della Valle, Emanuele
    A4 Artikkeli konferenssijulkaisussa
    Crowdsourcing is the go-to solution for data collection and annotation in the context of NLP tasks. Nevertheless, crowdsourced data is noisy by nature; the source is often unknown and additional validation work is performed to guarantee the dataset’s quality. In this article, we compare two crowdsourcing sources on a dialogue paraphrasing task revolving around a chatbot service. We observe that workers hired on crowdsourcing platforms produce lexically poorer and less diverse rewrites than service users engaged voluntarily. Notably enough, on dialogue clarity and optimality, the two paraphrase sources’ human-perceived quality does not differ significantly. Furthermore, for the chatbot service, the combined crowdsourced data is enough to train a transformer-based Natural Language Generation (NLG) system. To enable similar services, we also release tools for collecting data and training the dialogue-act-based transformer-based NLG module.
  • Loading...
    Thumbnail Image
    Speech Recognition for Conversational Finnish
    (2021-01-26) Moisio, Anssi
    Sähkötekniikan korkeakoulu | Master's thesis
    Spontaneous conversational Finnish is a challenging type of speech to recognise due to frequent dysfluencies in sentence structure and the use of various informal wordforms. This thesis work was an effort to improve the speech recognition accuracy for conversational Finnish. The purpose was to evaluate recent acoustic and language modelling methods on conversational Finnish. The main experiments include devaluating the effect of different speaker embedding approaches and comparing Transformer-XL and recurrent neural language models, using word and subword vocabularies. Combining the best acoustic and language models built during this thesis work improved the word error rate by 4.9 absolute percentages compared to the previous best result.
  • Loading...
    Thumbnail Image
    Subword RNNLM approximations for out-of-vocabulary keyword search
    (2019-01-01) Singh, Mittul; Virpioja, Sami; Smit, Peter; Kurimo, Mikko
    A4 Artikkeli konferenssijulkaisussa
    In spoken Keyword Search, the query may contain out-of-vocabulary (OOV) words not observed when training the speech recognition system. Using subword language models (LMs) in the first-pass recognition makes it possible to recognize the OOV words, but even the subword n-gram LMs suffer from data sparsity. Recurrent Neural Network (RNN) LMs alleviate the sparsity problems but are not suitable for first-pass recognition as such. One way to solve this is to approximate the RNNLMs by back-off n-gram models. In this paper, we propose to interpolate the conventional n-gram models and the RNNLM approximation for better OOV recognition. Furthermore, we develop a new RNNLM approximation method suitable for subword units: It produces variable-order n-grams to include long-span approximations and considers also n-grams that were not originally observed in the training corpus. To evaluate these models on OOVs, we setup Arabic and Finnish Keyword Search tasks concentrating only on OOV words. On these tasks, interpolating the baseline RNNLM approximation and a conventional LM outperforms the conventional LM in terms of the Maximum Term Weighted Value for single-character subwords. Moreover, replacing the baseline approximation with the proposed method achieves the best performance on both multi- and single-character subwords.
Help | Open Access publishing | Instructions to convert a file to PDF/A | Errata instructions | Send Feedback
Aalto UniversityPrivacy notice | Cookie settings | Accessibility Statement | Aalto University Learning Centre