Browsing by Author "Kurimo, Mikko"
Now showing 1 - 20 of 198
Results Per Page
Sort Options
Item The Aalto system based on fine-tuned AudioSet features for DCASE2018 task2 - general purpose audio tagging(2018-11) Xu, Zhicun; Smit, Peter; Kurimo, Mikko; Dept Signal Process and Acoust; Centre of Excellence in Computational Inference, COIN; Speech Recognition; Dept Signal Process and AcoustIn this paper, we presented a neural network system for DCASE 2018 task 2, general purpose audio tagging. We fine-tuned the Google AudioSet feature generation model with different settings for the given 41 classes on top of a fully connected layer with 100 units. Then we used the fine-tuned models to generate 128 dimensional features for each 0.960s audio. We tried different neural network structures including LSTM and multi-level attention models. In our experiments, the multi-level attention model has shown its superiority over others. Truncating the silence parts, repeating and splitting the audio into the fixed length, pitch shifting augmentation, and mixup techniques are all used in our experiments. The proposed system achieved a result with MAP@3 score at 0.936, which outperforms the baseline result of 0.704 and achieves top 8% in the public leaderboard.Item Aalto system for the 2017 Arabic multi-genre broadcast challenge(2018) Smit, Peter; Gangireddy, Siva; Enarvi, Seppo; Virpioja, Sami; Kurimo, Mikko; Dept Signal Process and Acoust; Centre of Excellence in Computational Inference, COIN; Speech RecognitionWe describe the speech recognition systems we have created for MGB-3, the 3rd Multi Genre Broadcast challenge, which this year consisted of a task of building a system for transcribing Egyptian Dialect Arabic speech, using a big audio corpus of primarily Modern Standard Arabic speech and only a small amount (5 hours) of Egyptian adaptation data. Our system, which was a combination of different acoustic models, language models and lexical units, achieved a Multi-Reference Word Error Rate of 29.25%, which was the lowest in the competition. Also on the old MGB-2 task, which was run again to indicate progress, we achieved the lowest error rate: 13.2%. The result is a combination of the application of state-of-the-art speech recognition methods such as simple dialect adaptation for a Time-Delay Neural Network (TDNN) acoustic model (-27% errors compared to the baseline), Recurrent Neural Network Language Model (RNNLM) rescoring (an additional -5%), and system combination with Minimum Bayes Risk (MBR) decoding (yet another -10%). We also explored the use of morph and character language models, which was particularly beneficial in providing a rich pool of systems for the MBR decoding.Item Acoustic model and language model adaptation for a mobile dictation service(Aalto University, 2010) Mansikkaniemi, André; Kurimo, Mikko; Elektroniikan, tietoliikenteen ja automaation tiedekunta; Sams, MikkoAutomatisk taligenkänning är en maskinstyrd metod genom vilken tal omvandlas till text. MobiDic är en mobil dikteringstjänst som använder ett serverbaserat automatiskt taligenkänningssystem för att omvandla tal inspelat på en mobiltelefon till läsbara och editerbara textdokument. I detta arbete undersöktes förmågan hos Tekniska Högskolans taligenkänningssystem att omvandla juridik-relaterat tal inspelat på en mobiltelefon med MobiDics klientprogram till korrekt text. Det fanns skillnader mellan test- och träningsdata gällande både akustik och språk. De akutiska bakgrundsmodellerna var tränade med tal som hade spelats in på en datormikrofon. Språkmodellerna var tränade med text från olika tidningar och nyhetstjänster. På grund av testdatans speciella karaktär har tyngdpunkten i arbetet legat på att förbättra taligenkänningsförmågan hos systemet genom adaptering av akustiska modeller och språkmodeller. Adaptering av akustiska modeller ger de bästa och pålitligaste resultaten i syftet att förbättra taligenkänningsförmågan. Genom att använda den globala cMLLR-metoden och endast 2 minuter av adapteringsdata kan man förminska antalet feltolkade ord med 15-22%. Genom att använda den regressionsklassbaserade cMLLR-metoden kan man uppnåytterligare förbättringar i taligenkänningsförmågan om det finns större mängder av adapteringsdata (> 10 min.) tillgängligt. Adaptering av språkmodellen gav ingen betydande förbättring av taligenkännings förmågan. Det främsta problemet var de stora skillnaderna mellan språkadapteringsdata och språket som förekom i de juridik-relaterade talinspelningarna.Item Acoustic Model Compression with MAP adaptation(Linköping University Electronic Press, 2017) Leino, Katri; Kurimo, Mikko; Dept Signal Process and Acoust; Tiedemann, Jörg; Centre of Excellence in Computational Inference, COIN; Speech RecognitionSpeaker adaptation is an important step in optimization and personalization of the performance of automatic speech recognition (ASR) for individual users. While many applications target in rapid adaptation by various global transformations, slower adaptation to obtain a higher level of personalization would be useful for many active ASR users, especially for those whose speech is not recognized well. This paper studies the outcome of combinations of maximum a posterior (MAP) adaptation and compression of Gaussian mixture models. An important result that has not received much previous attention is how MAP adaptation can be utilized to radically decrease the size of the models as they get tuned to a particular speaker. This is particularly relevant for small personal devices which should provide accurate recognition in real-time despite a low memory, computation, and electricity consumption. With our method we are able to decrease the model complexity with MAP adaptation while increasing the accuracy.Item Adaptation of Neural Network Language Models for Speech Recognition(2020-01-20) Neralla, Vasumathi; Leinonen, Juho; Sähkötekniikan korkeakoulu; Kurimo, MikkoItem Adaptiivisten vektorikvantisointimenetelmien ja kätkettyjen Markov-mallien kombinaatioita puheentunnistuksessa(1992) Kurimo, Mikko; Torkkola, Kari; Tietotekniikan osasto; Teknillinen korkeakoulu; Helsinki University of Technology; Ruusunen, JukkaItem An adaptive method to achieve speaker independence in a speech recognition system(1999) Siivola, Vesa; Kurimo, Mikko; Sähkö- ja tietoliikennetekniikan osasto; Teknillinen korkeakoulu; Helsinki University of Technology; Oja, ErkkiTässä diplomityössä etsitään tapoja parantaa puheentunnistimen tarkkuutta, kun tunnistinta ei ole opetettu käyttäjän puheella. Työssä tarkastellaan useita tapoja vaikuttaa tähän, alkaen tunnistimen perusmallin valinnasta tekniikoihin, joilla yritetään vähentää taustakohinan vaikutusta ja mukauttaa eli adaptoida puheen malli käyttäjän puhetyylin mukaan. Sekä kohinan kompensointi että mallin mukauttaminen tapahtuu laitetta käytettäessä, eikä mitään opetusistuntoa vaadita. Nämä ominaisuudet ovat tärkeitä, kun yritetään tehdä julkista palvelua, kuten esimerkiksi automaattista lennonvarausjärjestelmää, eikä käyttäjää voida vaivata opetusistunnolla. Työssä käytetty tunnistin perustuu kätkettyihin Markov-malleihin. Tunnistimen perusrakenneyksikkönä kokeillaan yhden foneemin sijasta käyttää siirtymää yhdestä foneemista toiseen ja huomataan, että jälkimmäinen toimii paremmin vastaavalla määrällä parametrejä. Kepstrien keskiarvojen normalisoinnilla yritetään kompensoida tekijöitä, jotka muuttavat irrotettuja piirteitä konsistentisti. Tämä menetelmä toimii kohtuullisen hyvin. Puheen mallin adaptoimiseen käytetty algoritmi johdetaan maksimi a posteriori -adaptoinnista ja itseorganisoituvista kartoista. Adaptaatio toimii hyvin yhden foneemin mallilla, mutta foneemien siirtymiä käytettäessä tulokset eivät juurikaan parane syistä, jotka esitetään työssä.Item Advances in subword-based HMM-DNN speech recognition across languages(Academic Press Inc., 2021-03) Smit, Peter; Virpioja, Sami; Kurimo, Mikko; Dept Signal Process and Acoust; Centre of Excellence in Computational Inference, COIN; Speech RecognitionWe describe a novel way to implement subword language models in speech recognition systems based on weighted finite state transducers, hidden Markov models, and deep neural networks. The acoustic models are built on graphemes in a way that no pronunciation dictionaries are needed, and they can be used together with any type of subword language model, including character models. The advantages of short subword units are good lexical coverage, reduced data sparsity, and avoiding vocabulary mismatches in adaptation. Moreover, constructing neural network language models (NNLMs) is more practical, because the input and output layers are small. We also propose methods for combining the benefits of different types of language model units by reconstructing and combining the recognition lattices. We present an extensive evaluation of various subword units on speech datasets of four languages: Finnish, Swedish, Arabic, and English. The results show that the benefits of short subwords are even more consistent with NNLMs than with traditional n-gram language models. Combination across different acoustic models and language models with various units improve the results further. For all the four datasets we obtain the best results published so far. Our approach performs well even for English, where the phoneme-based acoustic models and word-based language models typically dominate: The phoneme-based baseline performance can be reached and improved by 4% using graphemes only when several grapheme-based models are combined. Furthermore, combining both grapheme and phoneme models yields the state-of-the-art error rate of 15.9% for the MGB 2018 dev17b test. For all four languages we also show that the language models perform reasonably well when only limited training data is available.Item Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference(2023-10-27) Porjazovski, Dejan; Getman, Yaroslav; Grósz, Tamás; Kurimo, Mikko; Department of Information and Communications Engineering; Speech RecognitionLarge pre-trained models are essential in paralinguistic systems, demonstrating effectiveness in tasks like emotion recognition and stuttering detection. In this paper, we employ large pre-trained models for the ACM Multimedia Computational Paralinguistics Challenge, addressing the Requests and Emotion Share tasks. We explore audio-only and hybrid solutions leveraging audio and text modalities. Our empirical results consistently show the superiority of the hybrid approaches over the audio-only models. Moreover, we introduce a Bayesian layer as an alternative to the standard linear output layer. The multimodal fusion approach achieves an 85.4% UAR on HC-Requests and 60.2% on HC-Complaints. The ensemble model for the Emotion Share task yields the best value of .614. The Bayesian wav2vec2 approach, explored in this study, allows us to easily build ensembles, at the cost of fine-tuning only one model. Moreover, we can have usable confidence values instead of the usual overconfident posterior probabilities.Item Akustisten mallien adaptointi kielten yli puhujariippumattomassa puheentunnistuksessa(Aalto University, 2010) Karhila, Reima; Kurimo, Mikko; Elektroniikan, tietoliikenteen ja automaation tiedekunta; Alku, PaavoLaadukas puheentunnistus vaatii tunnistussysteemiltä kykyä mukautua puhujan ääneen ja puhetapaan. Suurin osa puheentunnistusjärjestelmistä on rakennettu kielellisesti yhtenäisten ryhmien käyttöön. Kun erilaisista kielellisistä taustoista tulevat ihmiset muodostavat enemmän ja enemmän käyttäjäryhmiä, tarve lisääntyy tehokkaalle monikieliselle puheentunnistukselle, joka ottaa huomioon murteiden ja painotusten lisäksi myös eri kielet. Tässä työssä tutkittiin, miten englannin ja suomen puheen akustisia malleja voidaan yhdistellä ja näin rakentaa monikielinen puheentunnistin. Työssä tutkittiin myös miten puhuja-adaptaatio toimii näissä järjestelmissä kielten sisällä ja kielirajan yli niin, että yhden kielen puhedataa käytetään adaptaatioon toisella kielellä. Puheentunnistimia rakennettiin suurilla suomen- ja englanninkielisillä puhekorpuksilla ja testattiin sekä yksi- että kaksikielisellä aineistolla. Tulosten perusteella voidaan todeta, että englannin ja suomen akustisten mallien yhdistelemisessä turvallisen klusteroinnin raja on niin alhaalla, että yhdistely ei juurikaan kannata tunnistimen tehokkuuden parantamiseksi. Tuloksista nähdään myös, että äidinkielenä puhutun suomen tunnistamista voitiin parantaa käyttämällä vieraana kielenä puhutun englannin dataa. Tämä mekanismi toimi vain yksisuuntaisesti: Vieraana kielenä puhutun englannin tunnistusta ei voinut parantaa äidinkielenä puhutun suomen datan avulla.Item Application of Learning Vector Quantization and Self-Organizing Maps for training continuous density and semi-continuous Markov models(1994) Kurimo, Mikko; Tietotekniikan osasto; Teknillinen korkeakoulu; Helsinki University of Technology; Kohonen, TeuvoItem Approaching human performance in noise robust automatic speech recognition(Aalto University, 2014) Keronen, Sami; Palomäki, Kalle; Signaalinkäsittelyn ja akustiikan laitos; Department of Signal Processing and Acoustics; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, MikkoModern automatic speech recognition systems are able to achieve human-like performance on read speech in relatively noise-free environments. However, in the presence of heavily deteriorating noise, the gap between human and machine recognition remains large. The work presented in the thesis is aimed to enhance the speech recognition performance in varying noise and low signal-to-noise ratio conditions by improving the short-time spectral analysis of the speech signal and the spectrographic mask estimation in the missing data framework. In the thesis, the fast Fourier transformation based spectrum estimation of Mel-frequency cepstral coefficients is substituted with extended weighted linear prediction. Temporal weighting in linear predictive analysis emphasizes the high amplitude samples that are assumed less corrupted by noise and attenuates the others. Extending the weighting to separately apply to each lag in the prediction of each sample arguably offers more modeling power for deteriorated speech. The extended weighted linear prediction is shown to exceed the recognition performance of conventional linear prediction, weighted linear prediction and fast Fourier transformation based feature extraction. Missing data methods assume that only part of the spectro-temporal components of the deteriorated signal are corrupted by noise while the speech-dominant components hold the reliable information that can be used in recognition. Two spectrographic mask estimation techniques based on binary classification of features are proposed in the thesis. The first method is founded on a comprehensive set of design features and the second on the Gaussian-Bernoulli restricted Boltzmann machine that learns the feature set automatically. Both mask estimation methods are shown to outperform their respective reference mask estimation methods in recognition accuracy. All the proposed noise robust techniques are immediately applicable to automatic speech recognition. With further refinement, the mask estimation methods could also be applied to hearing aids since they are able to attenuate the background noise thus increasing the speech intelligibility.Item Attention-Based End-To-End Named Entity Recognition From Speech(2021) Porjazovski, Dejan; Leinonen, Juho; Kurimo, Mikko; Dept Signal Process and Acoust; Ekštein, Kamil; Pártl, František; Konopík, Miloslav; Speech RecognitionNamed entities are heavily used in the field of spoken language understanding, which uses speech as an input. The standard way of doing named entity recognition from speech involves a pipeline of two systems, where first the automatic speech recognition system generates the transcripts, and then the named entity recognition system produces the named entity tags from the transcripts. In such cases, automatic speech recognition and named entity recognition systems are trained independently, resulting in the automatic speech recognition branch not being optimized for named entity recognition and vice versa. In this paper, we propose two attention-based approaches for extracting named entities from speech in an end-to-end manner, that show promising results. We compare both attention-based approaches on Finnish, Swedish, and English data sets, underlining their strengths and weaknesses.Item Audio Event Classification Using Deep Learning Methods(2018-12-10) Xu, Zhicun; Smit, Peter; Sähkötekniikan korkeakoulu; Kurimo, MikkoWhether crossing the road or enjoying a concert, sound carries important information about the world around us. Audio event classification refers to recognition tasks involving the assignment of one or several labels, such as ‘dog bark’ or ‘doorbell’, to a particular audio signal. Thus, teaching machines to conduct this classification task can help humans in many fields. Since deep learning has shown its great potential and usefulness in many AI applications, this thesis focuses on studying deep learning methods and building suitable neural networks for this audio event classification task. In order to evaluate the performance of different neural networks, we tested them on both Google AudioSet and the dataset for DCASE 2018 Task 2. Instead of providing original audio files, AudioSet offers compact 128-dimensional embeddings outputted by a modified VGG model for audio with a frame length of 960ms. For DCASE 2018 Task 2, we firstly preprocessed the soundtracks and then fine-tuned the VGG model that AudioSet used as a feature extractor. Thus, each soundtrack from both tasks is represented as a series of 128-dimensional features. We then compared the DNN, LSTM, and multi-level attention models with different hyper parameters. The results show that fine-tuning the feature generation model for the DCASE task greatly improved the evaluation score. In addition, the attention models were found to perform the best in our settings for both tasks. The results indicate that utilizing a CNN-like model as a feature extractor for the log-mel spectrograms and modeling the dynamics information using an attention model can achieve state-of-the-art results in the task of audio event classification. For future research, the thesis suggests training a better CNN model for feature extraction, utilizing multi-scale and multi-level features for better classification, and combining the audio features with other multimodal information for audiovisual data analysis.Item Audiovisual Speaker Clustering for News Broadcast Videos(2015-06-10) Kayal, Subhradeep; Laaksonen, Jorma; Perustieteiden korkeakoulu; Kurimo, MikkoItem Augmentation, Oversampling and Curriculum Learning for Small Imbalanced Speech Data(2023-12-11) Lun, Tin; Voskoboinik, Ekaterina; Al-Ghezi, Ragheb; Sähkötekniikan korkeakoulu; Kurimo, MikkoAutomatic Speech Recognition (ASR) systems have seen remarkable breakthrough in recent years, which has in turn fostered the development of ASR-supported Automatic Speaking Assessment (ASA) systems. However, their advancement is engaged with two main challenges: data scarcity and data imbalance, especially in languages such as Finnish and Finland Swedish. This thesis aims to explore methods that alleviate these two challenges when training ASR and ASA systems for second language (L2) speakers. These systems could be found in applications such as language learning apps and language proficiency tests. Training such ASR systems requires transcribed L2 speech data, which is scarce in most languages. Additionally, proficiency scores are required to train ASA systems, but very expensive to obtain. Thus, it is important to maximise the utilisation of existing datasets. This study works with a L2 Finnish dataset and a L2 Finland Swedish dataset, both are small (approx. 14 hours or less) and imbalanced. In particular, intermediate proficiency levels are well-represented in the datasets, while beginner- and advanced-levels have only very few samples. To solve these two problems, four methods were explored: 1) audio augmentation, 2) augmentation using Text-To-Speech (TTS) synthesisers, 3) oversampling with augmentation, and 4) class-wise curriculum learning. To improve ASR performance on L2 speech, audio augmentation is shown to be an effective method, while augmentation with TTS synthesiser has positive impact mainly for speech of lower proficiency. For ASA training, audio augmentation alone does not yield significant improvement, while its combination with oversampling leads to the best results. Lastly, class-wise curriculum learning is shown to be less effective than other methods in our experiments.Item Automaattiset foneemirajojen tunnistusmenetelmät(2023-05-31) Ylä-Outinen, Juho; Kurimo, Mikko; Sähkötekniikan korkeakoulu; Aalto, SamuliItem Automated Assessment of Task Completion in Spontaneous Speech for Finnish and Finland Swedish Language Learners(2023-05-16) Voskoboinik, Ekaterina; Getman, Yaroslav; Al-Ghezi, Ragheb; Kurimo, Mikko; Grosz, Tamas; Department of Information and Communications Engineering; Speech Recognition; Speech Recognition; Department of Information and Communications EngineeringThis study investigates the feasibility of automated content scoring for spontaneous spoken responses from Finnish and Finland Swedish learners. Our experiments reveal that pretrained Transformer-based models outperform the tf-idf baseline in automatic task completion grading. Furthermore, we demonstrate that pre-fine-tuning these models to differentiate between responses to distinct prompts enhances subsequent task completion finetuning. We observe that task completion classifiers exhibit accelerated learning and produce predictions with stronger correlations to human grading when accounting for task differences. Additionally, we find that employing similarity learning, as opposed to conventional classification fine-tuning, further improves the results. It is especially helpful to learn not just the similarities between the responses in one score bin, but the exact differences between the average human scores responses received. Lastly, we demonstrate that models applied to both manual and ASR transcripts yield comparable correlations to human grading.Item Automatic Assessment of Spoken Lexico-Grammatical Proficiency in L2 Finnish and Swedish(2022-07-29) Akiki, Clara; Al-Ghezi, Ragheb; Voskoboinik, Ekaterina; Perustieteiden korkeakoulu; Kurimo, MikkoItem Automatic Construction of the Finnish Parliament Speech Corpus(2017-08) Mansikkaniemi, Andre; Smit, Peter; Kurimo, Mikko; Dept Signal Process and Acoust; Centre of Excellence in Computational Inference, COIN; Speech RecognitionAutomatic speech recognition (ASR) systems require large amounts of transcribed speech data, for training state-of-the-art deep neural network (DNN) acoustic models. Transcribed speech is a scarce and expensive resource, and ASR systems are prone to underperform in domains where there is not a lot of training data available. In this work, we open up a vast and previously unused resource of transcribed speech for Finnish, by retrieving and aligning all the recordings and meeting transcripts from the web portal of the Parliament of Finland. Short speech-text segment pairs are retrieved from the audio and text material, by using the Levenshtein algorithm to align the first-pass ASR hypotheses with the corresponding meeting transcripts. DNN acoustic models are trained on the automatically constructed corpus, and performance is compared to other models trained on a commercially available speech corpus. Model performance is evaluated on Finnish parliament speech, by dividing the testing set into seen and unseen speakers. Performance is also evaluated on broadcast speech to test the general applicability of the parliament speech corpus. We also study the use of meeting transcripts in language model adaptation, to achieve additional gains in speech recognition accuracy of Finnish parliament speech.