Browsing by Author "Smit, Peter"
Now showing 1 - 20 of 29
- Results Per Page
- Sort Options
- The Aalto system based on fine-tuned AudioSet features for DCASE2018 task2 - general purpose audio tagging
A4 Artikkeli konferenssijulkaisussa(2018-11) Xu, Zhicun; Smit, Peter; Kurimo, MikkoIn this paper, we presented a neural network system for DCASE 2018 task 2, general purpose audio tagging. We fine-tuned the Google AudioSet feature generation model with different settings for the given 41 classes on top of a fully connected layer with 100 units. Then we used the fine-tuned models to generate 128 dimensional features for each 0.960s audio. We tried different neural network structures including LSTM and multi-level attention models. In our experiments, the multi-level attention model has shown its superiority over others. Truncating the silence parts, repeating and splitting the audio into the fixed length, pitch shifting augmentation, and mixup techniques are all used in our experiments. The proposed system achieved a result with MAP@3 score at 0.936, which outperforms the baseline result of 0.704 and achieves top 8% in the public leaderboard. - Aalto system for the 2017 Arabic multi-genre broadcast challenge
A4 Artikkeli konferenssijulkaisussa(2018) Smit, Peter; Gangireddy, Siva; Enarvi, Seppo; Virpioja, Sami; Kurimo, MikkoWe describe the speech recognition systems we have created for MGB-3, the 3rd Multi Genre Broadcast challenge, which this year consisted of a task of building a system for transcribing Egyptian Dialect Arabic speech, using a big audio corpus of primarily Modern Standard Arabic speech and only a small amount (5 hours) of Egyptian adaptation data. Our system, which was a combination of different acoustic models, language models and lexical units, achieved a Multi-Reference Word Error Rate of 29.25%, which was the lowest in the competition. Also on the old MGB-2 task, which was run again to indicate progress, we achieved the lowest error rate: 13.2%. The result is a combination of the application of state-of-the-art speech recognition methods such as simple dialect adaptation for a Time-Delay Neural Network (TDNN) acoustic model (-27% errors compared to the baseline), Recurrent Neural Network Language Model (RNNLM) rescoring (an additional -5%), and system combination with Minimum Bayes Risk (MBR) decoding (yet another -10%). We also explored the use of morph and character language models, which was particularly beneficial in providing a rich pool of systems for the MBR decoding. - Advances in subword-based HMM-DNN speech recognition across languages
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2021-03) Smit, Peter; Virpioja, Sami; Kurimo, MikkoWe describe a novel way to implement subword language models in speech recognition systems based on weighted finite state transducers, hidden Markov models, and deep neural networks. The acoustic models are built on graphemes in a way that no pronunciation dictionaries are needed, and they can be used together with any type of subword language model, including character models. The advantages of short subword units are good lexical coverage, reduced data sparsity, and avoiding vocabulary mismatches in adaptation. Moreover, constructing neural network language models (NNLMs) is more practical, because the input and output layers are small. We also propose methods for combining the benefits of different types of language model units by reconstructing and combining the recognition lattices. We present an extensive evaluation of various subword units on speech datasets of four languages: Finnish, Swedish, Arabic, and English. The results show that the benefits of short subwords are even more consistent with NNLMs than with traditional n-gram language models. Combination across different acoustic models and language models with various units improve the results further. For all the four datasets we obtain the best results published so far. Our approach performs well even for English, where the phoneme-based acoustic models and word-based language models typically dominate: The phoneme-based baseline performance can be reached and improved by 4% using graphemes only when several grapheme-based models are combined. Furthermore, combining both grapheme and phoneme models yields the state-of-the-art error rate of 15.9% for the MGB 2018 dev17b test. For all four languages we also show that the language models perform reasonably well when only limited training data is available. - Audio Event Classification Using Deep Learning Methods
Sähkötekniikan korkeakoulu | Master's thesis(2018-12-10) Xu, ZhicunWhether crossing the road or enjoying a concert, sound carries important information about the world around us. Audio event classification refers to recognition tasks involving the assignment of one or several labels, such as ‘dog bark’ or ‘doorbell’, to a particular audio signal. Thus, teaching machines to conduct this classification task can help humans in many fields. Since deep learning has shown its great potential and usefulness in many AI applications, this thesis focuses on studying deep learning methods and building suitable neural networks for this audio event classification task. In order to evaluate the performance of different neural networks, we tested them on both Google AudioSet and the dataset for DCASE 2018 Task 2. Instead of providing original audio files, AudioSet offers compact 128-dimensional embeddings outputted by a modified VGG model for audio with a frame length of 960ms. For DCASE 2018 Task 2, we firstly preprocessed the soundtracks and then fine-tuned the VGG model that AudioSet used as a feature extractor. Thus, each soundtrack from both tasks is represented as a series of 128-dimensional features. We then compared the DNN, LSTM, and multi-level attention models with different hyper parameters. The results show that fine-tuning the feature generation model for the DCASE task greatly improved the evaluation score. In addition, the attention models were found to perform the best in our settings for both tasks. The results indicate that utilizing a CNN-like model as a feature extractor for the log-mel spectrograms and modeling the dynamics information using an attention model can achieve state-of-the-art results in the task of audio event classification. For future research, the thesis suggests training a better CNN model for feature extraction, utilizing multi-scale and multi-level features for better classification, and combining the audio features with other multimodal information for audiovisual data analysis. - Automatic Construction of the Finnish Parliament Speech Corpus
A4 Artikkeli konferenssijulkaisussa(2017-08) Mansikkaniemi, Andre; Smit, Peter; Kurimo, MikkoAutomatic speech recognition (ASR) systems require large amounts of transcribed speech data, for training state-of-the-art deep neural network (DNN) acoustic models. Transcribed speech is a scarce and expensive resource, and ASR systems are prone to underperform in domains where there is not a lot of training data available. In this work, we open up a vast and previously unused resource of transcribed speech for Finnish, by retrieving and aligning all the recordings and meeting transcripts from the web portal of the Parliament of Finland. Short speech-text segment pairs are retrieved from the audio and text material, by using the Levenshtein algorithm to align the first-pass ASR hypotheses with the corresponding meeting transcripts. DNN acoustic models are trained on the automatically constructed corpus, and performance is compared to other models trained on a commercially available speech corpus. Model performance is evaluated on Finnish parliament speech, by dividing the testing set into seen and unseen speakers. Performance is also evaluated on broadcast speech to test the general applicability of the parliament speech corpus. We also study the use of meeting transcripts in language model adaptation, to achieve additional gains in speech recognition accuracy of Finnish parliament speech. - Automatic Speech Recognition for Human-Robot Interaction Using an Under-Resourced Language
Sähkötekniikan korkeakoulu | Master's thesis(2015-08-24) Leinonen, JuhoAutomatic speech recognition will soon be a part of everyday life. Even today many people use the speech recognizer in their smartphones, whether it is Google Now or Siri. Commercial applications have existed for years for automatic dictation, and command-based voice user interfaces. The abundance of software divides languages in two; in well-resourced languages there is no shortage of products, while under-resourced languages might not even receive academic interest. In this thesis, an automatic speech recognizer is built for North Sami, which is a morphologically rich under-resourced language in the Uralic family. These properties create challenges for the recognition process, of which this thesis will concentrate on the issue of out-of-vocabulary words. The use of whole words is compared with word fragments, morphs, and tests are conducted to optimize other language model variables such as vocabulary size and context length. The experiments show that morph-based language models solve the problem of out-of-vocabulary words and significantly improve the recognition results without slowing the process too much. In addition, increasing context length improves the morph models, while adding supervision to generating them does not. As such, this thesis recommends a high order morph model generated with unsupervised methods to be used with North Sami. - Automatic Speech Recognition for Northern Sámi with comparison to other Uralic Languages
School of Electrical Engineering | A4 Artikkeli konferenssijulkaisussa(2016) Smit, Peter; Leinonen, Juho; Jokinen, Kristiina; Kurimo, MikkoSpeech technology applications for major languages are becoming widely available, but for many other languages there is no commercial interest in developing speech technology. As the lack of technology and applications will threaten the existence of these languages, it is important to study how to create speech recognizers with minimal effort and low resources. As a test case, we have developed a Large Vocabulary Continuous Speech Recognizer for Northern Sámi, an Finno-Ugric language that has little resources for speech technology available. Using only limited audio data, 2.5 hours, and the Northern Sámi Wikipedia for the language model we achieved 7.6% Letter Error Rate (LER). With a language model based on a higher quality language corpus we achieved 4.2% LER. To put this in perspective we also trained systems in other, better-resourced, Finno-Ugric languages (Finnish and Estonian) with the same amount of data and compared those to state-of-the-art systems in those languages. - Automatic Speech Recognition for Northern Sámi with comparison to other Uralic Languages
A4 Artikkeli konferenssijulkaisussa(2016-01-20) Smit, Peter; Leinonen, Juho; Jokinen, Kristiina; Kurimo, MikkoSpeech technology applications for major languages are becoming widely available, but for many other languages there is no commercial interest in developing speech technology. As the lack of technology and applications will threaten the existence of these languages, it is important to study how to create speech recognizers with minimal effort and low resources. As a test case, we have developed a Large Vocabulary Continuous Speech Recognizer for Northern Sámi, an Finno-Ugric language that has little resources for speech technology available. Using only limited audio data, 2.5 hours, and the Northern Sámi Wikipedia for the language model we achieved 7.6% Letter Error Rate (LER). With a language model based on a higher quality language corpus we achieved 4.2% LER. To put this in perspective we also trained systems in other, better-resourced, Finno-Ugric languages (Finnish and Estonian) with the same amount of data and compared those to state-of-the-art systems in those languages. - Automatic Speech Recognition Principle and Demo
Sähkötekniikan korkeakoulu | Master's thesis(2018-02-12) Shi, Changtai - Automatic Speech Recognition with Very Large Conversational Finnish and Estonian Vocabularies
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2017-11) Enarvi, Seppo; Smit, Peter; Virpioja, Sami; Kurimo, MikkoToday, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands of words. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variations due to colloquial pronunciations, in addition to the word compounding and inflections. Very large vocabularies are also needed, for example, when the recognition of rare proper names is important. - CaptainA: Integrated pronunciation practice and data collection portal
A4 Artikkeli konferenssijulkaisussa(2018) Rouhe, Aku; Karhila, Reima; Elg, Aija; Toivola, Minnaleena; Khandelwal, Mayank; Smit, Peter; Smolander, Anna Riikka; Kurimo, MikkoWe demonstrate Captaina, computer assisted pronunciation training portal. It is aimed at university students, who read passages aloud and receive automatic feedback based on speech recognition and phoneme classification. Later their teacher can provide more accurate feedback and comments through the portal. The system enables better independent practice. It also acts as a data collection method. We aim to gather both good quality second language speech data with segmentations, and the teacher given evaluations of pronunciation. - Character-based units for Unlimited Vocabulary Continuous Speech Recognition
A4 Artikkeli konferenssijulkaisussa(2018) Smit, Peter; Gangireddy, Siva; Enarvi, Seppo; Virpioja, Sami; Kurimo, MikkoWe study character-based language models in the state-of-the-art speech recognition framework. This approach has advantages over both word-based systems and so-called end-to-end ASR systems that do not have separate acoustic and language models. We describe the necessary modifications needed to build an effective character-based ASR system using the Kaldi toolkit and evaluate the models based on words, statistical morphs, and characters for both Finnish and Arabic. The morph-based models yield the best recognition results for both well-resourced and lower-resourced tasks, but the character-based models are close to their performance in the lower-resource tasks, outperforming the word-based models. Character-based models are especially good at predicting novel word forms that were not seen in the training data. Using character-based neural network language models is both computationally efficient and provides a larger gain compared to the morph and word-based systems. - Creating synthetic voices for children by adapting adult average voice using stacked transformations and VTLN
A4 Artikkeli konferenssijulkaisussa(2012) Karhila, Reima; Doddipatla, Rama Sanand; Kurimo, Mikko; Smit, Peter - Development of the Finnish Spoken Dialog System for an Educational Robot
Sähkötekniikan korkeakoulu | Master's thesis(2017-04-03) Sallinen, NiklasSpoken dialog systems are coming in the every day life, for example in the personal assistants such as Siri from Apple. However, spoken dialog systems could be used in a vast range of products. In this thesis a spoken dialog system prototype was developed to be used in an educational robot. The main problem in an educational robot to recognize children's speech. The speech of the children varies significantly between speakers, which makes it more difficult to recognize with a single acoustic model. The main focus of the thesis is in the speech recognition and adaptation. The acoustic model used is trained with data gathered from adults and then adapted with the data from children. The adaptation is done for each speaker separately and also as an average child adaptation. The results are compared to the commercial speech recognizer developed by Google Inc. The experiments show that, when adapting the adult model with data from each speaker separately word error rate can be decreased from 8.1 % to 2.4 % and with the average adaptation to 3.1 %. The adaptation that was used was vocal tract length normalization (VTLN) and constrained maximum likelihood linear regression (CMLLR) combined. In comparison word error rate of the commercial product used is 7.4 %. - Digitala: An augmented test and review process prototype for high-stakes spoken foreign language examination
A4 Artikkeli konferenssijulkaisussa(2016) Karhila, Reima; Rouhe, Aku; Smit, Peter; Mansikkaniemi, André; Kallio, Heini; Lindroos, Erik; Hildén, Raili; Vainio, Martti; Kurimo, MikkoThis paper introduces the first prototype for a computerised examination procedure of spoken foreign languages in Finland, intended for national scale upper secondary school final examinations. Speech technology and profiling of reviewers are used to minimise the otherwise massive reviewing effort. - Effects of Language Relatedness for Cross-lingual Transfer Learning in Character-Based Language Models
A4 Artikkeli konferenssijulkaisussa(2020) Singh, Mittul; Smit, Peter; Virpioja, Sami; Kurimo, MikkoCharacter-based Neural Network Language Models (NNLM) have the advantage of smaller vocabulary and thus faster training times in comparison to NNLMs based on multi-character units. However, in low-resource scenarios, both the character and multi-character NNLMs suffer from data sparsity. In such scenarios, the cross-lingual transfer has improved multi-character NNLM performance by allowing information transfer from a source to the target language. In the same vein, we propose to use cross-lingual transfer for character NNLMs applied to low-resource Automatic Speech Recognition (ASR). However, applying cross-lingual transfer to character NNLMs is not as straightforward. We observe that relatedness of the source language plays an important role in cross-lingual pretraining of character NNLMs. We evaluate this aspect on ASR tasks for two target languages: Finnish (with English and Estonian as source) and Swedish (with Danish, Norwegian, and English as source). Prior work has observed no difference between using the related or unrelated language for multi-character NNLMs. We, however, show that for character-based NNLMs, only pretraining with a related language improves the ASR performance, and using an unrelated language may deteriorate it. We also observe that the benefits are larger when there is much lesser target data than source data. - First-pass decoding with n-gram approximation of RNNLM: The problem of rare words
A4 Artikkeli konferenssijulkaisussa(2018-09) Singh, Mittul; Smit, Peter; Virpioja, Sami; Kurimo, Mikko - Improved subword modeling for WFST-based speech recognition
A4 Artikkeli konferenssijulkaisussa(2017-08) Smit, Peter; Virpioja, Sami; Kurimo, MikkoBecause in agglutinative languages the number of observed word forms is very high, subword units are often utilized in speech recognition. However, the proper use of subword units requires careful consideration of details such as silence modeling, position-dependent phones, and combination of the units. In this paper, we implement subword modeling in the Kaldi toolkit by creating modified lexicon by finite-state transducers to represent the subword units correctly. We experiment with multiple types of word boundary markers and achieve the best results by adding a marker to the left or right side of a subword unit whenever it is not preceded or followed by a word boundary, respectively. We also compare three different toolkits that provide data-driven subword segmentations. In our experiments on a variety of Finnish and Estonian datasets, the best subword models do outperform word-based models and naive subword implementations. The largest relative reduction in WER is a 23% over word-based models for a Finnish read speech dataset. The results are also better than any previously published ones for the same datasets, and the improvement on all datasets is more than 5%. - Low-Resource Active Learning of Morphological Segmentation
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2016) Grönroos, Stig-Arne; Hiovain, Katri; Smit, Peter; Rauhala, Ilona; Jokinen, Kristiina; Kurimo, Mikko; Virpioja, SamiMany Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection. - Modern subword-based models for automatic speech recognition
School of Electrical Engineering | Doctoral dissertation (article-based)(2019) Smit, PeterIn today's society, speech recognition systems have reached a mass audience, especially in the field of personal assistants such as Amazon Alexa or Google Home. Yet, this does not mean that speech recognition has been solved. On the contrary, for many domains, tasks, and languages such systems do not exist. Subword-based automatic speech recognition has been studied in the past for many reasons, often to overcome limitations on the size of the vocabulary. Specifically for agglutinative languages, where new words can be created on the fly, handling these limitations is possible using a subword-based automatic speech recognition (ASR) system. Though, over time subword-based systems lost a bit of popularity as system resources increased and word-based models with large vocabularies became possible. Still, subword-based models in modern ASR systems can predict words that have never been seen before and better use the available language modeling resources. Furthermore, subword models have smaller vocabularies, which makes neural network language models (NNLMs) easier to train and use. Hence, in this thesis, we study subword models for ASR and make two major contributions. First, this thesis reintroduces subword-based modeling in a modern framework based on weighted finite-state transducers and describe the necessary tools for making a sound and effective system. It does this through careful modification of the lexicon FST part of a WFST-based recognizer. Secondly, extensive experiments using are done using subwords, with different types of language models including n-gram models and NNLMs. These experiments are performed on six different languages setting the new best-published result for any of these datasets. Overall, we show that subword-based models can outperform word-based models in terms of ASR performance for many different types of languages. This thesis also details design choices needed when building modern subword ASR systems, including the choices of the segmentation algorithm, vocabulary size and subword marking style. In addition, it includes techniques to combine speech recognition models trained on different units through system combination. Lastly, it evaluates the use of the smallest possible subword unit; characters and shows that these models can be smaller and yet be competitive to word-based models.