Browsing by Author "Kurimo, Mikko, Prof., Aalto University, Department of Signal Processing and Acoustics, Finland"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item Continuous Unsupervised Topic Adaptation for Morph-based Speech Recognition(Aalto University, 2017) Mansikkaniemi, André; Signaalinkäsittelyn ja akustiikan laitos; Department of Signal Processing and Acoustics; Speech Recognition Research Group; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, Mikko, Prof., Aalto University, Department of Signal Processing and Acoustics, FinlandModern automatic speech recognition (ASR) systems are speaker independent and designed to recognize continuous large vocabulary speech. The key components of an ASR system are the acoustic model, language model, lexicon and decoder. A constant challenge for an ASR system over time, is how to adapt to changing topics and the introduction of new names and words. Enabling continuous topic adaptation for ASR systems requires finding new relevant text sources for adapting the language model and identifying words which need new and modified pronunciation rules. In this thesis, unsupervised methods that enable continuous topic adaptation for a Finnish morph-based ASR system are studied. Based on first-pass ASR output, topic and time relevant text data is retrieved from a collection of pre-indexed Web texts. Adapting the background language model with the best matching texts improves recognition accuracy. The recognition accuracy of foreign names and acronyms, one of the focus areas in this thesis, is also improved. Further improvement is achieved by identifying foreign names and acronyms in the retrieved texts, and generating adapted pronunciation rules for them. In statistical morph-based ASR, words are sometimes oversegmented. To enable a more reliable and easier mapping of adapted pronunciation rules, oversegmented foreign names and acronyms are restored back into their base forms. Morpheme restoration also improves recognition accuracy slightly. User feedback is also explored in this thesis for enabling ongoing lexicon adaptation of ASR systems. Based on user corrections of ASR output, optimal pronunciation rules for mis-recognized words are recovered by using forced alignment and Viterbi decoding. A collection of recovered pronunciation rules can be used for the recognition of new speech data. Experiments showed some minor improvements in the recognition of foreign names using user feedback based lexicon adaptation.Item Contributions to Morphology Learning using Conditional Random Fields(Aalto University, 2016) Ruokolainen, Teemu; Virpioja, Sami, Dr., Aalto University, Department of Signal Processing and Acoustics, Finland; Signaalinkäsittelyn ja akustiikan laitos; Department of Signal Processing and Acoustics; Speech and Language Processing; Perustieteiden korkeakoulu; School of Science; Kurimo, Mikko, Prof., Aalto University, Department of Signal Processing and Acoustics, FinlandNatural language processing (NLP) refers to the study of systems performing natural language related tasks in an automatic manner, that is, without human supervision or interference. This thesis work considers NLP problems related to morphology analysis, that is, the description of internal structure of words. Acquiring knowledge of morphology is necessary in order for applications, such as search engines, machine translators, and speech recognizers, to successfully address rare and previously unseen word forms. In particular, we focus on two widely applied morphological analysis tasks, namely, morphological tagging and segmentation. In morphological tagging, the aim is to assign words in sentential contexts with word class labels describing their morphological properties. Meanwhile, morphological segmentation considers describing the inner word structure by splitting word forms into their smallest meaning-bearing units, morphemes. In the scope of this thesis, we approach the morphological tagging and segmentation problems using statistical, data-driven machine learning methodology. Using this approach, the processing systems are learned (estimated) based on training data prepared manually by a human expert. In particular, we focus on the highly influential conditional random field (CRF) model proposed for sequence tagging and segmentation in the early 2000s. As the first main contribution, the thesis discusses data-driven morphological segmentation employing the CRF model. A particular emphasis is placed on the semi-supervised learning setting, in which the available data consists of a small number of annotated segmentation examples and a large amount of unannotated raw word forms. The provided empirical evaluation on six languages shows that the proposed semi-supervised CRF-based approach is highly successful in the considered morphological segmentation task compared to earlier methods. In particular, the performed error analysis shows that closed class phenomena, such as suffixation of English and Finnish, can be learned already from a small number of annotated examples in a supervised manner. Meanwhile, open morpheme class phenomena, such as compounding of Finnish, can be learned by additionally exploiting the large unannotated word list using the semi-supervised approach. As the second main contribution, the thesis contains a presentation of FinnPos, the first open-source statistical morphological tagging and lemmatization toolkit designed specifically for Finnish. The CRF-based FinnPos system is readily applicable for tagging and lemmatization of running text with models learned from the recently published Finnish Turku Dependency Treebank and FinnTreeBank.Item Feature Enhancement and Uncertainty Estimation for Recognition of Noisy and Reverberant Speech(Aalto University, 2016) Kallasjoki, Heikki; Palomäki, Kalle, Doc., Aalto University, Department of Signal Processing and Acoustics, Finland; Signaalinkäsittelyn ja akustiikan laitos; Department of Signal Processing and Acoustics; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, Mikko, Prof., Aalto University, Department of Signal Processing and Acoustics, FinlandThe task of automatic speech recognition has received considerable research attention and many systems have seen large-scale commercial deployment. However, lack of robustness is still a barrier to their use in novel applications. While human listeners are adept in understanding spoken language in diverse environments, the signal distortion caused by noise and reflected sounds severely degrades the accuracy of conventional systems. This thesis studies methods of reducing the effects of such distortions, improving the performance of speech recognition in challenging conditions. The emphasis of this thesis is on algorithms that enhance the sequence of input features observed by a speech recognition system, with the aim of making them more invariant to noise and reverberation. Research on several ways of addressing the problem is included. Weighted linear prediction is considered as a method to incorporate robustness in spectral modeling used for speech feature extraction. To counteract additive noise, improvements are proposed to algorithms based on the missing data framework and the use of non-negative matrix factorization as a tool for separating sound sources. Speech corrupted by reverberation is addressed by extending the source separation model to account for convolutional distortion. Further, a method of transforming the corrupted features based on matching their distribution to that of uncorrupted speech is presented. The positive impact of the proposed approaches on speech recognition performance is confirmed and quantified by experimental evaluation on large vocabulary continuous speech recognition tasks. Complementing the work, methods to extract and utilize information about the varying uncertainty of the enhanced features are investigated. While no system is capable of perfectly removing all traces of noise from the speech features, it is often possible to estimate the local accuracy of the processed speech. This information can be used in the decoding stage of a speech recognition system, to de-emphasize the regions of the input where the uncertainty is high, and the input features are more likely to be incorrect. This thesis proposes and evaluates heuristic uncertainty metrics compatible with the missing data and non-negative matrix factorization feature enhancement systems.Item Improving very large vocabulary language modeling and decoding for speech recognition in morphologically rich languages(Aalto University, 2020) Varjokallio, Matti; Virpioja, Sami, Dr., University of Helsinki, Finland; Signaalinkäsittelyn ja akustiikan laitos; Department of Signal Processing and Acoustics; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, Mikko, Prof., Aalto University, Department of Signal Processing and Acoustics, FinlandIn the automatic speech recognition of agglutinative and morphologically rich languages, the recognition vocabulary may, in many tasks, need to cover several millions of word forms. This poses challenges for the search component of the speech recognizer, as in many cases, real-time recognition speed would be preferred, and the number of possible recognition hypotheses is large. A typical modern large vocabulary speech recognizer utilizes a probabilistic language model to assign prior probabilities for the word sequences. Estimating accurate language models from a text corpus also becomes harder due to increased data sparsity. So far, the most successful approach for the speech recognition of morphologically rich languages has been to segment the words to shorter, more frequently occurring units, thus alleviating the estimability problems. Also, if all concatenations of subwords are allowed, the recognition vocabulary is unlimited. This thesis concentrates on different approaches where a limited but very large recognition vocabulary is used. This type of recognizer can, in addition to the subword-based language models, also use language models trained over words and word classes to reach improved modeling accuracy. For the case where only a subword language model is used, the thesis shows a novel way of constructing the recognition graph. In this case, the recognition vocabulary is easy to augment with new word forms by utilizing resources like dictionaries and morphological analyzers. The constrained recognition vocabulary approaches are shown to be viable choices in many speech recognition use cases. Additionally, in this case, it is shown that the search may also operate in real time and even faster than the case where the recognition vocabulary was unlimited. Also, the recognition of non-words is avoided, and the recognition accuracy may exceed the unlimited vocabulary approach if a low enough out-of-vocabulary rate is reached. In one part of the thesis, human word recognition performance is analyzed using statistical morphological models in a visual lexical decision task where the participants' eye movements were also recorded using eye tracking. Morfessor Baseline -method, which segments only the infrequent words, predicted the observations well in most of the experiments. This finding supports the corresponding model of word recognition in humans.Item Machine translation into morphologically rich low-resource languages(Aalto University, 2020) Grönroos, Stig-Arne; Virpioja, Sami, Dr., University of Helsinki, Finland; Signaalinkäsittelyn ja akustiikan laitos; Department of Signal Processing and Acoustics; Speech and Language Processing research group; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, Mikko, Prof., Aalto University, Department of Signal Processing and Acoustics, FinlandMachine translation is an important natural language processing application, enabling widened access to information, cultural interchange, and business opportunities in a multilingual world. Driven by research into deep neural networks, machine translation has recently made rapid advances, particularly in the fluency of the translation output. As the methods tend to be data-hungry,high-resource languages have benefited more than low-resource ones. In this work, the aim is to improve machine translation into low-resource morphologically rich languages. Rich morphology leads to a combinatorial explosion in the number of word forms,resulting in very large vocabularies, containing many poorly modeled rare words. This thesis addresses these challenges with multiple approaches. The focus is on methods for segmenting words into subwords, to get more frequent and thus easier learned representations, and to increase the symmetry between languages. It is important to exploit additional resources from related tasks,such as parallel data from related high-resource language pairs and monolingual data from both low- and high-resource languages. Useful auxiliary data sets for multimodal translation can befound from captioning and text-only translation tasks. The methods for exploiting this auxiliary data include cross-lingual learning and data augmentation e.g. using denoising sequence autoen-coders and subword regularization. Learning setups used in the thesis include using unsupervised and language-independent methods, using active learning to guide an annotation effort to produce more informative data, and using scheduled multi-task learning to improve cross-lingual transfer. Contributions of the thesis include five novel segmentation methods: Morfessor FlatCat, Omorfi-restricted Morfessor, Cognate Morfessor, Morfessor EM+Prune, and a semi-supervised neural method. An active learning strategy for Morfessor FlatCat is presented. Evaluation of segmentation quality is performed using both intrinsic and extrinsic automatic methods. Morfessor EM+Prunefinds models with both lower cost and better quality in unsupervised segmentation than Morfessor Baseline. Active learning is superior to random selection for collecting annotations. The best performance in semi-supervised segmentation is achieved when using Morfessor FlatCat segmentations as features in a conditional random field. Contributions to machine translation include a target-side multi-task learning scheme, and scheduled multi-task learning with a denoising sequence autoencoder. LeBLEU, an evaluation measure suitable for morphologically rich languages is presented. Evaluation of translation quality is performed using both automatic and human evaluation. When resources are scarce, the most important auxiliary data comes from related languages. Other types of auxiliary data, such as monolingual corpora, are also beneficial and the gains are partly cumulative.Item Modeling Conversational Finnish for Automatic Speech Recognition(Aalto University, 2018) Enarvi, Seppo; Virpioja, Sami, Dr., Aalto University, Department of Signal Processing and Acoustics, Finland; Signaalinkäsittelyn ja akustiikan laitos; Department of Signal Processing and Acoustics; Speech Recognition Research Group; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, Mikko, Prof., Aalto University, Department of Signal Processing and Acoustics, FinlandThe accuracy of automatic speech recognizers has been constantly improving for decades. Aalto University has developed automatic recognition of Finnish speech and achieved very low error rates on clearly spoken standard Finnish, such as news broadcasts. Recognition of natural conversations is much more challenging. The language that is spoken in Finnish conversations also differs in many ways from standard Finnish, and its recognition requires data that has previously been unavailable. This thesis develops automatic speech recognition for conversational Finnish, starting by collection of training and evaluation data. For language modeling, large amounts of text are collected from the Internet, and filtered to match the colloquial speaking style. An evaluation set is published and used to benchmark the progress in conversational Finnish speech recognition. The thesis addresses many difficulties that arise from the fact that the vocabulary that is used in Finnish conversations is very large. Using deep neural networks for acoustic modeling and recurrent neural networks for language modeling, accuracy that is already useful in practical applications is achieved in conversational speech recognition.Item Statistical methods for incomplete speech data(Aalto University, 2016) Remes, Ulpu; Palomäki, Kalle, Docent, Aalto University, Department of Signal Processing and Acoustics, Finland; Signaalinkäsittelyn ja akustiikan laitos; Department of Signal Processing and Acoustics; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, Mikko, Prof., Aalto University, Department of Signal Processing and Acoustics, FinlandSpeech can be represented as an observation matrix where each node corresponds to a certain speech feature. However when speech is mixed with environmental sounds, some features cannot be observed and the observation matrix remains incomplete. The missing values are a problem because incomplete observations can support incorrect conclusions and because most applications cannot process incomplete data. Methods that are used to handle incomplete observations are called missing-data methods. This thesis presents on overview on missing-data methods and discusses their application in noise-robust automatic speech recognition. Hence we assume that the speech observations are incomplete due to environmental sounds. The methods studied in this work substitute unobserved feature values with estimates calculated based on the incomplete observations and statistical dependencies between the observed and unobserved features. This is called missing-data imputation. The main research directions include imputation methods that utilise temporal dependencies between observations and imputation methods that associate feature estimates with uncertainties. The experiments conducted in this work indicate that temporal dependencies and imputation uncertainties improve automatic speech recognition performance when speech is corrupted with environmental noise. The thesis also discusses narrowband telephone speech and bandwidth extension. Narrowband speech can be considered incomplete since observations associated with certain features are not included in the narrowband transmission. Bandwidth extension means that the narrowband observations are converted into wideband observations which include more features. The bandwidth extension methods evaluated in this work estimate wideband observations based on narrowband observations and statistical dependencies between narrowband and wideband features.