Browsing by Author "Kurimo, Mikko, Assoc. Prof., Aalto University, Department of Signal Processing and Acoustics, Finland"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Building personalised speech technology systems with sparse, bad quality or out-of-domain data(Aalto University, 2019) Karhila, Reima; Kurimo, Mikko, Assoc. Prof., Aalto University, Department of Signal Processing and Acoustics, Finland; Signaalinkäsittelyn ja akustiikan laitos; Department of Signal Processing and Acoustics; Speech Recognition Group; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, Mikko, Assoc. Prof., Aalto University, Department of Signal Processing and Acoustics, FinlandAutomatic speech recognition and text-to-speech systems offer handsfree and eyesfree interfaces for applications on computers, telephones and home and wearable electronics. The perceived quality and identity of a text-to-speech system's voice are essential to the user experience. The possibilities for different speaker identities are practically limitless if short or out-of-domain collections of speech can be used to transfer speaker identity to the synthetic voice. This thesis describes background, methods and results for a group of experiments performed with statistical parametric speech synthesis and speech recognition, with focus on speaker adaptation of the models and evaluation the quality of the systems' output. All these systems rely on speech models that are trained on large collections of speech and text data. The speech data have been preprocessed into acoustic features using a vocoder. The amount and quality of available data are addressed in experiments on the effects of background noise in the adaptation data of speaker-adaptive HMM-GMM statistical parametric speech synthesis, listener perception of speaker background in speaker adapted speech synthesis with sparse, foreign-accented data, and stacking group and speaker adaptations to improve quality of speech synthesis for out-of-domain speakers. Cross-lingual adaptation is investigated in experiments on probabilistic cross-lingual speaker adaptation when a model for source language is not available, and bilingual speech synthesis with code-switching when source language data is not available for one of the languages. In all these studies, the speaker characteristics were successfully transferred to a synthesic voice even if the adaptation data was noisy, in another language or there was very little of it. Cross-lingual adaptation was also investigated for automatic speech recognition of bilingual speakers and found to improve recognition results. Any system development relies on measuring the quality of the output, and this thesis also includes an overview of objective and subjective methods of quality evaluation for synthetic speech and natural foreign language speech, as well as an analysis of different objective measures for evaluating quality of HMM-GMM based speech synthesis systems. Building on components of speech recognition and synthesis systems, this thesis also presents a system for evaluating and scoring the pronunciation quality of foreign language learners utterances. Rating pronunciation quality of single utterances is a difficult problem but our system manages to do it at a speed and reliability that is satisfactory for computer games used to study language learning.Item Modern subword-based models for automatic speech recognition(Aalto University, 2019) Smit, Peter; Virpioja, Sami, Dr., Aalto University, Department of Signal Processing and Acoustics, Finland; Signaalinkäsittelyn ja akustiikan laitos; Department of Signal Processing and Acoustics; Speech Recognition Research Group; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Kurimo, Mikko, Assoc. Prof., Aalto University, Department of Signal Processing and Acoustics, FinlandIn today's society, speech recognition systems have reached a mass audience, especially in the field of personal assistants such as Amazon Alexa or Google Home. Yet, this does not mean that speech recognition has been solved. On the contrary, for many domains, tasks, and languages such systems do not exist. Subword-based automatic speech recognition has been studied in the past for many reasons, often to overcome limitations on the size of the vocabulary. Specifically for agglutinative languages, where new words can be created on the fly, handling these limitations is possible using a subword-based automatic speech recognition (ASR) system. Though, over time subword-based systems lost a bit of popularity as system resources increased and word-based models with large vocabularies became possible. Still, subword-based models in modern ASR systems can predict words that have never been seen before and better use the available language modeling resources. Furthermore, subword models have smaller vocabularies, which makes neural network language models (NNLMs) easier to train and use. Hence, in this thesis, we study subword models for ASR and make two major contributions. First, this thesis reintroduces subword-based modeling in a modern framework based on weighted finite-state transducers and describe the necessary tools for making a sound and effective system. It does this through careful modification of the lexicon FST part of a WFST-based recognizer. Secondly, extensive experiments using are done using subwords, with different types of language models including n-gram models and NNLMs. These experiments are performed on six different languages setting the new best-published result for any of these datasets. Overall, we show that subword-based models can outperform word-based models in terms of ASR performance for many different types of languages. This thesis also details design choices needed when building modern subword ASR systems, including the choices of the segmentation algorithm, vocabulary size and subword marking style. In addition, it includes techniques to combine speech recognition models trained on different units through system combination. Lastly, it evaluates the use of the smallest possible subword unit; characters and shows that these models can be smaller and yet be competitive to word-based models.