Modern subword-based models for automatic speech recognition

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorVirpioja, Sami, Dr., Aalto University, Department of Signal Processing and Acoustics, Finland
dc.contributor.authorSmit, Peter
dc.contributor.departmentSignaalinkäsittelyn ja akustiikan laitosfi
dc.contributor.departmentDepartment of Signal Processing and Acousticsen
dc.contributor.labSpeech Recognition Research Groupen
dc.contributor.schoolSähkötekniikan korkeakoulufi
dc.contributor.schoolSchool of Electrical Engineeringen
dc.contributor.supervisorKurimo, Mikko, Assoc. Prof., Aalto University, Department of Signal Processing and Acoustics, Finland
dc.date.accessioned2019-05-21T09:01:08Z
dc.date.available2019-05-21T09:01:08Z
dc.date.defence2019-06-17
dc.date.issued2019
dc.description.abstractIn today's society, speech recognition systems have reached a mass audience, especially in the field of personal assistants such as Amazon Alexa or Google Home. Yet, this does not mean that speech recognition has been solved. On the contrary, for many domains, tasks, and languages such systems do not exist. Subword-based automatic speech recognition has been studied in the past for many reasons, often to overcome limitations on the size of the vocabulary. Specifically for agglutinative languages, where new words can be created on the fly, handling these limitations is possible using a subword-based automatic speech recognition (ASR) system. Though, over time subword-based systems lost a bit of popularity as system resources increased and word-based models with large vocabularies became possible. Still, subword-based models in modern ASR systems can predict words that have never been seen before and better use the available language modeling resources. Furthermore, subword models have smaller vocabularies, which makes neural network language models (NNLMs) easier to train and use.  Hence, in this thesis, we study subword models for ASR and make two major contributions. First, this thesis reintroduces subword-based modeling in a modern framework based on weighted finite-state transducers and describe the necessary tools for making a sound and effective system. It does this through careful modification of the lexicon FST part of a WFST-based recognizer. Secondly, extensive experiments using are done using subwords, with different types of language models including n-gram models and NNLMs. These experiments are performed on six different languages setting the new best-published result for any of these datasets. Overall, we show that subword-based models can outperform word-based models in terms of ASR performance for many different types of languages. This thesis also details design choices needed when building modern subword ASR systems, including the choices of the segmentation algorithm, vocabulary size and subword marking style. In addition, it includes techniques to combine speech recognition models trained on different units through system combination. Lastly, it evaluates the use of the smallest possible subword unit; characters and shows that these models can be smaller and yet be competitive to word-based models.en
dc.format.extent62 + app. 136
dc.format.mimetypeapplication/pdfen
dc.identifier.isbn978-952-60-8566-1 (electronic)
dc.identifier.isbn978-952-60-8565-4 (printed)
dc.identifier.issn1799-4942 (electronic)
dc.identifier.issn1799-4934 (printed)
dc.identifier.issn1799-4934 (ISSN-L)
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/38073
dc.identifier.urnURN:ISBN:978-952-60-8566-1
dc.language.isoenen
dc.opnHain, Thomas, Prof., University of Sheffield, UK
dc.publisherAalto Universityen
dc.publisherAalto-yliopistofi
dc.relation.haspart[Publication 1]: Sami Virpioja, Peter Smit, Stig-Arne Grönroos, Mikko Kurimo. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Full text in Acris/Aaltodoc: http://urn.fi/URN:ISBN:978-952-60-5501-5. Aalto University publication series SCIENCE + TECHNOLOGY, 25/2013.
dc.relation.haspart[Publication 2]: Peter Smit, Sami Virpioja, Mikko Kurimo. Improved Subword Modeling for WFST-Based Speech Recognition. In Annual Conference of the International Speech Communication Association (INTERSPEECH), Stockholm, pages 2551–2555 , August 2017. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201710157202. DOI: 10.21437/Interspeech.2017-103
dc.relation.haspart[Publication 3]: Peter Smit, Juho Leinonen, Kristiina Jokinen, Mikko Kurimo. Automatic Speech Recognition for Northern Sámi with comparison to other Uralic Languages. In Proceedings of the Second International Workshop on Computational Linguistics for Uralic Languages, Szeged, pages 80–91, January 2016. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201701191109.
dc.relation.haspart[Publication 4]: Juho Leinonen, Peter Smit, Sami Virpioja, Mikko Kurimo. New Baseline in Automatic Speech Recognition for Northern Sámi. In Fourth International Workshop on Computational Linguistics for Uralic Languages, Helsinki, pages 89–99, January 2018. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201802091229. DOI: 10.18653/v1/W18-0208
dc.relation.haspart[Publication 5]: Peter Smit, Siva Reddy Gangireddy, Seppo Enarvi, Sami Virpioja, Mikko Kurimo. Character-based units for Unlimited Vocabulary Continuous Speech Recognition. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, pages 149–156, December 2017. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201802091465. DOI: 10.1109/ASRU.2017.8268929
dc.relation.haspart[Publication 6]: Peter Smit, Siva Reddy Gangireddy, Seppo Enarvi, Sami Virpioja, Mikko Kurimo. Aalto system for the 2017 Arabic multi-genre broadcast challenge. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, pages 338–345, December 2017. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201802091512. DOI: 10.1109/ASRU.2017.8268955
dc.relation.haspart[Publication 7]: Seppo Enarvi, Peter Smit, Sami Virpioja, Mikko Kurimo. Automatic Speech Recognition with Very Large Conversational Finnish and Estonian Vocabularies. IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 25, issue 11, pages 2085–2097, November 2017. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201710157079. DOI: 10.1109/TASLP.2017.2743344
dc.relation.haspart[Publication 8]: Peter Smit, Sami Virpioja, Mikko Kurimo. Advances in Subword-based HMM-DNN Speech Recognition Across Languages. Submitted to Language Resources and Evaluation, 29 November 2018.
dc.relation.ispartofseriesAalto University publication series DOCTORAL DISSERTATIONSen
dc.relation.ispartofseries97/2019
dc.revSaraçlar, Murat, Prof., Boğaziçi University, Turkey
dc.revAli, Ahmed, Dr., Qatar Computing Research Institute, Qatar
dc.subject.keywordautomatic speech recognitionen
dc.subject.keywordlanguage modelingen
dc.subject.keywordsubword modelsen
dc.subject.otherElectrical engineeringen
dc.titleModern subword-based models for automatic speech recognitionen
dc.typeG5 Artikkeliväitöskirjafi
dc.type.dcmitypetexten
dc.type.ontasotDoctoral dissertation (article-based)en
dc.type.ontasotVäitöskirja (artikkeli)fi
local.aalto.acrisexportstatuschecked 2019-07-02_1127
local.aalto.archiveyes
local.aalto.formfolder2019_05_20_klo_12_50
local.aalto.infraScience-IT
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
isbn9789526085661.pdf
Size:
6.84 MB
Format:
Adobe Portable Document Format