Advances in subword-based HMM-DNN speech recognition across languages

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorSmit, Peteren_US
dc.contributor.authorVirpioja, Samien_US
dc.contributor.authorKurimo, Mikkoen_US
dc.contributor.departmentDepartment of Signal Processing and Acousticsen
dc.contributor.groupauthorCentre of Excellence in Computational Inference, COINen
dc.contributor.groupauthorSpeech Recognitionen
dc.date.accessioned2020-10-30T12:46:45Z
dc.date.available2020-10-30T12:46:45Z
dc.date.issued2021-03en_US
dc.description| openaire: EC/H2020/780069/EU//MeMAD
dc.description.abstractWe describe a novel way to implement subword language models in speech recognition systems based on weighted finite state transducers, hidden Markov models, and deep neural networks. The acoustic models are built on graphemes in a way that no pronunciation dictionaries are needed, and they can be used together with any type of subword language model, including character models. The advantages of short subword units are good lexical coverage, reduced data sparsity, and avoiding vocabulary mismatches in adaptation. Moreover, constructing neural network language models (NNLMs) is more practical, because the input and output layers are small. We also propose methods for combining the benefits of different types of language model units by reconstructing and combining the recognition lattices. We present an extensive evaluation of various subword units on speech datasets of four languages: Finnish, Swedish, Arabic, and English. The results show that the benefits of short subwords are even more consistent with NNLMs than with traditional n-gram language models. Combination across different acoustic models and language models with various units improve the results further. For all the four datasets we obtain the best results published so far. Our approach performs well even for English, where the phoneme-based acoustic models and word-based language models typically dominate: The phoneme-based baseline performance can be reached and improved by 4% using graphemes only when several grapheme-based models are combined. Furthermore, combining both grapheme and phoneme models yields the state-of-the-art error rate of 15.9% for the MGB 2018 dev17b test. For all four languages we also show that the language models perform reasonably well when only limited training data is available.en
dc.description.versionPeer revieweden
dc.format.extent18
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationSmit, P, Virpioja, S & Kurimo, M 2021, ' Advances in subword-based HMM-DNN speech recognition across languages ', Computer Speech and Language, vol. 66, 101158 . https://doi.org/10.1016/j.csl.2020.101158en
dc.identifier.doi10.1016/j.csl.2020.101158en_US
dc.identifier.issn0885-2308
dc.identifier.issn1095-8363
dc.identifier.otherPURE UUID: c31c4057-b91c-4473-8b19-602ee2407192en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/c31c4057-b91c-4473-8b19-602ee2407192en_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85092219457&partnerID=8YFLogxKen_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/52484122/ELEC_Smit_Advances_in_Subword_based_CSL.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/47346
dc.identifier.urnURN:NBN:fi:aalto-202010306229
dc.language.isoenen
dc.publisherAcademic Press Inc.
dc.relationinfo:eu-repo/grantAgreement/EC/H2020/780069/EU//MeMADen_US
dc.relation.ispartofseriesComputer Speech and Languageen
dc.relation.ispartofseriesVolume 66en
dc.rightsopenAccessen
dc.subject.keywordCharacter unitsen_US
dc.subject.keywordLarge vocabulary speech recognitionen_US
dc.subject.keywordRecurrent neural network language modelsen_US
dc.subject.keywordSubword unitsen_US
dc.titleAdvances in subword-based HMM-DNN speech recognition across languagesen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files