Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-hour Scale

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorRouhe, Akuen_US
dc.contributor.authorGrósz, Tamásen_US
dc.contributor.authorKurimo, Mikkoen_US
dc.contributor.departmentDepartment of Information and Communications Engineeringen
dc.contributor.groupauthorSpeech Recognitionen
dc.date.accessioned2024-01-04T08:51:40Z
dc.date.available2024-01-04T08:51:40Z
dc.date.issued2024en_US
dc.description.abstractEnd-to-End speech recognition has become the center of attention for speech recognition research, but Hybrid Hidden Markov Model Deep Neural Network (HMM/DNN) -systems remain a competitive approach in terms of performance. End-to-End models may be better at very large data scales, and HMM / DNN-systems may have an advantage in low-resource scenarios, but the thousand-hour scale is particularly interesting for comparisons. At that scale experiments have not been able to conclusively demonstrate which approach is best, or if the heterogeneous approaches yield similar results. In this work, we work towards answering that question for Attention-based Encoder-Decoder models compared with HMM / DNN-systems. We present two simple experimental design principles, and how to build systems adhering to those principles. We demonstrate how those principles remove confounding variables related to both data, and neural architecture and training. We apply the principles in a set of experiments on three diverse thousand-hour-scale tasks. In our experiments, the HMM / DNN-systems yield equal or better results in almost all cases.en
dc.description.versionPeer revieweden
dc.format.extent16
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationRouhe, A, Grósz, T & Kurimo, M 2024, 'Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-hour Scale', IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 623-638. https://doi.org/10.1109/taslp.2023.3336517en
dc.identifier.doi10.1109/taslp.2023.3336517en_US
dc.identifier.issn2329-9290
dc.identifier.otherPURE UUID: 4a5092d0-4909-4675-bf78-ced18e2ad4c7en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/4a5092d0-4909-4675-bf78-ced18e2ad4c7en_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85178024451&partnerID=8YFLogxK
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/130999178/Principled_Comparisons_for_End-to-End_Speech_Recognition_Attention_vs_Hybrid_at_the_1000-Hour_Scale.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/125406
dc.identifier.urnURN:NBN:fi:aalto-202401041095
dc.language.isoenen
dc.publisherIEEE
dc.relation.ispartofseriesIEEE/ACM Transactions on Audio, Speech, and Language Processingen
dc.relation.ispartofseriesVolume 32, pp. 623-638en
dc.rightsopenAccessen
dc.subject.keywordASRen_US
dc.subject.keywordHMM/DNNen_US
dc.subject.keywordEnd-to-Enden_US
dc.titlePrincipled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-hour Scaleen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files