Multilingual TTS Accent Impressions for Accented ASR

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorKarakasidis, Georgiosen_US
dc.contributor.authorRobinson, Nathanielen_US
dc.contributor.authorGetman, Yaroslaven_US
dc.contributor.authorOgayo, Atienoen_US
dc.contributor.authorAl-Ghezi, Ragheben_US
dc.contributor.authorAyasi, Ananyaen_US
dc.contributor.authorWatanabe, Shinjien_US
dc.contributor.authorMortensen, David R.en_US
dc.contributor.authorKurimo, Mikkoen_US
dc.contributor.departmentDepartment of Information and Communications Engineeringen
dc.contributor.editorEkštein, Kamilen_US
dc.contributor.editorPártl, Františeken_US
dc.contributor.editorKonopík, Miloslaven_US
dc.contributor.groupauthorSpeech Recognitionen
dc.contributor.organizationDepartment of Information and Communications Engineeringen_US
dc.contributor.organizationCarnegie Mellon Universityen_US
dc.contributor.organizationSpeech Recognitionen_US
dc.date.accessioned2024-01-17T08:28:33Z
dc.date.available2024-01-17T08:28:33Z
dc.date.embargoinfo:eu-repo/date/embargoEnd/2024-08-23en_US
dc.date.issued2023en_US
dc.descriptionPublisher Copyright: © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
dc.description.abstractAutomatic Speech Recognition (ASR) for high-resource languages like English is often considered a solved problem. However, most high-resource ASR systems favor socioeconomically advantaged dialects. In the case of English, this leaves behind many L2 speakers and speakers of low-resource accents (a majority of English speakers). One way to mitigate this is to fine-tune a pre-trained English ASR model for a desired low-resource accent. However, collecting transcribed accented audio is costly and time-consuming. In this work, we present a method to produce synthetic L2-English speech via pre-trained text-to-speech (TTS) in an L1 language (target accent). This can be produced at a much larger scale and lower cost than authentic speech collection. We present initial experiments applying this augmentation method. Our results suggest that success of TTS augmentation relies on access to more than one hour of authentic training data and a diversity of target-domain prompts for speech synthesis.en
dc.description.versionPeer revieweden
dc.format.extent11
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationKarakasidis, G, Robinson, N, Getman, Y, Ogayo, A, Al-Ghezi, R, Ayasi, A, Watanabe, S, Mortensen, D R & Kurimo, M 2023, Multilingual TTS Accent Impressions for Accented ASR. in K Ekštein, F Pártl & M Konopík (eds), Text, Speech, and Dialogue - 26th International Conference, TSD 2023, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14102 LNAI, Springer, pp. 317-327, International Conference on Text, Speech, and Dialogue, Pilsen, Czech Republic, 04/09/2023. https://doi.org/10.1007/978-3-031-40498-6_28en
dc.identifier.doi10.1007/978-3-031-40498-6_28en_US
dc.identifier.isbn978-3-031-40497-9
dc.identifier.issn0302-9743
dc.identifier.issn1611-3349
dc.identifier.otherPURE UUID: b64f5c76-8e0c-4a99-b061-a63d15ebdac2en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/b64f5c76-8e0c-4a99-b061-a63d15ebdac2en_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/133739889/Multilingual_TTS_Accent_Impressions_for_Accented_ASR_TSD2023.pdf
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/125859
dc.identifier.urnURN:NBN:fi:aalto-202401171534
dc.language.isoenen
dc.relation.ispartofInternational Conference on Text, Speech, and Dialogueen
dc.relation.ispartofseriesText, Speech, and Dialogue - 26th International Conference, TSD 2023, Proceedingsen
dc.relation.ispartofseriespp. 317-327en
dc.relation.ispartofseriesLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) ; Volume 14102 LNAIen
dc.rightsopenAccessen
dc.subject.keywordaccented speech recognitionen_US
dc.subject.keyworddata augmentationen_US
dc.subject.keywordlow-resource speech technologiesen_US
dc.subject.keywordspeech synthesisen_US
dc.titleMultilingual TTS Accent Impressions for Accented ASRen
dc.typeA4 Artikkeli konferenssijulkaisussafi
dc.type.versionacceptedVersion

Files