Normal-to-Lombard adaptation of speech synthesis using long short-term memory recurrent neural networks

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorBollepalli, Bajibabuen_US
dc.contributor.authorJuvela, Laurien_US
dc.contributor.authorAiraksinen, Manuen_US
dc.contributor.authorValentini-Botinhao, Cassiaen_US
dc.contributor.authorAlku, Paavoen_US
dc.contributor.departmentDepartment of Signal Processing and Acousticsen
dc.contributor.groupauthorSpeech Communication Technologyen
dc.contributor.organizationUniversity of Edinburghen_US
dc.date.accessioned2019-05-06T09:07:42Z
dc.date.available2019-05-06T09:07:42Z
dc.date.embargoinfo:eu-repo/date/embargoEnd/2021-04-24en_US
dc.date.issued2019-07-01en_US
dc.description.abstractIn this article, three adaptation methods are compared based on how well they change the speaking style of a neural network based text-to-speech (TTS) voice. The speaking style conversion adopted here is from normal to Lombard speech. The selected adaptation methods are: auxiliary features (AF), learning hidden unit contribution (LHUC), and fine-tuning (FT). Furthermore, four state-of-the-art TTS vocoders are compared in the same context. The evaluated vocoders are: GlottHMM, GlottDNN, STRAIGHT, and pulse model in log-domain (PML). Objective and subjective evaluations were conducted to study the performance of both the adaptation methods and the vocoders. In the subjective evaluations, speaking style similarity and speech intelligibility were assessed. In addition to acoustic model adaptation, phoneme durations were also adapted from normal to Lombard with the FT adaptation method. In objective evaluations and speaking style similarity tests, we found that the FT method outperformed the other two adaptation methods. In speech intelligibility tests, we found that there were no significant differences between vocoders although the PML vocoder showed slightly better performance compared to the three other vocoders.en
dc.description.versionPeer revieweden
dc.format.extent12
dc.format.extent64-75
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationBollepalli, B, Juvela, L, Airaksinen, M, Valentini-Botinhao, C & Alku, P 2019, ' Normal-to-Lombard adaptation of speech synthesis using long short-term memory recurrent neural networks ', Speech Communication, vol. 110, pp. 64-75 . https://doi.org/10.1016/j.specom.2019.04.008en
dc.identifier.doi10.1016/j.specom.2019.04.008en_US
dc.identifier.issn0167-6393
dc.identifier.issn1872-7182
dc.identifier.otherPURE UUID: 25aea363-f4b7-4bf0-9bac-8d5d3f3b04aben_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/25aea363-f4b7-4bf0-9bac-8d5d3f3b04aben_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85064711915&partnerID=8YFLogxKen_US
dc.identifier.otherPURE LINK: http://www.sciencedirect.com/science/article/pii/S0167639318303832en_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/33417584/ELEC_Bollepalli_Normal_to_lombard_Speech_Communication.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/37623
dc.identifier.urnURN:NBN:fi:aalto-201905062743
dc.language.isoenen
dc.publisherElsevier
dc.relation.ispartofseriesSpeech Communicationen
dc.relation.ispartofseriesVolume 110en
dc.rightsopenAccessen
dc.subject.keywordLombarden_US
dc.subject.keywordAuxiliary featuresen_US
dc.subject.keywordLHUCen_US
dc.subject.keywordFine-tuningen_US
dc.subject.keywordLSTMen_US
dc.subject.keywordAdaptationen_US
dc.subject.keywordTTSen_US
dc.titleNormal-to-Lombard adaptation of speech synthesis using long short-term memory recurrent neural networksen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi

Files