Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorJuvela, Laurien_US
dc.contributor.authorBollepalli, Bajibabuen_US
dc.contributor.authorYamagishi, Junichien_US
dc.contributor.authorAlku, Paavoen_US
dc.contributor.departmentDepartment of Signal Processing and Acousticsen
dc.contributor.groupauthorSpeech Communication Technologyen
dc.contributor.organizationResearch Organization of Information and Systems, National Institute of Informaticsen_US
dc.date.accessioned2019-06-03T14:12:14Z
dc.date.available2019-06-03T14:12:14Z
dc.date.issued2019-05-01en_US
dc.description.abstractThe state-of-the-art in text-to-speech (TTS) synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parallel versions are difficult to train and even more computationally expensive. Meanwhile, generative adversarial networks (GANs) have achieved impressive results in image generation and are making their way into audio applications; parallel inference is among their lucrative properties. By adopting recent advances in GAN training techniques, this investigation studies waveform generation for TTS in two domains (speech signal and glottal excitation). Listening test results show that while direct waveform generation with GAN is still far behind WaveNet, a GAN-based glottal excitation model can achieve quality and voice similarity on par with a WaveNet vocoder.en
dc.description.versionPeer revieweden
dc.format.extent5
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationJuvela, L, Bollepalli, B, Yamagishi, J & Alku, P 2019, Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks. in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 8683271, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, pp. 6915 - 6919, IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, United Kingdom, 12/05/2019. https://doi.org/10.1109/ICASSP.2019.8683271en
dc.identifier.doi10.1109/ICASSP.2019.8683271en_US
dc.identifier.isbn978-1-4799-8132-8
dc.identifier.isbn978-1-4799-8131-1
dc.identifier.issn1520-6149
dc.identifier.issn2379-190X
dc.identifier.otherPURE UUID: 4e11fdba-3ab9-4e73-a508-612fc052b4d9en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/4e11fdba-3ab9-4e73-a508-612fc052b4d9en_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/33439327/ELEC_Juvela_Waveform_Generation_2019_ICASSP.pdf
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/38251
dc.identifier.urnURN:NBN:fi:aalto-201906033336
dc.language.isoenen
dc.relation.ispartofIEEE International Conference on Acoustics, Speech, and Signal Processingen
dc.relation.ispartofseriesICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)en
dc.relation.ispartofseriespp. 6915 - 6919en
dc.relation.ispartofseriesProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processingen
dc.rightsopenAccessen
dc.subject.keywordNeural vocodingen_US
dc.subject.keywordtext-to-speechen_US
dc.subject.keywordGANen_US
dc.subject.keywordglottal excitation modelen_US
dc.titleWaveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networksen
dc.typeA4 Artikkeli konferenssijulkaisussafi
dc.type.versionacceptedVersion

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ELEC_Juvela_Waveform_Generation_2019_ICASSP.pdf
Size:
596.29 KB
Format:
Adobe Portable Document Format