Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks
No Thumbnail Available
Access rights
openAccess
acceptedVersion
URL
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
Date
2019-05-01
Major/Subject
Mcode
Degree programme
Language
en
Pages
5
Series
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6915 - 6919, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
Abstract
The state-of-the-art in text-to-speech (TTS) synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parallel versions are difficult to train and even more computationally expensive. Meanwhile, generative adversarial networks (GANs) have achieved impressive results in image generation and are making their way into audio applications; parallel inference is among their lucrative properties. By adopting recent advances in GAN training techniques, this investigation studies waveform generation for TTS in two domains (speech signal and glottal excitation). Listening test results show that while direct waveform generation with GAN is still far behind WaveNet, a GAN-based glottal excitation model can achieve quality and voice similarity on par with a WaveNet vocoder.Description
Keywords
Neural vocoding, text-to-speech, GAN, glottal excitation model
Other note
Citation
Juvela, L, Bollepalli, B, Yamagishi, J & Alku, P 2019, Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks . in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ., 8683271, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, pp. 6915 - 6919, IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, United Kingdom, 12/05/2019 . https://doi.org/10.1109/ICASSP.2019.8683271