Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorSeshadri, Shreyasen_US
dc.contributor.authorJuvela, Laurien_US
dc.contributor.authorRäsänen, Okkoen_US
dc.contributor.authorAlku, Paavoen_US
dc.contributor.departmentDepartment of Signal Processing and Acousticsen
dc.contributor.groupauthorSpeech Communication Technologyen
dc.date.accessioned2019-04-02T06:54:40Z
dc.date.available2019-04-02T06:54:40Z
dc.date.issued2019-01-01en_US
dc.description.abstractSpeaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we aim to provide a general SSC system for converting styles with varying vocal effort and focus on normal-to-Lombard conversion as a case study of this problem. We propose a parametric approach that uses a vocoder to extract speech features. These features are mapped using parallel machine learning models from utterances spoken in normal style to the corresponding features of Lombard speech. Finally, the mapped features are converted to a Lombard speech waveform with the vocoder. A total of three vocoders (GlottDNN, STRAIGHT, and Pulse model in log domain (PML)) and three machine learning mapping methods (standard GMM, Bayesian GMM, and feed-forward DNN) were compared in the proposed normal-to-Lombard style conversion system. The conversion was evaluated using two subjective listening tests measuring perceived Lombardness and quality of the converted speech signals, and by using an instrumental measure called Speech Intelligibility in Bits (SIIB) for speech intelligibility evaluation under various noise levels. The results of the subjective tests show that the system is able to convert normal speech into Lombard speech and that there is a trade-off between quality and Lombardness of the mapped utterances. The GlottDNN and PML stand out as the best vocoders in terms of quality and Lombardness, respectively, whereas the DNN is the best mapping method in terms of Lombardness. PML with the standard GMM seems to give a good compromise between the two attributes. The SIIB experiments indicate that intelligibility of converted speech compared to that of normal speech improved in noisy conditions most effectively when DNN mapping was used with STRAIGHT and PML.en
dc.description.versionPeer revieweden
dc.format.extent17
dc.format.extent17230-17246
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationSeshadri, S, Juvela, L, Räsänen, O & Alku, P 2019, ' Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning ', IEEE Access, vol. 7, 8631106, pp. 17230-17246 . https://doi.org/10.1109/ACCESS.2019.2895923en
dc.identifier.doi10.1109/ACCESS.2019.2895923en_US
dc.identifier.issn2169-3536
dc.identifier.otherPURE UUID: 6876ab7d-e037-47cf-9897-d87bc55f6081en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/6876ab7d-e037-47cf-9897-d87bc55f6081en_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85061789099&partnerID=8YFLogxKen_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/32488027/ELEC_Seshadri_Vocal_effort_IEEEAccess.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/37322
dc.identifier.urnURN:NBN:fi:aalto-201904022453
dc.language.isoenen
dc.publisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
dc.relation.ispartofseriesIEEE Accessen
dc.relation.ispartofseriesVolume 7en
dc.rightsopenAccessen
dc.subject.keywordBayesian GMMen_US
dc.subject.keywordDNNen_US
dc.subject.keywordGlottDNNen_US
dc.subject.keywordLombard speechen_US
dc.subject.keywordpulse model in log domainen_US
dc.subject.keywordspeaking style conversionen_US
dc.subject.keywordvocal efforten_US
dc.titleVocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learningen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files