Learning Centre

Machine learning methods for suprasegmental analysis and conversion in speech

 |  Login

Show simple item record

dc.contributor Aalto-yliopisto fi
dc.contributor Aalto University en
dc.contributor.advisor Räsänen, Okko, Asst. Prof., Tampere University, Finland
dc.contributor.author Seshadri, Shreyas
dc.date.accessioned 2020-12-01T10:00:10Z
dc.date.available 2020-12-01T10:00:10Z
dc.date.issued 2020
dc.identifier.isbn 978-952-64-0167-6 (electronic)
dc.identifier.isbn 978-952-64-0166-9 (printed)
dc.identifier.issn 1799-4942 (electronic)
dc.identifier.issn 1799-4934 (printed)
dc.identifier.issn 1799-4934 (ISSN-L)
dc.identifier.uri https://aaltodoc.aalto.fi/handle/123456789/67490
dc.description.abstract Speech technology is a field of technological research focusing on methods to process spoken language. Work in the area has largely relied on a combination of domain-specific knowledge and digital signal processing (DSP) algorithms, often combined with statistical (parametric) models. In this context, machine learning (ML) has played a central role in estimating the parameters of such models. Recently, better access to large quantities of data has opened the door to advanced ML models that are less constrained by the assumptions necessary for the DSP models and are potentially capable of achieving higher performance. The goal of this thesis is to investigate the applicability of recent state-of-the-art (SoA) developments in ML to the modelling and processing of speech at the so-called suprasegmental level to tackle the following topical problems in speech research: 1) zero-resource speech processing (ZS), which aims to learn language patterns from speech without access to annotated datasets, 2) automatic word (WCE) and syllable (SCE) count estimation which focus on quantifying the amount of linguistic content in audio recordings, and 3) speaking style conversion (SSC), which deals with the conversion of the speaking style of an utterance while retaining the linguistic content, speaking identity and quality. In contrast to the segmental level which consists of elementary speech units known as phone(me)s, the suprasegmental level encodes more slowly varying characteristics of speech such as the speaker identity, speaking style, prosody and emotion. The ML-approaches used in the thesis are non-parametric Bayesian (NPB) models, which have a strong mathematical foundation based on Bayesian statistics, and artificial neural networks (NNs), which are universal function approximators capable of leveraging large quantities of training data. The NN variants used include 1) end-to-end models that are capable of learning complicated mapping functions without the need to explicitly model the intermediate steps, and 2) generative adversarial networks (GANs), which are based on training a minimax game between two competing NNs. In ZS, NPB clustering methods were investigated for the discovery of syllabic clusters from speech and were shown to eliminate the need for model selection. In the WCE/SCE task, a novel end-to-end model was developed for automatic and language-independent syllable counting from speech. The method improved the syllable counting accuracy by approximately 10 percentage points from the previously published SoA method while relaxing the requirements of the data annotation used for the model training. As for SSC, a new parametric approach was introduced for the task. Bayesian models were first studied with parallel data, followed by GAN-based solutions for non-parallel data. GAN-based models were shown to achieve SoA performance in terms of both subjective and objective measures and without access to parallel data. Augmented CycleGANs also enable manual control of the degree of style conversion achieved in the SSC task. en
dc.format.extent 98 + app. 82
dc.format.mimetype application/pdf en
dc.language.iso en en
dc.publisher Aalto University en
dc.publisher Aalto-yliopisto fi
dc.relation.ispartofseries Aalto University publication series DOCTORAL DISSERTATIONS en
dc.relation.ispartofseries 201/2020
dc.relation.haspart [Publication 1]: Seshadri, S., Remes, U. & Räsänen, O. Comparison of non-parametric Bayesian mixture models for syllable clustering and zero-resource speech processing. In Annual Conference of the International Speech Communication Association (INTERSPEECH), Stockholm, Sweden, pp. 2744–2748, August 2017. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201711217678. DOI: 10.21437/Interspeech.2017-339
dc.relation.haspart [Publication 2]: Seshadri, S., Remes, U. & Räsänen, O. Dirichlet process mixture models for clustering i-vector data. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, pp. 5740–5744, March 2017. DOI: 10.1109/ICASSP.2017.7953202
dc.relation.haspart [Publication 3]: Räsänen, O., Seshadri, S., Karadayi, J., Riebling, E., Bunce, J., Cristia, A., Metze, F., Casillas, M., Rosemberg, C., Bergelson, E. & Soderstrome, M. Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech. Speech Communication, vol. 113, pp. 63–80, October 2019. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201909035123. DOI: 10.1016/j.specom.2019.08.005
dc.relation.haspart [Publication 4]: Seshadri, S. & Räsänen, O. SylNet: An adaptable end-to-end syllable count estimator for speech. IEEE Signal Processing Letters, vol. 26, no. 9, pp. 1359–1363, July 2019. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201909205373. DOI: 10.1109/LSP.2019.2929415
dc.relation.haspart [Publication 5]: Seshadri, S., Juvela, L., Räsänen, O. & Alku, P. Vocal effort based speaking style conversion using vocoder features and parallel learning. IEEE Access, vol. 7, pp. 17230–17246, January 2019. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201904022453. DOI: 10.1109/ACCESS.2019.2895923
dc.relation.haspart [Publication 6]: Seshadri, S., Juvela, L., Yamagishi, J., Räsänen, O. & Alku, P. Cycle- consistent adversarial networks for non-parallel vocal effort based speaking style conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, pp. 6835–6839, May 2019. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201906033383. DOI: 10.1109/ICASSP.2019.8682648
dc.relation.haspart [Publication 7]: Seshadri, S., Juvela, L., Alku, P. & Räsänen, O. Augmented Cycle- GANs for continuous scale normal-to-Lombard speaking style conversion. In Annual Conference of the International Speech Communication Association (INTERSPEECH), Graz, Austria, pp. 2838–2842, September 2019. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-202001021295. DOI: 10.21437/Interspeech.2019-1681
dc.subject.other Electrical engineering en
dc.title Machine learning methods for suprasegmental analysis and conversion in speech en
dc.type G5 Artikkeliväitöskirja fi
dc.contributor.school Sähkötekniikan korkeakoulu fi
dc.contributor.school School of Electrical Engineering en
dc.contributor.department Signaalinkäsittelyn ja akustiikan laitos fi
dc.contributor.department Department of Signal Processing and Acoustics en
dc.subject.keyword suprasegmental speech processing en
dc.subject.keyword Bayesian learning en
dc.subject.keyword deep learning en
dc.subject.keyword zero-resource speech processing en
dc.subject.keyword word and syllable count estimation en
dc.subject.keyword speaking style conversion en
dc.identifier.urn URN:ISBN:978-952-64-0167-6
dc.type.dcmitype text en
dc.type.ontasot Doctoral dissertation (article-based) en
dc.type.ontasot Väitöskirja (artikkeli) fi
dc.contributor.supervisor Alku, Paavo, Prof., Aalto University, Department of Signal Processing and Acoustics, Finland
dc.opn Hautamäki, Ville, Dr., University of Eastern Finland, Finland
dc.rev Lee, Hung-yi, Assoc. Prof., National Taiwan University, Taiwan
dc.rev Tang, Yan, Asst. Prof., University of Illinois, USA
dc.date.defence 2020-12-18
local.aalto.acrisexportstatus checked 2020-12-28_2043
local.aalto.infra Aalto Acoustics Lab
local.aalto.formfolder 2020_12_01_klo_10_45
local.aalto.archive yes

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search archive

Advanced Search

article-iconSubmit a publication