Machine learning methods for suprasegmental analysis and conversion in speech

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
School of Electrical Engineering | Doctoral thesis (article-based) | Defence date: 2020-12-18
Degree programme
98 + app. 82
Aalto University publication series DOCTORAL DISSERTATIONS, 201/2020
Speech technology is a field of technological research focusing on methods to process spoken language. Work in the area has largely relied on a combination of domain-specific knowledge and digital signal processing (DSP) algorithms, often combined with statistical (parametric) models. In this context, machine learning (ML) has played a central role in estimating the parameters of such models. Recently, better access to large quantities of data has opened the door to advanced ML models that are less constrained by the assumptions necessary for the DSP models and are potentially capable of achieving higher performance. The goal of this thesis is to investigate the applicability of recent state-of-the-art (SoA) developments in ML to the modelling and processing of speech at the so-called suprasegmental level to tackle the following topical problems in speech research: 1) zero-resource speech processing (ZS), which aims to learn language patterns from speech without access to annotated datasets, 2) automatic word (WCE) and syllable (SCE) count estimation which focus on quantifying the amount of linguistic content in audio recordings, and 3) speaking style conversion (SSC), which deals with the conversion of the speaking style of an utterance while retaining the linguistic content, speaking identity and quality. In contrast to the segmental level which consists of elementary speech units known as phone(me)s, the suprasegmental level encodes more slowly varying characteristics of speech such as the speaker identity, speaking style, prosody and emotion. The ML-approaches used in the thesis are non-parametric Bayesian (NPB) models, which have a strong mathematical foundation based on Bayesian statistics, and artificial neural networks (NNs), which are universal function approximators capable of leveraging large quantities of training data. The NN variants used include 1) end-to-end models that are capable of learning complicated mapping functions without the need to explicitly model the intermediate steps, and 2) generative adversarial networks (GANs), which are based on training a minimax game between two competing NNs. In ZS, NPB clustering methods were investigated for the discovery of syllabic clusters from speech and were shown to eliminate the need for model selection. In the WCE/SCE task, a novel end-to-end model was developed for automatic and language-independent syllable counting from speech. The method improved the syllable counting accuracy by approximately 10 percentage points from the previously published SoA method while relaxing the requirements of the data annotation used for the model training. As for SSC, a new parametric approach was introduced for the task. Bayesian models were first studied with parallel data, followed by GAN-based solutions for non-parallel data. GAN-based models were shown to achieve SoA performance in terms of both subjective and objective measures and without access to parallel data. Augmented CycleGANs also enable manual control of the degree of style conversion achieved in the SSC task.
Supervising professor
Alku, Paavo, Prof., Aalto University, Department of Signal Processing and Acoustics, Finland
Thesis advisor
Räsänen, Okko, Asst. Prof., Tampere University, Finland
suprasegmental speech processing, Bayesian learning, deep learning, zero-resource speech processing, word and syllable count estimation, speaking style conversion
Other note
  • [Publication 1]: Seshadri, S., Remes, U. & Räsänen, O. Comparison of non-parametric Bayesian mixture models for syllable clustering and zero-resource speech processing. In Annual Conference of the International Speech Communication Association (INTERSPEECH), Stockholm, Sweden, pp. 2744–2748, August 2017.
    Full text in Acris/Aaltodoc:
    DOI: 10.21437/Interspeech.2017-339 View at publisher
  • [Publication 2]: Seshadri, S., Remes, U. & Räsänen, O. Dirichlet process mixture models for clustering i-vector data. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, pp. 5740–5744, March 2017.
    DOI: 10.1109/ICASSP.2017.7953202 View at publisher
  • [Publication 3]: Räsänen, O., Seshadri, S., Karadayi, J., Riebling, E., Bunce, J., Cristia, A., Metze, F., Casillas, M., Rosemberg, C., Bergelson, E. & Soderstrome, M. Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech. Speech Communication, vol. 113, pp. 63–80, October 2019.
    Full text in Acris/Aaltodoc:
    DOI: 10.1016/j.specom.2019.08.005 View at publisher
  • [Publication 4]: Seshadri, S. & Räsänen, O. SylNet: An adaptable end-to-end syllable count estimator for speech. IEEE Signal Processing Letters, vol. 26, no. 9, pp. 1359–1363, July 2019.
    Full text in Acris/Aaltodoc:
    DOI: 10.1109/LSP.2019.2929415 View at publisher
  • [Publication 5]: Seshadri, S., Juvela, L., Räsänen, O. & Alku, P. Vocal effort based speaking style conversion using vocoder features and parallel learning. IEEE Access, vol. 7, pp. 17230–17246, January 2019.
    Full text in Acris/Aaltodoc:
    DOI: 10.1109/ACCESS.2019.2895923 View at publisher
  • [Publication 6]: Seshadri, S., Juvela, L., Yamagishi, J., Räsänen, O. & Alku, P. Cycle- consistent adversarial networks for non-parallel vocal effort based speaking style conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, pp. 6835–6839, May 2019.
    Full text in Acris/Aaltodoc:
    DOI: 10.1109/ICASSP.2019.8682648 View at publisher
  • [Publication 7]: Seshadri, S., Juvela, L., Alku, P. & Räsänen, O. Augmented Cycle- GANs for continuous scale normal-to-Lombard speaking style conversion. In Annual Conference of the International Speech Communication Association (INTERSPEECH), Graz, Austria, pp. 2838–2842, September 2019.
    Full text in Acris/Aaltodoc:
    DOI: 10.21437/Interspeech.2019-1681 View at publisher