Browsing by Author "Airaksinen, Manu"
Now showing 1 - 20 of 23
Results Per Page
Sort Options
Item Alternating minimisation for glottal inverse filtering(2017-05-17) Bleyer, Ismael Rodrigo; Lybeck, Lasse; Auvinen, Harri; Airaksinen, Manu; Alku, Paavo; Siltanen, Samuli; Dept Signal Process and Acoust; Speech Communication Technology; University of HelsinkiA new method is proposed for solving the glottal inverse filtering (GIF) problem. The goal of GIF is to separate an acoustical speech signal into two parts: the glottal airflow excitation and the vocal tract filter. To recover such information one has to deal with a blind deconvolution problem. This ill-posed inverse problem is solved under a deterministic setting, considering unknowns on both sides of the underlying operator equation. A stable reconstruction is obtained using a double regularization strategy, alternating between fixing either the glottal source signal or the vocal tract filter. This enables not only splitting the nonlinear and nonconvex problem into two linear and convex problems, but also allows the use of the best parameters and constraints to recover each variable at a time. This new technique, called alternating minimization glottal inverse filtering (AM-GIF), is compared with two other approaches: Markov chain Monte Carlo glottal inverse filtering (MCMC-GIF), and iterative adaptive inverse filtering (IAIF), using synthetic speech signals. The recent MCMC-GIF has good reconstruction quality but high computational cost. The state-of-the-art IAIF method is computationally fast but its accuracy deteriorates, particularly for speech signals of high fundamental frequency (F0). The results show the competitive performance of the new method: With high F0, the reconstruction quality is better than that of IAIF and close to MCMC-GIF while reducing the computational complexity by two orders of magnitude.Item Automatic Posture and Movement Tracking of Infants with Wearable Movement Sensors(Nature Publishing Group, 2020-12-01) Airaksinen, Manu; Räsänen, Okko; Ilen, Elina; Häyrinen, Taru; Kivi, Anna; Marchi, Viviana; Gallen, Anastasia; Blom, Sonja; Varhe, Anni; Kaartinen, Nico; Haataja, Leena; Vanhatalo, Sampsa; Dept Signal Process and Acoust; Department of Design; Jorma Skyttä's Group; Fashion/Textile Futures; Helsinki University Central Hospital; IRCCS Fondazione Stella Maris - Calambrone; Kaasa Solution GmbHInfants' spontaneous and voluntary movements mirror developmental integrity of brain networks since they require coordinated activation of multiple sites in the central nervous system. Accordingly, early detection of infants with atypical motor development holds promise for recognizing those infants who are at risk for a wide range of neurodevelopmental disorders (e.g., cerebral palsy, autism spectrum disorders). Previously, novel wearable technology has shown promise for offering efficient, scalable and automated methods for movement assessment in adults. Here, we describe the development of an infant wearable, a multi-sensor smart jumpsuit that allows mobile accelerometer and gyroscope data collection during movements. Using this suit, we first recorded play sessions of 22 typically developing infants of approximately 7 months of age. These data were manually annotated for infant posture and movement based on video recordings of the sessions, and using a novel annotation scheme specifically designed to assess the overall movement pattern of infants in the given age group. A machine learning algorithm, based on deep convolutional neural networks (CNNs) was then trained for automatic detection of posture and movement classes using the data and annotations. Our experiments show that the setup can be used for quantitative tracking of infant movement activities with a human equivalent accuracy, i.e., it meets the human inter-rater agreement levels in infant posture and movement classification. We also quantify the ambiguity of human observers in analyzing infant movements, and propose a method for utilizing this uncertainty for performance improvements in training of the automated classifier. Comparison of different sensor configurations also shows that four-limb recording leads to the best performance in posture and movement classification.Item Building an Open Source Classifier for the Neonatal EEG Background: A Systematic Feature-Based Approach From Expert Scoring to Clinical Visualization(FRONTIERS MEDIA SA, 2021-05-31) Moghadam, Saeed Montazeri; Pinchefsky, Elana; Tse, Ilse; Marchi, Viviana; Kohonen, Jukka; Kauppila, Minna; Airaksinen, Manu; Tapani, Karoliina; Nevalainen, Päivi; Hahn, Cecil; Tam, Emily W.Y.; Stevenson, Nathan J.; Vanhatalo, Sampsa; Department of Computer Science; Dept Signal Process and Acoust; Helsinki Institute for Information Technology (HIIT); Professorship Kaski Petteri; Jorma Skyttä's Group; University of Helsinki; University of Montreal; University of Toronto; Queensland Institute of Medical ResearchNeonatal brain monitoring in the neonatal intensive care units (NICU) requires a continuous review of the spontaneous cortical activity, i.e., the electroencephalograph (EEG) background activity. This needs development of bedside methods for an automated assessment of the EEG background activity. In this paper, we present development of the key components of a neonatal EEG background classifier, starting from the visual background scoring to classifier design, and finally to possible bedside visualization of the classifier results. A dataset with 13,200 5-minute EEG epochs (8–16 channels) from 27 infants with birth asphyxia was used for classifier training after scoring by two independent experts. We tested three classifier designs based on 98 computational features, and their performance was assessed with respect to scoring system, pre- and post-processing of labels and outputs, choice of channels, and visualization in monitor displays. The optimal solution achieved an overall classification accuracy of 97% with a range across subjects of 81–100%. We identified a set of 23 features that make the classifier highly robust to the choice of channels and missing data due to artefact rejection. Our results showed that an automated bedside classifier of EEG background is achievable, and we publish the full classifier algorithm to allow further clinical replication and validation studies.Item A comparison between STRAIGHT, glottal, and sinusoidal vocoding in statistical parametric speech synthesis(2018-09) Airaksinen, Manu; Juvela, Lauri; Bollepalli, Bajibabu; Yamagishi, Junichi; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication Technology; National Institute of InformaticsA vocoder is used to express a speech waveform with a controllable parametric representation that can be converted back into a speech waveform. Vocoders representing their main categories (mixed excitation, glottal, sinusoidal vocoders) were compared in this study with formal and crowd-sourced listening tests. Vocoder quality was measured within the context of analysis-synthesis as well as text-to-speech (TTS) synthesis in a modern statistical parametric speech synthesis framework. Furthermore, the TTS experiments were divided into synthesis with vocoder-specific features and synthesis with a shared envelope model, where the waveform generation method of the vocoders is mainly responsible for the quality differences. Finally, all of the tests included four distinct voices as a way to investigate the effect of different speakers on the synthesized speech quality. The obtained results suggest that the choice of the voice has a profound impact on the overall quality of the vocoder-generated speech, and the best vocoder for each voice can vary case by case. The single best-rated TTS system was obtained with the glottal vocoder GlottDNN using a male voice with low expressiveness. However, the results indicate that the sinusoidal vocoder PML (pulse model in log-domain) has the best overall performance across the performed tests. Finally, when controlling for the spectral models of the vocoders, the observed differences are similar to the baseline results. This indicates that the waveform generation method of a vocoder is essential for quality improvements.Item Data augmentation strategies for neural network F0 estimation(2019-05-01) Airaksinen, Manu; Juvela, Lauri; Alku, Paavo; Räsänen, Okko; Dept Signal Process and Acoust; Jorma Skyttä's Group; Speech Communication TechnologyThis study explores various speech data augmentation methods for the task of noise-robust fundamental frequency (F0) estimation with neural networks. The explored augmentation strategies are split into additive noise and channel -based augmentation and into vocoder-based augmentation methods. In vocoder-based augmentation, a glottal vocoder is used to enhance the accuracy of ground truth F0 used for training of the neural network, as well as to expand the training data diversity in terms of F0 patterns and vocal tract lengths of the talkers. Evaluations on the PTDB-TUG corpus indicate that noise and channel augmentation can be used to greatly increase the noise robustness of trained models, and that vocoder-based ground truth enhancement further increases model performance. For smaller datasets, vocoder-based diversity augmentation can also be used to increase performance. The best-performing proposed method greatly outperformed the compared F0 estimation methods in terms of noise robustness.Item Effects of training data variety in generating glottal pulses from acoustic features with DNNs(2017-08) Airaksinen, Manu; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication TechnologyGlottal volume velocity waveform, the acoustical excitation of voiced speech, cannot be acquired through direct measurements in normal production of continuous speech. Glottal inverse filtering (GIF), however, can be used to estimate the glottal flow from recorded speech signals. Unfortunately, the usefulness of GIF algorithms is limited since they are sensitive to noise and call for high-quality recordings. Recently, efforts have been taken to expand the use of GIF by training deep neural networks (DNNs) to learn a statistical mapping between frame-level acoustic features and glottal pulses estimated by GIF. This framework has been successfully utilized in statistical speech synthesis in the form of the GlottDNN vocoder which uses a DNN to generate glottal pulses to be used as the synthesizer’s excitation waveform. In this study, we investigate how the DNN-based generation of glottal pulses is affected by training data variety. The evaluation is done using both objective measures as well as subjective listening tests of synthetic speech. The results suggest that the performance of the glottal pulse generation with DNNs is affected particularly by how well the training corpus suits GIF: processing low-pitched male speech and sustained phonations shows better performance than processing high-pitched female voices or continuous speech.Item Estimation of the glottal source from coded telephone speech using deep neural networks(2019-01-01) Narendra, N.P.; Airaksinen, Manu; Story, Brad; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication Technology; University of ArizonaEstimation of glottal source information can be performed non-invasively from speech by using glottal inverse filtering (GIF) methods. However, the existing GIF methods are sensitive even to slight distortions in speech signals under different realistic scenarios, for example, in coded telephone speech. Therefore, there is a need for robust GIF methods which could accurately estimate glottal flows from coded telephone speech. To address the issue of robust GIF, this paper proposes a new deep neural net-based glottal inverse filtering (DNN-GIF) method for estimation of glottal source from coded telephone speech. The proposed DNN-GIF method utilizes both coded and clean versions of speech signal during training. DNN is used to map the speech features extracted from coded speech with the glottal flows estimated from the corresponding clean speech. The glottal flows are estimated from the clean speech by using quasi closed phase analysis (QCP). To generate coded telephone speech, adaptive multi-rate (AMR) codec is utilized which operates in two transmission bandwidths: narrow band (300 Hz - 3.4 kHz) and wide band (50 Hz - 7 kHz). The glottal source parameters were computed from the proposed and existing GIF methods by using vowels obtained from natural speech data as well as from artificial speech production models. The errors in glottal source parameters indicate that the proposed DNN-GIF method has considerably improved the glottal flow estimation under coded condition for both low- and high-pitched vowels. The proposed DNN-GIF method can be utilized to accurately11In this article, the term “accurate/accuracy” is used only when referring to quantitative, objective measures. extract glottal source -based features from coded telephone speech which can be used to improve the performance of speech technology applications such as speaker recognition, emotion recognition and telemonitoring of neurodegerenerative diseases.Item Glottal source estimation from coded telephone speech using a deep neural network(2017-08) Nonavinakere Prabhakera, Narendra; Airaksinen, Manu; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication TechnologyIn speech analysis, the information about the glottal source is obtained from speech by using glottal inverse filtering (GIF). The accuracy of state-of-the-art GIF methods is sufficiently high when the input speech signal is of high-quality (i.e., with little noise or reverberation). However, in realistic conditions, particularly when GIF is computed from coded telephone speech, the accuracy of GIF methods deteriorates severely. To robustly estimate the glottal source under coded condition, a deep neural network (DNN)-based method is proposed. The proposed method utilizes a DNN to map the speech features extracted from the coded speech to the glottal flow waveform estimated from the corresponding clean speech. To generate the coded telephone speech, adaptive multi-rate (AMR) codec is utilized which is a widely used speech compression method. The proposed glottal source estimation method is compared with two existing GIF methods, closed phase covariance analysis (CP) and iterative adaptive inverse filtering (IAIF). The results indicate that the proposed DNN-based method is capable of estimating glottal flow waveforms from coded telephone speech with a considerably better accuracy in comparison to CP and IAIF.Item Intelligent wearable allows out-of-the-lab tracking of developing motor abilities in infants(NATURE PORTFOLIO, 2022-06-15) Airaksinen, Manu; Gallen, Anastasia; Kivi, Anna; Vijayakrishnan, Pavithra; Häyrinen, Taru; Ilen, Elina; Räsänen, Okko; Haataja, Leena; Vanhatalo, Sampsa; Department of Design; Fashion/Textile Futures; BABA Center; Helsinki University Hospital; Tampere University; University of HelsinkiBackground Early neurodevelopmental care needs better, effective and objective solutions for assessing infants’ motor abilities. Novel wearable technology opens possibilities for characterizing spontaneous movement behavior. This work seeks to construct and validate a generalizable, scalable, and effective method to measure infants’ spontaneous motor abilities across all motor milestones from lying supine to fluent walking. Methods A multi-sensor infant wearable was constructed, and 59 infants (age 5–19 months) were recorded during their spontaneous play. A novel gross motor description scheme was used for human visual classification of postures and movements at a second-level time resolution. A deep learning -based classifier was then trained to mimic human annotations, and aggregated recording-level outputs were used to provide posture- and movement-specific developmental trajectories, which enabled more holistic assessments of motor maturity. Results Recordings were technically successful in all infants, and the algorithmic analysis showed human-equivalent-level accuracy in quantifying the observed postures and movements. The aggregated recordings were used to train an algorithm for predicting a novel neurodevelopmental measure, Baba Infant Motor Score (BIMS). This index estimates maturity of infants’ motor abilities, and it correlates very strongly (Pearson’s r = 0.89, p Conclusions The results show that out-of-hospital assessment of infants’ motor ability is possible using a multi-sensor wearable. The algorithmic analysis provides metrics of motility that are transparent, objective, intuitively interpretable, and they link strongly to infants’ age. Such a solution could be automated and scaled to a global extent, holding promise for functional benchmarking in individualized patient care or early intervention trials.Item Methods for the application of glottal inverse filtering to statistical parametric speech synthesis(Aalto University, 2018) Airaksinen, Manu; Signaalinkäsittelyn ja akustiikan laitos; Department of Signal Processing and Acoustics; Sähkötekniikan korkeakoulu; School of Electrical Engineering; Alku, Paavo, Academy Prof., Aalto University, Department of Signal Processing and Acoustics, FinlandSpeech is a fundamental method of human communication that allows conveying information between people. Even though the linguistic content is commonly regarded as the main information in speech, the signal contains a richness of other information, such as prosodic cues that shape the intended meaning of a sentence. This information is largely generated by quasi-periodic glottal excitation, which is the acoustic speech excitation airflow originating from the lungs that makes the vocal folds oscillate in the production of voiced speech. By regulating the sub-glottal pressure and the tension of the vocal folds, humans learn to affect the characteristics of the glottal excitation in order to signal the emotional state of the speaker for example. Glottal inverse filtering (GIF) is an estimation method for the glottal excitation of a recorded speech signal. Various cues about the speech signal, such as the mode of phonation, can be detected and analyzed from an estimate of the glottal flow, both instantaneously and as a function of time. Aside from its use in fundamental speech research, such as phonetics, the recent advances in GIF and machine learning enable a wider variety of GIF applications, such as emotional speech synthesis and the detection of paralinguistic information. However, GIF is a difficult inverse problem where the target algorithm output is generally unattainable with direct measurements. Thus the algorithms and their evaluation need to rely on some prior assumptions about the properties of the speech signal. A common thread utilized in most of the studies in this thesis is the estimation of the vocal tract transfer function (the key problem in GIF) by temporally weighting the optimization criterion in GIF so that the effect of the main excitation peak is attenuated. This thesis studies GIF from various perspectives---including the development of two new GIF methods that improve GIF performance over the state-of-the-art methods---and furthers basic research in the automated estimation of glottal excitation. The estimation of the GIF-based vocal tract transfer function for formant tracking and perceptually weighted speech envelope estimation is also studied. The central speech technology application of GIF addressed in the thesis is the use of GIF-based spectral envelope models and glottal excitation waveforms as target training data for the generative neural network models used in statistical parametric speech synthesis. The obtained results show that even though the presented studies provide improvements to the previous methodology for all voice types, GIF-based speech processing continues to mainly benefit male voices in speech synthesis applications.Item Normal-to-Lombard adaptation of speech synthesis using long short-term memory recurrent neural networks(Elsevier, 2019-07-01) Bollepalli, Bajibabu; Juvela, Lauri; Airaksinen, Manu; Valentini-Botinhao, Cassia; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication Technology; University of EdinburghIn this article, three adaptation methods are compared based on how well they change the speaking style of a neural network based text-to-speech (TTS) voice. The speaking style conversion adopted here is from normal to Lombard speech. The selected adaptation methods are: auxiliary features (AF), learning hidden unit contribution (LHUC), and fine-tuning (FT). Furthermore, four state-of-the-art TTS vocoders are compared in the same context. The evaluated vocoders are: GlottHMM, GlottDNN, STRAIGHT, and pulse model in log-domain (PML). Objective and subjective evaluations were conducted to study the performance of both the adaptation methods and the vocoders. In the subjective evaluations, speaking style similarity and speech intelligibility were assessed. In addition to acoustic model adaptation, phoneme durations were also adapted from normal to Lombard with the FT adaptation method. In objective evaluations and speaking style similarity tests, we found that the FT method outperformed the other two adaptation methods. In speech intelligibility tests, we found that there were no significant differences between vocoders although the PML vocoder showed slightly better performance compared to the three other vocoders.Item An Open Source Classifier for Bed Mattress Signal in Infant Sleep Monitoring(FRONTIERS MEDIA SA, 2021-01-14) Ranta, Jukka; Airaksinen, Manu; Kirjavainen, Turkka; Vanhatalo, Sampsa; Stevenson, Nathan J.; Dept Signal Process and Acoust; Jorma Skyttä's Group; University of Helsinki; Queensland Institute of Medical ResearchObjective: To develop a non-invasive and clinically practical method for a long-term monitoring of infant sleep cycling in the intensive care unit. Methods: Forty three infant polysomnography recordings were performed at 1–18 weeks of age, including a piezo element bed mattress sensor to record respiratory and gross-body movements. The hypnogram scored from polysomnography signals was used as the ground truth in training sleep classifiers based on 20,022 epochs of movement and/or electrocardiography signals. Three classifier designs were evaluated in the detection of deep sleep (N3 state): support vector machine (SVM), Long Short-Term Memory neural network, and convolutional neural network (CNN). Results: Deep sleep was accurately identified from other states with all classifier variants. The SVM classifier based on a combination of movement and electrocardiography features had the highest performance (AUC 97.6%). A SVM classifier based on only movement features had comparable accuracy (AUC 95.0%). The feature-independent CNN resulted in roughly comparable accuracy (AUC 93.3%). Conclusion: Automated non-invasive tracking of sleep state cycling is technically feasible using measurements from a piezo element situated under a bed mattress. Significance: An open source infant deep sleep detector of this kind allows quantitative, continuous bedside assessment of infant’s sleep cycling.Item OPENGLOT – An open environment for the evaluation of glottal inverse filtering(Elsevier, 2019-02-01) Alku, Paavo; Murtola, Tiina; Malinen, Jarmo; Kuortti, Juha; Story, Brad; Airaksinen, Manu; Salmi, Mika; Vilkman, Erkki; Geneid, Ahmed; Dept Signal Process and Acoust; Department of Mathematics and Systems Analysis; Department of Mechanical Engineering; Speech Communication Technology; Numerical Analysis; Jorma Skyttä's Group; Advanced Manufacturing and Materials; University of Arizona; University of HelsinkiGlottal inverse filtering (GIF) refers to technology to estimate the source of voiced speech, the glottal flow, from speech signals. When a new GIF algorithm is proposed, its accuracy needs to be evaluated. However, the evaluation of GIF is problematic because the ground truth, the real glottal volume velocity signal generated by the vocal folds, cannot be recorded non-invasively from natural speech. This absence of the ground truth has been circumvented in most previous GIF studies by using simple linear source-filter synthesis techniques with known artificial glottal flow models and all-pole vocal tract filters. Moreover, in a few previous studies, physical modeling of speech production has been utilized in synthesis of the test data for GIF evaluation. The evaluation strategy in previous GIF studies is, however, scattered between individual investigations and there is currently a lack of a coherent, common platform to be used in GIF evaluation. In order to address this shortcoming, the current study introduces a new environment, called OPENGLOT, for GIF evaluation. The key ideas of OPENGLOT are twofold: the environment is versatile (i.e., it provides different types of test signals for GIF evaluation) and open (i.e., the system can be used by anyone who wants to evaluate her or his new GIF method and compare it objectively to previously developed benchmark techniques). OPENGLOT consists of four main parts, Repositories I–IV, that contain data and sound synthesis software. Repository I contains a large set of synthetic glottal flow waveforms, and speech signals generated by using the Liljencrants–Fant (LF) waveform as an artificial excitation, and a digital all-pole filter to model the vocal tract. Repository II contains glottal flow and speech pressure signals generated using physical modeling of human speech production. Repository III contains pairs of glottal excitation and speech pressure signal generated by exciting 3D printed plastic vocal tract replica with LF excitations via a loudspeaker. Finally, Repository IV contains multichannel recordings (speech pressure signal, electroglottogram, high-speed video of the vocal folds) from natural production of speech. After presenting these four core parts of OPENGLOT, the article demonstrates the platform by presenting a typical use case.Item Puhesignaalin perustaajuusestimointi(2010) Airaksinen, Manu; Fagerlund, Seppo; Elektroniikan, tietoliikenteen ja automaation tiedekunta; Turunen, MarkusItem Quasi-closed phase forward-backward linear prediction analysis of speech for accurate formant detection and estimation(2017-09-25) Gowda, Dhananjaya; Airaksinen, Manu; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication TechnologyRecently, a quasi-closed phase (QCP) analysis of speech signals for accurate glottal inverse filtering was proposed. However, the QCP analysis which belongs to the family of temporally weighted linear prediction (WLP) methods uses the conventional forward type of sample prediction. This may not be the best choice especially in computing WLP models with a hard-limiting weighting function. A sample selective minimization of the prediction error in WLP reduces the effective number of samples available within a given window frame. To counter this problem, a modified quasi-closed phase forward-backward (QCP-FB) analysis is proposed, wherein each sample is predicted based on its past as well as future samples thereby utilizing the available number of samples more effectively. Formant detection and estimation experiments on synthetic vowels generated using a physical modeling approach as well as natural speech utterances show that the proposed QCP-FB method yields statistically significant improvements over the conventional linear prediction and QCP methods.Item Semi-supervised machine learning techniques for infant motility classification(2021-10-18) Vijayakrishnan, Pavithra; Airaksinen, Manu; Perustieteiden korkeakoulu; Jung, AlexActivity recognition (AR) is an emerging field due to its direct applications in various areas including fitness and health sector. AR involves the classification of human activities and movements into different categories. Specifically, infant activity recognition (IAR) assists in diagnosing motor disorders, such as cerebral palsy and hemiplegia and early diagnosis opens avenue for improved care and treatment. IAR can be performed by applying machine learning (ML) techniques. Most studies on infant motility classification employ supervised ML methods that require data to be manually annotated. However, manual annotation of infant motility data to different categories is extremely laborious, expensive, and prone to ambiguity. Therefore, in order to reduce the heavy reliance on manually annotated data, the aim of this thesis is to evaluate the feasibility of using semi-supervised machine learning techniques for classifying infant movement data. The infant data used for this study was acquired with inertial measurement unit (IMU) sensors attached to a wearable Maiju jumpsuit. The semi-supervised learning methods applied here first utilize unannotated data for representation learning with unsupervised learning algorithms such as autoencoder(AE) and contrastive predictive coding (CPC). These learned representations are then used to perform movement classification on annotated data using supervised learning algorithms. The optimal semi-supervised model is determined by fine-tuning the hyper-parameters based on the unweighted average F1-Score (UWAF) metric. The results of this study indicate that the UWAF scores obtained from the optimal semi-supervised models are better in comparison to end-to-end supervised models, especially for lower amounts of available annotated data. Therefore, semi-supervised learning by employing unsupervised pre-training for representation learning followed by supervised learning of the movement classes on the learned representation provides a viable and cost-effective methodology for IAR in future.Item Speaker-independent raw waveform model for glottal excitation(2018-09-02) Juvela, Lauri; Tsiaras, Vassilis; Bollepalli, Bajibabu; Airaksinen, Manu; Yamagishi, Junichi; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication Technology; University of Crete; National Institute of InformaticsRecent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing the waveform generator model across multiple speakers without additional speaker codes. However, multi-speaker WaveNet models require large amounts of training data and computation to cover the entire acoustic space. This paper proposes leveraging the source-filter model of speech production to more effectively train a speaker-independent waveform generator with limited resources. We present a multi-speaker ’GlotNet’ vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech. Listening tests show that the proposed model performs favourably to a direct WaveNet vocoder trained with the same model architecture and data.Item Speech Waveform Synthesis from MFCC Sequences with Generative Adversarial Networks(2018-09-10) Juvela, Lauri; Bollepalli, Bajibabu; Wang, Xin; Kameoka, Hirokazu; Airaksinen, Manu; Yamagishi, Junichi; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication Technology; Nippon Telegraph and Telephone Corporation; National Institute of InformaticsThis paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in speech applications, such as ASR, but are generally considered unusable for speech synthesis. First, we predict fundamental frequency and voicing information from MFCCs with an autoregressive recurrent neural net. Second, the spectral envelope information contained in MFCCs is converted to all-pole filters, and a pitch-synchronous excitation model matched to these filters is trained. Finally, we introduce a generative adversarial network-based noise model to add a realistic high-frequency stochastic component to the modeled excitation signal. The results show that high quality speech reconstruction can be obtained, given only MFCC information at test time.Item Tilastollisessa parametrisessa puhesynteesissä käytettyjen vokooderien analyysi-synteesi-vertailu(2012) Airaksinen, Manu; Raitio, Tuomo; Signaalinkäsittelyn ja akustiikan laitos; Sähkötekniikan korkeakoulu; Alku, PaavoTässä työssä esitetään kirjallisuuskatsaus ja kokeellinen osio tilastollisessa parametrisessa puhesynteesissä käytetyistä vokoodereista. Kokeellisessa osassa kolmen valitun vokooderin (GlottHMM, STRAIGHT ja Harmonic/Stochastic Model) analyysi-synteesi -ominaisuuksia tarkastellaan usealla tavalla. Suoritetut kokeet olivat vokooderiparametrien tilastollisten jakaumien analysointi, puheen tunnetilan tilastollinen vaikutus vokooderiparametrien jakaumiin sekä subjektiivinen kuuntelukoe jolla mitattiin vokooderien suhteellista analyysi-synteesi -laatua. Tulokset osoittavat että STRAIGHT-vokooderi omaa eniten Gaussiset parametrijakaumat ja tasaisimman synteesilaadun. GlottHMM-vokooderin parametrit osoittivat suurinta herkkyyttä puheen tunnetilan funktiona ja vokooderi sai parhaan, mutta laadultaan vaihtelevan kuuntelukoetuloksen. HSM-vokooderin LSF-parametrien havaittiin olevan Gaussisempia kuin GlottHMM-vokooderin LSF parametrit, mutta vokooderin havaittiin kärsivän kohinaherkkyydestä, ja se sai huonoimman kuuntelukoetuloksen.Item Time-regularized linear prediction for noise-robust extraction of the spectral envelope of speech(International Speech Communication Association, 2018-09-02) Airaksinen, Manu; Juvela, Lauri; Räsänen, Okko; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication TechnologyFeature extraction of speech signals is typically performed in short-time frames by assuming that the signal is stationary within each frame. For the extraction of the spectral envelope of speech, which conveys the formant frequencies produced by the resonances of the slowly varying vocal tract, an often used frame length is within 20-30 ms. However, this kind of conventional frame-based spectral analysis is oblivious of the broader temporal context of the signal and is prone to degradation by, for example, environmental noise. In this paper, we propose a new frame-based linear prediction (LP) analysis method that includes a regularization term that penalizes energy differences in consecutive frames of an all-pole spectral envelope model. This integrates the slowly varying nature of the vocal tract as a part of the analysis. Objective evaluations related to feature distortion and phonetic representational capability were performed by studying the properties of the mel-frequency cepstral coefficient (MFCC) representations computed from different spectral estimation methods under noisy conditions using the TIMIT database. The results show that the proposed time-regularized LP approach exhibits superior MFCC distortion behavior while simultaneously having the greatest average separability of different phoneme categories in comparison to the other methods.