Browsing by Author "Juvela, Lauri"
Now showing 1 - 20 of 35
Results Per Page
Sort Options
Item Äänisisällön suojaus vesileimauksella(2024-05-23) Laitinen, Reeta; Juvela, Lauri; Sähkötekniikan korkeakoulu; Aalto, SamuliVesileima on digitaalinen tunniste, jota voidaan hyödyntää esimerkiksi tiedon alkuperän tai aitouden tunnistamiseen. Vesileimauksen perinteiset sovelluskohteet ovat tekijänoikeussuoja sekä digitaalisen sisällön omistajan todentaminen. Viime vuosina on noussut esiin uusia sovelluskohteita, kuten tekoälyn generoiman sisällön vesileimaaminen. Vesileiman tärkeimmät ominaisuudet ovat robustisuus, eli kyky vastustaa vesileimaan kohdistuvia muutoksia, huomaamattomuus sekä kapasiteetti, joiden välillä täytyy löytää kuhunkin sovelluskohteeseen sopiva tasapaino. Vesileimaa suunniteltaessa keskeisenä tavoitteena on luoda sellainen vesileima, joka kestää hyökkäyksiä tehokkaasti samalla, kun se säilyttää alkuperäisen signaalin laadun mahdollisimman hyvänä. Lisäksi on tärkeää, että vesileima voidaan havaita ja todentaa isäntäsignaalista. Tässä kandidaatintyössä käsitellään perinteisiä digitaalisen signaalinkäsittelyn menetelmiä hyödyntäviä vesileimoja, kuten aika- ja muunnosalueen menetelmiä sekä syväoppimista hyödyntäviä lähestymistapoja. Perinteisiä vesileimausmenetelmiä on kehitetty jo vuosikymmenten ajan, ja niitä on olemassa useita erilaisia. Niiden ongelmia on kuitenkin esimerkiksi liian heikko robustisuus ja liian monimutkainen toteutus, mikä vaikeuttaa niiden käytön yleistymistä. Syväoppimisen avulla on viime vuosien aikana pyritty löytämään näihin ongelmiin ratkaisuja, koska syväoppimismallit voidaan kouluttaa toimimaan monipuolisesti eri tilanteissa ja useita hyökkäyksiä vastaan. Syväoppimismalleihin perustuvien vesileimausmenetelmien avulla voidaan löytää ratkaisuja monimutkaisten vesileimojen upottamiseen ja vesileimauksen automatisointiin. Näin vesileimojen saavutettavuutta voitaisiin parantaa. Tämän työn tavoitteena oli vertailla eri menetelmien suorituskykyä sekä niiden vahvuuksia ja heikkouksia. Lisäksi työssä tarkasteltiin vesileimausmenetelmien havaitsemista isäntäsignaalista ja tulevaisuuden kehityssuuntia. Kirjallisuustutkimuksen perusteella voidaan todeta, että syväoppimiseen perustuvat vesileimausmenetelmät ovat lupaava väline digitaalisen äänisisällön suojaamisessa. Syväoppimisen hyödyntäminen äänisisällön vesileimauksessa on kuitenkin vielä varhaisessa kehitysvaiheessa ja perinteisillä vesileimoilla on vielä tärkeä rooli esimerkiksi uuden tutkimuksen pohjana. Tulevaisuudessa äänisisällön vesileimaus on tärkeä tutkimuskohde digitaalisen sisällön suojaamisessa.Item Adversarial Guitar Amplifier Modelling with Unpaired Data(2023-06-10) Wright, Alec; Välimäki, Vesa; Juvela, Lauri; Department of Information and Communications Engineering; Audio Signal Processing; Speech SynthesisWe propose an audio effects processing framework that learns to emulate a target electric guitar tone from a recording. We train a deep neural network using an adversarial approach, with the goal of trans-forming the timbre of a guitar, into the timbre of another guitar after audio effects processing has been applied, for example, by a guitar amplifier. The model training requires no paired data, and the resulting model emulates the target timbre well whilst being capable of real-time processing on a modern personal computer. To verify our approach we present two experiments, one which carries out un-paired training using paired data, allowing us to monitor training via objective metrics, and another that uses fully unpaired data, corresponding to a realistic scenario where a user wants to emulate a guitar timbre only using audio data from a recording. Our listening test results confirm that the models are perceptually convincing.Item Analyzing sentiments in social media posts discussing Posti: leveraging OmaPosti feedback and an open source Finnish sentiment dataset.(2023-10-09) Piippo, Alisa; Ali, Muhammad Irfan; Perustieteiden korkeakoulu; Juvela, LauriItem ASVspoof 2019: a large-scale public database of synthetized, converted and replayed speech(Academic Press Inc., 2020-11) Wang, Xin; Yamagishi, Junichi; Todisco, Massimiliano; Delgado, Hector; Nautsch, Andreas; Evans, Nicholas; Sahidullah, Md; Vestman, Ville; Kinnunen, Tomi; Aik Lee, Kong; Juvela, Lauri; Alku, Paavo; Peng, Yu-Huai; Hwang, Hsin-Te; Tsao, Yu; Wang, Hsin-Min; Le Maguer, Sebastien; Becker, Markus; Henderson, Fergus; Clark, Rob; Zhang, Yu; Wang, Quan; Jia, Ye; Onuma, Kai; Mushika, Koji; Kaneda, Takashi; Jiang, Yuan; Liu, Li-Juan; Wu, Yi-Chiao; Huang, Wen-Chin; Toda, Tomoki; Tanaka, Kou; Kameoka, Hirokazu; Steiner, Ingmar; Matrouf, Driss; Bonastre, Jean-Francois; Govender, Avashna; Ronanki, Srikanth; Zhang, Jing-Xuan; Ling, Zhen-Hua; Dept Signal Process and Acoust; Speech Communication Technology; National Institute of Informatics; EURECOM; Université de Lorraine; University of Eastern Finland; NEC Corporation; Academia Sinica; Trinity College Dublin; Google, USA; HOYA Corporation; IFLYTEK Co., Ltd.; Nagoya University; NTT Communication Science Laboratories; AudEERING GmbH; Avignon Université; University of Edinburgh; University of Science and Technology of ChinaAutomatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as “presentation attacks.” These vulnerabilities are generally unacceptable and call for spoofing countermeasures or “presentation attack detection” systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof challenge initiative was created to foster research on anti-spoofing and to provide common platforms for the assessment and comparison of spoofing countermeasures. The first edition, ASVspoof 2015, focused upon the study of countermeasures for detecting of text-to-speech synthesis (TTS) and voice conversion (VC) attacks. The second edition, ASVspoof 2017, focused instead upon replay spoofing attacks and countermeasures. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona fide utterances even by human subjects. It is expected that the ASVspoof 2019 database, with its varied coverage of different types of spoofing data, could further foster research on anti-spoofing.Item Audioefektien mallintaminen neuroverkoilla(2024-05-20) Kajasvirta, Mikael; Juvela, Lauri; Sähkötekniikan korkeakoulu; Aalto, SamuliAudioefekteillä tarkoitetaan sisään tulevan audiosignaalin prosessointia tavalla, joka muuttaa sen ominaisuuksia. Audioefektejä ovat esimerkiksi särö-, kaiku- ja kompressioefektit. Monet suositut audioefektit ovat analogisia laitteita. Perinteisesti analogisia audioefektejä mallinnetaan digitaalisen signaalinkäsittelyn avulla siten, että systeemin rakenne simuloidaan digitaalisesti. Yksinkertaisempaa on kouluttaa neuroverkko, joka mallintaa laitteen ominaisuuksia. Audioefektejä voidaan mallintaa käyttäen erilaisia neuroverkkoarkkitehtuureja. Suosittuja arkkitehtuureja audioefektejä mallinnettaessa ovat takaisinkytketyt neuroverkot, konvoluutioneuroverkot sekä useista eri arkkitehtuureista koostuvat hybridimallit. Työn tarkoitus oli kerätä kootusti tietoa audioefektien mallintamisesta neuroverkoilla sekä kartoittaa nykytutkimuksen tilaa. Työssä tehdyn kirjallisuustutkimuksen perusteella nykytutkimuksissa mallinnettujen efektien laatu on hyvällä tasolla, ja ne ovat saaneet vakuuttavia tuloksia MUSHRA-kokeista. Tämän lisäksi efektejä on onnistuttu mallintamaan reaaliajassa, mikä on tärkeä kriteeri monissa sovelluksissa.Item Augmented CycleGANs for continuous scale normal-to-Lombard speaking style conversion(2019) Seshadri, Shreyas; Juvela, Lauri; Alku, Paavo; Räsänen, Okko; Dept Signal Process and Acoust; Speech Communication Technology; Jorma Skyttä's GroupLombard speech is a speaking style associated with increased vocal effort that is naturally used by humans to improve intelligibility in the presence of noise. It is hence desirable to have a system capable of converting speech from normal to Lombard style. Moreover, it would be useful if one could adjust the degree of Lombardness in the converted speech so that the system is more adaptable to different noise environments. In this study, we propose the use of recently developed Augmented cycle-consistent adversarial networks (Augmented CycleGANs) for conversion between normal and Lombard speaking styles. The proposed system gives a smooth control on the degree of Lombardness of the mapped utterances by traversing through different points in the latent space of the trained model. We utilize a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract features from normal speech that are then mapped to Lombard-style features using the Augmented CycleGAN. Finally, the mapped features are converted to Lombard speech with PML. The model is trained on multi-language data recorded in different noise conditions, and we compare its effectiveness to a previously proposed CycleGAN system in experiments for intelligibility and quality of mapped speech.Item Collaborative Watermarking for Adversarial Speech Synthesis(2024-03-18) Juvela, Lauri; Wang, Xin; Department of Information and Communications Engineering; Speech Synthesis; National Institute of InformaticsAdvances in neural speech synthesis have brought us technology that is not only close to human naturalness, but is also capable of instant voice cloning with little data, and is highly accessible with pre-trained models available. Naturally, the potential flood of generated content raises the need for synthetic speech detection and watermarking. Recently, considerable research effort in synthetic speech detection has been related to the Automatic Speaker Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on passive countermeasures. This paper takes a complementary view to generated speech detection: a synthesis system should make an active effort to watermark the generated speech in a way that aids detection by another machine, but remains transparent to a human listener. We propose a collaborative training scheme for synthetic speech watermarking and show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance over conventional classifier training. Furthermore, we demonstrate how collaborative training can be paired with augmentation strategies for added robustness against noise and time-stretching. Finally, listening tests demonstrate that collaborative training has little adverse effect on perceptual quality of vocoded speech.Item A comparison between STRAIGHT, glottal, and sinusoidal vocoding in statistical parametric speech synthesis(2018-09) Airaksinen, Manu; Juvela, Lauri; Bollepalli, Bajibabu; Yamagishi, Junichi; Alku, Paavo; Dept Signal Process and Acoust; National Institute of InformaticsA vocoder is used to express a speech waveform with a controllable parametric representation that can be converted back into a speech waveform. Vocoders representing their main categories (mixed excitation, glottal, sinusoidal vocoders) were compared in this study with formal and crowd-sourced listening tests. Vocoder quality was measured within the context of analysis-synthesis as well as text-to-speech (TTS) synthesis in a modern statistical parametric speech synthesis framework. Furthermore, the TTS experiments were divided into synthesis with vocoder-specific features and synthesis with a shared envelope model, where the waveform generation method of the vocoders is mainly responsible for the quality differences. Finally, all of the tests included four distinct voices as a way to investigate the effect of different speakers on the synthesized speech quality. The obtained results suggest that the choice of the voice has a profound impact on the overall quality of the vocoder-generated speech, and the best vocoder for each voice can vary case by case. The single best-rated TTS system was obtained with the glottal vocoder GlottDNN using a male voice with low expressiveness. However, the results indicate that the sinusoidal vocoder PML (pulse model in log-domain) has the best overall performance across the performed tests. Finally, when controlling for the spectral models of the vocoders, the observed differences are similar to the baseline results. This indicates that the waveform generation method of a vocoder is essential for quality improvements.Item Conditional Spoken Digit Generation with StyleGAN(International Speech Communication Association, 2020) Palkama, Kasperi; Juvela, Lauri; Ilin, Alexander; Department of Computer Science; Dept Signal Process and Acoust; Professor of Practice Ilin Alexander; Aalto UniversityThis paper adapts a StyleGAN model for speech generation with minimal or no conditioning on text. StyleGAN is a multiscale convolutional GAN capable of hierarchically capturing data structure and latent variation on multiple spatial (or temporal) levels. The model has previously achieved impressive results on facial image generation, and it is appealing to audio applications due to similar multi-level structures present in the data. In this paper, we train a StyleGAN to generate melspectrograms on the Speech Commands dataset, which contains spoken digits uttered by multiple speakers in varying acoustic conditions. In a conditional setting our model is conditioned on the digit identity, while learning the remaining data variation remains an unsupervised task. We compare our model to the current unsupervised state-of-the-art speech synthesis GAN architecture, the WaveGAN, and show that the proposed model outperforms according to numerical measures and subjective evaluation by listening tests.Item Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion(2019-05-01) Seshadri, Shreyas; Juvela, Lauri; Yamagishi, Junichi; Räsänen, Okko; Alku, Paavo; Dept Signal Process and Acoust; Jorma Skyttä's Group; Speech Communication Technology; Research Organization of Information and Systems, National Institute of InformaticsSpeaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we propose the use of cycle-consistent adversarial networks (CycleGANs) for converting styles with varying vocal effort, and focus on conversion between normal and Lombard styles as a case study of this problem. We propose a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract speech features. These features are mapped using the CycleGAN from utterances in the source style to the corresponding features of target speech. Finally, the mapped features are converted to a Lombard speech waveform with the PML. The CycleGAN was compared in subjective listening tests with 2 other standard mapping methods used in conversion, and the CycleGAN was found to have the best performance in terms of speech quality and in terms of the magnitude of the perceptual change between the two styles.Item Data augmentation strategies for neural network F0 estimation(2019-05-01) Airaksinen, Manu; Juvela, Lauri; Alku, Paavo; Räsänen, Okko; Jorma Skyttä's Group; Speech Communication TechnologyThis study explores various speech data augmentation methods for the task of noise-robust fundamental frequency (F0) estimation with neural networks. The explored augmentation strategies are split into additive noise and channel -based augmentation and into vocoder-based augmentation methods. In vocoder-based augmentation, a glottal vocoder is used to enhance the accuracy of ground truth F0 used for training of the neural network, as well as to expand the training data diversity in terms of F0 patterns and vocal tract lengths of the talkers. Evaluations on the PTDB-TUG corpus indicate that noise and channel augmentation can be used to greatly increase the noise robustness of trained models, and that vocoder-based ground truth enhancement further increases model performance. For smaller datasets, vocoder-based diversity augmentation can also be used to increase performance. The best-performing proposed method greatly outperformed the compared F0 estimation methods in terms of noise robustness.Item Exposure Bias and State Matching in Recurrent Neural Network Virtual Analog Models(2021-09-08) Peussa, Aleksi; Damskägg, Eero-Pekka; Sherson, Thomas; Mimilakis, Stylianos; Juvela, Lauri; Gotsopoulos, Athanasios; Välimäki, Vesa; Dept Signal Process and Acoust; Evangelista, Gianpaolo; Holighaus, Nicki; Audio Signal Processing; Neural DSP TechnologiesVirtual analog (VA) modeling using neural networks (NNs) has great potential for rapidly producing high-fidelity models. Recurrent neural networks (RNNs) are especially appealing for VA due to their connection with discrete nodal analysis. Furthermore, VA models based on NNs can be trained efficiently by directly exposing them to the circuit states in a gray-box fashion. However, exposure to ground truth information during training can leave the models susceptible to error accumulation in a free-running mode, also known as “exposure bias” in machine learning literature. This paper presents a unified framework for treating the previously proposed state trajectory network (STN) and gated recurrent unit (GRU) networks as special cases of discrete nodal analysis. We propose a novel circuit state-matching mechanism for the GRU and experimentally compare the previously mentioned networks for their performance in state matching, during training, and in exposure bias, during inference. Experimental results from modeling a diode clipper show that all the tested models exhibit some exposure bias, which can be mitigated by truncated backpropagation through time. Furthermore, the proposed state matching mechanism improves the GRU modeling performance of an overdrive pedal and a phaser pedal, especially in the presence of external modulation, apparent in a phaser circuit.Item Gelp: GAN-excited linear prediction for speech synthesis from mel-spectrogram(2019-01-01) Juvela, Lauri; Bollepalli, Bajibabu; Yamagishi, Junichi; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication TechnologyRecent advances in neural network -based text-to-speech have reached human level naturalness in synthetic speech. The present sequence-to-sequence models can directly map text to mel-spectrogram acoustic features, which are convenient for modeling, but present additional challenges for vocoding (i.e., waveform generation from the acoustic features). High-quality synthesis can be achieved with neural vocoders, such as WaveNet, but such autoregressive models suffer from slow sequential inference. Meanwhile, their existing parallel inference counterparts are difficult to train and require increasingly large model sizes. In this paper, we propose an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrate a linear predictive synthesis filter into the model. Results show that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.Item Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis(2017-08) Bollepalli, Bajibabu; Juvela, Lauri; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication TechnologyRecent studies have shown that text-to-speech synthesis quality can be improved by using glottal vocoding. This refers to vocoders that parameterize speech into two parts, the glottal excitation and vocal tract, that occur in the human speech production apparatus. Current glottal vocoders generate the glottal excitation waveform by using deep neural networks (DNNs). However, the squared error-based training of the present glottal excitation models is limited to generating conditional average waveforms, which fails to capture the stochastic variation of the waveforms. As a result, shaped noise is added as post-processing. In this study, we propose a new method for predicting glottal waveforms by generative adversarial networks (GANs). GANs are generative models that aim to embed the data distribution in a latent space, enabling generation of new instances very similar to the original by randomly sampling the latent distribution. The glottal pulses generated by GANs show a stochastic component similar to natural glottal pulses. In our experiments, we compare synthetic speech generated using glottal waveforms produced by both DNNs and GANs. The results show that the newly proposed GANs achieve synthesis quality comparable to that of widely-used DNNs, without using an additive noise component.Item Generative Adversarial Networks for Speech Synthesis(2020-06-15) Palkama, Kasperi; Juvela, Lauri; Perustieteiden korkeakoulu; Ilin, AlexanderThis thesis adapts a style-based generator architecture for generative adversarial networks (StyleGAN) for speech generation with minimal or no conditioning on text. StyleGAN is a multi-scale convolutional GAN capable of hierarchically capturing data structure and latent variation on multiple spatial (or temporal) levels. The model has previously achieved impressive results on facial image generation, and it is appealing to audio applications due to similar multi-level structures present in the data. In this thesis, we train a StyleGAN to generate mel-frequency spectrograms on the Speech Commands dataset, which contains spoken digits uttered by multiple speakers in varying acoustic conditions. In a conditional setting our model is conditioned on the digit identity, while learning the remaining data variation remains an unsupervised task. We compare our model to the current unsupervised state-of-the-art speech synthesis GAN architecture, the WaveGAN, and show that the proposed model outperforms according to numerical measures and subjective evaluation by listening tests.Item GlotNet-A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis(IEEE Advancing Technology for Humanity, 2019-06-01) Juvela, Lauri; Bollepalli, Bajibabu; Tsiaras, Vassilis; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication Technology; University of CreteRecently, generative neural network models which operate directly on raw audio, such as WaveNet, have improved the state of the art in text-to-speech synthesis (TTS). Moreover, there is increasing interest in using these models as statistical vocoders for generating speech waveforms from various acoustic features. However, there is also a need to reduce the model complexity, without compromising the synthesis quality. Previously, glottal pulseforms (i.e., time-domain waveforms corresponding to the source of human voice production mechanism) have been successfully synthesized in TTS by glottal vocoders using straightforward deep feedforward neural networks. Therefore, it is natural to extend the glottal waveform modeling domain to use the more powerful WaveNet-like architecture. Furthermore, due to their inherent simplicity, glottal excitation waveforms permit scaling down the waveform generator architecture. In this study, we present a raw waveform glottal excitation model, called GlotNet, and compare its performance with the corresponding direct speech waveform model, WaveNet, using equivalent architectures. The models are evaluated as part of a statistical parametric TTS system. Listening test results show that both approaches are rated highly in voice similarity to the target speaker, and obtain similar quality ratings with large models. Furthermore, when the model size is reduced, the quality degradation is less severe for GlotNet.Item Glottispulssikirjaston faktorointi puhesynteesiä varten(2011) Juvela, Lauri; Raitio, Tuomo; Sähkötekniikan korkeakoulu; Liinaharja, MarkkuItem In Search of the Perfect Prompt(2023-10-09) Frîncu, Ioana; Wu, Ronin; Botev, Victor; Perustieteiden korkeakoulu; Juvela, LauriThe study investigates the efficacy of soft and hard prompt strategies in the scientific domain, namely in the tasks of conversational abstract generation. The proposed approach incorporates two distinct methods, prompt engineering and prompt tuning, within a Conversational Recommender System (CRS). The primary objective of this system is to aid users in generating abstracts for their research. The present study employs an evaluation approach that integrates user research with objective performance criteria. This study examines the strengths and disadvantages associated with both categories of prompts, commencing with an analysis of existing literature on CRS and prompting studies, and subsequently conducting original research tests. This study makes three primary contributions. Initially, a compilation of prerequisites and hypothetical situations is formed by an examination of the issue at hand. This wish list presents a range of potential technological, user, and functional views that have the potential to contribute to future studies in this area. Furthermore, the examination of user studies is an integral element of our evaluation methodology. During this process, we analyze many factors pertaining to the 6 participants, including their cognitive load, response time, and overall happiness while applying challenging prompts within the CRS. In our investigation, we examine the behavior and needs of the target demographic, consisting of academics and researchers. Our findings suggest a tendency among this group to favor interactions that are focused on factual information and question-and-answer exchanges, as opposed to more expansive and conversational encounters. Thirdly, our study delves into the comprehensibility and relevance of the generated abstracts, utilizing well-established criteria such as Rouge and F1 scores. In our research, the anticipated effect of combining prompts with text-generation tasks is to produce scientific abstracts that are imprecise and broader in nature. However, this objective contradicts the expectations of the users. The research findings shed light on the difficulties and advantages that arise from implementing prompting techniques with a CRS. This study makes a valuable contribution by recognizing the importance of contextual comprehension and employing prompting strategies from both technical and user-centric viewpoints. One of the primary findings is that it is crucial to customize prompt tactics in accordance with user preferences and domain demands. The given findings contribute to the existing body of knowledge on conversational recommender systems and their applications in the field of natural language processing.Item KLANN: Linearising Long-Term Dynamics in Nonlinear Audio Effects Using Koopman Networks(IEEE, 2024-04-16) Huhtala, Ville; Juvela, Lauri; Schlecht, Sebastian J.; Department of Information and Communications Engineering; Department of Art and Media; Speech Synthesis; Virtual Acoustics; Department of Information and Communications EngineeringIn recent years, neural network-based black-box modeling of nonlinear audio effects has improved considerably. Present convolutional and recurrent models can model audio effects with long-term dynamics, but the models require many parameters, thus increasing the processing time. In this paper, we propose KLANN, a Koopman-Linearised Audio Neural Network structure that lifts a one-dimensional signal (mono audio) into a high-dimensional approximately linear state-space representation with nonlinear mapping, and then uses differentiable biquad filters to predict linearly within the lifted state-space. Results show that the proposed models match the high performance of the state-of-the-art neural models while having a more compact architecture, reducing the number of parameters by tenfold, and having interpretable components.Item Learning neural discrete representations for speech(2022-05-16) Tulensalo, Jarkko; Juvela, Lauri; Perustieteiden korkeakoulu; Ilin, AlexanderCurrent state-of-the-art models in text-to-speech domain do not generate raw waveform directly. The models use variations of Mel frequency representations when generating speech which is then translated into raw waveform with a separately trained audio vocoder. This thesis studied two hypotheses. First, we studied if we can learn neural discrete representation from raw waveform speech using Vector Quantized Variational AutoEncoders. In results, we show that the model learns neural discrete representations that can be used for speech compression with high speech quality. We report perceptual evaluation speech score (PESQ) of 2.8 with our model which indicates comparable or higher speech quality to recent neural vocoders in literature. We also present speech samples of our proposed model. Second, we studied if we can use autoregressive Transformers in generating speech in raw waveform directly from text using the previously learnt discrete speech representations which we train using the LJSpeech labeled text-to-speech dataset. Our experiments show promising results but the model does not generalise to all samples. In further research, we suggest conducting the same experiment with a larger dataset.