Loudspeaker Modelling with Recurrent Neural Networks

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Sähkötekniikan korkeakoulu | Master's thesis
Acoustics and Audio Technology
Degree programme
CCIS - Master’s Programme in Computer, Communication and Information Sciences (TS2013)
Digital twins of loudspeakers are a useful assets for fine-tuning purposes during the design and the manufacturing phase. They can serve as an alternative to real-time measurement for objective evaluation of adjustments made by digital signal processing. Binaural loudspeaker models could introduce a more repeatable framework for subjective listening and provide flexibility for remote work due to the reduced need for actual physical devices. Neural Networks are a well-proven tool for system identification of different audio hardware devices. This thesis project will focus on creating a digital twin of a multimedia stereo loudspeaker system by using stereo audio waveform as the input and a binaural recording of the system's playback as the target waveform for Recurrent Neural Network (RNN) training. The RNN architecture is inspired by the current state-of-the-art method for single channel audio effects modelling, and is adapted for the stereo waveform use case. Firstly, the RNN model is tested with different synthesized target data that simulates the real recorded data. This approach allows us to estimate the properties which are the most challenging for the RNN to learn. Secondly, the experiments are run with a real recorded, time-aligned dataset, and the RNN's performance is objectively evaluated by the Error-To-Signal Ratio (ESR). In the current state-of-the-art method on single channel audio modelling, the initial hidden state of the RNN is computed by using no-gradient startup inference to accumulate the hidden state over the first few hundred samples of the training sequence. The thesis project proposes a new method called Discontinuous Sequence Training (DISCO). The method prepares the training dataset according to the RNNs architecture’s hyper-parameter sequence length and the system's impulse response length, such that it allows for correct initialization of the initial hidden state without additional pre-training inference. DISCO reaches the training and inference precision of hidden state initialization in the current state-of-the-art method for black-box modelling with RNNs only by modifying the dataset.
Schlecht, Sebastian Jiro
Thesis advisor
Schlecht, Sebastian Jiro
loudspeaker modelling, digital twin, system dientification, deep learning, stereo modelling, DISCO sequence training
Other note