Vector quantization in deep neural networks for speech and image processing
Loading...
URL
Journal Title
Journal ISSN
Volume Title
School of Electrical Engineering |
Doctoral thesis (article-based)
| Defence date: 2025-02-07
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
2025
Major/Subject
Mcode
Degree programme
Language
en
Pages
86 + app. 60
Series
Aalto University publication series Doctoral Theses, 13/2025
Abstract
Vector quantization (VQ) is a classic signal processing technique that models the probability density function of a distribution using a set of representative vectors called the codebook, such that each codebook vector represents a subset of the distribution's samples. Deep neural networks (DNNs) are a branch of machine learning that has gained popularity in recent decades as they can solve complicated optimization problems. Since VQ provides an abstract high-level discrete representation of a distribution, it has been widely used as a beneficial tool in many applications based on DNNs, such as image generation, speech recognition, text-to-speech synthesis, and speech and video coding. Regarding VQ's broad utilization in applications based on DNNs, a small improvement in VQ can result in a huge boost in the performance of many applications installed on devices dealing with different data types such as speech, image, video, and text. This thesis mainly focuses on improving various VQ methods within deep learning frameworks. We propose using vector quantization instead of scalar quantization in a speech coding framework. The experiments show that the decoded speech has a higher perceptual quality because VQ considers the correlation between different dimensions of spectral envelopes. As another contribution, we propose a new solution to the gradient collapse problem called noise substitution in vector quantization (NSVQ), in which we model VQ as the addition of a noise vector to the input. Experiments show that NSVQ gives a faster convergence, more accurate gradients, and less hyperparameters to tune than two state-of-the-art solutions, i.e., straight-through estimator and exponential moving average. We further demonstrate that NSVQ can also optimize other variants of VQ that use multiple codebooks, e.g., product VQ, residual VQ, and additive VQ. Experimental results under different speech coding, image compression, and approximate nearest neighbor search show that VQ variants optimized by NSVQ can perform comparably to the baselines. By incorporating space-filling curves into VQ, we introduced a novel quantization technique called space-filling vector quantization (SFVQ), which quantizes the input on a continuous piecewise linear curve. Because of inherent order in the SFVQ codebook, adjacent codebook vectors refer to similar contents. We used this property of SFVQ, which allows us to interpret the underlying phonetic structure of the latent space of a voice conversion model. Moreover, we used SFVQ to interpret the intermediate latent space of StyleGAN2 and BigGAN image generative models. SFVQ gives good control over generations, such that we found the mapping between the latent space and generative factors (e.g., gender, age, etc.), and we discovered the interpretable directions to change the image's attributes (e.g., smile, pose, etc.). In another work, we used SFVQ to cluster the speaker embeddings to enhance the speaker's privacy in speech processing tools based on DNNs.Description
Supervising professor
Bäckström, Tom, Prof., Aalto University, Department of Information and Communications Engineering, FinlandKeywords
vector quantization, deep neural networks, space-filling curves, space-filling vector, quantization, gradient collapse, interpretability, speaker anonymization
Other note
Parts
-
[Publication 1]: M. H. Vali, T. Bäckström. End-to-End Optimized Multi-Stage Vector Quantization of Spectral Envelopes for Speech and Audio Coding. In Interspeech 2021, pp. 3355-3359, Brno, Czechia, August-September 2021.
Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202109299353DOI: 10.21437/Interspeech.2021-867 View at publisher
-
[Publication 2]: M. H. Vali, T. Bäckström. NSVQ: Noise Substitution in Vector Quantization for Machine Learning. IEEE Access, Volume 10, pp. 13598-13610, January 2022.
Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202202161954DOI: 10.1109/ACCESS.2022.3147670 View at publisher
-
[Publication 3]: M. H. Vali, T. Bäckström. Stochastic Optimization of Vector Quantization Methods in Application to Speech and Image Processing. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2023, pp. 1-5, Rhodes, Greece, June 2023.
Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202306073674DOI: 10.1109/ICASSP49357.2023.10096204 View at publisher
-
[Publication 4]: M. H. Vali, T. Bäckström. Interpretable Latent Space Using Space-Filling Curves for Phonetic Analysis in Voice Conversion. In Interspeech 2023, pp. 306-310, Dublin, Ireland, August 2023.
Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202310046155DOI: 10.21437/Interspeech.2023-1549 View at publisher
-
[Publication 5]: M. H. Vali, T. Bäckström. Unsupervised Panoptic Interpretation of Latent Spaces in GANs Using Space-Filling Vector Quantization. Submitted to Conference on Computer Vision and Pattern Recognition, 10 pages, November 2024.
DOI: 10.48550/arXiv.2410.20573 View at publisher
-
[Publication 6]: M. H. Vali, T. Bäckström. Privacy PORCUPINE: Anonymization of Speaker Attributes Using Occurrence Normalization for Space-Filling Vector Quantization. In Interspeech 2024, pp. 2230-2234, Kos, Greece, September 2024.
Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202409186426DOI: 10.21437/Interspeech.2024-117 View at publisher