Vector quantization in deep neural networks for speech and image processing

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Electrical Engineering | Doctoral thesis (article-based) | Defence date: 2025-02-07

Date

2025

Major/Subject

Mcode

Degree programme

Language

en

Pages

86 + app. 60

Series

Aalto University publication series Doctoral Theses, 13/2025

Abstract

Vector quantization (VQ) is a classic signal processing technique that models the probability density function of a distribution using a set of representative vectors called the codebook, such that each codebook vector represents a subset of the distribution's samples. Deep neural networks (DNNs) are a branch of machine learning that has gained popularity in recent decades as they can solve complicated optimization problems. Since VQ provides an abstract high-level discrete representation of a distribution, it has been widely used as a beneficial tool in many applications based on DNNs, such as image generation, speech recognition, text-to-speech synthesis, and speech and video coding. Regarding VQ's broad utilization in applications based on DNNs, a small improvement in VQ can result in a huge boost in the performance of many applications installed on devices dealing with different data types such as speech, image, video, and text. This thesis mainly focuses on improving various VQ methods within deep learning frameworks. We propose using vector quantization instead of scalar quantization in a speech coding framework. The experiments show that the decoded speech has a higher perceptual quality because VQ considers the correlation between different dimensions of spectral envelopes. As another contribution, we propose a new solution to the gradient collapse problem called noise substitution in vector quantization (NSVQ), in which we model VQ as the addition of a noise vector to the input. Experiments show that NSVQ gives a faster convergence, more accurate gradients, and less hyperparameters to tune than two state-of-the-art solutions, i.e., straight-through estimator and exponential moving average. We further demonstrate that NSVQ can also optimize other variants of VQ that use multiple codebooks, e.g., product VQ, residual VQ, and additive VQ. Experimental results under different speech coding, image compression, and approximate nearest neighbor search show that VQ variants optimized by NSVQ can perform comparably to the baselines. By incorporating space-filling curves into VQ, we introduced a novel quantization technique called space-filling vector quantization (SFVQ), which quantizes the input on a continuous piecewise linear curve. Because of inherent order in the SFVQ codebook, adjacent codebook vectors refer to similar contents. We used this property of SFVQ, which allows us to interpret the underlying phonetic structure of the latent space of a voice conversion model. Moreover, we used SFVQ to interpret the intermediate latent space of StyleGAN2 and BigGAN image generative models. SFVQ gives good control over generations, such that we found the mapping between the latent space and generative factors (e.g., gender, age, etc.), and we discovered the interpretable directions to change the image's attributes (e.g., smile, pose, etc.). In another work, we used SFVQ to cluster the speaker embeddings to enhance the speaker's privacy in speech processing tools based on DNNs.

Description

Supervising professor

Bäckström, Tom, Prof., Aalto University, Department of Information and Communications Engineering, Finland

Keywords

vector quantization, deep neural networks, space-filling curves, space-filling vector, quantization, gradient collapse, interpretability, speaker anonymization

Other note

Parts

  • [Publication 1]: M. H. Vali, T. Bäckström. End-to-End Optimized Multi-Stage Vector Quantization of Spectral Envelopes for Speech and Audio Coding. In Interspeech 2021, pp. 3355-3359, Brno, Czechia, August-September 2021.
    DOI: 10.21437/Interspeech.2021-867 View at publisher
  • [Publication 2]: M. H. Vali, T. Bäckström. NSVQ: Noise Substitution in Vector Quantization for Machine Learning. IEEE Access, Volume 10, pp. 13598-13610, January 2022.
    DOI: 10.1109/ACCESS.2022.3147670 View at publisher
  • [Publication 3]: M. H. Vali, T. Bäckström. Stochastic Optimization of Vector Quantization Methods in Application to Speech and Image Processing. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2023, pp. 1-5, Rhodes, Greece, June 2023.
    DOI: 10.1109/ICASSP49357.2023.10096204 View at publisher
  • [Publication 4]: M. H. Vali, T. Bäckström. Interpretable Latent Space Using Space-Filling Curves for Phonetic Analysis in Voice Conversion. In Interspeech 2023, pp. 306-310, Dublin, Ireland, August 2023.
    DOI: 10.21437/Interspeech.2023-1549 View at publisher
  • [Publication 5]: M. H. Vali, T. Bäckström. Unsupervised Panoptic Interpretation of Latent Spaces in GANs Using Space-Filling Vector Quantization. Submitted to Conference on Computer Vision and Pattern Recognition, 10 pages, November 2024.
    DOI: 10.48550/arXiv.2410.20573 View at publisher
  • [Publication 6]: M. H. Vali, T. Bäckström. Privacy PORCUPINE: Anonymization of Speaker Attributes Using Occurrence Normalization for Space-Filling Vector Quantization. In Interspeech 2024, pp. 2230-2234, Kos, Greece, September 2024.
    DOI: 10.21437/Interspeech.2024-117 View at publisher

Citation