Browsing by Author "Vali, Mohammadhassan"
Now showing 1 - 7 of 7
- Results Per Page
- Sort Options
- End-to-End Optimized Multi-Stage Vector Quantization of Spectral Envelopes for Speech and Audio Coding
A4 Artikkeli konferenssijulkaisussa(2021-09) Vali, Mohammadhassan; Bäckström, TomSpectral envelope modeling is an instrumental part of speech and audio codecs, which can be used to enable efficient entropy coding of spectral components. Overall optimization of codecs, including envelope models, has however been difficult due to the complicated interactions between different modules of the codec. In this paper, we study an end-to-end optimization methodology to optimize all modules in a codec integrally with respect to each other while capturing all these complex interactions with a global loss function. For the quantization of the spectral envelope parameters with a fixed bitrate, we use multistage vector quantization which gives high quality, but yet has a computational complexity which can be realistically applied in embedded devices. The obtained results demonstrate benefits in terms of PESQ and PSNR in comparison to the 3GPP EVS, as well as our recently proposed PyAWNeS codecs. - Interpretable Latent Space Using Space-Filling Curves for Phonetic Analysis in Voice Conversion
A4 Artikkeli konferenssijulkaisussa(2023) Vali, Mohammadhassan; Bäckström, TomVector quantized variational autoencoders (VQ-VAE) are well-known deep generative models, which map input data to a latent space that is used for data generation. Such latent spaces are unstructured and can thus be difficult to interpret. Some earlier approaches have introduced a structure to the latent space through supervised learning by defining data labels as latent variables. In contrast, we propose an unsupervised technique incorporating space-filling curves into vector quantization (VQ), which yields an arranged form of latent vectors such that adjacent elements in the VQ codebook refer to similar content. We applied this technique to the latent codebook vectors of a VQ-VAE, which encode the phonetic information of a speech signal in a voice conversion task. Our experiments show there is a clear arrangement in latent vectors representing speech phones, which clarifies what phone each latent vector corresponds to and facilitates other detailed interpretations of latent vectors. - Low-complexity Real-time Neural Network for Blind Bandwidth Extension of Wideband Speech
A4 Artikkeli konferenssijulkaisussa(2023-09-04) Gómez Mellado, Esteban; Vali, Mohammadhassan; Bäckström, TomSpeech is streamed at 16 kHz or lower sample rates in many applications (e.g. VoIP, Bluetooth headsets). Extending its bandwidth can produce significant quality improvements. We introduce BBWEXNet, a lightweight neural network that performs blind bandwidth extension of speech from 16 kHz (wideband) to 48 kHz (fullband) in real-time in CPU. Our low latency approach allows running the model with a maximum algorithmic delay of 16 ms, enabling end-to-end communication in streaming services and scenarios where the GPU is busy or unavailable. We propose a series of optimizations that take advantage of the U-Net architecture and vector quantization methods commonly used in speech coding, to produce a model whose performance is comparable to previous real-time solutions, but approximately halving the memory footprint and computational cost. Moreover, we show that the model complexity can be further reduced with a marginal impact on the perceived output quality. - NSVQ: Noise Substitution in Vector Quantization for Machine Learning
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2022) Vali, Mohammadhassan; Bäckström, TomMachine learning algorithms have been shown to be highly effective in solving optimization problems in a wide range of applications. Such algorithms typically use gradient descent with backprop- agation and the chain rule. Hence, the backpropagation fails if intermediate gradients are zero for some functions in the computational graph, because it causes the gradients to collapse when multiplying with zero. Vector quantization is one of those challenging functions for machine learning algorithms, since it is a piece-wise constant function and its gradient is zero almost everywhere. A typical solution is to apply the straight through estimator which simply copies the gradients over the vector quantization function in the backpropagation. Other solutions are based on smooth or stochastic approximation. This study proposes a vector quantization technique called NSVQ, which approximates the vector quantization behavior by substituting a multiplicative noise so that it can be used for machine learning problems. Specifically, the vector quantization error is replaced by product of the original error and a normalized noise vector, the samples of which are drawn from a zero-mean, unit-variance normal distribution. We test our proposed NSVQ in three scenarios with various types of applications. Based on the experiments, the proposed NSVQ achieves more accuracy and faster convergence in comparison to the straight through estimator, exponential moving averages, and the MiniBatchKmeans approaches. - Privacy and Quality Improvements in Open Offices Using Multi-Device Speech Enhancement
A4 Artikkeli konferenssijulkaisussa(2023-08-19) Rech, Silas; Vali, Mohammadhassan; Bäckström, TomTeleconferencing has increased in popularity and often takes place around other people such as open offices. A particular problem of such environments is that multiple users can have independent conversations simultaneously, which leak into each others’ devices. This poses problems of both privacy and quality. In this work, we introduce a multi-device, targeted speech separation network. We call this network IsoNet, as it isolates the dominant speech in a mixture of multiple speakers by generating a mask from interfering speakers. This mask is used to remove speech from other simultaneous conversations in the enhanced speech signal. The privacy improvement is measured by mutual information and the enhancement quality is evaluated with a MUSHRA test, PESQ, and SI-SNR. Our experiments show a statistically significant improvement with IsoNet from 27 to 75 in MUSHRA score and a decrease of mutual information of 60%. IsoNet improves privacy as sensitive speech content is effectively attenuated. - Privacy PORCUPINE: Anonymization of Speaker Attributes Using Occurrence Normalization for Space-Filling Vector Quantization
A4 Artikkeli konferenssijulkaisussa(2024-09) Vali, Mohammadhassan; Bäckström, TomSpeech signals contain a vast range of private information such as its text, speaker identity, emotions, and state of health. Privacy-preserving speech processing seeks to filter out any private information that is not needed for downstream tasks, for example with an information bottleneck, sufficiently tight that only the desired information can pass through. We however demonstrate that the occurrence frequency of codebook elements in bottlenecks using vector quantization have an uneven information rate, threatening privacy. We thus propose to use space-filling vector quantization (SFVQ) together with occurrence normalization, balancing the information rate and thus protecting privacy. Our experiments with speaker identification validate the proposed method. This approach thus provides a generic tool for quantizing information bottlenecks in any speech applications such that their privacy disclosure is predictable and quantifiable. - Stochastic Optimization of Vector Quantization Methods in Application to Speech and Image Processing
A4 Artikkeli konferenssijulkaisussa(2023) Vali, Mohammadhassan; Bäckström, TomVector quantization (VQ) methods have been used in a wide range of applications for speech, image, and video data. While classic VQ methods often use expectation maximization, in this paper, we investigate the use of stochastic optimization employing our recently proposed noise substitution in vector quantization technique. We consider three variants of VQ including additive VQ, residual VQ, and product VQ, and evaluate their quality, complexity and bitrate in speech coding, image compression, approximate nearest neighbor search, and a selection of toy examples. Our experimental results demonstrate the trade-offs in accuracy, complexity, and bitrate such that using our open source implementations and complexity calculator, the best vector quantization method can be chosen for a particular problem.