From Raw Speech to Fixed Representations: A Comprehensive Evaluation of Speech Embedding Techniques
Loading...
Access rights
openAccess
publishedVersion
URL
Journal Title
Journal ISSN
Volume Title
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
Authors
Porjazovski, Dejan
Grosz, Tamas
Kurimo, Mikko
Date
2024
Major/Subject
Mcode
Degree programme
Language
en
Pages
15
Series
IEEE/ACM Transactions on Audio Speech and Language Processing, Volume 32, pp. 3546-3560
Abstract
Speech embeddings, fixed-size representations derived from raw audio data, play a crucial role in diverse machine learning applications. Despite the abundance of speech embedding techniques, selecting the most suitable one remains challenging. Existing studies often focus on intrinsic or extrinsic aspects, seldom exploring both simultaneously. Furthermore, comparing the state-of-the-art pre-trained models with prior speech embedding solutions is notably scarce in the literature. To address these gaps, we undertake a comprehensive evaluation of both small and large-scale speech embedding models, which, in our opinion, needs to incorporate both intrinsic and extrinsic assessments. The intrinsic experiments delve into the models' ability to pick speaker-related characteristics and assess their discriminative capacities, providing insights into their inherent capabilities and internal workings. Concurrently, the extrinsic experiments evaluate whether the models learned semantic cues during pre-training. The findings underscore the superior performance of the large-scale pre-trained models, albeit at an elevated computational cost. The base self-supervised models show comparable results to their large counterparts, making them a better choice for many applications. Furthermore, we show that by selecting the most crucial dimensions, the models' performance often does not suffer drastically and even improves in some cases. This research contributes valuable insights into the nuanced landscape of speech embeddings, aiding researchers and practitioners in making informed choices for various applications.Description
Publisher Copyright: Authors
Keywords
Computational modeling, Data models, dimension contribution, extrinsic evaluation, Feature extraction, intrinsic evaluation, Speech embeddings, Speech processing, Task analysis, Training, Transformers
Other note
Citation
Porjazovski, D, Grosz, T & Kurimo, M 2024, ' From Raw Speech to Fixed Representations: A Comprehensive Evaluation of Speech Embedding Techniques ', IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 32, pp. 3546-3560 . https://doi.org/10.1109/TASLP.2024.3426301