INVESTIGATING THE CLUSTERS DISCOVERED BY PRE-TRAINED AV-HUBERT
Loading...
Access rights
openAccess
URL
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
Date
2024
Major/Subject
Mcode
Degree programme
Language
en
Pages
5
Series
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Abstract
Self-supervised models, such as HuBERT and its audio-visual version AV-HuBERT, have demonstrated excellent performance on various tasks. The main factor for their success is the pre-training procedure, which requires only raw data without human transcription. During the self-supervised pre-training phase, HuBERT is trained to discover latent clusters in the training data, but these clusters are discarded, and only the last hidden layer is used by the conventional finetuning step. We investigate what latent information the AV-HuBERT model managed to uncover via its clusters and can we use them directly for speech recognition. To achieve this, we consider the sequence of cluster ids as a'language' developed by the AV-HuBERT and attempt to translate it to English text via small LSTM-based models. These translation models enable us to investigate the relations between the clusters and the English alphabet, shedding light on groups of latent clusters specialized to recognise specific phonetic groups. Our results demonstrate that using the pre-trained system as a quantizer, we are able to compress the video to as low as 275 bit/sec while maintaining acceptable speech recognition accuracy. Furthermore, compared to the conventional finetuning step, our solution has considerably lower computational cost.Description
Publisher Copyright: © 2024 IEEE.
Keywords
ASR, audiovisual, AV-HuBERT, machine translation, SSL
Other note
Citation
Virkkunen, A, Huang, G, Grosz, T & Kurimo, M 2024, INVESTIGATING THE CLUSTERS DISCOVERED BY PRE-TRAINED AV-HUBERT . in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings . ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, IEEE, pp. 11196-11200, IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Korea, Republic of, 14/04/2024 . https://doi.org/10.1109/ICASSP48485.2024.10447434