From Raw Speech to Fixed Representations: A Comprehensive Evaluation of Speech Embedding Techniques

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorPorjazovski, Dejanen_US
dc.contributor.authorGrosz, Tamasen_US
dc.contributor.authorKurimo, Mikkoen_US
dc.contributor.departmentDepartment of Information and Communications Engineeringen
dc.contributor.groupauthorSpeech Recognitionen
dc.date.accessioned2024-08-06T07:46:01Z
dc.date.available2024-08-06T07:46:01Z
dc.date.issued2024en_US
dc.descriptionPublisher Copyright: Authors
dc.description.abstractSpeech embeddings, fixed-size representations derived from raw audio data, play a crucial role in diverse machine learning applications. Despite the abundance of speech embedding techniques, selecting the most suitable one remains challenging. Existing studies often focus on intrinsic or extrinsic aspects, seldom exploring both simultaneously. Furthermore, comparing the state-of-the-art pre-trained models with prior speech embedding solutions is notably scarce in the literature. To address these gaps, we undertake a comprehensive evaluation of both small and large-scale speech embedding models, which, in our opinion, needs to incorporate both intrinsic and extrinsic assessments. The intrinsic experiments delve into the models' ability to pick speaker-related characteristics and assess their discriminative capacities, providing insights into their inherent capabilities and internal workings. Concurrently, the extrinsic experiments evaluate whether the models learned semantic cues during pre-training. The findings underscore the superior performance of the large-scale pre-trained models, albeit at an elevated computational cost. The base self-supervised models show comparable results to their large counterparts, making them a better choice for many applications. Furthermore, we show that by selecting the most crucial dimensions, the models' performance often does not suffer drastically and even improves in some cases. This research contributes valuable insights into the nuanced landscape of speech embeddings, aiding researchers and practitioners in making informed choices for various applications.en
dc.description.versionPeer revieweden
dc.format.extent15
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationPorjazovski, D, Grosz, T & Kurimo, M 2024, 'From Raw Speech to Fixed Representations: A Comprehensive Evaluation of Speech Embedding Techniques', IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 32, pp. 3546-3560. https://doi.org/10.1109/TASLP.2024.3426301en
dc.identifier.doi10.1109/TASLP.2024.3426301en_US
dc.identifier.issn2329-9290
dc.identifier.otherPURE UUID: 77ebb9d7-c792-43c2-90eb-9fd413307172en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/77ebb9d7-c792-43c2-90eb-9fd413307172en_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85198358967&partnerID=8YFLogxK
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/153176742/From_Raw_Speech_to_Fixed_Representations_A_Comprehensive_Evaluation_of_Speech_Embedding_Techniques.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/129682
dc.identifier.urnURN:NBN:fi:aalto-202408065255
dc.language.isoenen
dc.publisherIEEE
dc.relation.ispartofseriesIEEE/ACM Transactions on Audio Speech and Language Processingen
dc.relation.ispartofseriesVolume 32, pp. 3546-3560en
dc.rightsopenAccessen
dc.subject.keywordComputational modelingen_US
dc.subject.keywordData modelsen_US
dc.subject.keyworddimension contributionen_US
dc.subject.keywordextrinsic evaluationen_US
dc.subject.keywordFeature extractionen_US
dc.subject.keywordintrinsic evaluationen_US
dc.subject.keywordSpeech embeddingsen_US
dc.subject.keywordSpeech processingen_US
dc.subject.keywordTask analysisen_US
dc.subject.keywordTrainingen_US
dc.subject.keywordTransformersen_US
dc.titleFrom Raw Speech to Fixed Representations: A Comprehensive Evaluation of Speech Embedding Techniquesen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files