Classification of vocal intensity category from speech using the wav2vec2 and whisper embeddings

Loading...
Thumbnail Image
Access rights
openAccess
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
Date
2023
Major/Subject
Mcode
Degree programme
Language
en
Pages
5
4134-4138
Series
Proceedings of Interspeech'23, Volume 2023-August, Interspeech
Abstract
In speech communication, talkers regulate vocal intensity resulting in speech signals of different intensity categories (e.g., soft, loud). Intensity category carries important information about the speaker's health and emotions. However, many speech databases lack calibration information, and therefore sound pressure level cannot be measured from the recorded data. Machine learning, however, can be used in intensity category classification even though calibration information is not available. This study investigates pre-trained model embeddings (Wav2vec2 and Whisper) in classification of vocal intensity category (soft, normal, loud, and very loud) from speech signals expressed using arbitrary amplitude scales. We use a new database consisting of two speaking tasks (sentence and paragraph). Support vector machine is used as a classifier. Our results show that the pre-trained model embeddings outperformed three baseline features, providing improvements of up to 7%(absolute) in accuracy.
Description
Keywords
Other note
Citation
Kodali, M, Kadiri, S & Alku, P 2023, Classification of vocal intensity category from speech using the wav2vec2 and whisper embeddings . in Proceedings of Interspeech'23 . vol. 2023-August, Interspeech, International Speech Communication Association (ISCA), pp. 4134-4138, Interspeech, Dublin, Ireland, 20/08/2023 . https://doi.org/10.21437/Interspeech.2023-2038