Fine-tuning of pre-trained models for classification of vocal intensity category from speech signals

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorKodali, Manilaen_US
dc.contributor.authorKadiri, Sudarsanaen_US
dc.contributor.authorAlku, Paavoen_US
dc.contributor.departmentDepartment of Information and Communications Engineeringen
dc.contributor.groupauthorSpeech Communication Technologyen
dc.date.accessioned2024-12-17T16:21:56Z
dc.date.available2024-12-17T16:21:56Z
dc.date.issued2024en_US
dc.description.abstractSpeakers regulate vocal intensity on many occasions for example to be heard over a long distance or to express vocal emotions. Humans can regulate vocal intensity over a wide sound pressure level (SPL) range and therefore speech can be categorized into different vocal intensity categories. Recent machine learning experiments have studied classification of vocal intensity category from speech signals which have been recorded without SPL information and which are represented on arbitrary amplitude scales. By fine-tuning four pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, HuBERT, audio speech transformers), this paper studies classification of speech into four intensity categories (soft, normal, loud, very loud), when speech is presented on such arbitrary amplitude scale. The fine-tuned model embeddings showed absolute improvements of 5% and 10-12% in accuracy compared to baselines for the target intensity category label and the SPL-based intensity category label, respectively.en
dc.description.versionPeer revieweden
dc.format.extent5
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationKodali, M, Kadiri, S & Alku, P 2024, Fine-tuning of pre-trained models for classification of vocal intensity category from speech signals. in Interspeech 2024. Interspeech, International Speech Communication Association (ISCA), pp. 482-486, Interspeech, Kos Island, Greece, 01/09/2024. https://doi.org/10.21437/Interspeech.2024-2237en
dc.identifier.doi10.21437/Interspeech.2024-2237en_US
dc.identifier.issn2958-1796
dc.identifier.otherPURE UUID: 7ef2ef86-a5c3-4e39-a9e2-70fc4969d0f5en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/7ef2ef86-a5c3-4e39-a9e2-70fc4969d0f5en_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85214799393&partnerID=8YFLogxK
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/167202121/kodali24_interspeech.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/132408
dc.identifier.urnURN:NBN:fi:aalto-202412177885
dc.language.isoenen
dc.relation.ispartofInterspeechen
dc.relation.ispartofseriesInterspeech 2024en
dc.relation.ispartofseriespp. 482-486en
dc.relation.ispartofseriesInterspeechen
dc.rightsopenAccessen
dc.subject.keywordspeechen_US
dc.subject.keywordaudio speech transformersen_US
dc.subject.keywordHuBERTen_US
dc.subject.keywordsound pressure levelen_US
dc.subject.keywordVocal intensityen_US
dc.subject.keywordwav2vec2en_US
dc.titleFine-tuning of pre-trained models for classification of vocal intensity category from speech signalsen
dc.typeA4 Artikkeli konferenssijulkaisussafi
dc.type.versionpublishedVersion

Files