Fine-tuning of pre-trained models for classification of vocal intensity category from speech signals
dc.contributor | Aalto-yliopisto | fi |
dc.contributor | Aalto University | en |
dc.contributor.author | Kodali, Manila | en_US |
dc.contributor.author | Kadiri, Sudarsana | en_US |
dc.contributor.author | Alku, Paavo | en_US |
dc.contributor.department | Department of Information and Communications Engineering | en |
dc.contributor.groupauthor | Speech Communication Technology | en |
dc.date.accessioned | 2024-12-17T16:21:56Z | |
dc.date.available | 2024-12-17T16:21:56Z | |
dc.date.issued | 2024 | en_US |
dc.description.abstract | Speakers regulate vocal intensity on many occasions for example to be heard over a long distance or to express vocal emotions. Humans can regulate vocal intensity over a wide sound pressure level (SPL) range and therefore speech can be categorized into different vocal intensity categories. Recent machine learning experiments have studied classification of vocal intensity category from speech signals which have been recorded without SPL information and which are represented on arbitrary amplitude scales. By fine-tuning four pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, HuBERT, audio speech transformers), this paper studies classification of speech into four intensity categories (soft, normal, loud, very loud), when speech is presented on such arbitrary amplitude scale. The fine-tuned model embeddings showed absolute improvements of 5% and 10-12% in accuracy compared to baselines for the target intensity category label and the SPL-based intensity category label, respectively. | en |
dc.description.version | Peer reviewed | en |
dc.format.extent | 5 | |
dc.format.mimetype | application/pdf | en_US |
dc.identifier.citation | Kodali, M, Kadiri, S & Alku, P 2024, Fine-tuning of pre-trained models for classification of vocal intensity category from speech signals. in Interspeech 2024. Interspeech, International Speech Communication Association (ISCA), pp. 482-486, Interspeech, Kos Island, Greece, 01/09/2024. https://doi.org/10.21437/Interspeech.2024-2237 | en |
dc.identifier.doi | 10.21437/Interspeech.2024-2237 | en_US |
dc.identifier.issn | 2958-1796 | |
dc.identifier.other | PURE UUID: 7ef2ef86-a5c3-4e39-a9e2-70fc4969d0f5 | en_US |
dc.identifier.other | PURE ITEMURL: https://research.aalto.fi/en/publications/7ef2ef86-a5c3-4e39-a9e2-70fc4969d0f5 | en_US |
dc.identifier.other | PURE LINK: http://www.scopus.com/inward/record.url?scp=85214799393&partnerID=8YFLogxK | |
dc.identifier.other | PURE FILEURL: https://research.aalto.fi/files/167202121/kodali24_interspeech.pdf | en_US |
dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/132408 | |
dc.identifier.urn | URN:NBN:fi:aalto-202412177885 | |
dc.language.iso | en | en |
dc.relation.ispartof | Interspeech | en |
dc.relation.ispartofseries | Interspeech 2024 | en |
dc.relation.ispartofseries | pp. 482-486 | en |
dc.relation.ispartofseries | Interspeech | en |
dc.rights | openAccess | en |
dc.subject.keyword | speech | en_US |
dc.subject.keyword | audio speech transformers | en_US |
dc.subject.keyword | HuBERT | en_US |
dc.subject.keyword | sound pressure level | en_US |
dc.subject.keyword | Vocal intensity | en_US |
dc.subject.keyword | wav2vec2 | en_US |
dc.title | Fine-tuning of pre-trained models for classification of vocal intensity category from speech signals | en |
dc.type | A4 Artikkeli konferenssijulkaisussa | fi |
dc.type.version | publishedVersion |