aalto1 untyped-item.component.html
Automatic classification of vocal intensity categories from amplitude-normalized speech signals by comparing acoustic features and classifier models
Loading...
Access rights
openAccess
CC BY
CC BY
Creative Commons license
Except where otherwised noted, this item's license is described as openAccess
publishedVersion
URL
Journal Title
Journal ISSN
Volume Title
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Date
Major/Subject
Mcode
Degree programme
Language
en
Pages
27
Series
Speech Communication, Volume 174
Abstract
Regulation of vocal intensity is a fundamental phenomenon in speech communication. Speakers use different intensity categories (e.g., soft, normal, and loud voice) to generate different vocal emotions or to communicate in noisy conditions or over varying distances. Vocal intensity categories have been studied in fundamental research of speech, but much less is known about their automatic classification. This study investigates the classification of vocal intensity categories from speech signals in a scenario, where the original level information of speech is absent and the signal is presented on a normalized amplitude scale. Different acoustic features were studied together with machine learning (ML) and deep learning (DL) classifiers using two different labeling approaches. Speech signals recorded from 50 speakers reciting sentences in four intensity categories (soft, normal, loud, and very loud) were analyzed. Altogether 15 feature sets including different cepstral, spectral and handcrafted (eGeMAPS) features were compared. Three ML classifiers (support vector machine, random forest and AdaBoost), and four DL classifiers (deep neural network, convolutional neural network, recurrent neural network and bidirectional long short-term memory network) were compared. The best classification accuracy of 86.0% was obtained by combining the best performing cepstral and spectral features and using the bidirectional long short-term memory classifier.
Description
Other note
Citation
Kodali, M, Ansari, L, Kadiri, S, Narayanan, S & Alku, P 2025, 'Automatic classification of vocal intensity categories from amplitude-normalized speech signals by comparing acoustic features and classifier models', Speech Communication, vol. 174, 103288. https://doi.org/10.1016/j.specom.2025.103288
