AVID: A speech database for machine learning studies on vocal intensity

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorAlku, Paavoen_US
dc.contributor.authorKodali, Manilaen_US
dc.contributor.authorLaaksonen, Lauraen_US
dc.contributor.authorKadiri, Sudarsanaen_US
dc.contributor.departmentDepartment of Information and Communications Engineeringen
dc.contributor.groupauthorSpeech Communication Technologyen
dc.contributor.organizationHuawei Technologiesen_US
dc.date.accessioned2024-03-06T10:30:48Z
dc.date.available2024-03-06T10:30:48Z
dc.date.issued2024-02en_US
dc.description.abstractVocal intensity, which is quantified typically with the sound pressure level (SPL), is a key feature of speech. To measure SPL from speech recordings, a standard calibration tone (with a reference SPL of 94 dB or 114 dB) needs to be recorded together with speech. However, most of the popular databases that are used in areas such as speech and speaker recognition have been recorded without calibration information by expressing speech on arbitrary amplitude scales. Therefore, information about vocal intensity of the recorded speech, including SPL, is lost. In the current study, we introduce a new open and calibrated speech/electroglottography (EGG) database named Aalto Vocal Intensity Database (AVID). AVID includes speech and EGG produced by 50 speakers (25 males, 25 females) who varied their vocal intensity in four categories (soft, normal, loud and very loud). Recordings were conducted using a constant mouth-to-microphone distance and by recording a calibration tone. The speech data was labelled sentence-wise using a total of 19 labels that support the utilisation of the data in machine learing (ML) -based studies of vocal intensity based on supervised learning. In order to demonstrate how the AVID data can be used to study vocal intensity, we investigated one multi-class classification task (classification of speech into soft, normal, loud and very loud intensity classes) and one regression task (prediction of SPL of speech). In both tasks, we deliberately warped the level of the input speech by normalising the signal to have its maximum amplitude equal to 1.0, that is, we simulated a scenario that is prevalent in current speech databases. The results show that using the spectrogram feature with the support vector machine classifier gave an accuracy of 82% in the multi-class classification of the vocal intensity category. In the prediction of SPL, using the spectrogram feature with the support vector regressor gave an mean absolute error of about 2 dB and a coefficient of determination of 92%. We welcome researchers interested in classification and regression problems to utilise AVID in the study of vocal intensity, and we hope that the current results could serve as baselines for future ML studies on the topic.en
dc.description.versionPeer revieweden
dc.format.extent11
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationAlku, P, Kodali, M, Laaksonen, L & Kadiri, S 2024, 'AVID: A speech database for machine learning studies on vocal intensity', Speech Communication, vol. 157, 103039. https://doi.org/10.1016/j.specom.2024.103039en
dc.identifier.doi10.1016/j.specom.2024.103039en_US
dc.identifier.issn0167-6393
dc.identifier.issn1872-7182
dc.identifier.otherPURE UUID: 1cf61ba7-93da-4051-9ea0-dd9f34a28f43en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/1cf61ba7-93da-4051-9ea0-dd9f34a28f43en_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/140113248/1-s2.0-S0167639324000116-main.pdf
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/126876
dc.identifier.urnURN:NBN:fi:aalto-202403062511
dc.language.isoenen
dc.publisherElsevier
dc.relation.ispartofseriesSpeech Communicationen
dc.relation.ispartofseriesVolume 157en
dc.rightsopenAccessen
dc.subject.keywordVocal intensityen_US
dc.subject.keywordconvolutional neural networken_US
dc.subject.keywordmachine learningen_US
dc.subject.keywordsound pressure levelen_US
dc.subject.keywordspeech databaseen_US
dc.subject.keywordsupport vector machineen_US
dc.titleAVID: A speech database for machine learning studies on vocal intensityen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files