Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorRäsänen, Okkoen_US
dc.contributor.authorSeshadri, Shreyasen_US
dc.contributor.authorKaradayi, Julienen_US
dc.contributor.authorRiebling, Ericen_US
dc.contributor.authorBunce, Johnen_US
dc.contributor.authorCristia, Alejandrinaen_US
dc.contributor.authorMetze, Florianen_US
dc.contributor.authorCasillas, Marisaen_US
dc.contributor.authorRosemberg, Celiaen_US
dc.contributor.authorBergelson, Elikaen_US
dc.contributor.authorSoderstrom, Melanieen_US
dc.contributor.departmentDept Signal Process and Acousten
dc.contributor.groupauthorJorma Skyttä's Groupen
dc.contributor.organizationCNRSen_US
dc.contributor.organizationCarnegie Mellon Universityen_US
dc.contributor.organizationUniversity of Manitobaen_US
dc.contributor.organizationMax Planck Institute for Psycholinguisticsen_US
dc.contributor.organizationConsejo Nacional de Investigaciones Científicas y Técnicasen_US
dc.contributor.organizationDuke Universityen_US
dc.date.accessioned2019-09-03T13:46:54Z
dc.date.available2019-09-03T13:46:54Z
dc.date.issued2019-10-01en_US
dc.description.abstractAutomatic word count estimation (WCE) from audio recordings can be used to quantify the amount of verbal communication in a recording environment. One key application of WCE is to measure language input heard by infants and toddlers in their natural environments, as captured by daylong recordings from microphones worn by the infants. Although WCE is nearly trivial for high-quality signals in high-resource languages, daylong recordings are substantially more challenging due to the unconstrained acoustic environments and the presence of near- and far-field speech. Moreover, many use cases of interest involve languages for which reliable ASR systems or even well-defined lexicons are not available. A good WCE system should also perform similarly for low- and high-resource languages in order to enable unbiased comparisons across different cultures and environments. Unfortunately, the current state-of-the-art solution, the LENA system, is based on proprietary software and has only been optimized for American English, limiting its applicability. In this paper, we build on existing work on WCE and present the steps we have taken towards a freely available system for WCE that can be adapted to different languages or dialects with a limited amount of orthographically transcribed speech data. Our system is based on language-independent syllabification of speech, followed by a language-dependent mapping from syllable counts (and a number of other acoustic features) to the corresponding word count estimates. We evaluate our system on samples from daylong infant recordings from six different corpora consisting of several languages and socioeconomic environments, all manually annotated with the same protocol to allow direct comparison. We compare a number of alternative techniques for the two key components in our system: speech activity detection and automatic syllabification of speech. As a result, we show that our system can reach relatively consistent WCE accuracy across multiple corpora and languages (with some limitations). In addition, the system outperforms LENA on three of the four corpora consisting of different varieties of English. We also demonstrate how an automatic neural network-based syllabifier, when trained on multiple languages, generalizes well to novel languages beyond the training data, outperforming two previously proposed unsupervised syllabifiers as a feature extractor for WCE.en
dc.description.versionPeer revieweden
dc.format.extent18
dc.format.extent63-80
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationRäsänen, O, Seshadri, S, Karadayi, J, Riebling, E, Bunce, J, Cristia, A, Metze, F, Casillas, M, Rosemberg, C, Bergelson, E & Soderstrom, M 2019, ' Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech ', Speech Communication, vol. 113, pp. 63-80 . https://doi.org/10.1016/j.specom.2019.08.005en
dc.identifier.doi10.1016/j.specom.2019.08.005en_US
dc.identifier.issn0167-6393
dc.identifier.issn1872-7182
dc.identifier.otherPURE UUID: 870c0a6b-f491-44aa-97e8-bf0a1f9668bben_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/870c0a6b-f491-44aa-97e8-bf0a1f9668bben_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85070952723&partnerID=8YFLogxKen_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/36516899/ELEC_rasanen_automatic_word_count_speechComm.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/40081
dc.identifier.urnURN:NBN:fi:aalto-201909035123
dc.language.isoenen
dc.publisherElsevier
dc.relation.ispartofseriesSpeech Communicationen
dc.relation.ispartofseriesVolume 113en
dc.rightsopenAccessen
dc.subject.keywordAutomatic syllabificationen_US
dc.subject.keywordDaylong recordingsen_US
dc.subject.keywordLanguage acquisitionen_US
dc.subject.keywordNoise robustnessen_US
dc.subject.keywordWord count estimationen_US
dc.titleAutomatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speechen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion
Files