Estimation of the glottal source from coded telephone speech using deep neural networks

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorNarendra, N.P.en_US
dc.contributor.authorAiraksinen, Manuen_US
dc.contributor.authorStory, Braden_US
dc.contributor.authorAlku, Paavoen_US
dc.contributor.departmentDepartment of Signal Processing and Acousticsen
dc.contributor.groupauthorSpeech Communication Technologyen
dc.contributor.organizationUniversity of Arizonaen_US
dc.date.accessioned2018-12-21T10:28:03Z
dc.date.available2018-12-21T10:28:03Z
dc.date.embargoinfo:eu-repo/date/embargoEnd/2020-12-19en_US
dc.date.issued2019-01-01en_US
dc.description.abstractEstimation of glottal source information can be performed non-invasively from speech by using glottal inverse filtering (GIF) methods. However, the existing GIF methods are sensitive even to slight distortions in speech signals under different realistic scenarios, for example, in coded telephone speech. Therefore, there is a need for robust GIF methods which could accurately estimate glottal flows from coded telephone speech. To address the issue of robust GIF, this paper proposes a new deep neural net-based glottal inverse filtering (DNN-GIF) method for estimation of glottal source from coded telephone speech. The proposed DNN-GIF method utilizes both coded and clean versions of speech signal during training. DNN is used to map the speech features extracted from coded speech with the glottal flows estimated from the corresponding clean speech. The glottal flows are estimated from the clean speech by using quasi closed phase analysis (QCP). To generate coded telephone speech, adaptive multi-rate (AMR) codec is utilized which operates in two transmission bandwidths: narrow band (300 Hz - 3.4 kHz) and wide band (50 Hz - 7 kHz). The glottal source parameters were computed from the proposed and existing GIF methods by using vowels obtained from natural speech data as well as from artificial speech production models. The errors in glottal source parameters indicate that the proposed DNN-GIF method has considerably improved the glottal flow estimation under coded condition for both low- and high-pitched vowels. The proposed DNN-GIF method can be utilized to accurately11In this article, the term “accurate/accuracy” is used only when referring to quantitative, objective measures. extract glottal source -based features from coded telephone speech which can be used to improve the performance of speech technology applications such as speaker recognition, emotion recognition and telemonitoring of neurodegerenerative diseases.en
dc.description.versionPeer revieweden
dc.format.extent10
dc.format.extent95-104
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationNarendra, N P, Airaksinen, M, Story, B & Alku, P 2019, ' Estimation of the glottal source from coded telephone speech using deep neural networks ', Speech Communication, vol. 106, pp. 95-104 . https://doi.org/10.1016/j.specom.2018.12.002en
dc.identifier.doi10.1016/j.specom.2018.12.002en_US
dc.identifier.issn0167-6393
dc.identifier.issn1872-7182
dc.identifier.otherPURE UUID: 043a09f7-a8d7-4b69-b7c8-63da08f96c7ben_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/043a09f7-a8d7-4b69-b7c8-63da08f96c7ben_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85058619832&partnerID=8YFLogxKen_US
dc.identifier.otherPURE LINK: http://www.sciencedirect.com/science/article/pii/S0167639318301444en_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/30450896/narendra_et_al_speech_communication_2018.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/35606
dc.identifier.urnURN:NBN:fi:aalto-201812216614
dc.language.isoenen
dc.relation.ispartofseriesSpeech Communicationen
dc.relation.ispartofseriesVolume 106en
dc.rightsopenAccessen
dc.subject.keywordGlottal source estimationen_US
dc.subject.keywordGlottal inverse filteringen_US
dc.subject.keywordDeep neural networken_US
dc.subject.keywordCoded telephone speechen_US
dc.titleEstimation of the glottal source from coded telephone speech using deep neural networksen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi

Files