Applying dnn adaptation to reduce the session dependency of ultrasound tongue imaging-based silent speech interfaces

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorGosztolya, Gáboren_US
dc.contributor.authorGrósz, Tamásen_US
dc.contributor.authorTóth, Lászlóen_US
dc.contributor.authorMarkó, Alexandraen_US
dc.contributor.authorCsapó, Tamás Gáboren_US
dc.contributor.departmentDepartment of Signal Processing and Acousticsen
dc.contributor.groupauthorSpeech Recognitionen
dc.contributor.organizationHungarian Academy of Sciencesen_US
dc.contributor.organizationUniversity of Szegeden_US
dc.contributor.organizationEötvös Loránd Universityen_US
dc.contributor.organizationBudapest University of Technology and Economicsen_US
dc.date.accessioned2020-10-02T06:22:09Z
dc.date.available2020-10-02T06:22:09Z
dc.date.issued2020-01-01en_US
dc.description.abstractSilent Speech Interfaces (SSI) perform articulatory-to-acoustic mapping to convert articulatory movement into synthesized speech. Its main goal is to aid the speech handicapped, or to be used as a part of a communication system operating in silence-required environments or in those with high background noise. Although many previous studies addressed the speaker-dependency of SSI models, session-dependency is also an important issue due to the possible misalignment of the recording equipment. In particular, there are currently no solutions available, in the case of tongue ultrasound recordings. In this study, we investigate the degree of session-dependency of standard feed-forward DNN-based models for ultrasound-based SSI systems. Besides examining the amount of training data required for speech synthesis parameter estimation, we also show that DNN adaptation can be useful for handling session dependency. Our results indicate that by using adaptation, less training data and training time are needed to achieve the same speech quality over training a new DNN from scratch. Our experiments also suggest that the sub-optimal cross-session behavior is caused by the misalignment of the recording equipment, as adapting just the lower, feature extractor layers of the neural network proved to be sufficient, in achieving a comparative level of performance.en
dc.description.versionPeer revieweden
dc.format.extent16
dc.format.extent109-124
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationGosztolya, G, Grósz, T, Tóth, L, Markó, A & Csapó, T G 2020, ' Applying dnn adaptation to reduce the session dependency of ultrasound tongue imaging-based silent speech interfaces ', Acta Polytechnica Hungarica, vol. 17, no. 7, pp. 109-124 . https://doi.org/10.12700/APH.17.7.2020.7.6en
dc.identifier.doi10.12700/APH.17.7.2020.7.6en_US
dc.identifier.issn1785-8860
dc.identifier.otherPURE UUID: 0aa0397a-a5d7-4da7-8631-472e0379ea3een_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/0aa0397a-a5d7-4da7-8631-472e0379ea3een_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85091064685&partnerID=8YFLogxKen_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/51694294/Gosztolya_Grosz_Toth_Marko_Csapo_104_1.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/46763
dc.identifier.urnURN:NBN:fi:aalto-202010025728
dc.language.isoenen
dc.publisherObuda University
dc.relation.ispartofseriesACTA POLYTECHNICA HUNGARICAen
dc.relation.ispartofseriesVolume 17, issue 7en
dc.rightsopenAccessen
dc.subject.keywordArticulatory-to-acoustic mappingen_US
dc.subject.keywordDeep Neural Networksen_US
dc.subject.keywordDNN adaptationen_US
dc.subject.keywordSession dependencyen_US
dc.subject.keywordSilent speech interfacesen_US
dc.titleApplying dnn adaptation to reduce the session dependency of ultrasound tongue imaging-based silent speech interfacesen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files