Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets
dc.contributor | Aalto-yliopisto | fi |
dc.contributor | Aalto University | en |
dc.contributor.author | Lindgren, Matias | en_US |
dc.contributor.author | Jauhiainen, Tommi | en_US |
dc.contributor.author | Kurimo, Mikko | en_US |
dc.contributor.department | Department of Signal Processing and Acoustics | en |
dc.contributor.groupauthor | Speech Recognition | en |
dc.contributor.groupauthor | Centre of Excellence in Computational Inference, COIN | en |
dc.contributor.organization | University of Helsinki | en_US |
dc.date.accessioned | 2021-01-25T10:12:02Z | |
dc.date.available | 2021-01-25T10:12:02Z | |
dc.date.issued | 2020 | en_US |
dc.description | | openaire: EC/H2020/780069/EU//MeMAD | |
dc.description.abstract | In this paper, we propose a software toolkit for easier end-to-end training of deep learning based spoken language identification models across several speech datasets. We apply our toolkit to implement three baseline models, one speaker recognition model, and three x-vector architecture variations, which are trained on three datasets previously used in spoken language identification experiments. All models are trained separately on each dataset (closed task) and on a combination of all datasets (open task), after which we compare if the open task training yields better language embeddings. We begin by training all models end-to-end as discriminative classifiers of spectral features, labeled by language. Then, we extract language embedding vectors from the trained end-to-end models, train separate Gaussian Naive Bayes classifiers on the vectors, and compare which model provides best language embeddings for the back-end classifier. Our experiments show that the open task condition leads to improved language identification performance on only one of the datasets. In addition, we discovered that increasing x-vector model robustness with random frequency channel dropout significantly reduces its end-to-end classification performance on the test set, while not affecting back-end classification performance of its embeddings. Finally, we note that two baseline models consistently outperformed all other models. | en |
dc.description.version | Peer reviewed | en |
dc.format.extent | 5 | |
dc.format.mimetype | application/pdf | en_US |
dc.identifier.citation | Lindgren, M, Jauhiainen, T & Kurimo, M 2020, Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. vol. 2020-October, Interspeech, International Speech Communication Association (ISCA), pp. 467-471, Interspeech, Shanghai, China, 25/10/2020. https://doi.org/10.21437/Interspeech.2020-2706 | en |
dc.identifier.doi | 10.21437/Interspeech.2020-2706 | en_US |
dc.identifier.issn | 2308-457X | |
dc.identifier.other | PURE UUID: 57006ede-d074-41fe-b29a-eb6e028e6b35 | en_US |
dc.identifier.other | PURE ITEMURL: https://research.aalto.fi/en/publications/57006ede-d074-41fe-b29a-eb6e028e6b35 | en_US |
dc.identifier.other | PURE LINK: http://www.scopus.com/inward/record.url?scp=85098199407&partnerID=8YFLogxK | |
dc.identifier.other | PURE FILEURL: https://research.aalto.fi/files/55067030/Releasing_a_Toolkit_and_Comparing_the_Performance.pdf | en_US |
dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/102153 | |
dc.identifier.urn | URN:NBN:fi:aalto-202101251463 | |
dc.language.iso | en | en |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/780069/EU//MeMAD | en_US |
dc.relation.ispartof | Interspeech | en |
dc.relation.ispartofseries | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH | en |
dc.relation.ispartofseries | Volume 2020-October, pp. 467-471 | en |
dc.relation.ispartofseries | Interspeech | en |
dc.rights | openAccess | en |
dc.subject.keyword | Deep learning | en_US |
dc.subject.keyword | Language embedding | en_US |
dc.subject.keyword | Spoken language identification | en_US |
dc.subject.keyword | TensorFlow | en_US |
dc.subject.keyword | X-vector | en_US |
dc.title | Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets | en |
dc.type | A4 Artikkeli konferenssijulkaisussa | fi |
dc.type.version | publishedVersion |