Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets

Loading...
Thumbnail Image

Access rights

openAccess
publishedVersion

URL

Journal Title

Journal ISSN

Volume Title

A4 Artikkeli konferenssijulkaisussa

Date

2020

Major/Subject

Mcode

Degree programme

Language

en

Pages

5

Series

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Volume 2020-October, pp. 467-471, Interspeech

Abstract

In this paper, we propose a software toolkit for easier end-to-end training of deep learning based spoken language identification models across several speech datasets. We apply our toolkit to implement three baseline models, one speaker recognition model, and three x-vector architecture variations, which are trained on three datasets previously used in spoken language identification experiments. All models are trained separately on each dataset (closed task) and on a combination of all datasets (open task), after which we compare if the open task training yields better language embeddings. We begin by training all models end-to-end as discriminative classifiers of spectral features, labeled by language. Then, we extract language embedding vectors from the trained end-to-end models, train separate Gaussian Naive Bayes classifiers on the vectors, and compare which model provides best language embeddings for the back-end classifier. Our experiments show that the open task condition leads to improved language identification performance on only one of the datasets. In addition, we discovered that increasing x-vector model robustness with random frequency channel dropout significantly reduces its end-to-end classification performance on the test set, while not affecting back-end classification performance of its embeddings. Finally, we note that two baseline models consistently outperformed all other models.

Description

| openaire: EC/H2020/780069/EU//MeMAD

Keywords

Deep learning, Language embedding, Spoken language identification, TensorFlow, X-vector

Other note

Citation

Lindgren, M, Jauhiainen, T & Kurimo, M 2020, Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets . in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH . vol. 2020-October, Interspeech, International Speech Communication Association (ISCA), pp. 467-471, Interspeech, Shanghai, China, 25/10/2020 . https://doi.org/10.21437/Interspeech.2020-2706