Voice-quality Features for Deep Neural Network Based Speaker Verification Systems

Loading...
Thumbnail Image

Access rights

openAccess
publishedVersion

URL

Journal Title

Journal ISSN

Volume Title

A4 Artikkeli konferenssijulkaisussa

Major/Subject

Mcode

Degree programme

Language

en

Pages

5

Series

29th European Signal Processing Conference, EUSIPCO 2021 - Proceedings, pp. 176-180, European Signal Processing Conference

Abstract

Jitter and shimmer are voice-quality features which have been successfully used to detect voice pathologies and classify different speaking styles. In this paper, we investigate the usefulness of such voice-quality features in neural-network based speaker verification systems. To combine these two sets of features, the cosine distance scores estimated from the two sets are linearly weighted to obtain a single, fused score. The fused score is used to accept/reject a given speaker. The experimental results carried out on Voxceleb-1 dataset demonstrate that the fusion of the cosine distance scores extracted from the mel-spectrogram and voice quality features provide a 15% relative improvement in Equal Error Rate (EER) compared to the baseline system which is based only on mel-spectrogram features.

Description

Other note

Citation

Zewoudie, A, Koivisto, L & Bäckström, T 2021, Voice-quality Features for Deep Neural Network Based Speaker Verification Systems. in 29th European Signal Processing Conference, EUSIPCO 2021 - Proceedings. European Signal Processing Conference, IEEE, pp. 176-180, European Signal Processing Conference, Dublin, Ireland, 23/08/2021. https://doi.org/10.23919/EUSIPCO54536.2021.9616242