Speaker verification is a subtask of speaker recognition that employs speech, the most natural way of communication, as a form of biometric analysis. For this, a system extracts and models the characteristic features of speaker voices from their speech signals. This verification is an essential tool in many applications, ranging from law enforcement to voice-controlled smart assistants (e.g., Siri) that are currently widespread in our daily lives.
However, speech contains a large degree of variability from different sources that can severely degrade the performance of these systems. Thus, current developments have been focused on subduing these issues thanks to the creation of large datasets tailored for speaker recognition and the advances in deep learning that have significantly boosted performance. Specifically, deep speaker embeddings are a successful technique to represent a speaker using a fixed-dimensional feature vector.
This thesis focuses on implementing two speaker verification systems that extract deep speaker embeddings using deep neural networks and an advanced objective function. Moreover, the models are analyzed using various test sets, such as in "in the wild" environments or employing unseen languages, specifically Finnish. The experiments demonstrated the excellent generalization ability and robustness of the models against adverse conditions and their capacity to be language-agnostic.