A Comparison of Data Augmentation Methods in Voice Pathology Detection

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä
Degree programme
Computer Speech and Language, Volume 83
To distinguish pathological voices from healthy voices, automatic voice pathology detection systems can be built using machine learning (ML) and deep learning (DL) techniques. To fully exploit such systems, large quantities of training data are typically required. The amount of training data is, however, small in the area of pathological voice, and therefore data augmentation (DA) becomes a potential technology to artificially increase the quantity of training data. This study presents a systematic comparison between various DA methods in the detection of pathological voice, including three time domain methods (noise addition, pitch shifting and time stretching), one time-frequency domain method (SpecAugment), and two vocoder-based methods (harmonic-to-noise ratio (HNR) modification and glottal pulse length modification). Detection systems were built using four popular spectral feature representations (static mel-frequency cepstral coefficients (MFCCs), dynamic MFCCs, spectrogram and mel-spectrogram). As classifiers, two widely used ML models (support vector machine (SVM) and random forest (RF)) and two DL models (long short-term memory (LSTM) network and convolutional neural network (CNN) with 1-dimensional (1-D) and 2-dimensional (2-D) architectures) were used. These systems were trained using a small number of training samples from two popular databases of pathological voice (HUPA and SVD) to find the best feature/classifier combination for each database. As a result, one ML-based detection system (mel-spectrogram/SVM for HUPA and SVD) and two DL-based detection systems (dynamic MFCCs/2-D CNN for HUPA and mel-spectrogram/2-D CNN for SVD) were selected for the comparison of the DA methods. The results show that by using DA in the system training, detection accuracy increased compared to the baseline systems that were trained without using DA. This improvement in accuracy was, however, clearly larger for the 2D-CNN system than for the SVM system. Furthermore, all six DA methods improved accuracy of the 2-D CNN system compared to the baseline system for both databases. The highest improvements were achieved using the time-frequency domain SpecAugment DA method, which improved accuracy by 1.5% and 3.8% (absolute) for the HUPA and SVD database, respectively.
voice pathology, data augmentation, deep learning, CNNs, mel-spectrogram
Other note
Javanmardi , F , Kadiri , S & Alku , P 2023 , ' A Comparison of Data Augmentation Methods in Voice Pathology Detection ' , Computer Speech and Language , vol. 83 , 101552 , pp. 1-16 . https://doi.org/10.1016/j.csl.2023.101552