Segment phoneme classification from speech under noisy conditions: Using amplitude-frequency modulation based two-dimensional auto-regressive features with deep neural networks

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Sähkötekniikan korkeakoulu | Master's thesis
Signal Processing
Degree programme
TLT - Master’s Programme in Communications Engineering (TS2005)
This thesis investigates at the acoustic-phonetic level the noise robustness of features derived using the AM-FM analysis of speech signals. The analysis on the noise robustness of these features is done using various neural network models and is based on the segment classification of phonemes. This analysis is also extended and the robustness of the AM-FM based features is compared under similar noise conditions with the traditional features such as the Mel-frequency cepstral coefficients(MFCC). We begin with an important aspect of segment phoneme classification experiments which is the study of architectural and training strategies of the various neural network models used. The results of these experiments showed that there is a difference in the training pattern adopted by the various neural network models. Before over-fitting, models that undergo pre-training are seen to train for many epochs more than their opposite models that do not undergo pre-training. Taking this difference in training pattern into perspective and based on phoneme classification rate the Gaussian restricted Boltzmann machine and the single layer perceptron are selected as the best performing model of the two groups, respectively. Using the two best performing models for classification, segment phoneme classification experiments under different noise conditions are performed for both the AM-FM based and traditional features. The experiments showed that AM-FM based frequency domain linear prediction features with or without feature compensation are more robust in the classification of 61 phonemes under white noise and 0 $dB$ signal-to-noise ratio(SNR) conditions compared to the traditional features. However, when the phonemes are folded to 39 phonemes, the results are ambiguous under all noise conditions and there is no unanimous conclusion as to which feature is most robust.
Alku, Paavo
Thesis advisor
Gowda, Dhananjaya
robust speech recognition, AM-FM based features, segment phoneme classification, deep neural networks
Other note