Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorKethireddy, Rashmien_US
dc.contributor.authorKadiri, Sudarsana Reddyen_US
dc.contributor.authorGangashetty, Suryakanth V.en_US
dc.contributor.departmentDept Signal Process and Acousten
dc.contributor.groupauthorSpeech Communication Technologyen
dc.contributor.organizationInternational Institute of Information Technology Hyderabaden_US
dc.contributor.organizationKoneru Lakshmaiah Education Foundationen_US
dc.date.accessioned2022-03-28T09:41:37Z
dc.date.available2022-03-28T09:41:37Z
dc.date.embargoinfo:eu-repo/date/embargoEnd/2022-08-03en_US
dc.date.issued2022-02-01en_US
dc.description.abstractThe goal of this study is to investigate advanced signal processing approaches [single frequency filtering (SFF) and zero-time windowing (ZTW)] with modern deep neural networks (DNNs) [convolution neural networks (CNNs), temporal convolution neural networks (TCN), time-delay neural network (TDNN), and emphasized channel attention, propagation and aggregation in TDNN (ECAPA-TDNN)] for dialect classification of major dialects of English. Previous studies indicated that SFF and ZTW methods provide higher spectro-temporal resolution. To capture the intrinsic variations in articulations among dialects, four feature representations [spectrogram (SPEC), cepstral coefficients, mel filter-bank energies, and mel-frequency cepstral coefficients (MFCCs)] are derived from SFF and ZTW methods. Experiments with and without data augmentation using CNN classifiers revealed that the proposed features performed better than baseline short-time Fourier transform (STFT)-based features on the UT-Podcast database [Hansen, J. H., and Liu, G. (2016). "Unsupervised accent classification for deep data fusion of accent and language information," Speech Commun. 78, 19-33]. Even without data augmentation, all the proposed features showed an approximate improvement of 15%-20% (relative) over best baseline (SPEC-STFT) feature. TCN, TDNN, and ECAPA-TDNN classifiers that capture wider temporal context further improved the performance for many of the proposed and baseline features. Among all the baseline and proposed features, the best performance is achieved with single frequency filtered cepstral coefficients for TCN (81.30%), TDNN (81.53%), and ECAPA-TDNN (85.48%). An investigation of data-driven filters, instead of fixed mel-scale, improved the performance by 2.8% and 1.4% (relatively) for SPEC-STFT and SPEC-SFF, and nearly equal for SPEC-ZTW. To assist related work, we have made the code available ([Kethireddy, R., and Kadiri, S. R. (2022). "Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations," https://github.com/r39ashmi/e2e_dialect (Last viewed 21 December 2021)].).en
dc.description.versionPeer revieweden
dc.format.extent16
dc.format.extent1077-1092
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationKethireddy, R, Kadiri, S R & Gangashetty, S V 2022, ' Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations ', The Journal of the Acoustical Society of America, vol. 151, no. 2, pp. 1077-1092 . https://doi.org/10.1121/10.0009405en
dc.identifier.doi10.1121/10.0009405en_US
dc.identifier.issn0001-4966
dc.identifier.issn1520-8524
dc.identifier.otherPURE UUID: 7cdb5410-d78f-4ad1-92df-bbc535d79999en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/7cdb5410-d78f-4ad1-92df-bbc535d79999en_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85125598875&partnerID=8YFLogxKen_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/80833494/Kethireddy_Deep_neural_architectures_for_dialect_classification.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/113772
dc.identifier.urnURN:NBN:fi:aalto-202203282649
dc.language.isoenen
dc.publisherACOUSTICAL SOCIETY OF AMERICA
dc.relation.ispartofseriesThe Journal of the Acoustical Society of Americaen
dc.relation.ispartofseriesVolume 151, issue 2en
dc.rightsopenAccessen
dc.titleDeep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representationsen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
Files