Learning Centre

Audio Event Classification Using Deep Learning Methods

 |  Login

Show simple item record

dc.contributor Aalto-yliopisto fi
dc.contributor Aalto University en
dc.contributor.advisor Smit, Peter
dc.contributor.author Xu, Zhicun
dc.date.accessioned 2018-12-14T16:00:13Z
dc.date.available 2018-12-14T16:00:13Z
dc.date.issued 2018-12-10
dc.identifier.uri https://aaltodoc.aalto.fi/handle/123456789/35445
dc.description.abstract Whether crossing the road or enjoying a concert, sound carries important information about the world around us. Audio event classification refers to recognition tasks involving the assignment of one or several labels, such as ‘dog bark’ or ‘doorbell’, to a particular audio signal. Thus, teaching machines to conduct this classification task can help humans in many fields. Since deep learning has shown its great potential and usefulness in many AI applications, this thesis focuses on studying deep learning methods and building suitable neural networks for this audio event classification task. In order to evaluate the performance of different neural networks, we tested them on both Google AudioSet and the dataset for DCASE 2018 Task 2. Instead of providing original audio files, AudioSet offers compact 128-dimensional embeddings outputted by a modified VGG model for audio with a frame length of 960ms. For DCASE 2018 Task 2, we firstly preprocessed the soundtracks and then fine-tuned the VGG model that AudioSet used as a feature extractor. Thus, each soundtrack from both tasks is represented as a series of 128-dimensional features. We then compared the DNN, LSTM, and multi-level attention models with different hyper parameters. The results show that fine-tuning the feature generation model for the DCASE task greatly improved the evaluation score. In addition, the attention models were found to perform the best in our settings for both tasks. The results indicate that utilizing a CNN-like model as a feature extractor for the log-mel spectrograms and modeling the dynamics information using an attention model can achieve state-of-the-art results in the task of audio event classification. For future research, the thesis suggests training a better CNN model for feature extraction, utilizing multi-scale and multi-level features for better classification, and combining the audio features with other multimodal information for audiovisual data analysis. en
dc.format.extent 67+6
dc.format.mimetype application/pdf en
dc.language.iso en en
dc.title Audio Event Classification Using Deep Learning Methods en
dc.type G2 Pro gradu, diplomityö fi
dc.contributor.school Sähkötekniikan korkeakoulu fi
dc.subject.keyword audio event classification en
dc.subject.keyword AudioSet en
dc.subject.keyword multi-level attention model en
dc.subject.keyword VGG en
dc.identifier.urn URN:NBN:fi:aalto-201812146460
dc.programme.major Acoustics and Audio Technology fi
dc.programme.mcode ELEC3030 fi
dc.type.ontasot Master's thesis en
dc.type.ontasot Diplomityö fi
dc.contributor.supervisor Kurimo, Mikko
dc.programme CCIS - Master’s Programme in Computer, Communication and Information Sciences (TS2013) fi
dc.location P1 fi
local.aalto.electroniconly yes
local.aalto.openaccess yes


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search archive


Advanced Search

article-iconSubmit a publication

Browse

Statistics