Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorAdavanne, Sharathen_US
dc.contributor.authorPolitis, Archontisen_US
dc.contributor.authorNikunen, Joonasen_US
dc.contributor.authorVirtanen, Tuomasen_US
dc.contributor.departmentDepartment of Signal Processing and Acousticsen
dc.contributor.organizationTampere Universityen_US
dc.date.accessioned2019-03-05T10:13:52Z
dc.date.available2019-03-05T10:13:52Z
dc.date.issued2019-03en_US
dc.description| openaire: EC/H2020/637422/EU//EVERYSOUND
dc.description.abstractIn this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.en
dc.description.versionPeer revieweden
dc.format.extent15
dc.identifier.citationAdavanne, S, Politis, A, Nikunen, J & Virtanen, T 2019, 'Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks', IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, 8567942, pp. 34-48. https://doi.org/10.1109/JSTSP.2018.2885636en
dc.identifier.doi10.1109/JSTSP.2018.2885636en_US
dc.identifier.issn1932-4553
dc.identifier.issn1941-0484
dc.identifier.otherPURE UUID: 00ca3dd1-6298-4528-bf1e-998303871f81en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/00ca3dd1-6298-4528-bf1e-998303871f81en_US
dc.identifier.otherPURE LINK: https://arxiv.org/pdf/1807.00129.pdf
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/36994
dc.identifier.urnURN:NBN:fi:aalto-201903052140
dc.language.isoenen
dc.publisherIEEE
dc.relationinfo:eu-repo/grantAgreement/EC/H2020/637422/EU//EVERYSOUNDen_US
dc.relation.ispartofseriesIEEE Journal of Selected Topics in Signal Processingen
dc.relation.ispartofseriesVolume 13, issue 1, pp. 34-48en
dc.rightsopenAccessen
dc.subject.keywordconvolutional recurrent neural networken_US
dc.subject.keyworddirection of arrival estimationen_US
dc.subject.keywordSound event detectionen_US
dc.titleSound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networksen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi

Files