Are 3D convolutional networks inherently biased towards appearance?

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorByvshev, Petren_US
dc.contributor.authorXiao, Yuen_US
dc.contributor.authorMettes, Pascalen_US
dc.contributor.departmentDepartment of Communications and Networkingen
dc.contributor.groupauthorMobile Cloud Computingen
dc.contributor.organizationUniversity of Amsterdamen_US
dc.date.accessioned2022-05-24T05:13:49Z
dc.date.available2022-05-24T05:13:49Z
dc.date.issued2022-07en_US
dc.description.abstract3D convolutional networks, as direct inheritors of 2D convolutional networks for images, have placed theirmark on action recognition in videos. Combined with pretraining on large-scale video data, high classificationaccuracies have been obtained on numerous video benchmarks. In an effort to better understand why 3Dconvolutional networks are so effective, several works have highlighted their bias towards static appearanceand towards the scenes in which actions occur. In this work, we seek to find the source of this bias and questionwhether the observed biases towards static appearances are inherent to 3D convolutional networks or representlimited significance of motion in the training data. We resolve this by presenting temporality measures thatestimate the data-to-model motion dependency at both the layer-level and the kernel-level. Moreover, weintroduce two synthetic datasets where motion and appearance are decoupled by design, which allows us todirectly observe their effects on the networks. Our analysis shows that 3D architectures arenotinherentlybiased towards appearance. When trained on the most prevalent video sets, 3D convolutional networks areindeed biased throughout, especially in the final layers of the network. However, when training on datawith motions and appearances explicitly decoupled and balanced, such networks adapt to varying levels oftemporality. To this end, we see the proposed measures as a reliable method to estimate motion relevance foractivity classification in datasets and use them to uncover the differences between popular pre-training videocollections, such as Kinetics, IG-65M and Howto100 m.en
dc.description.versionPeer revieweden
dc.format.extent12
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationByvshev, P, Xiao, Y & Mettes, P 2022, 'Are 3D convolutional networks inherently biased towards appearance?', Computer Vision and Image Understanding, vol. 220, no. 103437, 103437. https://doi.org/10.1016/j.cviu.2022.103437en
dc.identifier.doi10.1016/j.cviu.2022.103437en_US
dc.identifier.issn1077-3142
dc.identifier.issn1090-235X
dc.identifier.otherPURE UUID: cf4c2bf5-6d89-4335-8900-5d78f665fe34en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/cf4c2bf5-6d89-4335-8900-5d78f665fe34en_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/83213062/1_s2.0_S1077314222000534_main.pdf
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/114565
dc.identifier.urnURN:NBN:fi:aalto-202205243412
dc.language.isoenen
dc.publisherElsevier
dc.relation.ispartofseriesComputer Vision and Image Understandingen
dc.relation.ispartofseriesVolume 220, issue 103437en
dc.rightsopenAccessen
dc.subject.keyword3D modelsen_US
dc.subject.keywordTemporality measureen_US
dc.subject.keywordMotion analysisen_US
dc.subject.keywordLarge-scale videosetsen_US
dc.titleAre 3D convolutional networks inherently biased towards appearance?en
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1_s2.0_S1077314222000534_main.pdf
Size:
4.64 MB
Format:
Adobe Portable Document Format