Are 3D convolutional networks inherently biased towards appearance?
Loading...
Access rights
openAccess
publishedVersion
URL
Journal Title
Journal ISSN
Volume Title
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
Authors
Date
2022-07
Major/Subject
Mcode
Degree programme
Language
en
Pages
12
Series
Computer Vision and Image Understanding, Volume 220, issue 103437
Abstract
3D convolutional networks, as direct inheritors of 2D convolutional networks for images, have placed theirmark on action recognition in videos. Combined with pretraining on large-scale video data, high classificationaccuracies have been obtained on numerous video benchmarks. In an effort to better understand why 3Dconvolutional networks are so effective, several works have highlighted their bias towards static appearanceand towards the scenes in which actions occur. In this work, we seek to find the source of this bias and questionwhether the observed biases towards static appearances are inherent to 3D convolutional networks or representlimited significance of motion in the training data. We resolve this by presenting temporality measures thatestimate the data-to-model motion dependency at both the layer-level and the kernel-level. Moreover, weintroduce two synthetic datasets where motion and appearance are decoupled by design, which allows us todirectly observe their effects on the networks. Our analysis shows that 3D architectures arenotinherentlybiased towards appearance. When trained on the most prevalent video sets, 3D convolutional networks areindeed biased throughout, especially in the final layers of the network. However, when training on datawith motions and appearances explicitly decoupled and balanced, such networks adapt to varying levels oftemporality. To this end, we see the proposed measures as a reliable method to estimate motion relevance foractivity classification in datasets and use them to uncover the differences between popular pre-training videocollections, such as Kinetics, IG-65M and Howto100 m.Description
Keywords
3D models, Temporality measure, Motion analysis, Large-scale videosets
Other note
Citation
Byvshev, P, Xiao, Y & Mettes, P 2022, ' Are 3D convolutional networks inherently biased towards appearance? ', Computer Vision and Image Understanding, vol. 220, no. 103437, 103437 . https://doi.org/10.1016/j.cviu.2022.103437