Motion and Appearance Representation Learning of Human Activities From Videos and Wearable Sensor

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
School of Electrical Engineering | Doctoral thesis (article-based) | Defence date: 2023-04-19
Degree programme
78 + app. 88
Aalto University publication series DOCTORAL THESES, 38/2023
Recently we have observed a substantial growth of video data and its consumption,and the most successful video models are deep learning networks, which are trained on large-scale datasets of videos. The aim of this thesis is to enhance video representations learned with such deep learning networks. Noting that three-dimensional (3D) models inherited their design from the two-dimensional(2D) image understanding models, the goal of this project is to distinguish the dissimilarity that comes with the temporal dimension by studying how temporal dependencies are learned by the models. First, we explored how temporal information can be learned from individual frames that cover a continuous process. It was framed as an ordinal image classification problem in which classes represent the sequence of stages that the process undertakes. Siamese networks were used to extract temporal information from pairs of images and improve the ordinal classification and temporal precision of the prediction. Second, we investigated how to make video activity recognition more robust to noise and occlusions by using complementary information provided by smart gloves. We proposed heterogeneous fusion that was framed as a nonlocal operation in which one modality served to enforce the other, which allowed to improve the activity recognition. Third, we developed a simulation-driven platform for creating smart-glove-based human activity recognition systems by effectively using large pools of video data for creating synthetic smart gloves sensor data. Subsequently, when applying 3D deep video features, a bias was observed towards appearance information rather than motion. To identify the source of that phenomenon, we designed a temporality measure for 3D convolutional networks at both the layer level and the kernel level. Our analysis suggests that3D architectures are not inherently biased towards appearance. When trained on the most prevalent videosets, 3D convolutional networks are indeed biased throughout, especially in the final layers of the network, however, when trained on data with motions and appearances explicitly decoupled and balanced or on data with more pronounced temporal information, such networks adapted to varying levels of temporality. Lastly, two fundamental factors of sampling regimes for training deep networks were identified: frame density and temporal extent.We outlined an approach to estimate the preferred sampling regime at the level of individual actions in a data-driven manner. We have concluded that video understanding models may benefit from considering the variate nature of videosampling regimes. This thesis contributes to the pool of methods that aim to extract temporal information from videos in an optimal manner and combine it with other sensory modalities. We illustrated a clear connection between the training video data and network’s ability to model dynamical patterns in the videos. We believe this thesis advances the networks’ architecture designs and brings insights into the open video data and its impact on the learned video representations.
Supervising professor
Yu Xiao, Prof., Aalto University, Department of Information and Communications Engineering, Finland
learning, video data, models
Other note
  • [Publication 1]: Petr Byvshev, Pham-An Truong, and Yu Xiao. 2020. Image-based Renovation Progress Inspection with Deep Siamese Networks. In Proceedings of the 12th International Conference on Machine Learning and Computing (ICMLC 2020). Association for Computing Machinery, New York, NY, USA, 96–104.
    Full text in Acris/Aaltodoc:
    DOI: 10.1145/3383972.3384036 View at publisher
  • [Publication 2]: Petr Byvshev, Pascal Mettes, and Yu Xiao. 2020. Heterogeneous Non-Local Fusion for Multimodal Activity Recognition. In Proceedings of the International Conference on Multimedia Retrieval (ICMR 2020). Association for Computing Machinery, New York, NY, USA, 63–72.
    Full text in Acris/Aaltodoc:
    DOI: 10.1145/3372278.3390675 View at publisher
  • [Publication 3]: Petr Byvshev, Pascal Mettes, and Yu Xiao. 2022. Are 3D convolutional networks inherently biased towards appearance? Computer Vision and Image Understanding Volume 220 (103437), Issue C (Jul 2022, 12 pages).
    Full text in Acris/Aaltodoc:
    DOI: 10.1016/j.cviu.2022.103437 View at publisher
  • [Publication 4]: Clayton Frederick Souza Leite, Petr Byvshev, Henry Mauranen and Yu Xiao. 2022. Simulation-driven Design of Smart Gloves for Gesture Recognition. (30 pages), Manuscript submitted for publication.
    DOI: 10.2139/ssrn.4195252 View at publisher
  • [Publication 5]: Petr Byvshev, Robert-Jan Bruintjes, Xin Liu, Strafforello Ombretta, Jan van Gemert, Pascal Mettes and Yu Xiao. 2022. The Density-Extent Map of Video Representation Learning. (11 pages), Manuscript submitted for publication