Motion and Appearance Representation Learning of Human Activities From Videos and Wearable Sensor

dc.contributorAalto Universityen
dc.contributor.authorByvshev, Petr
dc.contributor.departmentInformaatio- ja tietoliikennetekniikan laitosfi
dc.contributor.departmentDepartment of Information and Communications Engineeringen
dc.contributor.labMobile Cloud Computingen
dc.contributor.schoolSähkötekniikan korkeakoulufi
dc.contributor.schoolSchool of Electrical Engineeringen
dc.contributor.supervisorYu Xiao, Prof., Aalto University, Department of Information and Communications Engineering, Finland
dc.description.abstractRecently we have observed a substantial growth of video data and its consumption,and the most successful video models are deep learning networks, which are trained on large-scale datasets of videos. The aim of this thesis is to enhance video representations learned with such deep learning networks. Noting that three-dimensional (3D) models inherited their design from the two-dimensional(2D) image understanding models, the goal of this project is to distinguish the dissimilarity that comes with the temporal dimension by studying how temporal dependencies are learned by the models. First, we explored how temporal information can be learned from individual frames that cover a continuous process. It was framed as an ordinal image classification problem in which classes represent the sequence of stages that the process undertakes. Siamese networks were used to extract temporal information from pairs of images and improve the ordinal classification and temporal precision of the prediction. Second, we investigated how to make video activity recognition more robust to noise and occlusions by using complementary information provided by smart gloves. We proposed heterogeneous fusion that was framed as a nonlocal operation in which one modality served to enforce the other, which allowed to improve the activity recognition. Third, we developed a simulation-driven platform for creating smart-glove-based human activity recognition systems by effectively using large pools of video data for creating synthetic smart gloves sensor data. Subsequently, when applying 3D deep video features, a bias was observed towards appearance information rather than motion. To identify the source of that phenomenon, we designed a temporality measure for 3D convolutional networks at both the layer level and the kernel level. Our analysis suggests that3D architectures are not inherently biased towards appearance. When trained on the most prevalent videosets, 3D convolutional networks are indeed biased throughout, especially in the final layers of the network, however, when trained on data with motions and appearances explicitly decoupled and balanced or on data with more pronounced temporal information, such networks adapted to varying levels of temporality. Lastly, two fundamental factors of sampling regimes for training deep networks were identified: frame density and temporal extent.We outlined an approach to estimate the preferred sampling regime at the level of individual actions in a data-driven manner. We have concluded that video understanding models may benefit from considering the variate nature of videosampling regimes. This thesis contributes to the pool of methods that aim to extract temporal information from videos in an optimal manner and combine it with other sensory modalities. We illustrated a clear connection between the training video data and network’s ability to model dynamical patterns in the videos. We believe this thesis advances the networks’ architecture designs and brings insights into the open video data and its impact on the learned video representations.en
dc.format.extent78 + app. 88
dc.identifier.isbn978-952-64-1193-4 (electronic)
dc.identifier.isbn978-952-64-1192-7 (printed)
dc.identifier.issn1799-4942 (electronic)
dc.identifier.issn1799-4934 (printed)
dc.identifier.issn1799-4934 (ISSN-L)
dc.opnGong, Shaogang, Prof., Queen Mary University of London, UK
dc.publisherAalto Universityen
dc.relation.haspart[Publication 1]: Petr Byvshev, Pham-An Truong, and Yu Xiao. 2020. Image-based Renovation Progress Inspection with Deep Siamese Networks. In Proceedings of the 12th International Conference on Machine Learning and Computing (ICMLC 2020). Association for Computing Machinery, New York, NY, USA, 96–104. Full text in Acris/Aaltodoc: DOI: 10.1145/3383972.3384036
dc.relation.haspart[Publication 2]: Petr Byvshev, Pascal Mettes, and Yu Xiao. 2020. Heterogeneous Non-Local Fusion for Multimodal Activity Recognition. In Proceedings of the International Conference on Multimedia Retrieval (ICMR 2020). Association for Computing Machinery, New York, NY, USA, 63–72. Full text in Acris/Aaltodoc: DOI: 10.1145/3372278.3390675
dc.relation.haspart[Publication 3]: Petr Byvshev, Pascal Mettes, and Yu Xiao. 2022. Are 3D convolutional networks inherently biased towards appearance? Computer Vision and Image Understanding Volume 220 (103437), Issue C (Jul 2022, 12 pages). Full text in Acris/Aaltodoc: DOI: 10.1016/j.cviu.2022.103437
dc.relation.haspart[Publication 4]: Clayton Frederick Souza Leite, Petr Byvshev, Henry Mauranen and Yu Xiao. 2022. Simulation-driven Design of Smart Gloves for Gesture Recognition. (30 pages), Manuscript submitted for publication. DOI: 10.2139/ssrn.4195252
dc.relation.haspart[Publication 5]: Petr Byvshev, Robert-Jan Bruintjes, Xin Liu, Strafforello Ombretta, Jan van Gemert, Pascal Mettes and Yu Xiao. 2022. The Density-Extent Map of Video Representation Learning. (11 pages), Manuscript submitted for publication
dc.relation.ispartofseriesAalto University publication series DOCTORAL THESESen
dc.revGall, Juergen, Dr., Max Planck Institute for Intelligent Systems, Germany
dc.subject.keywordvideo dataen
dc.subject.otherElectrical engineeringen
dc.titleMotion and Appearance Representation Learning of Human Activities From Videos and Wearable Sensoren
dc.typeG5 Artikkeliväitöskirjafi
dc.type.ontasotDoctoral dissertation (article-based)en
dc.type.ontasotVäitöskirja (artikkeli)fi
local.aalto.acrisexportstatuschecked 2023-04-20_1100
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
7.64 MB
Adobe Portable Document Format