Spatio-Temporal Video Grounding using Transformers
Perustieteiden korkeakoulu | Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Machine Learning, Data Science and Artificial Intelligence
Master’s Programme in Computer, Communication and Information Sciences
AbstractThe realm of artificial intelligence has witnessed significant advancements, particularly in the fields of Computer Vision and Natural Language Processing. Among the myriad of tasks that have emerged, Spatio-Temporal Video Grounding (STVG) stands out, aiming to align video segments in both spatial and temporal dimensions with corresponding textual descriptions. This thesis delves into the intricacies of STVG, focusing on the challenges of aligning video content with textual descriptions, especially in the face of potential ambiguities present in natural language. Through a comprehensive exploration of the STCAT model, an encoder-decoder Transformer-based STVG architecture, this research investigates modifications to its architecture and evaluates the performance of its variants. The study is underpinned by three pivotal research questions that target the modifications to the Anchor Queries module, alterations to the Attention Unit, and the model's adaptability to varying object and human sizes within datasets. By leveraging state-of-the-art models and methodologies, this research contributes to the ongoing development in the field of video understanding, particularly in STVG. The findings, presented through rigorous quantitative and qualitative analyses, offer insights into the potential enhancements and future directions in the domain.
Thesis advisorPehlivan Tort, Selen
temporal video grounding, VidSTG, spatio-temporal video grounding, STCAT, transformer, HCSTVG