Spatio-Temporal Video Grounding using Transformers

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2023-10-09
Department
Major/Subject
Machine Learning, Data Science and Artificial Intelligence
Mcode
SCI3044
Degree programme
Master’s Programme in Computer, Communication and Information Sciences
Language
en
Pages
63
Series
Abstract
The realm of artificial intelligence has witnessed significant advancements, particularly in the fields of Computer Vision and Natural Language Processing. Among the myriad of tasks that have emerged, Spatio-Temporal Video Grounding (STVG) stands out, aiming to align video segments in both spatial and temporal dimensions with corresponding textual descriptions. This thesis delves into the intricacies of STVG, focusing on the challenges of aligning video content with textual descriptions, especially in the face of potential ambiguities present in natural language. Through a comprehensive exploration of the STCAT model, an encoder-decoder Transformer-based STVG architecture, this research investigates modifications to its architecture and evaluates the performance of its variants. The study is underpinned by three pivotal research questions that target the modifications to the Anchor Queries module, alterations to the Attention Unit, and the model's adaptability to varying object and human sizes within datasets. By leveraging state-of-the-art models and methodologies, this research contributes to the ongoing development in the field of video understanding, particularly in STVG. The findings, presented through rigorous quantitative and qualitative analyses, offer insights into the potential enhancements and future directions in the domain.
Description
Supervisor
Laaksonen, Jorma
Thesis advisor
Pehlivan Tort, Selen
Keywords
temporal video grounding, VidSTG, spatio-temporal video grounding, STCAT, transformer, HCSTVG
Other note
Citation