Spatio-Temporal Video Grounding using Transformers

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2023-10-09

Department

Major/Subject

Machine Learning, Data Science and Artificial Intelligence

Mcode

SCI3044

Degree programme

Master’s Programme in Computer, Communication and Information Sciences

Language

en

Pages

63

Series

Abstract

The realm of artificial intelligence has witnessed significant advancements, particularly in the fields of Computer Vision and Natural Language Processing. Among the myriad of tasks that have emerged, Spatio-Temporal Video Grounding (STVG) stands out, aiming to align video segments in both spatial and temporal dimensions with corresponding textual descriptions. This thesis delves into the intricacies of STVG, focusing on the challenges of aligning video content with textual descriptions, especially in the face of potential ambiguities present in natural language. Through a comprehensive exploration of the STCAT model, an encoder-decoder Transformer-based STVG architecture, this research investigates modifications to its architecture and evaluates the performance of its variants. The study is underpinned by three pivotal research questions that target the modifications to the Anchor Queries module, alterations to the Attention Unit, and the model's adaptability to varying object and human sizes within datasets. By leveraging state-of-the-art models and methodologies, this research contributes to the ongoing development in the field of video understanding, particularly in STVG. The findings, presented through rigorous quantitative and qualitative analyses, offer insights into the potential enhancements and future directions in the domain.

Description

Supervisor

Laaksonen, Jorma

Thesis advisor

Pehlivan Tort, Selen

Keywords

temporal video grounding, VidSTG, spatio-temporal video grounding, STCAT, transformer, HCSTVG

Other note

Citation