Size-Modulated Deformable Attention in Spatio-Temporal Video Grounding Pipelines
Loading...
Access rights
openAccess
acceptedVersion
URL
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Date
Department
Major/Subject
Mcode
Degree programme
Language
en
Pages
Series
Pattern Recognition - 27th International Conference, ICPR 2024, Proceedings, pp. 308-324, Lecture Notes in Computer Science ; Volume 15318
Abstract
The integration of attention mechanisms into computer vision tasks, inspired by the success of Transformers in natural language processing, has revolutionized various applications such as object detection and visual grounding. In this paper, we focus on spatio-temporal video grounding (STVG), a computer vision task that aims to jointly extract spatial and temporal regions from videos based on textual descriptions. Leveraging recent advancements in attention-based Transformer architectures, particularly in object detectors, and building upon a recent baseline model, we integrate two enhancements in attention modules: Width-Height Modulation and Deformable Attention units. These enhancements aim to improve the accuracy and efficiency of STVG techniques in two datasets, HC-STVG and VidSTG, by addressing challenges related to feature inconsistencies and prediction reliability across video frames. As a result, our study contributes to advancing the baseline models in spatio-temporal video grounding, bridging the gap between computer vision and natural language processing domains.Description
Other note
Citation
Tiwari, H, Pehlivan Tort, S & Laaksonen, J 2024, Size-Modulated Deformable Attention in Spatio-Temporal Video Grounding Pipelines. in A Antonacopoulos, S Chaudhuri, R Chellappa, C-L Liu, S Bhattacharya & U Pal (eds), Pattern Recognition - 27th International Conference, ICPR 2024, Proceedings. Lecture Notes in Computer Science, vol. 15318, Springer, pp. 308-324, International Conference on Pattern Recognition, Kolkata, India, 01/12/2024. https://doi.org/10.1007/978-3-031-78456-9_20