Dense video captioning: Update module and cosine constraint on transformers

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Department

Mcode

SCI3044

Language

en

Pages

44

Series

Abstract

Transformer-based models are widely adopted in multi-modal learning as the cross-attention mechanism has been shown to produce effective representations of multi-modalities. The attention mechanism takes two modalities as the queries and keys and maps the combination of them into the query domain. This thesis aims at studying the use of the attention mechanism specifically for the dense video captioning task, which concentrates on generating a paragraph describing events in a video segment. While applying the attention mechanism to dense video captioning, the textual and visual contexts are normally taken as the queries and keys, respectively. Additionally, the vision-language contexts from the current segment and the history segments could also be taken as the queries and keys. Conceptually, the combination of the queries and the keys are treated as cross-attentive outputs to predict the next possible word in the caption. However, the queries and the cross-attentive outputs can be weakly correlated or aligned and the resulting representations become more of a confusing signal for the prediction. In this thesis, a novel update module is proposed for Transformer-based models through exploring the similarities across the queries and the cross-attentive outputs. The proposed module refines the attentive outputs by either interpolating or extrapolating the queries and attentive outputs without adding extra learnable parameters. To prevent the weakly correlated or aligned conditions, a constraint is proposed in the update module to modulate the similarities between the queries and the cross-attentive outputs. Experiments are performed on two benchmark datasets, ActivityNet Captions and YouCook2, and the results show the effectiveness of the method on two Transformer baselines. To evaluate how the proposed method generalizes to different features, two types of video features are tested.

Description

Supervisor

Laaksonen, Jorma

Thesis advisor

Wang, Tzu-Jui Julius

Other note

Citation