Domain-Agnostic Multi-Modal Video Retrieval

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2023-10-09

Department

Major/Subject

Machine learning, Data Science and Artificial intelligence(Macademia)

Mcode

SCI3044

Degree programme

Master’s Programme in Computer, Communication and Information Sciences

Language

en

Pages

56

Series

Abstract

The rapid proliferation of multimedia content has necessitated the development of efficient video retrieval systems. Multi-modal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and videos. Traditional approaches for multi-modal retrieval often rely on domain-specific techniques and models, limiting their generalizability across different domains. This thesis aims to develop a domain-agnostic approach for multi-modal video retrieval, enabling effective retrieval irrespective of the specific domain or data modality. The research explores techniques such as transfer learning, where pre-trained models from different domains are fine-tuned using domain-agnostic strategies. Additionally, attention mechanisms and fusion techniques are investigated to leverage cross-modal interactions and capture relevant information from diverse modalities. An important aspect of the research is to find robust methods for audio-video integration as both of them individually provide retrieval cues for the text query. To this end, the loss functions and the architectural design of the model is developed with a strong focus on increasing the mutual information between text and audio-video feature. The proposed approach is quantitatively evaluated on various video benchmark datasets such as MSR-VTT and YouCook2. The results showcase that the approach not only holds its own against state-of-the-art methods but also outperforms them in certain scenarios, with a notable 6% improvement in R@5 and R@10 metrics in the best-performing cases. Qualitative evaluations further illuminated the utility of audio, especially in instances where there's a direct word match between text and audio, exemplified by queries like "A man is calling his colleagues" aligning with video audio containing the word "colleague". In essence, the findings of this research pave the way for a versatile and integrated solution for multi-modal retrieval, with potential applications spanning a wide range of domains.

Description

Supervisor

Laaksonen, Jorma

Thesis advisor

Pehlivan Tort, Selen

Keywords

fusion, transformers, contrastive learning, unified, video retrieval, multi-modal

Other note

Citation