Domain-Agnostic Multi-Modal Video Retrieval
Perustieteiden korkeakoulu | Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Machine learning, Data Science and Artificial intelligence(Macademia)
Master’s Programme in Computer, Communication and Information Sciences
AbstractThe rapid proliferation of multimedia content has necessitated the development of efficient video retrieval systems. Multi-modal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and videos. Traditional approaches for multi-modal retrieval often rely on domain-specific techniques and models, limiting their generalizability across different domains. This thesis aims to develop a domain-agnostic approach for multi-modal video retrieval, enabling effective retrieval irrespective of the specific domain or data modality. The research explores techniques such as transfer learning, where pre-trained models from different domains are fine-tuned using domain-agnostic strategies. Additionally, attention mechanisms and fusion techniques are investigated to leverage cross-modal interactions and capture relevant information from diverse modalities. An important aspect of the research is to find robust methods for audio-video integration as both of them individually provide retrieval cues for the text query. To this end, the loss functions and the architectural design of the model is developed with a strong focus on increasing the mutual information between text and audio-video feature. The proposed approach is quantitatively evaluated on various video benchmark datasets such as MSR-VTT and YouCook2. The results showcase that the approach not only holds its own against state-of-the-art methods but also outperforms them in certain scenarios, with a notable 6% improvement in R@5 and R@10 metrics in the best-performing cases. Qualitative evaluations further illuminated the utility of audio, especially in instances where there's a direct word match between text and audio, exemplified by queries like "A man is calling his colleagues" aligning with video audio containing the word "colleague". In essence, the findings of this research pave the way for a versatile and integrated solution for multi-modal retrieval, with potential applications spanning a wide range of domains.
Thesis advisorPehlivan Tort, Selen
fusion, transformers, contrastive learning, unified, video retrieval, multi-modal