Domain-Agnostic Multi-Modal Video Retrieval

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorPehlivan Tort, Selen
dc.contributor.authorArora, Pranav
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.supervisorLaaksonen, Jorma
dc.date.accessioned2023-10-15T17:15:19Z
dc.date.available2023-10-15T17:15:19Z
dc.date.issued2023-10-09
dc.description.abstractThe rapid proliferation of multimedia content has necessitated the development of efficient video retrieval systems. Multi-modal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and videos. Traditional approaches for multi-modal retrieval often rely on domain-specific techniques and models, limiting their generalizability across different domains. This thesis aims to develop a domain-agnostic approach for multi-modal video retrieval, enabling effective retrieval irrespective of the specific domain or data modality. The research explores techniques such as transfer learning, where pre-trained models from different domains are fine-tuned using domain-agnostic strategies. Additionally, attention mechanisms and fusion techniques are investigated to leverage cross-modal interactions and capture relevant information from diverse modalities. An important aspect of the research is to find robust methods for audio-video integration as both of them individually provide retrieval cues for the text query. To this end, the loss functions and the architectural design of the model is developed with a strong focus on increasing the mutual information between text and audio-video feature. The proposed approach is quantitatively evaluated on various video benchmark datasets such as MSR-VTT and YouCook2. The results showcase that the approach not only holds its own against state-of-the-art methods but also outperforms them in certain scenarios, with a notable 6% improvement in R@5 and R@10 metrics in the best-performing cases. Qualitative evaluations further illuminated the utility of audio, especially in instances where there's a direct word match between text and audio, exemplified by queries like "A man is calling his colleagues" aligning with video audio containing the word "colleague". In essence, the findings of this research pave the way for a versatile and integrated solution for multi-modal retrieval, with potential applications spanning a wide range of domains.en
dc.format.extent56
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/124101
dc.identifier.urnURN:NBN:fi:aalto-202310156444
dc.language.isoenen
dc.programmeMaster’s Programme in Computer, Communication and Information Sciencesfi
dc.programme.majorMachine learning, Data Science and Artificial intelligence(Macademia)fi
dc.programme.mcodeSCI3044fi
dc.subject.keywordfusionen
dc.subject.keywordtransformersen
dc.subject.keywordcontrastive learningen
dc.subject.keywordunifieden
dc.subject.keywordvideo retrievalen
dc.subject.keywordmulti-modalen
dc.titleDomain-Agnostic Multi-Modal Video Retrievalen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessyes
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
master_Arora_Pranav_2023.pdf
Size:
21.66 MB
Format:
Adobe Portable Document Format