aalto1 untyped-item.component.html
Unleashing the Potential of LLMs for Audio-Visual Question Answering
Loading...
URL
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
Department
Major/Subject
Mcode
SCI3042
Language
en
Pages
69+1
Series
Abstract
Current audio-visual question answering (AVQA) methods are hindered by the scarcity of open-ended AVQA datasets. Most existing datasets primarily rely on classification and multiple-choice tasks, instead of fully exploiting large language models (LLMs) for generating answers to open-ended question-answering tasks. This thesis addresses this gap by leveraging ChatGPT-3.5 to construct an unbiased VALOR32K-AVQA dataset sourced from the existing VALOR32K captioning dataset. Both the original dataset and the generated questions and answers encompass audio and visual modalities and span a diverse range of real-world scenarios. Specifically, ChatGPT-3.5 is used to generate open-ended questions and answers from video captions, classify them into visual, audio, or audio-visual categories, and group them based on various aspects. Furthermore, a LLaMA-based AVQA multimodal model, LLaMA-AVQA, is introduced to enhance generative question-answering capabilities. LLaMA-AVQA incorporates video and audio inputs, providing audio-visual instructions for multimodal reasoning. Empirical evaluations of the thesis demonstrate that the results obtained using the VALOR32K-AVQA dataset highlight the significant presence of both audio and video modalities in answering questions related to real-world scenarios, surpassing current datasets that limit their applicability. Additionally, LLaMA-AVQA shows remarkable performance in pure generative tasks, a capability that has yet to be explored in current AVQA methods, and achieves performance comparable to the current state-of-the-art in multiple-choice tasks.