aalto1 untyped-item.component.html

Unleashing the Potential of LLMs for Audio-Visual Question Answering

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Department

Major/Subject

Mcode

SCI3042

Language

en

Pages

69+1

Series

Abstract

Current audio-visual question answering (AVQA) methods are hindered by the scarcity of open-ended AVQA datasets. Most existing datasets primarily rely on classification and multiple-choice tasks, instead of fully exploiting large language models (LLMs) for generating answers to open-ended question-answering tasks. This thesis addresses this gap by leveraging ChatGPT-3.5 to construct an unbiased VALOR32K-AVQA dataset sourced from the existing VALOR32K captioning dataset. Both the original dataset and the generated questions and answers encompass audio and visual modalities and span a diverse range of real-world scenarios. Specifically, ChatGPT-3.5 is used to generate open-ended questions and answers from video captions, classify them into visual, audio, or audio-visual categories, and group them based on various aspects. Furthermore, a LLaMA-based AVQA multimodal model, LLaMA-AVQA, is introduced to enhance generative question-answering capabilities. LLaMA-AVQA incorporates video and audio inputs, providing audio-visual instructions for multimodal reasoning. Empirical evaluations of the thesis demonstrate that the results obtained using the VALOR32K-AVQA dataset highlight the significant presence of both audio and video modalities in answering questions related to real-world scenarios, surpassing current datasets that limit their applicability. Additionally, LLaMA-AVQA shows remarkable performance in pure generative tasks, a capability that has yet to be explored in current AVQA methods, and achieves performance comparable to the current state-of-the-art in multiple-choice tasks.

Description

Supervisor

Laaksnonen, Jorma

Thesis advisor

Saif, Abduljalil

Other note

Citation

Endorsement

Review

Supplemented By

Referenced By