AV-PEA : Parameter-Efficient Adapter for Audio-Visual Multimodal Learning
Loading...
Access rights
openAccess
CC BY-NC-ND
CC BY-NC-ND
publishedVersion
URL
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
Department
Major/Subject
Mcode
Degree programme
Language
en
Pages
8
Series
Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP, pp. 730-737, Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
Abstract
Fine-tuning has emerged as a widely used transfer learning technique for leveraging pre-trained vision transformers in various downstream tasks. However, its success relies on tuning a significant number of trainable parameters, which could lead to significant costs in terms of both model training and storage. When it comes to audio-visual multimodal learning, the challenge also lies in effectively incorporating both audio and visual cues into the transfer learning process, especially when the original model has been trained with unimodal samples only. This paper introduces a novel audio-visual parameter-efficient adapter (AV-PEA) designed to improve multimodal transfer learning for audio-visual tasks. Through the integration of AV-PEA into a frozen vision transformer, like the visual transformer (ViT), the transformer becomes adept at processing audio inputs without prior knowledge of audio pre-training. This also facilitates the exchange of essential audio-visual cues between audio and visual modalities, all while introducing a limited set of trainable parameters into each block of the frozen transformer. The experimental results demonstrate that our AV-PEA consistently achieves superior or comparable performance to state-of-the-art methods in a range of audio-visual tasks, including audio-visual event localization (AVEL), audio-visual question answering (AVQA), audio-visual retrieval (AVR), and audio-visual captioning (AVC). Furthermore, it distinguishes itself from competitors by enabling seamless integration into these tasks while maintaining a consistent number of trainable parameters, typically accounting for less than 3.7% of the total parameters per task.Description
Publisher Copyright: © 2024 by Paper published under CC license (CC BY-NC-ND 4.0).
Other note
Citation
Radman, A & Laaksonen, J 2024, AV-PEA : Parameter-Efficient Adapter for Audio-Visual Multimodal Learning. in Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP. Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, SciTePress, pp. 730-737, International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Rome, Italy, 27/02/2024. https://doi.org/10.5220/0012431500003660