Agent-based multimodal sentiment analysis with vision-language models

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Electrical Engineering | Master's thesis

Department

Major/Subject

Mcode

Language

en

Pages

49

Series

Abstract

Multimodal sentiment analysis (MSA) aims to understand human emotions by integrating cues from text, audio, and visual data. The rise of large language models (LLMs) and visual language models (VLMs) has significantly enhanced multimodal understanding capabilities. However, these models require substantial computational resources and are often difficult to interpret, making them challenging to adapt to specific tasks or domains. This paper introduces a agent-based MSA framework. This framework can effectively coordinate local fine-tuned VLMs and specialized traditional models. The system employs simple feature extraction, parameter-efficient fine-tuning(PEFT) LoRA, and modular agents to handle text, audio, and videos. It not only focuses on local fine-tuned models but also lays the foundation for future connections with online multimodal systems based on API. Experiments on the CMU-MOSEI dataset show that the locally fine-tuned models, especially Qwen2-VL-LoRA, can achieve performance levels comparable to traditional multimodal systems (such as Transformer-based architectures), while maintaining control and privacy integrity. These results indicate that reducing parameters to fine-tune large models can make them have the same efficacy as smaller, specialized multimodal models. This provides a scalable and understandable foundation for future integration with online multimodal reasoning systems. This study also examines the performance of the model on different datasets, the impact of removing system parts, and how the modular agent design supports flexible multimodal processing. The research results show that the efficient local coordination of VLMs can almost match specialized systems. This provides a scalable, understandable, and privacy-protected solid foundation for real-world emotion computing applications.

Description

Supervisor

Zhou, Quan

Thesis advisor

Wang, Haoming

Other note

Citation