RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Electrical Engineering | Master's thesis

Department

Major/Subject

Mcode

Language

en

Pages

58

Series

Abstract

Recent advancements in Vision-Language Models (VLMs) have demonstrated their potential to tackle complex multimodal tasks. However, the performance of VLMs on RGB-thermal image pairs remains underexplored, despite the growing need for robust multimodal understanding in domains such as heat loss detection, anomaly detection, and risk assessment. To address this gap, we introduce RGB-Th-Bench, a novel benchmark designed to evaluate VLM capabilities on RGB-thermal paired data comprehensively. The benchmark includes over 1,800 expertly curated Yes/No questions spanning 16 skill dimensions, with a dual accuracy metric framework—question-level accuracy (QAcc) and skill-level accuracy (SAcc)—to rigorously assess performance. Evaluation of two state-of-the-art open-source VLMs on RGB-Th-Bench reveals significant challenges in comprehending RGB-thermal data. The highest QAcc and SAcc achieved were below 60 percent and 9 percent, respectively, underscoring the complexity of RGB-thermal understanding. The results show that the lack of large-scale application-specific and expert-annotated thermal-caption-pair datasets in pre-training is the most likely reason of the observed performance gap. Performance varied significantly across skill dimensions due to differences in task complexity and training data composition. Skills such as scene understanding and object presence, which align closely with pre-training data, exhibited stronger results, while more complex skills with less annotation, such as instance spatial relations and instance interaction, demonstrated notable gaps. Importantly, the stricter skill-level accuracy metric (SAcc) revealed additional weaknesses, particularly in scenarios with adversarial question structures. Finally, our experiments indicate that the tested VLMs exhibit excellent instruction-following behavior, strictly adhering to Yes/No response formats with precise alignment. These findings highlight the need for substantial advancements before VLMs can effectively act as multimodal agents in such applications, and provide valuable insights into current VLM limitations and pave the way for future research to improve performance in RGB-thermal multimodal understanding.

Description

Supervisor

Pajarinen, Joni

Thesis advisor

Khajavi Haghighat, Siavash

Other note

Citation