RGB-Th-bench: A dense benchmark for visual-thermal understanding of vision language models

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorKhajavi Haghighat, Siavash
dc.contributor.authorMoshtaghi, Mehdi
dc.contributor.schoolSähkötekniikan korkeakoulufi
dc.contributor.schoolSchool of Electrical Engineeringen
dc.contributor.supervisorPajarinen, Joni
dc.date.accessioned2025-01-27T18:04:14Z
dc.date.available2025-01-27T18:04:14Z
dc.date.issued2024-12-26
dc.description.abstractRecent advancements in Vision-Language Models (VLMs) have demonstrated their potential to tackle complex multimodal tasks. However, the performance of VLMs on RGB-thermal image pairs remains underexplored, despite the growing need for robust multimodal understanding in domains such as heat loss detection, anomaly detection, and risk assessment. To address this gap, we introduce RGB-Th-Bench, a novel benchmark designed to evaluate VLM capabilities on RGB-thermal paired data comprehensively. The benchmark includes over 1,800 expertly curated Yes/No questions spanning 16 skill dimensions, with a dual accuracy metric framework—question-level accuracy (QAcc) and skill-level accuracy (SAcc)—to rigorously assess performance. Evaluation of two state-of-the-art open-source VLMs on RGB-Th-Bench reveals significant challenges in comprehending RGB-thermal data. The highest QAcc and SAcc achieved were below 60 percent and 9 percent, respectively, underscoring the complexity of RGB-thermal understanding. The results show that the lack of large-scale application-specific and expert-annotated thermal-caption-pair datasets in pre-training is the most likely reason of the observed performance gap. Performance varied significantly across skill dimensions due to differences in task complexity and training data composition. Skills such as scene understanding and object presence, which align closely with pre-training data, exhibited stronger results, while more complex skills with less annotation, such as instance spatial relations and instance interaction, demonstrated notable gaps. Importantly, the stricter skill-level accuracy metric (SAcc) revealed additional weaknesses, particularly in scenarios with adversarial question structures. Finally, our experiments indicate that the tested VLMs exhibit excellent instruction-following behavior, strictly adhering to Yes/No response formats with precise alignment. These findings highlight the need for substantial advancements before VLMs can effectively act as multimodal agents in such applications, and provide valuable insights into current VLM limitations and pave the way for future research to improve performance in RGB-thermal multimodal understanding.en
dc.format.extent58
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/133626
dc.identifier.urnURN:NBN:fi:aalto-202501271911
dc.language.isoenen
dc.locationP1fi
dc.programmeMaster's Programme in ICT Innovationen
dc.programme.majorAutonomous Systemsen
dc.subject.keywordVision Language Models (VLMs)en
dc.subject.keywordLarge Language Models (LLMs)en
dc.subject.keywordthermal imageen
dc.subject.keywordBenchmarken
dc.subject.keyworddataseten
dc.subject.keywordVisual Question Answering (VQA)en
dc.titleRGB-Th-bench: A dense benchmark for visual-thermal understanding of vision language modelsen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessno

Files