Understanding the capabilities of vision language models on UML sequence diagrams

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Electrical Engineering | Master's thesis

Department

Mcode

Language

en

Pages

97

Series

Abstract

This thesis investigates the capabilities of vision language models (VLMs) to perform visual question answering (VQA) on UML sequence diagrams. For this purpose, a custom dataset was curated from documents, containing annotated sequence diagrams with different levels of visual complexity. The dataset supports three main tasks: information extraction, location recognition, and sequence flow analysis. The evaluation focused on five important factors: prompt type (zero-shot versus few-shot), context granularity (none, sparse, detailed), the effects of fine-tuning, task type, and diagram complexity. The experiments were carried out on the Gemma family of vision language models, with parameter-efficient fine-tuning applied to the 4B variant using LoRA adapters targeting both the vision and language components. The results indicate that detailed contextual input significantly enhances performance across all types of tasks, while few-shot prompting yields only marginal gains beyond what detailed contextual information alone provides. Fine-tuning led to limited improvements in the current setup, indicating the need for further experimentation with training strategies. The findings highlight key bottlenecks in applying vision language models to basic perception tasks, including a strong reliance on handcrafted contextual input and variability in performance with different diagram complexities. The thesis concludes by outlining several future research directions, such as full-model fine-tuning, training with longer context lengths, evaluating models with multi-token outputs that involve reasoning, and comparing across a wider range of model architectures and UML diagram types.

Description

Supervisor

Laaksonen, Jorma

Thesis advisor

Zumot, Laith

Other note

Citation