Understanding the capabilities of vision language models on UML sequence diagrams

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorZumot, Laith
dc.contributor.authorAbdul Rehman, Ali
dc.contributor.schoolSähkötekniikan korkeakoulufi
dc.contributor.schoolSchool of Electrical Engineeringen
dc.contributor.supervisorLaaksonen, Jorma
dc.date.accessioned2026-01-07T18:00:48Z
dc.date.available2026-01-07T18:00:48Z
dc.date.issued2025-11-23
dc.description.abstractThis thesis investigates the capabilities of vision language models (VLMs) to perform visual question answering (VQA) on UML sequence diagrams. For this purpose, a custom dataset was curated from documents, containing annotated sequence diagrams with different levels of visual complexity. The dataset supports three main tasks: information extraction, location recognition, and sequence flow analysis. The evaluation focused on five important factors: prompt type (zero-shot versus few-shot), context granularity (none, sparse, detailed), the effects of fine-tuning, task type, and diagram complexity. The experiments were carried out on the Gemma family of vision language models, with parameter-efficient fine-tuning applied to the 4B variant using LoRA adapters targeting both the vision and language components. The results indicate that detailed contextual input significantly enhances performance across all types of tasks, while few-shot prompting yields only marginal gains beyond what detailed contextual information alone provides. Fine-tuning led to limited improvements in the current setup, indicating the need for further experimentation with training strategies. The findings highlight key bottlenecks in applying vision language models to basic perception tasks, including a strong reliance on handcrafted contextual input and variability in performance with different diagram complexities. The thesis concludes by outlining several future research directions, such as full-model fine-tuning, training with longer context lengths, evaluating models with multi-token outputs that involve reasoning, and comparing across a wider range of model architectures and UML diagram types.en
dc.format.extent97
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/141716
dc.identifier.urnURN:NBN:fi:aalto-202601071105
dc.language.isoenen
dc.locationP1fi
dc.programmeMaster's Programme in Automation and Electrical Engineeringen
dc.programmeAutomaation ja sähkötekniikan maisteriohjelmafi
dc.programmeMagisterprogrammet i automation och elektrotekniksv
dc.programme.majorControl, Robotics and Autonomous Systemsen
dc.subject.keywordvision language modelsen
dc.subject.keywordvisual question answeringen
dc.subject.keywordUML sequence diagramsen
dc.subject.keywordfew-shot learningen
dc.subject.keywordfine-tuningen
dc.subject.keywordmodel evaluationen
dc.titleUnderstanding the capabilities of vision language models on UML sequence diagramsen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessyes

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
master_Abdul_Rehman_Ali_2026.pdf
Size:
3.69 MB
Format:
Adobe Portable Document Format