Understanding the capabilities of vision language models on UML sequence diagrams
| dc.contributor | Aalto-yliopisto | fi |
| dc.contributor | Aalto University | en |
| dc.contributor.advisor | Zumot, Laith | |
| dc.contributor.author | Abdul Rehman, Ali | |
| dc.contributor.school | Sähkötekniikan korkeakoulu | fi |
| dc.contributor.school | School of Electrical Engineering | en |
| dc.contributor.supervisor | Laaksonen, Jorma | |
| dc.date.accessioned | 2026-01-07T18:00:48Z | |
| dc.date.available | 2026-01-07T18:00:48Z | |
| dc.date.issued | 2025-11-23 | |
| dc.description.abstract | This thesis investigates the capabilities of vision language models (VLMs) to perform visual question answering (VQA) on UML sequence diagrams. For this purpose, a custom dataset was curated from documents, containing annotated sequence diagrams with different levels of visual complexity. The dataset supports three main tasks: information extraction, location recognition, and sequence flow analysis. The evaluation focused on five important factors: prompt type (zero-shot versus few-shot), context granularity (none, sparse, detailed), the effects of fine-tuning, task type, and diagram complexity. The experiments were carried out on the Gemma family of vision language models, with parameter-efficient fine-tuning applied to the 4B variant using LoRA adapters targeting both the vision and language components. The results indicate that detailed contextual input significantly enhances performance across all types of tasks, while few-shot prompting yields only marginal gains beyond what detailed contextual information alone provides. Fine-tuning led to limited improvements in the current setup, indicating the need for further experimentation with training strategies. The findings highlight key bottlenecks in applying vision language models to basic perception tasks, including a strong reliance on handcrafted contextual input and variability in performance with different diagram complexities. The thesis concludes by outlining several future research directions, such as full-model fine-tuning, training with longer context lengths, evaluating models with multi-token outputs that involve reasoning, and comparing across a wider range of model architectures and UML diagram types. | en |
| dc.format.extent | 97 | |
| dc.format.mimetype | application/pdf | en |
| dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/141716 | |
| dc.identifier.urn | URN:NBN:fi:aalto-202601071105 | |
| dc.language.iso | en | en |
| dc.location | P1 | fi |
| dc.programme | Master's Programme in Automation and Electrical Engineering | en |
| dc.programme | Automaation ja sähkötekniikan maisteriohjelma | fi |
| dc.programme | Magisterprogrammet i automation och elektroteknik | sv |
| dc.programme.major | Control, Robotics and Autonomous Systems | en |
| dc.subject.keyword | vision language models | en |
| dc.subject.keyword | visual question answering | en |
| dc.subject.keyword | UML sequence diagrams | en |
| dc.subject.keyword | few-shot learning | en |
| dc.subject.keyword | fine-tuning | en |
| dc.subject.keyword | model evaluation | en |
| dc.title | Understanding the capabilities of vision language models on UML sequence diagrams | en |
| dc.type | G2 Pro gradu, diplomityö | fi |
| dc.type.ontasot | Master's thesis | en |
| dc.type.ontasot | Diplomityö | fi |
| local.aalto.electroniconly | yes | |
| local.aalto.openaccess | yes |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- master_Abdul_Rehman_Ali_2026.pdf
- Size:
- 3.69 MB
- Format:
- Adobe Portable Document Format