ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations

Loading...
Thumbnail Image

Access rights

openAccess
CC BY
publishedVersion

URL

Journal Title

Journal ISSN

Volume Title

A4 Artikkeli konferenssijulkaisussa

Major/Subject

Mcode

Degree programme

Language

en

Pages

17

Series

IUI 2025 - Proceedings of the 2025 International Conference on Intelligent User Interfaces, pp. 861-877, International Conference on Intelligent User Interfaces, Proceedings IUI

Abstract

Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a dataset of 353K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks. To assess the performance of our model, we benchmark it on UI element detection tasks, evaluate response quality, and showcase its applicability to UI verification.

Description

Publisher Copyright: © 2025 Copyright held by the owner/author(s).

Other note

Citation

Jiang, Y, Schoop, E, Swearngin, A & Nichols, J 2025, ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations. in IUI 2025 - Proceedings of the 2025 International Conference on Intelligent User Interfaces. International Conference on Intelligent User Interfaces, Proceedings IUI, ACM, pp. 861-877, International Conference on Intelligent User Interfaces, Cagliari, Italy, 24/03/2025. https://doi.org/10.1145/3708359.3712129