aalto1 untyped-item.component.html
Enhancing knowledge graph interactions : A comprehensive Text-to-Cypher pipeline with large language models
Loading...
Access rights
openAccess
CC BY
CC BY
Creative Commons license
Except where otherwised noted, this item's license is described as openAccess
publishedVersion
URL
Journal Title
Journal ISSN
Volume Title
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
Major/Subject
Mcode
Degree programme
Language
en
Pages
20
Series
Information Processing and Management, Volume 63, issue 1
Abstract
Knowledge Graphs (KGs) store structured information but typically require specialized query languages, such as Cypher for Neo4j, creating accessibility challenges for users unfamiliar with graph syntax. Large Language Models (LLMs) offer a solution by translating natural language into Cypher queries. However, existing models—including large-scale LLMs (e.g., ChatGPT) and smaller open-source models (e.g., Llama-7B, 8B) often struggle with accurately generating domain-specific queries due to inadequate alignment with KG schemas and limited domain-specific training data. To address these limitations, we propose a training pipeline tailored specifically for domain-aligned Cypher query generation, emphasizing usability for smaller-scale models. Our method integrates template-based synthetic data generation for diverse, high-quality training samples. We combine supervised fine-tuning with preference learning to enhance domain knowledge and Cypher syntax understanding. Additionally, our approach includes a context-aware retrieval mechanism that dynamically incorporates relevant schema elements at inference, improving alignment with domain-specific knowledge. We evaluated our method on the Hetionet biomedical KG using a benchmark dataset of 240 queries across three complexity levels. Our results show that our context-aware prompting achieves a substantial improvement, increasing component matching accuracy by 23.6% for ChatGPT-4o over the vanilla prompt baseline. When applying our full training pipeline to smaller-scale models, CodeLlama-13B* achieves an execution accuracy of 69.2%, nearly matching ChatGPT-4o's 72.1%. Importantly, our approach significantly narrows the performance gap, enabling smaller models to effectively manage complex, domain-specific tasks previously dominated by larger models. These findings demonstrate that our method is scalable, computationally efficient, and robust for practical Cypher query generation applications.
Description
Other note
Citation
Yang, C, Li, C, Hu, X, Yu, H & Lu, J 2026, 'Enhancing knowledge graph interactions : A comprehensive Text-to-Cypher pipeline with large language models', Information Processing and Management, vol. 63, no. 1, 104280. https://doi.org/10.1016/j.ipm.2025.104280
