aalto1 untyped-item.component.html

Enhancing knowledge graph interactions : A comprehensive Text-to-Cypher pipeline with large language models

Loading...
Thumbnail Image

Access rights

openAccess
CC BY

Creative Commons license

Except where otherwised noted, this item's license is described as openAccess
publishedVersion

URL

Journal Title

Journal ISSN

Volume Title

A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä

Major/Subject

Mcode

Degree programme

Language

en

Pages

20

Series

Information Processing and Management, Volume 63, issue 1

Abstract

Knowledge Graphs (KGs) store structured information but typically require specialized query languages, such as Cypher for Neo4j, creating accessibility challenges for users unfamiliar with graph syntax. Large Language Models (LLMs) offer a solution by translating natural language into Cypher queries. However, existing models—including large-scale LLMs (e.g., ChatGPT) and smaller open-source models (e.g., Llama-7B, 8B) often struggle with accurately generating domain-specific queries due to inadequate alignment with KG schemas and limited domain-specific training data. To address these limitations, we propose a training pipeline tailored specifically for domain-aligned Cypher query generation, emphasizing usability for smaller-scale models. Our method integrates template-based synthetic data generation for diverse, high-quality training samples. We combine supervised fine-tuning with preference learning to enhance domain knowledge and Cypher syntax understanding. Additionally, our approach includes a context-aware retrieval mechanism that dynamically incorporates relevant schema elements at inference, improving alignment with domain-specific knowledge. We evaluated our method on the Hetionet biomedical KG using a benchmark dataset of 240 queries across three complexity levels. Our results show that our context-aware prompting achieves a substantial improvement, increasing component matching accuracy by 23.6% for ChatGPT-4o over the vanilla prompt baseline. When applying our full training pipeline to smaller-scale models, CodeLlama-13B* achieves an execution accuracy of 69.2%, nearly matching ChatGPT-4o's 72.1%. Importantly, our approach significantly narrows the performance gap, enabling smaller models to effectively manage complex, domain-specific tasks previously dominated by larger models. These findings demonstrate that our method is scalable, computationally efficient, and robust for practical Cypher query generation applications.

Description

Other note

Citation

Yang, C, Li, C, Hu, X, Yu, H & Lu, J 2026, 'Enhancing knowledge graph interactions : A comprehensive Text-to-Cypher pipeline with large language models', Information Processing and Management, vol. 63, no. 1, 104280. https://doi.org/10.1016/j.ipm.2025.104280

Endorsement

Review

Supplemented By

Referenced By