CBR-to-SQL: Rethinking retrieval-based text-to-SQL using case-based reasoning in the medical domain
Loading...
URL
Journal Title
Journal ISSN
Volume Title
School of Science |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
Department
Mcode
Language
en
Pages
84
Series
Abstract
Electronic Health Records (EHR) databases contain valuable data for healthcare research, yet extracting information out of these databases requires Structured Query Language (SQL) expertise and schema knowledge, limiting accessibility for non-technical users. While a promising approach is to use Large Language Models (LLMs) to translate natural language questions to SQL via Retrieval-Augmented Generation (RAG), adapting this approach to the medical domain is non-trivial. Standard RAG relies on single-step retrieval from a static pool of examples, which struggles with the variability and noise of medical terminology and jargon. This often leads to anti-patterns such as expanding the task demonstration pool to improve coverage, which in turn introduces noise and scalability problems. To address the issue, this research proposes CBR-to-SQL, a novel retrieval-based framework that integrates Case-based Reasoning (CBR) principles into medical text-to-SQL tasks. The framework operates in two phases: an offline Case Retain phase, which constructs reusable templates from question–SQL pairs by abstracting schema-specific details, and an online inference phase, where a two-step retrieval process—Template Construction and Source Discovery—reuses logical structures and adapts them to entities in natural language questions. Under this formulation, CBR-to-SQL allows sample-efficient reasoning, decomposes standard retrieval pipelines into focused subproblems, and provides interpretable, targeted optimization for SQL generation. Evaluated on the MIMICSQL benchmark, CBR-to-SQL achieves state-of-the-art logical form accuracy and competitive execution accuracy. It outperforms strong baselines, especially in low-resource data settings where training examples are limited. These results demonstrate that CBR-to-SQL can address the limitations of existing retrieval-based methods and inspire further CBR adaptations in standard RAG-to-SQL systems. Overall, the key contributions of this thesis include introducing CBR formulations to medical text-to-SQL, designing a CBR-based proof-of-concept, proposing a more challenging incomplete database environment to simulate limited data conditions, and developing a brittleness metric to quantify model reliance on retrieved cases.Description
Supervisor
Marttinen, PekkaThesis advisor
Moen, HansKumar, Anmol