CBR-to-SQL: Rethinking retrieval-based text-to-SQL using case-based reasoning in the medical domain

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Master's thesis

Department

Mcode

Language

en

Pages

84

Series

Abstract

Electronic Health Records (EHR) databases contain valuable data for healthcare research, yet extracting information out of these databases requires Structured Query Language (SQL) expertise and schema knowledge, limiting accessibility for non-technical users. While a promising approach is to use Large Language Models (LLMs) to translate natural language questions to SQL via Retrieval-Augmented Generation (RAG), adapting this approach to the medical domain is non-trivial. Standard RAG relies on single-step retrieval from a static pool of examples, which struggles with the variability and noise of medical terminology and jargon. This often leads to anti-patterns such as expanding the task demonstration pool to improve coverage, which in turn introduces noise and scalability problems. To address the issue, this research proposes CBR-to-SQL, a novel retrieval-based framework that integrates Case-based Reasoning (CBR) principles into medical text-to-SQL tasks. The framework operates in two phases: an offline Case Retain phase, which constructs reusable templates from question–SQL pairs by abstracting schema-specific details, and an online inference phase, where a two-step retrieval process—Template Construction and Source Discovery—reuses logical structures and adapts them to entities in natural language questions. Under this formulation, CBR-to-SQL allows sample-efficient reasoning, decomposes standard retrieval pipelines into focused subproblems, and provides interpretable, targeted optimization for SQL generation. Evaluated on the MIMICSQL benchmark, CBR-to-SQL achieves state-of-the-art logical form accuracy and competitive execution accuracy. It outperforms strong baselines, especially in low-resource data settings where training examples are limited. These results demonstrate that CBR-to-SQL can address the limitations of existing retrieval-based methods and inspire further CBR adaptations in standard RAG-to-SQL systems. Overall, the key contributions of this thesis include introducing CBR formulations to medical text-to-SQL, designing a CBR-based proof-of-concept, proposing a more challenging incomplete database environment to simulate limited data conditions, and developing a brittleness metric to quantify model reliance on retrieved cases.

Description

Supervisor

Marttinen, Pekka

Thesis advisor

Moen, Hans
Kumar, Anmol

Other note

Citation