Investigating retrieval-augmented code generation for domain-specific tasks

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Master's thesis

Department

Major/Subject

Mcode

Language

en

Pages

55

Series

Abstract

In recent years, Large Language Models (LLMs) have increasingly influenced both everyday life and a wide range of research domains, attracting considerable interest from the research community. Although previous studies have shown that LLMs trained specifically on code can successfully solve a variety of programming tasks, including code completion, code generation, and code translation, they can still face essential limitations coming from their reliance on static training data. In particular, most models lack access to up-to-date, domain or project-specific information at inference time, leading to incorrect or outdated outputs. Retrieval-Augmented Generation (RAG) has been proposed as an effective approach to address this limitation by incorporating relevant external knowledge. While RAG has demonstrated promising outcomes, its effectiveness for domain-specific code generation remains to be further investigated. This thesis evaluates the impact of retrieval-augmented code generation compared to standard code generation without retrieval across multiple domain-specific programming tasks. A general experimental framework is introduced to standardize different domain-specific benchmarks. To have a fair comparison, identical settings for the models and the generation are used. The evaluation is conducted on three distinct domains: quantum, cryptography, and game development. The generated code is evaluated using execution-based metrics (pass@k) when unit tests are available and similarity-based metrics (CodeBLEU) otherwise. The results show that retrieval augmentation affects models differently across domains and model sizes. Although retrieval can improve some models, it might decrease performance for others. Similarity-based metrics exhibit more minor differences between the baseline and retrieval-augmented settings, suggesting that structural similarity does not necessarily imply functional correctness. Qualitative analysis further shows that retrieval can reduce errors related to code, but may also introduce noise that negatively impacts generation. Overall, this study shows an empirical analysis of how retrieval augmentation supports domain-specific code generation, analyzing both its limitations and potential.

Description

Supervisor

Zhao, Bo

Thesis advisor

Yu, Cong

Other note

Citation