Reinforcement Learning for Programming Feedback : Aligning Small Language Models Without Human Preferences

Loading...
Thumbnail Image

Access rights

openAccess
CC BY
publishedVersion

URL

Journal Title

Journal ISSN

Volume Title

A4 Artikkeli konferenssijulkaisussa

Major/Subject

Mcode

Degree programme

Language

en

Pages

Series

Proceedings of 9th Educational Data Mining in Computer Science Education (CSEDM) Workshop, CEUR Workshop Proceedings ; Volume 4019

Abstract

Providing students with timely and effective feedback remains a critical challenge in programming education. Locally deployed Small Language Models (SLMs) offer a cost-effective solution that enables educators to generate feedback while avoiding third-party reliance and privacy concerns associated with Large Language Models (LLMs). However, SLMs often produce misleading or inaccurate feedback, limiting their practical use. This paper presents a fully automated reinforcement learning framework for aligning SLMs to generate high-quality programming feedback without any human-labelled examples or preference annotations. Our approach transfers the feedback capabilities of powerful LLMs ("teacher models") to smaller, low-resource models ("student models") that can run locally on consumer hardware, with the optional assistance of medium-sized "assistant" models. The framework supports two configurations: an off-policy setup that uses assistant model generations to bootstrap alignment and a lightweight online on-policy variant that trains directly on student model outputs. We evaluate both approaches by fine-tuning two SLMs on a real-world dataset of CS1 programming submissions collected across semesters. Our experiments simulate realistic deployment scenarios, training on data from past semesters and evaluating on future ones. Results show that both methods significantly improve feedback quality and generalize across new course offerings. We provide practical considerations for aligning SLMs in educational settings and outline a promising direction for future work. Our code is made available on GitHub: https://github.com/KoutchemeCharles/rlpf

Description

Keywords

Other note

Citation

Koutcheme, C, Dainese, N & Hellas, A 2025, Reinforcement Learning for Programming Feedback : Aligning Small Language Models Without Human Preferences. in B Akram, Y Shi, P Brusilovsky, T Price, K Koedinger, P Carvalho, S Zhang, A Lan & J Leinonen (eds), Proceedings of 9th Educational Data Mining in Computer Science Education (CSEDM) Workshop. CEUR Workshop Proceedings, vol. 4019, CEUR, Educational Data Mining in Computer Science Education Workshop, Palermo, Italy, 20/07/2025. < https://ceur-ws.org/Vol-4019/paper_01.pdf >