Reinforcement Learning for Programming Feedback : Aligning Small Language Models Without Human Preferences
Loading...
Access rights
openAccess
CC BY
CC BY
publishedVersion
URL
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Date
Department
Major/Subject
Mcode
Degree programme
Language
en
Pages
Series
Proceedings of 9th Educational Data Mining in Computer Science Education (CSEDM) Workshop, CEUR Workshop Proceedings ; Volume 4019
Abstract
Providing students with timely and effective feedback remains a critical challenge in programming education. Locally deployed Small Language Models (SLMs) offer a cost-effective solution that enables educators to generate feedback while avoiding third-party reliance and privacy concerns associated with Large Language Models (LLMs). However, SLMs often produce misleading or inaccurate feedback, limiting their practical use. This paper presents a fully automated reinforcement learning framework for aligning SLMs to generate high-quality programming feedback without any human-labelled examples or preference annotations. Our approach transfers the feedback capabilities of powerful LLMs ("teacher models") to smaller, low-resource models ("student models") that can run locally on consumer hardware, with the optional assistance of medium-sized "assistant" models. The framework supports two configurations: an off-policy setup that uses assistant model generations to bootstrap alignment and a lightweight online on-policy variant that trains directly on student model outputs. We evaluate both approaches by fine-tuning two SLMs on a real-world dataset of CS1 programming submissions collected across semesters. Our experiments simulate realistic deployment scenarios, training on data from past semesters and evaluating on future ones. Results show that both methods significantly improve feedback quality and generalize across new course offerings. We provide practical considerations for aligning SLMs in educational settings and outline a promising direction for future work. Our code is made available on GitHub: https://github.com/KoutchemeCharles/rlpfDescription
Keywords
Other note
Citation
Koutcheme, C, Dainese, N & Hellas, A 2025, Reinforcement Learning for Programming Feedback : Aligning Small Language Models Without Human Preferences. in B Akram, Y Shi, P Brusilovsky, T Price, K Koedinger, P Carvalho, S Zhang, A Lan & J Leinonen (eds), Proceedings of 9th Educational Data Mining in Computer Science Education (CSEDM) Workshop. CEUR Workshop Proceedings, vol. 4019, CEUR, Educational Data Mining in Computer Science Education Workshop, Palermo, Italy, 20/07/2025. < https://ceur-ws.org/Vol-4019/paper_01.pdf >