Towards automated programming feedback with open-weight language models

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Doctoral thesis (article-based) | Defence date: 2026-02-20

Date

Major/Subject

Mcode

Degree programme

Language

en

Pages

106 + app. 111

Series

Aalto University publication series Doctoral Theses, 42/2026

Abstract

The increasing demand for computer science experts highlights the importance of supporting novice learners effectively, particularly in introductory programming courses. A key factor in maintaining learner engagement and progress is the provision of timely, constructive feedback on student code. Yet, delivering such feedback at scale remains a significant challenge: human-centered support does not scale easily, and existing automated assessment systems often lack the ability to provide meaningful and nuanced guidance. Recent advances in language model (LM) research offer promising new avenues to address this gap. Large language models (LLMs) in particular have demonstrated strong capabilities for generating educational feedback. Much of the work in this space has relied on proprietary models such as those behind ChatGPT. However, reliance on these models raises concerns around cost,control, and long-term accessibility. These challenges are motivating a shift toward open-weight alternatives. This dissertation addresses several challenges in integrating open-weight language models into educational feedback systems, presenting contributions across three complementary dimensions. First, it explores methods for enabling pre-trained LMs to repair student programs. These methods combine infilling models with search algorithms to repair programs in place and leverage automated repair tools with distillation pipelines to bootstrap training. Second, it proposes two automated evaluation approaches to assess feedback capabilities without requiring human annotation: (i) an LLM-as-a-Judge approach that leverages language models'reasoning abilities to score feedback, and (ii) a repair-as-proxy approach that uses program repair performance to measure feedback proficiency. Third, it contributes reinforcement learning techniques to align small language models (SLMs) with pedagogical objectives. These approaches allow practitioners to choose between two forms of automated supervision: alignment via model preferences (i.e., AI-generated rankings) and self-supervised alignment using verifiable program repairs. Empirical evaluations on student code datasets and public programming benchmarks demonstrate that small, open-weight models can be selected, tuned, and evaluated almost automatically to generate useful explanations, hints, and corrections for novice programmers.

Description

Supervising professor

Kivelä, Mikko, Prof., Department of Computer Science, Aalto University, Finland

Thesis advisor

Hellas, Arto, Dr., Aalto University, Finland
Haaranen, Lassi, Dr., Aalto University, Finland

Other note

Parts

  • [Publication 1]: Charles Koutcheme, Sami Sarsa, Juho Leinonen, Arto Hellas, and PaulDenny. Automated Program Repair Using Generative Models for CodeInfilling. In Artificial Intelligence in Education (AIED 2023), LectureNotes in Computer Science, Volume 13916, June 2023.
    DOI: 10.1007/978-3-031-36272-9_74 View at publisher
  • [Publication 2]: Charles Koutcheme. Training Language Models for ProgrammingFeedback Using Automated Repair Tools. In Artificial Intelligence inEducation (AIED 2023), Lecture Notes in Computer Science, Volume13916, June 2023.
    DOI: http://dx.doi.org/10.1007/978-3-031-36272-9_79 View at publisher
  • [Publication 3]: Charles Koutcheme, Sami Sarsa, Juho Leinonen, Lassi Haaranen,and Arto Hellas. Evaluating Distance Measures for Program Repair. InProceedings of the 2023 ACM Conference on International ComputingEducation Research , Volume 1 (ICER ’23), New York, United States, pp.495–507, August 2023.
    DOI: 10.1145/3568813.3600130 View at publisher
  • [Publication 4]: Charles Koutcheme, Nicola Dainese, Sami Sarsa, Arto Hellas, JuhoLeinonen, Paul Denny. Open Source Language Models Can ProvideFeedback: Evaluating LLMs’ Ability to Help Students Using GPT-4-As-A-Judge. In Proceedings of the 2024 ACM Conference on Innovationand Technology in Computer Science Education , Volume 1 (ITiCSE ’24),Milan, Italy, pp. 52–58, July 2024.
    DOI: 10.1145/3649217.3653612 View at publisher
  • [Publication 5]: Charles Koutcheme, Nicola Dainese, Sami Sarsa, Arto Hellas, JuhoLeinonen, Syed Ashraf, Paul Denny. Evaluating Language Models forGenerating and Judging Programming Feedback. In Proceedings of the56th ACM Technical Symposium on Computer Science Education , Volume 1 (SIGCSE ’25), Pittsburgh, United States, pp. 624–630, February 2025.
    DOI: 10.1145/3641554.3701791 View at publisher
  • [Publication 6]: Charles Koutcheme, Nicola Dainese, Arto Hellas. Using ProgramRepair as a Proxy for Language Models’ Feedback Ability in ProgrammingEducation. In Proceedings of the 19th Workshop on Innovative Useof NLP for Building Educational Applications (BEA 2024), Mexico City,Mexico, pp. 165–181, June 2024.
  • [Publication 7]: Charles Koutcheme, Nicola Dainese, Arto Hellas. ReinforcementLearning for Programming Feedback: Aligning Small Language ModelsWithout Human Preferences. In Proceedings of the 9th EducationalData Mining in Computer Science Education (CSEDM) Workshop, CEURWorkshop Proceedings, Volume 4019, Palermo, Italy, pp. 1–18, July 2025.
  • [Publication 8]: Charles Koutcheme, Nicola Dainese, and Arto Hellas. Direct RepairOptimization: Training Small Language Models For EducationalProgram Repair Improves Feedback. In Proceedings of the 20th Workshopon Innovative Use of NLP for Building Educational Applications (BEA2025), Vienna, Austria, pp. 564–581, July 2025.

Citation