Browsing by Author "Koutcheme, Charles"
Now showing 1 - 14 of 14
- Results Per Page
- Sort Options
- Automated Program Repair Using Generative Models for Code Infilling
A4 Artikkeli konferenssijulkaisussa(2023) Koutcheme, Charles; Sarsa, Sami; Leinonen, Juho; Hellas, Arto; Denny, PaulIn educational settings, automated program repair techniques serve as a feedback mechanism to guide students working on their programming assignments. Recent work has investigated using large language models (LLMs) for program repair. In this area, most of the attention has been focused on using proprietary systems accessible through APIs. However, the limited access and control over these systems remain a block to their adoption and usage in education. The present work studies the repairing capabilities of open large language models. In particular, we focus on a recent family of generative models, which, on top of standard left-to-right program synthesis, can also predict missing spans of code at any position in a program. We experiment with one of these models on four programming datasets and show that we can obtain good repair performance even without additional training. - Designing and Building a Platform for Teaching Introductory Programming supported by Large Language Models
Perustieteiden korkeakoulu | Master's thesis(2024-01-22) Joy Kulangara, KiranLarge language models (LLMs) have the potential to improve programming education by providing feedback and guidance to students. Despite their potential benefits, the integration of LLMs into education presents unique challenges, including the risk of over-reliance on their feedback and the inconsistency of feedback quality. Addressing these concerns requires research to identify effective ways of integrating LLMs into programming education, which itself is challenging due to the rapid evolution of LLMs. To meet this challenge, this thesis introduces a flexible platform that can integrate multiple LLMs, providing an experimental space for research and innovative approaches to enhance programming education through LLMs. Guided by the Design Science Research Methodology framework, the thesis outlines the design, development, and evaluation of this educational platform. Conducted at Aalto University’s LeTech research group, the thesis presents an introductory programming learning platform specifically tailored to the group’s research objectives. The platform facilitates data collection, and enables the students to have a personalized learning experience with the help of LLM feedback. The work advances our understanding of LLMs in education and feedback mechanisms’ importance. The developed platform effectively demonstrates the feasibility of integrating LLMs into programming education. A small-scale study evaluating the platform’s overall usability received an average rating of 4.21 out of 5.00, while the LLM feedback received an average usefulness rating of 4.28 out of 5.00, highlighting its effectiveness and value in assisting students. Though the study sample size was small, the findings are encouraging. Future research could use the platform to explore multiple LLMs and conduct studies to improve the feedback mechanisms. - Evaluating Distance Measures for Program Repair
A4 Artikkeli konferenssijulkaisussa(2023-09-10) Koutcheme, Charles; Sarsa, Sami; Leinonen, Juho; Haaranen, Lassi; Hellas, ArtoBackground and Context: Struggling with programming assignments while learning to program is a common phenomenon in programming courses around the world. Supporting struggling students is a common theme in Computing Education Research (CER), where a wide variety of support methods have been created and evaluated. An important stream of research here focuses on program repair, where methods for automatically fixing erroneous code are used for supporting students as they debug their code. Work in this area has so far assessed the performance of the methods by evaluating the closeness of the proposed fixes to the original erroneous code. The evaluations have mainly relied on the use of edit distance measures such as the sequence edit distance and there is a lack of research on which distance measure is the most appropriate. Objectives: Provide insight into measures for quantifying the distance between erroneous code written by a student and a proposed change. We conduct the evaluation in an introductory programming context, where insight into the distance measures can provide help in choosing a suitable metric that can inform which fixes should be suggested to novices. Method: A team of five experts annotated a subset of the Dublin dataset, creating solutions for over a thousand erroneous programs written by students. We evaluated how the prominent edit distance measures from the CER literature compare against measures used in Natural Language Processing (NLP) tasks for retrieving the experts’ solutions from a pool of proposed solutions. We also evaluated how the expert-generated solutions compare against the solutions proposed by common program repair algorithms. The annotated dataset and the evaluation code are published as part of the work. Findings: Our results highlight that the ROUGE score, classically used for evaluating the performance of machine summarization tasks, performs well as an evaluation and selection metric for program repair. We also highlight the practical utility of NLP metrics, which allow an easier interpretation and comparison of the performance of repair techniques when compared to the classic methods used in the CER literature. Implications: Our study highlights the variety of distance metrics used for comparing source codes. We find issues with the classically used distance measures that can be combated by using NLP metrics. Based on our findings, we recommend including NLP metrics, and in particular, the ROUGE metric, in evaluations when considering new program repair methodologies. We also suggest incorporating NLP metrics into other areas where source codes are compared, including plagiarism detection. - Evaluating the Useof Retrieval-augmented Generation for Enhancing Online Courses
School of Science | Master's thesis(2024-11-24) Pasquarelli, LeonardoProviding sufficient and adequate teaching assistance towards students in programming education for online courses requires substantial resources, especially considering the growing enrolment numbers. To tackle the problems of scalable course assistance, we developed a chat bot specific to the Web Software Development (WSD) course at Aalto, using a novel technology called retrievalaugmented-generation (RAG), which harnesses large language models (LLM) and augments the produced answer with search results from an external data source: in our case the course material, vectorised and embedded into a vector database. Our evaluations include a benchmark, in which we compare the faithfulness and relevancy of answers generated by 54 different configurations, determined by the LLM, the embedding model, the chunk size and amount of chunks, and the retrieval mode. The 28 used questions were mainly collected from course participants taking the WSD course. The findings suggest that in the context of this experiment, higher chunk sizes work better, a vector-only retrieval mode produces better results, the choice of LLM in itself had a mild effect on the answer quality, and text-embedding-3-large and all-MiniLM-v6 performed significantly better than RoBERTa. Furthermore, we conducted an in-person user survey (N =14), in which students were required to work on course tasks given the assistance of our chat bot, and a search functionality. The goal was to assess the satisfaction of RAG when compared against a search functionality, as well as the search performance using RAG when compared against a search functionality. The findings suggest users perceive both assistants as useful or highly useful, and that the bot produces factually correct results. The preference towards a specific assistant and performance depended on various factors, including the exercise type. - Exploring How Students Solve Open-ended Assignments: A Study of SQL Injection Attempts in a Cybersecurity Course
A4 Artikkeli konferenssijulkaisussa(2022-07-07) Koutcheme, Charles; Tilanterä, Artturi; Peltonen, Aleksi; Hellas, Arto; Haaranen, LassiResearch into computing and learning how to program has been ongoing for decades. Commonly, this research has been focused on novice learners and the difficulties they encounter, especially during CS1. Cybersecurity is a critical aspect in computing - as a topic in university education as well as a core skill in the industry. In this study, we investigate how students solve open-ended assignments on a cybersecurity course offered to university students after two years of CS studies. Specifically, we looked at how students perform SQL injection attacks on an web application system, and study to what extent we can characterize the process in which they come up with successful injections. Our results show that there are distinguishable strategies used by individual students who seek to hack the system, where these approaches revolve around exploration and exploitation tactics. We also find evidence of learning due to a more pronounced use of exploitation in a subsequent similar assignment. - Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests
A4 Artikkeli konferenssijulkaisussa(2023-09-10) Hellas, Arto; Leinonen, Juho; Sarsa, Sami; Koutcheme, Charles; Kujanpää, Lilja; Sorva, JuhaBackground and Context: Over the past year, large language models (LLMs) have taken the world by storm. In computing education, like in other walks of life, many opportunities and threats have emerged as a consequence. Objectives: In this article, we explore such opportunities and threats in a specific area: responding to student programmers' help requests. More specifically, we assess how good LLMs are at identifying issues in problematic code that students request help on. Method: We collected a sample of help requests and code from an online programming course. We then prompted two different LLMs (OpenAI Codex and GPT-3.5) to identify and explain the issues in the students' code and assessed the LLM-generated answers both quantitatively and qualitatively. Findings: GPT-3.5 outperforms Codex in most respects. Both LLMs frequently find at least one actual issue in each student program (GPT-3.5 in 90% of the cases). Neither LLM excels at finding all the issues (GPT-3.5 finding them 57% of the time). False positives are common (40% chance for GPT-3.5). The advice that the LLMs provide on the issues is often sensible. The LLMs perform better on issues involving program logic rather than on output formatting. Model solutions are frequently provided even when the LLM is prompted not to. LLM responses to prompts in a non-English language are only slightly worse than responses to English prompts. Implications: Our results continue to highlight the utility of LLMs in programming education. At the same time, the results highlight the unreliability of LLMs: LLMs make some of the same mistakes that students do, perhaps especially when formatting output as required by automated assessment systems. Our study informs teachers interested in using LLMs as well as future efforts to customize LLMs for the needs of programming education. - Let's Ask AI About Their Programs : Exploring ChatGPT's Answers To Program Comprehension Questions
A4 Artikkeli konferenssijulkaisussa(2024-05-24) Lehtinen, Teemu; Koutcheme, Charles; Hellas, ArtoRecent research has explored the creation of questions from code submitted by students. These Questions about Learners' Code (QLCs) are created through program analysis, exploring execution paths, and then creating code comprehension questions from these paths and the broader code structure. Responding to the questions requires reading and tracing the code, which is known to support students' learning. At the same time, computing education researchers have witnessed the emergence of Large Language Models (LLMs) that have taken the community by storm. Researchers have demonstrated the applicability of these models especially in the introductory programming context, outlining their performance in solving introductory programming problems and their utility in creating new learning resources. In this work, we explore the capability of the state-of-the-art LLMs (GPT-3.5 and GPT-4) in answering QLCs that are generated from code that the LLMs have created. Our results show that although the state-of-the-art LLMs can create programs and trace program execution when prompted, they easily succumb to similar errors that have previously been recorded for novice programmers. These results demonstrate the fallibility of these models and perhaps dampen the expectations fueled by the recent LLM hype. At the same time, we also highlight future research possibilities such as using LLMs to mimic students as their behavior can indeed be similar for some specific tasks. - Methodological Considerations for Predicting At-risk Students
A4 Artikkeli konferenssijulkaisussa(2022-02-14) Koutcheme, Charles; Sarsa, Sami; Hellas, Arto; Haaranen, Lassi; Leinonen, JuhoEducational researchers have long sought to increase student retention. One stream of research focusing on this seeks to automatically identify students who are at risk of dropping out. Studies tend to agree that earlier identification of at-risk students is better, providing more room for targeted interventions. We looked at the interplay of data and predictive power of machine learning models used to identify at-risk students. We critically examine the often used approach where data collected from weeks 1, 2,..., n is used to predict whether a student becomes inactive in the subsequent weeks w, w ≥ n + 1, pointing out issues with this approach that may inflate models’ predictive power. Specifically, our empirical analysis highlights that including students who have become inactive on week n or before, where n > 1, to the data used to identify students who are inactive on the following weeks is a significant cause of bias. Including students who dropped out during the first week makes the problem significantly easier, since they have no data in the subsequent weeks. Based on our results, we recommend including only active students until week n when building and evaluating models for predicting dropouts in subsequent weeks and evaluating and reporting the particularities of the respective course contexts. - Open Source Language Models Can Provide Feedback : Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge
A4 Artikkeli konferenssijulkaisussa(2024-07-03) Koutcheme, Charles; Dainese, Nicola; Sarsa, Sami; Hellas, Arto; Leinonen, Juho; Denny, PaulLarge language models (LLMs) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. This has sparked considerable interest in the use of open source LLMs in education, but the quality of the feedback that such open models can produce remains understudied. This is a concern as providing flawed or misleading generated feedback could be detrimental to student learning. Inspired by recent work that has utilised very powerful LLMs, such as GPT-4, to evaluate the outputs produced by less powerful models, we conduct an automated analysis of the quality of the feedback produced by several open source models using a dataset from an introductory programming course. First, we investigate the viability of employing GPT-4 as an automated evaluator by comparing its evaluations with those of a human expert. We observe that GPT-4 demonstrates a bias toward positively rating feedback while exhibiting moderate agreement with human raters, showcasing its potential as a feedback evaluator. Second, we explore the quality of feedback generated by several leading open-source LLMs by using GPT-4 to evaluate the feedback. We find that some models offer competitive performance with popular proprietary LLMs, such as ChatGPT, indicating opportunities for their responsible use in educational settings. - Propagating Large Language Models Programming Feedback
A4 Artikkeli konferenssijulkaisussa(2024-07-09) Koutcheme, Charles; Hellas, ArtoLarge language models (LLMs) such as GPT-4 have emerged as promising tools for providing programming feedback. However, effective deployment of LLMs in massive classes and Massive Open Online Courses (MOOCs) raises financial concerns, calling for methods to minimize the number of calls to the APIs and systems serving such powerful models. In this article, we revisit the problem of 'propagating feedback' within the contemporary landscape of LLMs. Specifically, we explore feedback propagation as a way to reduce the cost of leveraging LLMs for providing programming feedback at scale. Our study investigates the effectiveness of this approach in the context of students requiring next-step hints for Python programming problems, presenting initial results that support the viability of the approach. We discuss our findings' implications and suggest directions for future research in optimizing feedback mechanisms for large-scale educational environments. - Speeding Up Automated Assessment of Programming Exercises
A4 Artikkeli konferenssijulkaisussa(2022) Sarsa, Sami; Leinonen, Juho; Koutcheme, Charles; Hellas, ArtoIntroductory programming courses around the world use automatic assessment. Automatic assessment for programming code is typically performed via unit tests which require computation time to execute, at times in significant amounts, leading to computation costs and delay in feedback to students. We present a step-based approach for speeding up automated assessment to address the issue, consisting of (1) a cache of past programming exercise submissions and their associated test results to avoid retesting equivalent new submissions; (2) static analysis to detect e.g. infinite loops (heuristically) ; (3) a machine learning model to evaluate programs without running them ; and (4) a traditional set of unit tests. When a student submits code for an exercise, the code is evaluated sequentially through each step, providing feedback to the student at the earliest possible time, reducing the need to run tests. We evaluate the impact of the proposed approach using data collected from an introductory programming course and demonstrate a considerable reduction in the number of exercise submissions that require running the tests (up to 80. Using the approach leads to faster feedback in a more sustainable way, and also provides opportunities for precise non-exercise specific feedback in steps (2) and (3). - Training Language Models for Programming Feedback Using Automated Repair Tools
A4 Artikkeli konferenssijulkaisussa(2023) Koutcheme, CharlesIn introductory programming courses, automated repair tools (ARTs) are used to provide feedback to students struggling with debugging. Most successful ARTs take advantage of context-specific educational data to construct repairs to students’ buggy codes. Recent work in student program repair using large language models (LLMs) has also started to utilize such data. An underexplored area in this field is the use of ARTs in combination with LLMs. In this paper, we propose to transfer the repairing capabilities of existing ARTs to open large language models by finetuning LLMs on ART corrections to buggy codes. We experiment with this approach using three large datasets of Python programs written by novices. Our results suggest that a finetuned LLM provides more reliable and higher-quality repairs than the repair tool used for finetuning the model. This opens venues for further deploying and using educational LLM-based repair techniques. - Understanding student behaviors when learning online materials using fine-grained browsing data
Perustieteiden korkeakoulu | Master's thesis(2021-08-23) Koutcheme, CharlesMany institutions in the higher education sector now propose courses fully available through interactive web pages. The rise of such courses encouraged multiple research endeavors to leverage data that can be automatically collected from these courses to study how students learn in online contexts. The subject of our work comes from the observation that there is a knowledge gap about the specific details of how students navigate courses available through web pages. This study thus illustrates how analyzing fine-grained data about students' browsing can improve our understanding of how they learn online courses. We show that such analysis can shed light on fundamental aspects of students' learning processes, mainly what knowledge they acquire and how they acquire it. For that purpose, we collected data from an introductory programming course given at Aalto University. We also created an interpretable model that defines students' browsing as a sequence of two simple actions: reading and moving. We specialized these two actions further by leveraging our previous knowledge about students' behaviors and the structure of the course studied. The model also introduces several new concepts such as ordering and position of material elements and distance traveled. We then computed several statistics related to different dimensions of the model and which leverage the new concepts introduced. Using the computed statistics, we predicted students dropping out of the course. Our results show that: (1) analyzing the distribution and the relationship between these statistics gives significant insights into students' relationship with the course, and (2) allows us to predict at least 80 percent of students dropping out of the course during the first weeks with a minimum of 82 percent accuracy. Our findings imply that the way students navigate an online course is predictive of their interest in the course and suggest that further research in this area is necessary. - Using Program Repair as a Proxy for Language Models’ Feedback Ability in Programming Education
A4 Artikkeli konferenssijulkaisussa(2024-06) Koutcheme, Charles; Dainese, Nicola; Hellas, ArtoOne of the key challenges in programming education is being able to provide high-quality feedback to learners. Such feedback often includes explanations of the issues in students’ programs coupled with suggestions on how to fix these issues. Large language models (LLMs) have recently emerged as valuable tools that can help in this effort. In this article, we explore the relationship between the program repair ability of LLMs and their proficiency in providing natural language explanations of coding mistakes. We outline a benchmarking study that evaluates leading LLMs (including open-source ones) on program repair and explanation tasks. Our experiments study the capabilities of LLMs both on a course level and on a programming concept level, allowing us to assess whether the programming concepts practised in exercises with faulty student programs relate to the performance of the models. Our results highlight that LLMs proficient in repairing student programs tend to provide more complete and accurate natural language explanations of code issues. Overall, these results enhance our understanding of the role and capabilities of LLMs in programming education. Using program repair as a proxy for explanation evaluation opens the door for cost-effective assessment methods.