Automatic Algorithm Recognition Based on Programming Schemas and Beacons - A Supervised Machine Learning Classification Approach

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
School of Science | Doctoral thesis (article-based) | Defence date: 2013-03-08
Checking the digitized thesis and permission for publishing
Instructions for the author
Degree programme
117 + app. 137
Aalto University publication series DOCTORAL DISSERTATIONS, 17/2013
In this thesis, we present techniques to recognize basic algorithms covered in computer science education from source code. The techniques use various software metrics, language constructs and other characteristics of source code, as well as the concept of schemas and beacons from program comprehension models. Schemas are high level programming knowledge with detailed knowledge abstracted out. Beacons are statements that imply specific structures in a program. Moreover, roles of variables constitute an important part of the techniques. Roles are concepts that describe the behavior and usage of variables in a program. They have originally been introduced to help novices learn programming. We discuss two methods for algorithm recognition. The first one is a classification method based on a supervised machine learning technique. It uses the vectors of characteristics and beacons automatically computed from the algorithm implementations of a training set to learn what characteristics and beacons can best describe each algorithm. Based on these observed instance-class pairs, the system assigns a class to each new input algorithm implementation according to its characteristics and beacons. We use the C4.5 algorithm to generate a decision tree that performs the task. In the second method, the schema detection method, algorithms are defined as schemas that exist in the knowledge base of the system. To identify an algorithm, the method searches the source code to detect schemas that correspond to those predefined schemas. Moreover, we present a method that combines these two methods: it first applies the schema detection method to extract algorithmic schemas from the given program and then proceeds to the classification method applied to the schema parts only. This enhances the reliability of the classification method, as the characteristics and beacons are computed only from the algorithm implementation code, instead of the whole given program. We discuss several empirical studies conducted to evaluate the performance of the methods. Some results are as follows: evaluated by leave-one-out cross-validation, the estimated classification accuracy for sorting algorithms is 98,1%, for searching, heap, basic tree traversal and graph algorithms 97,3% and for the combined method (on sorting algorithms and their variations from real student submissions) 97,0%. For the schema detection method, the accuracy is 88,3% and 94,1%, respectively. In addition, we present a study for categorizing student-implemented sorting algorithms and their variations in order to find problematic solutions that would allow us to give feedback on them. We also explain how these variations can be automatically recognized.
Supervising professor
Malmi, Lauri, Prof., Aalto University, Finland
Thesis advisor
Korhonen, Ari, Dr., Aalto University, Finland
algorithm recognition, schema detection, beacons, roles of variables, program comprehension, automated assessment
Other note
  • [Publication 1]: Ahmad Taherkhani, Ari Korhonen and Lauri Malmi. Recognizing Algorithms Using Language Constructs, Software Metrics and Roles of Variables: An Experiment with Sorting Algorithms. The Computer Journal, Volume 54 Issue 7, pages 1049–1066, June 2011.
  • [Publication 2]: Ahmad Taherkhani. Using Decision Tree Classifiers in Source Code Analysis to Recognize Algorithms: An Experiment with Sorting Algorithms.The Computer Journal, Volume 54 Issue 11, pages 1845–1860, October 2011.
  • [Publication 3]: Ahmad Taherkhani, Ari Korhonen and Lauri Malmi. Categorizing Variations of Student-Implemented Sorting Algorithms. Computer Science Education, Volume 22 Issue 2, pages 109–138, June 2012.
  • [Publication 4]: Ahmad Taherkhani, Ari Korhonen and Lauri Malmi. Automatic Recognition of Students’ Sorting Algorithm Implementations in a Data Structures and Algorithms Course. In Proceedings of the 12th Koli Calling International Conference on Computing Education Research, Tahko, Finland,10 pages, 15–18 November 2012, to appear.
  • [Publication 5]: Ahmad Taherkhani. Automatic Algorithm Recognition Based on Programming Schemas. In Proceedings of the 23th Annual Workshop on the Psychology of Programming Interest Group (PPIG ’11), University of York, UK, 12 pages, 6–8 September 2011.
  • [Publication 6]: Ahmad Taherkhani and Lauri Malmi. Beacon- and Schema-Based Method for Recognizing Algorithms from Students’ Source Code. Submitted to Journal of Educational Data Mining, 23 pages, submitted in June 2012.
  • [Publication 7]: Ahmad Taherkhani. Schema Detection and Beacon-Based Classification for Algorithm Recognition. In Proceedings of the 24th Annual Workshop on the Psychology of Programming Interest Group (PPIG ’12), London Metropolitan University, UK, 12 pages, 21–23 November 2012.