Explaining Machine Learning Models by Generating Counterfactuals

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2019-08-19
Department
Major/Subject
Machine Learning, Data Science and Artificial Intelligence
Mcode
SCI3044
Degree programme
Master’s Programme in Computer, Communication and Information Sciences
Language
en
Pages
47+4
Series
Abstract
Nowadays, machine learning is being applied in various domains, including safety critical areas, which directly affect our lives. These systems are so complex and rely on huge amounts of training data, so that we risk to create systems that we do not understand, which might lead to undesired behavior, such as fatal decisions, discrimination, ethnic bias, racism and others. Moreover, European Union recently adopted General Data Protection Regulation (GDPR), which requires companies to provide meaningful explanation of the logic behind decisions made by machine learning systems, if these decisions affect directly a human being. We address the issue of explaining various machine-learning models by generating counterfactuals for given data points. Counterfactual is a transformation, which shows how to alternate an input object, so that a classifier predicts a different class. Counterfactuals allow us to better understand why particular classification decisions take place. They may aid in troubleshooting a classifier and identifying biases by looking at alternations needed to be made in the data instances. For example, if a loan approval application system denies a loan for a particular person, and we can find a counterfactual indicating that we need to change the gender, or the race of a person for the loan to be approved, then we have identified bias in the model and we need to study our classifier better and retrain it to avoid such undesired behavior. In this thesis we propose a new framework to generate counterfactuals for a set of data points. The proposed framework aims to find a set of similar transformations to data points, such that those changes significantly reduce the probabilities of the target class. We argue that finding similar transformations for a set of data points helps to achieve more robust explanations to classifiers. We demonstrate our framework on 3 types of data: tabular, images and texts. We evaluate our model on both simple and real-world datasets, including ImageNet and 20 NewsGroups.
Description
Supervisor
Gionis, Aristides
Thesis advisor
Gionis, Aristides
Keywords
machine learning, interpretability, counterfactuals, transparency
Other note
Citation