Missing Fairness: The Discriminatory Effect of Missing Values in Datasets on Fairness in Machine Learning
Loading...
URL
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
2020-12-14
Department
Major/Subject
Machine Learning, Data Science and Artificial Intelligence
Mcode
SCI3044
Degree programme
Master’s Programme in Computer, Communication and Information Sciences
Language
en
Pages
62+4
Series
Abstract
As we enter a new decade, more and more governance in our society is assisted by autonomous decision-making systems, enabled by artificial intelligence and machine learning. Recently, an increasing amount of academic and general-audience publications have made aware of negative side effects accompanying such systems under the umbrella term of algorithmic fairness. While most of the articles focus on a small number of well-studied cases, to the best of our knowledge, none have dealt with large real-world datasets one might use to train models on in an industrial setting. Datasets are collections of observations recorded by humans, including many different forms of biases. Many proposed solutions to combat the structural discrimination focus on the detection and mitigation of unfairness in datasets and machine learning models. The readily available implementations and services adhere to the common practice of complete-case analysis by filtering samples containing missing values. This often leads to ignoring large portions of recorded data, further increasing subgroup imbalances and biases. In this thesis, we analyze a sparse real-world dataset and the effect of missing values on the predictive power and measurable discrimination of models trained upon it. We start with a brief review of the current literature on the topic of algorithmic fairness, that is, causes of unfairness in form of various biases, as well as the most current fairness definitions and measures. For our dataset, we acquired self-reported law school admissions data based on a popular internet platform in the USA. We explore patterns of missingness in the data and ways of imputing values based on established methods prior to training and tuning our models. Finally, we evaluate the performance of the models with respect to well-established fairness measures and detect a significant decrease of discriminatory biases for the subset of data with missing values.Description
Supervisor
Gionis, AristidesThesis advisor
Žliobaitė, IndrėKeywords
fairness, missing values, data imputation, algorithmic bias