Missing Fairness: The Discriminatory Effect of Missing Values in Datasets on Fairness in Machine Learning

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2020-12-14

Department

Major/Subject

Machine Learning, Data Science and Artificial Intelligence

Mcode

SCI3044

Degree programme

Master’s Programme in Computer, Communication and Information Sciences

Language

en

Pages

62+4

Series

Abstract

As we enter a new decade, more and more governance in our society is assisted by autonomous decision-making systems, enabled by artificial intelligence and machine learning. Recently, an increasing amount of academic and general-audience publications have made aware of negative side effects accompanying such systems under the umbrella term of algorithmic fairness. While most of the articles focus on a small number of well-studied cases, to the best of our knowledge, none have dealt with large real-world datasets one might use to train models on in an industrial setting. Datasets are collections of observations recorded by humans, including many different forms of biases. Many proposed solutions to combat the structural discrimination focus on the detection and mitigation of unfairness in datasets and machine learning models. The readily available implementations and services adhere to the common practice of complete-case analysis by filtering samples containing missing values. This often leads to ignoring large portions of recorded data, further increasing subgroup imbalances and biases. In this thesis, we analyze a sparse real-world dataset and the effect of missing values on the predictive power and measurable discrimination of models trained upon it. We start with a brief review of the current literature on the topic of algorithmic fairness, that is, causes of unfairness in form of various biases, as well as the most current fairness definitions and measures. For our dataset, we acquired self-reported law school admissions data based on a popular internet platform in the USA. We explore patterns of missingness in the data and ways of imputing values based on established methods prior to training and tuning our models. Finally, we evaluate the performance of the models with respect to well-established fairness measures and detect a significant decrease of discriminatory biases for the subset of data with missing values.

Description

Supervisor

Gionis, Aristides

Thesis advisor

Žliobaitė, Indrė

Keywords

fairness, missing values, data imputation, algorithmic bias

Other note

Citation