Handling missing values with hybrid approaches in supervised setting

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2022-03-21

Department

Major/Subject

Computer Science

Mcode

SCI3042

Degree programme

Master’s Programme in Computer, Communication and Information Sciences

Language

en

Pages

40+7

Series

Abstract

Missing data has become an increasingly important issue for training deep neural networks, especially in the case of large-scale datasets. It can be categorized into three groups: data missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Several state of the arts (SOTA) data imputation algorithms have been proposed to improve the handling of in- complete datasets. However, the effect of label distribution in supervised learning has been overlooked. Recent studies show that it is important to learn the label distribution conditionally on the missing data, which leads to huge performance gains. The aim of this thesis is to implement a hybrid approach that utilizes a generative deep latent variable model (DLVM) and a discriminative model to impute MAR data with importance weighted variational inference, including three train- ing strategies that outperform zero imputation. Furthermore, we introduce the label distribution into the hybrid model that consists of a DLVM model and a convolutional neural network (CNN), in the context of MNAR image data. The experiment shows that the joint model achieves extraordinary prediction accuracy and imputation result in the MNIST dataset.

Description

Supervisor

Marttinen, Pekka

Thesis advisor

Cui, Tianyu

Keywords

missing data, data imputation, auto-encoder, variational inference, importance sampling, supervised learning

Other note

Citation