Handling missing values with hybrid approaches in supervised setting

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2022-03-21
Department
Major/Subject
Computer Science
Mcode
SCI3042
Degree programme
Master’s Programme in Computer, Communication and Information Sciences
Language
en
Pages
40+7
Series
Abstract
Missing data has become an increasingly important issue for training deep neural networks, especially in the case of large-scale datasets. It can be categorized into three groups: data missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Several state of the arts (SOTA) data imputation algorithms have been proposed to improve the handling of in- complete datasets. However, the effect of label distribution in supervised learning has been overlooked. Recent studies show that it is important to learn the label distribution conditionally on the missing data, which leads to huge performance gains. The aim of this thesis is to implement a hybrid approach that utilizes a generative deep latent variable model (DLVM) and a discriminative model to impute MAR data with importance weighted variational inference, including three train- ing strategies that outperform zero imputation. Furthermore, we introduce the label distribution into the hybrid model that consists of a DLVM model and a convolutional neural network (CNN), in the context of MNAR image data. The experiment shows that the joint model achieves extraordinary prediction accuracy and imputation result in the MNIST dataset.
Description
Supervisor
Marttinen, Pekka
Thesis advisor
Cui, Tianyu
Keywords
missing data, data imputation, auto-encoder, variational inference, importance sampling, supervised learning
Other note
Citation