Handling missing values with hybrid approaches in supervised setting
Perustieteiden korkeakoulu | Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Master’s Programme in Computer, Communication and Information Sciences
AbstractMissing data has become an increasingly important issue for training deep neural networks, especially in the case of large-scale datasets. It can be categorized into three groups: data missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Several state of the arts (SOTA) data imputation algorithms have been proposed to improve the handling of in- complete datasets. However, the effect of label distribution in supervised learning has been overlooked. Recent studies show that it is important to learn the label distribution conditionally on the missing data, which leads to huge performance gains. The aim of this thesis is to implement a hybrid approach that utilizes a generative deep latent variable model (DLVM) and a discriminative model to impute MAR data with importance weighted variational inference, including three train- ing strategies that outperform zero imputation. Furthermore, we introduce the label distribution into the hybrid model that consists of a DLVM model and a convolutional neural network (CNN), in the context of MNAR image data. The experiment shows that the joint model achieves extraordinary prediction accuracy and imputation result in the MNIST dataset.
Thesis advisorCui, Tianyu
missing data, data imputation, auto-encoder, variational inference, importance sampling, supervised learning