Handling missing values with hybrid approaches in supervised setting
Loading...
URL
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
2022-03-21
Department
Major/Subject
Computer Science
Mcode
SCI3042
Degree programme
Master’s Programme in Computer, Communication and Information Sciences
Language
en
Pages
40+7
Series
Abstract
Missing data has become an increasingly important issue for training deep neural networks, especially in the case of large-scale datasets. It can be categorized into three groups: data missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Several state of the arts (SOTA) data imputation algorithms have been proposed to improve the handling of in- complete datasets. However, the effect of label distribution in supervised learning has been overlooked. Recent studies show that it is important to learn the label distribution conditionally on the missing data, which leads to huge performance gains. The aim of this thesis is to implement a hybrid approach that utilizes a generative deep latent variable model (DLVM) and a discriminative model to impute MAR data with importance weighted variational inference, including three train- ing strategies that outperform zero imputation. Furthermore, we introduce the label distribution into the hybrid model that consists of a DLVM model and a convolutional neural network (CNN), in the context of MNAR image data. The experiment shows that the joint model achieves extraordinary prediction accuracy and imputation result in the MNIST dataset.Description
Supervisor
Marttinen, PekkaThesis advisor
Cui, TianyuKeywords
missing data, data imputation, auto-encoder, variational inference, importance sampling, supervised learning