Imputation methods for left-censored data in mass spectrometry-based metabolomics

No Thumbnail Available

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2022-05-16

Department

Major/Subject

Machine Learning, Data Science and Artificial Intelligence

Mcode

SCI3044

Degree programme

Master’s Programme in Computer, Communication and Information Sciences

Language

en

Pages

68 + 13

Series

Abstract

Mass spectrometry-based metabolomics data usually contains a large number of missing values. One crucial type of missing values are left-censored (LC) values, which correspond to small values, often below the limit of detection of the measuring equipment. The need for an adequate imputation approach for LC values arises as missing values might be present in some features of only one study group. Such features may contain vital information to discriminate patients with different disease states. Nowadays, most missing value imputation methods focus on scenarios where missing values appear due to chance, paying less attention to small concentration missing values. We present a comprehensive comparison of common missing value imputation methods for mass spectrometry data. Furthermore, we evaluate a recently introduced method called Hybrid random tail imputation aiming to preserve significant association patterns. The methods were evaluated with real-world datasets of different number of features, sample sizes, and missing value proportions. The evaluation process involved the generation of complete datasets and artificially amputated datasets. Finally, the imputation quality was evaluated by calculating the normalized root mean squared error and the preservation of statistically significant association patterns. The experiments show that the chosen data amputation method highly affects the subsequent imputation. Agreeing with previous works, we found that, in general, simple imputation methods perform poorly compared with more complex imputation methods. One drawback of complex imputation methods is their excessive execution time that might be prohibitive in setups requiring multiple imputations. Also, we found that hybrid random tail imputation is among the methods that better preserve real association patterns after imputation while showing competitive normalized root mean squared error results. Finally, we discuss possible improvements to the hybrid random tail imputation focusing on better detection of the type of missingness.

Description

Supervisor

Hämäläinen, Wilhelmiina

Thesis advisor

Zhang, Yinjia

Keywords

missing value imputation, mass spectrometry, association patterns, metabolomics, data amputation

Other note

Citation