Imputation methods for left-censored data in mass spectrometry-based metabolomics

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorZhang, Yinjia
dc.contributor.authorGijon Leyva, Erick
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.supervisorHämäläinen, Wilhelmiina
dc.date.accessioned2022-05-22T17:08:19Z
dc.date.available2022-05-22T17:08:19Z
dc.date.issued2022-05-16
dc.description.abstractMass spectrometry-based metabolomics data usually contains a large number of missing values. One crucial type of missing values are left-censored (LC) values, which correspond to small values, often below the limit of detection of the measuring equipment. The need for an adequate imputation approach for LC values arises as missing values might be present in some features of only one study group. Such features may contain vital information to discriminate patients with different disease states. Nowadays, most missing value imputation methods focus on scenarios where missing values appear due to chance, paying less attention to small concentration missing values. We present a comprehensive comparison of common missing value imputation methods for mass spectrometry data. Furthermore, we evaluate a recently introduced method called Hybrid random tail imputation aiming to preserve significant association patterns. The methods were evaluated with real-world datasets of different number of features, sample sizes, and missing value proportions. The evaluation process involved the generation of complete datasets and artificially amputated datasets. Finally, the imputation quality was evaluated by calculating the normalized root mean squared error and the preservation of statistically significant association patterns. The experiments show that the chosen data amputation method highly affects the subsequent imputation. Agreeing with previous works, we found that, in general, simple imputation methods perform poorly compared with more complex imputation methods. One drawback of complex imputation methods is their excessive execution time that might be prohibitive in setups requiring multiple imputations. Also, we found that hybrid random tail imputation is among the methods that better preserve real association patterns after imputation while showing competitive normalized root mean squared error results. Finally, we discuss possible improvements to the hybrid random tail imputation focusing on better detection of the type of missingness.en
dc.format.extent68 + 13
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/114521
dc.identifier.urnURN:NBN:fi:aalto-202205223368
dc.language.isoenen
dc.programmeMaster’s Programme in Computer, Communication and Information Sciencesfi
dc.programme.majorMachine Learning, Data Science and Artificial Intelligencefi
dc.programme.mcodeSCI3044fi
dc.subject.keywordmissing value imputationen
dc.subject.keywordmass spectrometryen
dc.subject.keywordassociation patternsen
dc.subject.keywordmetabolomicsen
dc.subject.keyworddata amputationen
dc.titleImputation methods for left-censored data in mass spectrometry-based metabolomicsen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessno

Files