Benchmarking of deep neural networks for integration of multiomics data for cancer subtyping

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Master's thesis

Department

Mcode

Language

en

Pages

60

Series

Abstract

Molecular subtyping of cancer is crucial for precision medicine, however single-omics analyses only capture some tumor diversity, resulting in incomplete characterization of disease mechanisms and limited ability to guide personalized therapies. The integration of multiple omics layers provides a more comprehensive view of cancer, while systematic benchmarking of deep learning methods for bulk data remains limited. This research evaluates five approaches for unsupervised multi-omics integration: scMM, Concerto, OmiEmbed, MOCSS, and MultiVI. Two TCGA datasets were analyzed: breast invasive carcinoma (BRCA) and liver hepatocellular carcinoma (LIHC). Each data set included data on mRNA, DNA methylation, and miRNA. Performance was evaluated using clustering accuracy, normalized mutual information, k-nearest neighbors classification, and UMAP visualizations. The results indicate that the performance of the methods varies significantly by dataset. In BRCA, the VAE-based OmiEmbed achieved the highest clustering accuracy, while scMM performed best in classification. In LIHC, MOCSS, which combines autoencoders with a contrastive learning objective, outperformed other methods in clustering, while scMM again provided the best classification results. Across both datasets, mRNA was the most informative omics layer, while methylation-miRNA pairings were less helpful. In general, variational autoencoder frameworks performed better on datasets such as BRCA, which have more samples but fewer features. In such cases, they could make use of their generative modeling and reconstruction capabilities more effectively. By contrast, hybrid approaches such as MOCSS, which integrate autoencoders with contrastive learning, proved to be more effective in settings such as LIHC, which are characterized by fewer samples but a much higher number of features, leading to high-dimensional and noisy data. These findings suggest that the suitability of each approach depends less on the size of the data set alone and more on the interaction between sample size, dimensionality, and data complexity.

Description

Supervisor

Marttinen, Pekka

Thesis advisor

Safinianaini, Negar

Other note

Citation