Advancing towards personalized medicine: probabilistic machine learning and deep learning for health and genetics
Loading...
URL
Journal Title
Journal ISSN
Volume Title
School of Science |
Doctoral thesis (article-based)
| Defence date: 2025-04-30
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
2025
Major/Subject
Mcode
Degree programme
Language
en
Pages
65 + app. 103
Series
Aalto University publication series Doctoral Theses, 84/2025
Abstract
This thesis advances probabilistic machine learning and deep learning methods for personalized medicine applications. Personalized medicine aims to tailor diagnosis, prevention and treatment choices for diseases to individual patient characteristics, and machine learning supports this by offering powerful tools for analyzing various types of individual-level health and biological data. The core machine learning challenge that this thesis aims to address is how to meet the statistical inference needs of individual-level analyses for personalized medicine applications, while effectively utilizing the power of large datasets that capture complex relationships explaining patient outcomes. Addressing this enables more effective machine learning models for the generative and predictive machine learning applications of interest for personalized medicine, which are explored in the articles of the thesis for large-scale health and biological data sources, including genetic biobanks, population-scale health registers, and longitudinal data from electronic health record (EHR) systems. The first research question asks how to create individual-level synthetic datasets for high-dimensional genetic sequences and complex disease phenotypes. Synthetic data is an important tool for researchers developing and evaluating new computational methods for personalized medicine applications, such as polygenic risk scoring, but is difficult to generate effectively at scale from high-dimensional reference datasets with limited samples. The first contribution of the thesis is a new probabilistic machine learning approach and software tool that implements statistical models of the underlying generative processes and simulation-based inference techniques to create high-fidelity synthetic data for a large number of individuals, phenotypic traits and genetic variants. The second and third research questions concern deep learning methods for modeling longitudinal health data to predict various individual-level health-related outcomes. The thesis introduces two techniques to more effectively utilize the informative statistical relationships in large data sources: a geometric deep learning approach that leverages biological relationships between individuals to improve predictive performance and explainability; and a Bayesian meta-learning approach that improves generalizability by pooling information from related supervised learning tasks based on similarities in the causal relationships underlying the outcomes being predicted. These methods are validated through two case studies: modeling the influence of family history on an individual's disease risk using data from Finland's nationwide health registry system, and early prediction of various stroke outcomes using data from the UK Biobank and FinnGen projects.Description
Supervising professor
Kaski, Samuel, Prof., Aalto University, Department of Computer Science, Finland, and Prof., University of Manchester, United KingdomKeywords
probabilistic machine learning, deep learning, personalized medicine, synthetic data, generation, geometric deep learning, Bayesian meta-learning, causal relationships, electronic health records, human genetics
Other note
Parts
-
[Publication 1]: Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O’Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, and Andrea Ganna. HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes. Bioinformatics, Volume 39, Issue 9, September 2023.
Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202310046172DOI: 10.1093/bioinformatics/btad535 View at publisher
-
[Publication 2]: Sophie Wharrie, Zhiyu Yang, Andrea Ganna, and Samuel Kaski. Characterizing personalized effects of family information on disease risk using graph representation learning. In Proceedings of the 8th Machine Learning for Healthcare Conference, New York, United States, PMLR, 219:824-845, August 2023.
Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202401312218
- [Publication 3]: Sophie Wharrie, Lisa Eick, Lotta Mäkinen, Andrea Ganna, and Samuel Kaski. Bayesian Meta-Learning for Improving Generalizability of Health Prediction Models With Similar Causal Mechanisms. Submitted to a journal, December 2024. arci preprint arXiv:2310.12595