Efficient contextual embeddings to improve the transferability of disease trajectory projections with generative models
| dc.contributor | Aalto-yliopisto | fi |
| dc.contributor | Aalto University | en |
| dc.contributor.advisor | Hartonen, Tuomo | |
| dc.contributor.advisor | Ganna, Andrea | |
| dc.contributor.author | Burian, Jonas | |
| dc.contributor.school | Perustieteiden korkeakoulu | fi |
| dc.contributor.school | School of Science | en |
| dc.contributor.supervisor | Marttinen, Pekka | |
| dc.date.accessioned | 2025-08-19T17:08:24Z | |
| dc.date.available | 2025-08-19T17:08:24Z | |
| dc.date.issued | 2025-07-13 | |
| dc.description.abstract | A growing global burden of non-communicable diseases demands enhanced preventive healthcare measures to avert escalating societal costs. Electronic Health Records (EHRs) hold considerable promise in this regard, as they facilitate the leveraging of past disease trajectories to predict future risks. However, the transferability of current projection methods is limited due to their dependence on the specific biomedical vocabularies on which they were trained, restricting applicability across countries. This thesis proposes a new embedding model based on the BERT architecture that integrates codes from disparate medical vocabularies. The model is trained using a contrastive objective that ensures the produced embeddings reflect the semantics and relationships of the underlying concepts. An integration into Delphi, a transformer-based method for producing disease risk predictions, ultimately aims to enable disease trajectory projections based on medical events encoded in numerous medical vocabularies, possibly not included in the training data of the model. In an external evaluation, the proposed embedding model demonstrated superior performance in its capacity to align and differentiate related and unrelated concepts compared to other models for embedding biomedical concepts, excluding the OpenAI embeddings. This is achieved while maintaining only 20% of the size of the smallest embedding models in the comparison. Moreover, following the integration with the embedding model, Delphi exhibits effective transfer between synonymous codes of varying vocabularies, demonstrating minimal impact on performance. Transferability was also achieved between countries, specifically from the United Kingdom to Finland. Here, the model improved upon the original Delphi by incorporating codes not part of the training vocabularies in addition to also utilizing completely different vocabularies. Given these results, the proposed model has the potential to facilitate enhanced disease prevention, for instance, through the development of cross-country risk prediction models and by utilizing the complete array of available EHR data to make more precise risk predictions. | en |
| dc.format.extent | 118 | |
| dc.format.mimetype | application/pdf | en |
| dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/138103 | |
| dc.identifier.urn | URN:NBN:fi:aalto-202508196332 | |
| dc.language.iso | en | en |
| dc.programme | Master's Programme in Computer, Communication and Information Sciences | en |
| dc.programme.major | Machine Learning, Data Science and Artificial Intelligence | en |
| dc.subject.keyword | biomedical concept embeddings | en |
| dc.subject.keyword | contrastive learning | en |
| dc.subject.keyword | disease risk prediction | en |
| dc.subject.keyword | transferability | en |
| dc.subject.keyword | electronic health records | en |
| dc.subject.keyword | generative models | en |
| dc.title | Efficient contextual embeddings to improve the transferability of disease trajectory projections with generative models | en |
| dc.type | G2 Pro gradu, diplomityö | fi |
| dc.type.ontasot | Master's thesis | en |
| dc.type.ontasot | Diplomityö | fi |
| local.aalto.electroniconly | yes | |
| local.aalto.openaccess | no |