Efficient contextual embeddings to improve the transferability of disease trajectory projections with generative models

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorHartonen, Tuomo
dc.contributor.advisorGanna, Andrea
dc.contributor.authorBurian, Jonas
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.schoolSchool of Scienceen
dc.contributor.supervisorMarttinen, Pekka
dc.date.accessioned2025-08-19T17:08:24Z
dc.date.available2025-08-19T17:08:24Z
dc.date.issued2025-07-13
dc.description.abstractA growing global burden of non-communicable diseases demands enhanced preventive healthcare measures to avert escalating societal costs. Electronic Health Records (EHRs) hold considerable promise in this regard, as they facilitate the leveraging of past disease trajectories to predict future risks. However, the transferability of current projection methods is limited due to their dependence on the specific biomedical vocabularies on which they were trained, restricting applicability across countries. This thesis proposes a new embedding model based on the BERT architecture that integrates codes from disparate medical vocabularies. The model is trained using a contrastive objective that ensures the produced embeddings reflect the semantics and relationships of the underlying concepts. An integration into Delphi, a transformer-based method for producing disease risk predictions, ultimately aims to enable disease trajectory projections based on medical events encoded in numerous medical vocabularies, possibly not included in the training data of the model. In an external evaluation, the proposed embedding model demonstrated superior performance in its capacity to align and differentiate related and unrelated concepts compared to other models for embedding biomedical concepts, excluding the OpenAI embeddings. This is achieved while maintaining only 20% of the size of the smallest embedding models in the comparison. Moreover, following the integration with the embedding model, Delphi exhibits effective transfer between synonymous codes of varying vocabularies, demonstrating minimal impact on performance. Transferability was also achieved between countries, specifically from the United Kingdom to Finland. Here, the model improved upon the original Delphi by incorporating codes not part of the training vocabularies in addition to also utilizing completely different vocabularies. Given these results, the proposed model has the potential to facilitate enhanced disease prevention, for instance, through the development of cross-country risk prediction models and by utilizing the complete array of available EHR data to make more precise risk predictions.en
dc.format.extent118
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/138103
dc.identifier.urnURN:NBN:fi:aalto-202508196332
dc.language.isoenen
dc.programmeMaster's Programme in Computer, Communication and Information Sciencesen
dc.programme.majorMachine Learning, Data Science and Artificial Intelligenceen
dc.subject.keywordbiomedical concept embeddingsen
dc.subject.keywordcontrastive learningen
dc.subject.keyworddisease risk predictionen
dc.subject.keywordtransferabilityen
dc.subject.keywordelectronic health recordsen
dc.subject.keywordgenerative modelsen
dc.titleEfficient contextual embeddings to improve the transferability of disease trajectory projections with generative modelsen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessno

Files