Collaborative learning from distributed data with differentially private synthetic data

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorPrediger, Lukasen_US
dc.contributor.authorJälkö, Joonasen_US
dc.contributor.authorHonkela, Anttien_US
dc.contributor.authorKaski, Samuelen_US
dc.contributor.departmentDepartment of Computer Scienceen
dc.contributor.groupauthorProbabilistic Machine Learningen
dc.contributor.groupauthorProfessorship Kaski Samuelen
dc.contributor.groupauthorComputer Science Professorsen
dc.contributor.groupauthorComputer Science - Artificial Intelligence and Machine Learning (AIML) - Research areaen
dc.contributor.groupauthorFinnish Center for Artificial Intelligence, FCAIen
dc.contributor.groupauthorHelsinki Institute for Information Technology (HIIT)en
dc.contributor.organizationUniversity of Helsinkien_US
dc.date.accessioned2024-08-28T08:53:16Z
dc.date.available2024-08-28T08:53:16Z
dc.date.issued2024-06-14en_US
dc.description.abstractBackground Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank. Methods We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study’s Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores. Results We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups. Conclusions Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.en
dc.description.versionPeer revieweden
dc.format.extent14
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationPrediger, L, Jälkö, J, Honkela, A & Kaski, S 2024, 'Collaborative learning from distributed data with differentially private synthetic data', BMC Medical Informatics and Decision Making, vol. 24, no. 1, 167, pp. 1-14. https://doi.org/10.1186/s12911-024-02563-7en
dc.identifier.doi10.1186/s12911-024-02563-7en_US
dc.identifier.issn1472-6947
dc.identifier.otherPURE UUID: d187bb29-0251-472e-8b99-3caab67eaa0ben_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/d187bb29-0251-472e-8b99-3caab67eaa0ben_US
dc.identifier.otherPURE LINK: https://github.com/DPBayes/Collaborative-Learning-with-DP-Synthetic-Twin-Dataen_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/155087423/Collaborative_learning_from_distributed_data_with_differentially_private_synthetic_data.pdf
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/130442
dc.identifier.urnURN:NBN:fi:aalto-202408286003
dc.language.isoenen
dc.publisherBioMed Central
dc.relation.fundinginfoThis work was supported by the Research Council of Finland (Flagship programme: Finnish Center for Artificial Intelligence, FCAI; and grants 325572, 325573), the Strategic Research Council (SRC) established within the Research Council of Finland (Grant 336032), UKRI Turing AI World-Leading Researcher Fellowship (EP/W002973/1), as well as the European Union (Project 101070617).
dc.relation.ispartofseriesBMC Medical Informatics and Decision Makingen
dc.relation.ispartofseriesVolume 24, issue 1, pp. 1-14en
dc.rightsopenAccessen
dc.subject.keywordcollaborative learningen_US
dc.subject.keyworddifferential privacyen_US
dc.subject.keywordhealth informaticsen_US
dc.subject.keywordsynthetic dataen_US
dc.titleCollaborative learning from distributed data with differentially private synthetic dataen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files