HAPNEST : efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes
Loading...
Access rights
openAccess
URL
Journal Title
Journal ISSN
Volume Title
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
Date
2023-09-02
Major/Subject
Mcode
Degree programme
Language
en
Pages
10
1-10
1-10
Series
Bioinformatics (Oxford, England), Volume 39, issue 9
Abstract
MOTIVATION: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. RESULTS: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures. AVAILABILITY AND IMPLEMENTATION: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.Description
| openaire: EC/H2020/101016775/EU//INTERVENE
Keywords
Other note
Citation
Wharrie , S , Yang , Z , Raj , V , Monti , R , Gupta , R , Wang , Y , Martin , A , O'Connor , L J , Kaski , S , Marttinen , P , Palamara , P F , Lippert , C & Ganna , A 2023 , ' HAPNEST : efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes ' , Bioinformatics (Oxford, England) , vol. 39 , no. 9 , btad535 , pp. 1-10 . https://doi.org/10.1093/bioinformatics/btad535