HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

Abstract Motivation Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. Results We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures. Availability and implementation A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.

Figure 1: Architecture diagram for the software tool, available at https://github.com/intervene-EU-H2020/synthetic_data.Given a YAML configuration file and reference dataset as input, the software tool provides modules for preprocessing the reference data, generating synthetic datasets, evaluating data quality, and selecting optimal model parameters.The collection of modules has been containerised using both Docker and Singularity for ease of use.
Alternatively, users can specify the proportion of real haplotypes to sample from each ancestry group, but we refer to the Discussion section of the paper regarding the complications in interpreting admixed samples.Segments of length ℓ (in centimorgans) are sampled from the real haplotypes (Figure 1b) based on a simplified stochastic model of the coalescent and recombination processes, ℓ ∼ Exp(2T ρ s ), T ∼ Gamma(2, N s /N e,s ), (1) where ρ s is the population-specific recombination rate, N e,s is the population-specific mean effective population size, and N s is the number of reference samples for population s.The simulation of a coalescence time, T , is one aspect in which HAPNEST differs from previous methods such as HAPGEN2.Another feature we introduce is that to reduce close copying of genotypes from the reference, the presence of a genetic variant at position i is only copied if T ≤ m i , where m i is the variant's age of mutation (obtained from [1]).Two synthetic haplotypes, h j , j ∈ {1, 2}, constructed in this way are added element-wise to create a synthetic genotype, g (Figure 1c).
The algorithm is implemented efficiently in the Julia programming language, using a reference dataset of 4062 genotypes from the publicly available 1000 Genomes Project and Human Genome Diversity Project datasets.This data is downloaded from the Genome Aggregation Database (gnomAD) [2] and filtered to retain the HapMap3 variants.Users of the HAPNEST software can choose to download this pre-processed reference or process their own using a module included as part of the HAPNEST software.To enable biobank-scale data generation, the number of computers, threads of execution, and memory constraints can be configured so that HAPNEST utilises the available computing resources.This enables highly parallelisable, multi-threaded generation of millions of samples for millions of variants in a matter of hours.

Posterior distributions of model parameters
We use approximate Bayesian computation (ABC) rejection sampling to estimate the posterior distributions of the unknown model parameters, ρ s and N e,s , for each s ∈ S. ABC involves simulating many synthetic datasets and compares the real and synthetic data sets using a discrepancy between the two datasets, giving a high probability to parameter values that yield a small discrepancy.This discrepancy is measured by d(Q(D real ), Q(D synthetic )), where d is Euclidean distance and Q is a vector concatenating LD decay and cross-relatedness (the proportion of duplicate/MZ twin, first degree and second degree relatives, between the real and synthetic datasets for D synthetic and between two random partitions of the real dataset for D real ).
This approach favours models that preserve LD structure between the real and synthetic datasets, while limiting the degree of cross-relatedness to that found within the reference data to encourage diversification and generalisation of the synthetic samples.For computationally expensive simulations of large synthetic datasets, our software pipeline runs ABC with a small number of simulations by training a Gaussian process regression model to estimate the remaining simulation results [3].

Overview
Once the genotype has been generated, we subsequently assign liability of phenotypes for each individual as a summation of genetic effect, effect of user input covariates, and noise ϵ: where Y is liability of the phenotype being simulated, G is total genetic effect and U is total effect of all covariates.Users can specify proportion of variance for each component in phenotype construction.Further, based on the given trait prevalence, binary case-control statues can be assigned as below: , where Y b is the Bernoulli random variable denoting samples case/control statues, k is the trait population prevalence and Φ is the cumulative density function of standardised normal distribution.
Continuous liability will be output by default if phenotype prevalence is not provided by the user.

Genetic liability of single trait
The genetic liability of an individual for a given trait is defined as weighted sum of risk allele counts, where the weights are causal effect sizes randomly sampled from a normal distribution with zero mean.Users can choose to input a list of causal SNPs, or customise the polygenicity parameter for the phenotype simulation, which is defined as the proportion of total SNPs being causal (ie.Polygenicity = 0.01 indicates approximately 1% of total SNPs will have a causal effect on the trait determination).Since heritability of a SNP can be affected by many properties, such as SNP's minor allele frequency (MAF) or local linkage structure, we model the variance of effect size distribution for any causal SNP as a function of its MAF, LD score, and functional annotation where β i is the effect size for causal SNP i, p i is MAF of the SNP, r i is its local LD score, and s i is its functional annotation index.Due to the computation intensity from calculating local LD score for all causal SNPs from the input genome, especially when the polygenicity is high, precomputed LD scores should be provided along with the variant annotation in a reference file for this functionality to be considered.Along with the release of our software, we provide a sample reference with precomputed LD score from African population in Hapmap 3 project [4], which is subject to further user customisation.Adjusting parameters a, b and c changes the relationship between SNP heritability and those properties, which in reality, is an indication of negative selection strength.
As an example, choosing a = 0 will make the SNP effect size distribution invariant to its MAF, and same with b for LD score and c for SNP annotation.On top of these, we also allow stratifying the causal SNP effects into multiple levels (e.g.separating SNPs with large, medium, small effects) by adopting a Gaussian mixture distribution with components having different distribution width as below where k is the number of components and σ 2 j is parameter controlling the distribution width for the j−th component, both can be specified by the user.Each SNP falls into one of the mixture components j with probability π j , which is determined by the reciprocal of the width parameter σ 2 j and scaled so that we have k j=1 π j = 1.For a more realistic setup, we suggest using a wide range of width parameters, covering small amount of SNPs each having relatively higher heritability, to massive SNPs with each harbouring lower heritability [5].

Genetic effects on different populations
Our phenotype generation also encompasses population specific genetic effects, not only through incorporating MAF differences across population, but also by introducing the cross-ancestry genetic correlation parameter.We adopt the common assumption that for the same phenotype, causal variants are shared across ancestry but not necessarily their exact effect sizes, and generate correlated base causal effects for different populations from multivariate normal distribution.On top of the base effect size, we further integrate effect of the population specific MAF of the SNP.To be more specific, for each causal SNPs, we first generate population based effect sizes β i from a multivariate Gaussian mixture distribution where Σ is the population covariance matrix with each element [Σ] mn denoting covariance of standardised SNP effects in population m and n.As mentioned in the previous section, these base effect sizes can also be stratified in to various levels, using the component width parameter σ 2 j .
Then, the final effect size of SNP i in a population m will be where p im is the allele frequency of SNP i in population m.

Simulating multiple correlated phenotypes
Our tool also allows generation of multiple phenotypes with predefined genetic correlation and pleiotropic model simultaneously.Each trait has its own heritability and polygenicity.Moreover, SNP effects on different traits can be correlated under user-defined coefficient.Unlike the crossancestry genetic correlation, different traits can have different set of causal SNPs.In this case, the pleiotropy parameter can be specified to determine the proportion of causal SNPs being shared across different traits.Note that the phenotype correlation parameter applies to correlation of pleiotropy SNP effects, as well as effects of non-genetic covariates and environmental noise.

Evaluation
To evaluate the quality of data generated by the synthetic data generation algorithm, we consider three key aspects of data for downstream tasks.
1. Fidelity: Realistic synthetic genetic data should retain the key summary statistics of the original genetic data.
2. Diversity: Synthetic genetic data should be sufficiently different from the real-world genetic data, to preserve privacy.

Validity:
The synthetic phenotypes should behave similarly to the real-world phenotypes in down-stream genetics applications.
For each of these aspects, we design and evaluate multiple metrics to analyse the reliability as well as the robustness of the synthetic data generation algorithm.

Fidelity of synthetic samples
We compare four properties of the synthetic dataset against real dataset to understand the fidelity of synthetic samples in retaining key statistical properties.

Minor Allele Frequency
We measure the discrepancy of Minor Allele Frequency (MAF) from the original dataset using two statistical distances.First, we consider the observed value of each haplotype as an independent Bernoulli random variable.Hence, a genomic sample can be seen as a collection of realised Bernoulli random variables with different parameters.
Considering real data as one realisation and synthetic data as another realisation of this collection of random variables, we can compute statistical discrepancies between these two random variables.We use KL divergence and Wasserstein distance between Bernoulli random variables as the metric to measure the discrepancy in MAF.
2. Population structure Most of the downstream PRS tasks are interested in analysing the effect of SNPs at a population level, it is extremely critical to maintaining the genetic structure of real population in the synthetic samples.We use the popular concept of Principal Component Analysis to ensure that the synthetic samples are well aligned with the real samples.
In order to quantify the misalignment between the population structures, we designed a new metric 'PC alignment'.Let {x i } P i=1 be the set of principal components in real data, where x i ∈ R n SN P and { xi } P i=1 be the principal components of synthetic dataset.We can measure alignment between two principal components using the cosine distance defined as gives us a tool to quantify the degree to which the principal components are aligned.Further, we weigh the cosine distance between principal components by the proportion of variance explained by that PC in real data, denoted by η i .Hence, the final value is calculated using the formula, where top K components are considered for computational simplicity.
3. Linkage Disequilibrium While measuring MAF preserves the probability of occurrence on each SNP, the co-occurrence of neighbouring SNPs are not ensured due to the independent and identically distributed (IID) assumption.In order to verify the local structure of cooccurrence is maintained in the synthetic dataset, we compute the Linkage Disequilibrium correlation matrix and compare them between real data and synthetic data.
Here, AA truth measure the fraction of real samples for which nearest neighbour in metric space is a synthetic sample and AA syn measure the same fraction for synthetic samples.When the synthetic data is very close to real data, we will not be able to distinguish real data samples from synthetic samples and hence AA T S → 0.5.

Diversity of synthetic samples
A main aim of creating synthetic datasets is to preserve the privacy of individual samples in the original dataset, yet allowing the scientific community to develop new tools from the vast amount of data available.We study the 'closeness' of synthetic samples to real samples using two key metrics.
1. Kinship analysis: For kinship analysis [7], we compute the kinship between samples within the generated dataset as well as kinship across the reference dataset and generated dataset.
Within dataset kinship analysis is useful in analysing the diversity in the synthetic samples within the cohort and kinship across datasets are useful to address the privacy concerns of reference data leaking into synthetic data as well as quantifying the between synthetic and real samples.
2. IBS analysis Identical-by-state (IBS) analysis shows samples which share identical sequence at a particular locus and is useful in analysing the closeness between samples.We analyse the distribution of IBS values to compare the real data with generated synthetic data.

Validity of phenotypes
To ensure the phenotype generation algorithm returns realistic phenotype for each individual given the input parameters, we created phenotypes for 5,000 European ancestry samples under a range of heritability 0.1, 0.3, 0.5, 0.7, 0.9 and polygenicity 0.0001 or 0.1.Assuming a prevalence of 0.1, for each continuous phenotype, we also assigned binary case/control labels for all individuals.We computed heritability using genomic-relatedness-based restricted maximum-likelihood method implemented in GCTA (CGTA-GREML) [8] and compared the estimated heritability to our input parameters.Under each condition, the genetic relationship matrices (GRM) used for heritability estimation was computed using causal SNPs underlying the simulated phenotypes.We also calculated the genetic correlation of two traits simulated under various pleiotropic relationships using LDSC [9] as another benchmark of the phenotype generation.

Parameters used in methods comparison
All methods use equivalent reference inputs (converted to the file formats used by each tool).
Parameters used for Sim1000G: Function generateUnrelatedIndividuals from package, with N = 1000.Parameters used for HAPGEN2: Ne = 500, theta = 0, null variant effect.4 for the smaller reference panel, calculated as the maximum resident set size (maximum portion of memory occupied by a process in RAM).All experiments were carried out on computing instances with 32GB RAM, where processes that significantly exceeded this memory limit were automatically stopped by the system.In this case, the final RAM usage prior to the process being automatically stopped is reported in the  4 for the larger reference panel, calculated as the maximum resident set size (maximum portion of memory occupied by a process in RAM).All experiments were carried out on computing instances with 32GB RAM, where processes that significantly exceeded this memory limit were automatically stopped by the system.In this case, the final RAM usage prior to the process being automatically stopped is reported in the table.

Figure 2 :Figure 3 :
Figure 2: Posterior distributions inferred using rejection sampling, plotted as marginal and bivariate kernel density estimates for the effective population size, N e,s , and recombination rate, ρ s , for six superpopulation groups, s.The rejection sampling uses a 20% acceptance rate for 2000 parameter pairs (inferred after 500 iterations) sampled from uniform priors

Figure 4 :
Figure 4: Scatter plots corresponding to the posterior distribution samples plotted in Figure 2. The colourmap represents the number of samples in the simulated datasets that have close (twin or first degree) relatives in the real dataset (generalisability objective).

Figure 5 :
Figure 5: LD decay comparison for the reference data, HAPNEST-generated synthetic data optimised for the multi-objective (fidelity and generalisability) and HAPNEST-generated synthetic data optimised for the fidelity objective.Four reference datasets are considered, for s = EU R, AF R and k = 18267, 109673 (corresponding to the HapMap3 variants and all SNPs with non-zero MAF on chromosome 21, respectively).LD decay plots were generated using the PopLDdecay tool [10].

Figure 6 :
Figure 6: LD correlation for 500 contiguous SNPs selected at random.Four reference datasets are considered, for s = EU R, AF R and k = 18267, 109673 (corresponding to the HapMap3 variants and all SNPs with non-zero MAF on chromosome 21, respectively).The LD plot for the reference data is compared with N syn = 1000, 25000 synthetic samples, for HAPNEST model parameters selected using the multi-objective and the fidelity objective.

Figure 7 :
Figure 7: Regional plot around causal SNP on chromosome 21 under different genotype generation parameters, comparing with real reference genome.GWAS ran on different genomes with phenotype generated under same parameters and randomisation seed: heritability = 0.05 and single causal variant on chromosome 21.Each GWAS group contains 4062 EUR samples, which is the sample size for the reference genome of same ancestry.Panel A. shows the regional plot for GWAS ran on the real reference genome.The rest three panels each shows a representative case when different aspects are prioritised during parameter selection.B. Genotype simulated with rho = 0.35, N e = 8000, where fidelity is prioritised; C. Genotype simulated with rho = 0.68, N e = 11700, where fidelity and generalisability are balanced; D. Genotype simulated with rho = 0.75, N e = 25000, where generalisability is prioritised.See table 9 for quantified comparison of GWAS hits.

Figure 10 :
Figure10: Pearson correlation between predicted and observed values, for various PRS methods and European-ancestry phenotypes with various genetic architectures (H corresponds to the heritability value and P corresponds to the polygenicity value).The optimal parameters for each PRS method are identified using cross validation (CV), or pseudovalidation (PseudoVal), if CV is not available.Scatter plots show the average value obtained for each PRS method, from 3 iterations of the experiment, for HapMap3 variants across 22 chromosomes.

Figure 11 :
Figure 11: Example GWAS Manhattan plots for phenotypes under various genetic architectures.Coloured in green are causal SNPs on trait liability under each setup.a. Phenotype with low heritability (0.1) and low polygenicity (0.0001, i.e. approximately 0.01% of total SNPs having causal effects on trait liability); b.Phenotype with high heritability (0.9) and high polygenicity (0.1, i.e. approximately 10% of total SNPs having causal effects on trait liability).

4 .
[6]rest-Neighbour Adversarial Accuracy A sufficiently good synthetic data sample must be indistinguishable from the real data sample.Based on this principle, we use the adversarial accuracy[6]of a classifier to quantify the closeness of synthetic dataset to real dataset.Let N be the number of real samples as well as artificial samples, 1(•) evaluates to 1 if the condition is true, else 0, d T S (i) is the distance between i th real datapoint and its nearest neighbour in synthetic dataset, d ST (i) is the distance between i th synthetic datapoint and its nearest neighbour in real dataset, d T T (i) is the distance to nearest neighbour of i th real data point in the real dataset and d SS (i) same as above in synthetic dataset, then Nearest Neighbour Adversarial Accuracy AA T S defined as,

Table 1 :
Comparison of the evaluation metrics nearest neighbour adversarial accuracy, LD decay (Euclidean distance between LD decay vectors), MAF (Wasserstein divergence) and PC alignment, for four synthetic data generation tools.↓ means lower the better, ↑ means higher the better.For the nearest neighbour adversarial accuracy, a value near 0.5 is better.Each synthetic dataset contains 1000 samples, based on a European ancestry reference panel for HapMap3 variants on chromosome 21.HAPNEST (multi-objective) uses parameter values determined by the ABC algorithm for balancing the fidelity and generalisability objectives, while HAPNEST (LD objective) uses parameters that are optimised for the fidelity objective.Best results are indicated in bold.

Table 2 :
Comparison of relatedness metrics for four synthetic data generation tools, measuring the number of genetically related genotype pairs within the synthetic datasets.↓ means lower the better, ↑ means higher the better.Best results are indicated in bold.

Table 3 :
Comparison of cross-relatedness metrics for four synthetic data generation tools, measuring the number of genetically related genotype pairs between the synthetic and reference datasets.↓ means lower the better, ↑ means higher the better.Best results are indicated in bold.

Table 4 :
The generalisability and diversity scores for various sample sizes, averaged across five trials for chromosome 21 HapMap3 variants, and the HAPNEST and HAPGEN2 methods (with standard deviation given in brackets).The ratio N/Ne is fixed to ensure a fair comparison with the same average segment lengths

Table 5 :
Additional evaluation experiments for HAPNEST-generated datasets of European ancestry.Four reference datasets are considered, for N syn = 1000, 25000 and k = 18267, 109673 (corresponding to the HapMap3 variants and all SNPs with non-zero MAF on chromosome 21, respectively).HAPNEST model parameters are selected using either the multi-objective or the fidelity (LD) objective.Therefore datasets selected by the fidelity objective tend to have lower LD decay distance, while datasets selected by the multi-objective have lower related (twin and 1stdegree) genotypes within synthetic datasets (Kin.syn.) and between the synthetic and reference datasets (Kin.cross).

Table 6 :
Additional evaluation experiments for HAPNEST-generated datasets of African ancestry.Four reference datasets are considered, for N syn = 1000, 25000 and k = 18267, 109673 (corresponding to the HapMap3 variants and all SNPs with non-zero MAF on chromosome 21, respectively).HAPNEST model parameters are selected using either the multi-objective or the fidelity (LD) objective.Therefore datasets selected by the fidelity objective tend to have lower LD decay distance, while datasets selected by the multi-objective have lower related (twin and 1st-degree) genotypes within synthetic datasets (Kin.syn.) and between the synthetic and reference datasets (Kin.cross).

Table 7 :
Genetic correlation of simulated traits with and without causal variants.

Table 9 :
Comparison of SNPs surpassing various p-value thresholds in GWAS for genomes generated under different parameters.Also see figure7.Reporting number of SNPs with GWAS p-value lower than each threshold.

Table 10 :
Average RAM usage in GB (and standard deviation) for the scalability experiments reported in Figure 1, i.e. approximately 10% of total SNPs having causal effects on trait liability).

Table 11 :
table.Average RAM usage in GB (and standard deviation) for the scalability experiments reported in Figure