Efficient approaches for large-scale GWAS with genotype uncertainty

Abstract Association studies using genetic data from SNP-chip-based imputation or low-depth sequencing data provide a cost-efficient design for large-scale association studies. We explore methods for performing association studies applicable to such genetic data and investigate how using different priors when estimating genotype probabilities affects the association results. Our proposed method, ANGSD-asso’s latent model, models the unobserved genotype as a latent variable in a generalized linear model framework. The software is implemented in C/C++ and can be run multi-threaded. ANGSD-asso is based on genotype probabilities, which can be estimated using either the sample allele frequency or the individual allele frequencies as a prior. We explore through simulations how genotype probability-based methods compare with using genetic dosages. Our simulations show that in a structured population using the individual allele frequency prior has better power than the sample allele frequency. In scenarios with sequencing depth and phenotype correlation ANGSD-asso’s latent model has higher statistical power and less bias than using dosages. Adding additional covariates to the linear model of ANGSD-asso’s latent model has higher statistical power and less bias than other methods that accommodate genotype uncertainty, while also being much faster. This is shown with imputed data from UK Biobank and simulations.

The score test as described in Skotte et al. [2012] only has to estimate the parameters of the null model, where uncertainty on the variables do not have to be taken into account. It is therefore faster than our approach, where we both have to estimate the null and the alternative model. The idea behind the hybrid model is combining the speed of the score test with the desirable properties of ANGSD-asso's latent model, where estimates of the effect size and standard error can be obtained. It works by first running the score test, and then if the site has a P-value below a certain threshold, we additionally run the slower ANGSD-asso's latent model as well The threshold can be set by the user, the default value is 0.05.

ANGSD-asso's dosage model
The expected genotype E[G|x] can easily be calculated from the genotype probabilities. This is an easy way to accommodate some of the genotype uncertainty and is therefore a method for trying to deal with genotype uncertainty in association studies The genotype probability p(G|x) can be calculated using the genotype likelihood p(x|G) and the frequency f of the genetic variant, for calculating the prior p(G|f ). This is done using Bayes' formula p(G|x, f ) = p(x|G)p(G|f ) p(x) = p(x|G)p(G|f ) g∈0,1,2 p(x|G = g)p(G = g|f ) . ( Here it is assumed that we have one homogeneous population where f describes the frequency of the genetic variant well across all individuals. Other priors can be used, for instance a prior based on the individual allele frequency (π) or haplotype frequencies. We perform standard ordinary least squares using E[G|x] as our explanatory variable

Simulated sequencing data
Sequencing data was simulated by first choosing an average depth for a group of individuals and then sampling the specific depth assuming a Poisson distribution. For simplicity we assumed a constant error rate of 1 %, furthermore we assumed that only two bases exist and sample the reads from these two alleles conditional on the simulated genotype and the error rate. For the run-times in Figure 5 and Supplementary Figure 13 and the effects of priming in Supplementary Figure 11, genetic data was simulated using frequencies from the Yoruba population from Lazaridis et al. [2014b] where the curated Human Origins data set was selected. We chose 6 different simulation scenarios, as summarised in Table 1. In scenario 1 we evaluate the false positive rate when there is sequencing depth and phenotype correlation, under our null hypothesis of no effect of the genotype. In scenario 2 we examine the statistical power when simulating under our alternative hypothesis with no sequencing depth and phenotype correlation. Scenario 3 is similar to scenario 1 and 4 is similar to 2, but with the addition of population structure. Scenario 5 and 6 are similar to scenario 2 and 4 respectively, but with correlation between sequencing depth and phenotype correlation. The sequencing depth and phenotype correlation was simulated using a logistic function modelling the probability of being in the group with high average sequencing depth p(D i = high). As this function maps the input into probabilities, also the steepness of the curve can conveniently be controlled with just one parameter.
The higher (in absolute value) δ is, to a larger degree the phenotype will correlate with being in the group of high or low average sequencing depth, meaning the lowest values of the phenotype will have the lowest probability of being in the high depth group whereas the highest values of the phenotype will have the highest probability of being in the high depth group (if δ > 0 and vice versa if δ < 0). From the simulation data we estimate frequencies from the genotype likelihoods. For the admixed individuals we assume that the admixture proportions are known, we estimate the population frequencies using the approach from Skotte et al. [2013]. The sequencing depth and phenotype correlation is simulated as described in eq. 5.

Simulating estimated admixture proportions
We added simulations where we performed association analysis of the variant rs2951755. We used the curated Human Origins data set [Lazaridis et al., 2014a] with individuals from the Human Genetic Diversity Project (HGDP). In this data set rs2951755 has a frequency of 0.1 in French and 0.63 in Yoruba. We chose this variant because it has a high difference in population specific frequencies. We simulated genotypes for rs2951755 and between 50 to 50,000 variants based on frequencies from randomly sampled variants from the HGDP data. We simulated individuals as a mix between the French and Yoruba with similar admixture proportions to that of scenario 4 (see Figure 2) and with the same simulated effect size of 0.3 and an effect of ancestry of 1. The genotype likelihoods were then used to estimate the admixture proportions using NGSadmix [Skotte et al., 2013]. The results are shown in Supplementary Table 3 & 4  (c) Figure 1: Simulation scenario 1 with varying the sequencing depth and phenotype correlation (δ) (eq. 5). We have a population with 1,000 individuals without population structure or effect of genotype. We use a significance threshold of 10 −3 . Each point is the mean value from 10,000 simulations. and with sequencing depth phenotype correlation (δ = 5, eq. 5). The phenotype is simulated as a quantitative trait, with different effect sizes of the genotype, for each tested effect size it is the power calculated from 10,000 simulations.  Figure 5: Simulation scenario 6 with varying genotype effect size (β), sequencing depth phenotype correlation (δ) is fixed at a value of 5 (eq. 5). There is an effect of ancestry of population 1 (γ = 1). The phenotype is simulated as a quantitative trait, for each tested effect size it is the mean power from 10,000 simulations. 4 Sequencing depth phenotype correlation with a binary phenotype and population structure The binary phenotype is simulated like this: Depth phenotype correlation was simulated like this: The higher δ is the more probable cases will be in the high depth category. 5 Sequencing depth phenotype correlation with a binary phenotype Figure 9: Estimated odds ratio (OR) from all 3 methods used in Table 2, in order to show the bias of the estimated effect size. The estimated effect sizes were simulated using a relative risk (RR) of 1.14. For showing the bias of the estimated effect size, the OR is used, as this can be obtained from the logistic regression. A RR of 1.14 with a disease prevalence of 0.1 is equivalent to an OR of 1.158. The formula for converting OR to RR from Zhang and Kai [1998] was used.  Table 3.

Genotype likelihoods from NGS data
Next generation sequencing (NGS) produces reads with the observed nucleotide bases. These reads are then aligned to a reference genome, and each position will be covered by a certain number of reads, the more reads covering a position the more certainty there is of the true genotype of that position. The number of reads covering a position is called the sequencing depth. Each base in a sequenced read will have a quality score denoting the certainty of the called nucleotide base. Genotype likelihoods (GLs) is the likelihood of observing the sequence data x j given an unknown genotype G, meaning p(x j |G) for a given position in the genome j. The sequence data x j will consist of R observed bases The GL p(x j |G) is calculated using the quality scores of x j , one method for doing so is by assuming the R observed bases are independent given our genotype G Where A 1 and A 2 are the two alleles of the genotype G, summing over the alleles A and rewriting the likelihood R r=1 A∈A1,A2 p(b r |A)p(A) = And since p(A) or the probability of observing one of the two alleles must be one half.
The probability of a base b given an allele A relates to the error rate .
Here R is the depth at site j, b r is the observed rth base, and is the probability of an error as calculated from the quality score of b r . This is the approach used in the GATK framework McKenna et al. [2010], other approaches where the quality score is not used directly exist, for example SAMtools Li et al. [2009].
The likelihood can be used to calculate the posterior probability of the genotype G or the genotype probability, using a genotype prior, p(G) and Bayes Theorem, like in eq. 3.

Poisson distribution written as an exponential family
For the Poisson distribution we have that the probability of our phenotype y i of the ith individuals with value k can be written like this Where our linear predictor for the Poisson regression is We can write up exp(log(p(y Then we can write it as an exponential family using notation from Dobson and Barnett [2008] exp Giving us a(φ) = 1, b(η i ) = λ i = exp(η i ) and c(y i , φ) = − log (y!).

EM algorithm in ANGSD-asso's latent model
Starting from eq. 4 in the main text where we have The term p(y i |G i , z i , θ) we base on a linear regression model. We assume that given the genotype G, covariates Z and parameters θ the phenotype follow a normal distribution with a mean given by We can then apply the EM algorithm for maximising the likelihood. First we do the E-step, the term p(y i |G i , z i , θ) is the only one that depends on θ, this is equivalent to maxisming And then the M-step is differentiating this as in Lake et al. [2003] ∂ ∂β Using the rule of differentiating a sum going from (18)  The term p(y i |G i , z i , θ) can be written as an exponential family Where a(σ) = σ 2 , b(η i ) = η 2 i /2 and c(y i , σ) = −y 2 i /2σ 2 − log (2πσ 2 )/2. For a logistic regression we have a(σ) = 1, b(η i ) = log (1 + e ηi ) and c(y i , σ) = 0. And for a Poisson regression we have a(φ) = 1, b(η i ) = λ i = exp(η i ) and c(y i , φ) = − log (y!).
In going from (23) to (24) Where for α we will just replace z i with 1.
We recognise (24) and (25) as the score functions of a weighted regression, with regards to the respective terms (α, β, γ) [Dutang, 2017]. Each individual i, contributes one observation per possible genotype G the weights are given by p(G i |y i , x i , z i , θ) which is the probability of a genotype G given the phenotype y, covariates z and parameters θ. This is maximised by doing weighted least squares, where the parameters θ are chosen to maximise the likelihood.
The term p(G i |y i , x i , z i , θ) can be estimated using Bayes' theorem again making use of the assumption p(G i |x i , z i , θ) = p(G i |x i ), and that we can ignore the sequence data when we have the genotype p(y

Optimisation strategy for a normal distributed phenotype
First an initial guess of the standard deviation is calculated from the phenotype, using the sample standard deviation Then linear regression is done with the full model, using the dosages calculated form the genotype probabilities, to estimate the an initial guess of the coefficients for ANGSD-asso's latent model for faster convergence. This is referred to in this article as priming the coefficients. Regression weights are then calculated according to (26) and weighted least squares is done using the parameters and weights from the last iteration of the EM algorithm. Each individual has three entries in the design matrix (G, Z), where G is a vector with each of the three possible genotypes for each individual, each weighted by p(G i |y i , x i , z i , θ) as estimated for that individual and for that genotype. Then s is updated by using the weighted sum of squared residuals, from the weighted least squares with n − o degrees of freedom. Where n is the number of individuals and o is the number of coefficients in the linear model as described in (15). The term p(y i |G i , z i , θ) can be calculated using a normal distribution with the following parameters, where η i is from (15).
8 SNPTEST bias with simulated data Figure 12: This data is simulated with 10,000 individuals with an average depth of 0.1, 1, 10, 20X, 2,500 individuals each. Varying the effect size of the genotype (β). There is an effect of ancestry of population 1 (γ = 0.5), the admixture proportions for population 1 are for the same 2,500 individuals each (Q 1 =0.15, 0.4, 0.95, 1). We use a significance threshold of 10 −3 . The linear models are adjusted for ancestry, SNPTEST was run without transforming the phenotype or covariates to make it as comparable to ANGSD-asso's latent model as possible. Each point is based on 1,000 simulations. (a): We show the statistical power to detect a true association using ANGSD-asso's latent model and SNPTEST's latent model respectively with a sample frequency prior (f) and an individual allele frequency prior (π). (b): We show the bias of our estimated effect size.
9 Running times for a quantitative trait Figure 13: Running times for an analysis of a simulated quantitative trait with 442,769 genetic variants, varying the number of individuals (1,000, 5,000, 10,000, 20,000, 50,000 and 100,000), the model is run with 2 covariates (age and gender).
The genetic data has an average depth of 1X. For each point we have run the analysis 3 times and then used the mean running time. All runs but the SNPTEST analysis is run in ANGSD.