Iterative hard thresholding in genome-wide association studies: Generalized linear models, prior weights, and double sparsity

Abstract Background Consecutive testing of single nucleotide polymorphisms (SNPs) is usually employed to identify genetic variants associated with complex traits. Ideally one should model all covariates in unison, but most existing analysis methods for genome-wide association studies (GWAS) perform only univariate regression. Results We extend and efficiently implement iterative hard thresholding (IHT) for multiple regression, treating all SNPs simultaneously. Our extensions accommodate generalized linear models, prior information on genetic variants, and grouping of variants. In our simulations, IHT recovers up to 30% more true predictors than SNP-by-SNP association testing and exhibits a 2–3 orders of magnitude decrease in false-positive rates compared with lasso regression. We also test IHT on the UK Biobank hypertension phenotypes and the Northern Finland Birth Cohort of 1966 cardiovascular phenotypes. We find that IHT scales to the large datasets of contemporary human genetics and recovers the plausible genetic variants identified by previous studies. Conclusions Our real data analysis and simulation studies suggest that IHT can (i) recover highly correlated predictors, (ii) avoid over-fitting, (iii) deliver better true-positive and false-positive rates than either marginal testing or lasso regression, (iv) recover unbiased regression coefficients, (v) exploit prior information and group-sparsity, and (vi) be used with biobank-sized datasets. Although these advances are studied for genome-wide association studies inference, our extensions are pertinent to other regression problems with large numbers of predictors.


Introduction
In genome-wide association studies (GWAS), modern genotyping technology coupled with imputation algorithms can produce an n × p genotype matrix X with n ≈ 10 6 subjects and p ≈ 10 7 genetic predictors [1,2]. Datasets of this size require hundreds of gigabytes of disk space to store in compressed form. Decompressing data to floating point numbers for statistical analyses leads to matrices too large to fit into standard computer memory. The computational burden of dealing with massive GWAS datasets limits statistical analysis and interpretation. This article discusses and extends a class of algorithms capable of meeting the challenge of multiple-regression models scaled to the size of modern GWAS datasets.
Traditionally, GWAS analysis has focused on SNP-by-SNP (single-nucleotide polymorphism) association testing [1,3], with a P-value computed for each SNP via linear regression. This approach enjoys the advantages of simplicity, interpretability, and a low computational complexity of O(np). Furthermore, marginal linear regressions make efficient use of computer memory because computations are carried out on genotype vectors one at a time, as opposed to running on the full genotype matrix in multiple regression. Some authors further increase association power by reframing GWAS as a linear mixed-model problem and proceeding with variance component selection [4,5]. These advances remain within the scope of marginal analysis.
Despite their numerous successes [2], marginal regression is less than ideal for GWAS. It implicitly assumes that all SNPs have independent effects. In contrast, multiple regression can in principle model the effect of all SNPs simultaneously. This approach captures the biology behind GWAS more realistically because traits are usually determined by multiple SNPs acting in unison. Marginal regression selects associated SNPs 1 by 1 on the basis of a pre-set threshold. Given the stringency of the P-value threshold, marginal regression can miss many causal SNPs with low effect sizes. As a result, heritability is underestimated. When p n, one usually assumes that the number of variants k associated with a complex trait is much smaller than n. If this is true, we can expect multiple-regression models to perform better because they (i) offer better outlier detection [6] and better prediction, (ii) account for the correlations among SNPs, and (iii) allow investigators to model interactions. Of course, these advantages are predicated on finding the truly associated SNPs.
Adding penalties to the loss function is one way of achieving parsimony in multiple regression. The lasso [7,8] is the most popular model selection device in current use. The lasso model selects non-zero parameters by minimizing the criterion where (β) is a convex loss, λ is a sparsity tuning constant, and β 1 = j |β j | is the 1 norm of the parameters. The lasso has the virtues of preserving convexity and driving most parameter estimates to 0. Minimization can be conducted efficiently via cyclic coordinate descent [9,10]. The magnitude of the nonzero tuning constant λ determines the number of predictors selected.
Despite its widespread use, the lasso penalty has some drawbacks. First, the 1 penalty tends to shrink parameters toward 0, sometimes severely so. Second, λ must be tuned to achieve a given model size. Third, λ is chosen by cross-validation, a costly procedure. Fourth and most importantly, the shrinkage caused by the penalty leaves a lot of unexplained trait variance, which tends to encourage too many false-positive results to enter the model ultimately identified by cross-validation.
Inflated false-positive rates can be mitigated by substituting nonconvex penalties for the 1 penalty. For example, the minimax concave penalty (MCP) [11] λp(β j ) = λ |β j | 0 1 − s λγ + ds starts out at β j = 0 with slope λ and gradually transitions to a slope of 0 at β j = λγ . With minor adjustments, the coordinate descent algorithm for the lasso carries over to MCP penalized regression [12,13]. Model selection is achieved without severe shrinkage, and inference in GWAS improves [14]. However, in our experience its false-negative rate is considerably higher than iterative hard thresholding (IHT)'s rate [15]. A second remedy for the lasso, stability selection, weeds out false-positive results by looking for consistent predictor selection across random halves of the data [16]. However, it is known to be under-powered for GWAS compared to standard univariate selection [17]. In contrast, IHT minimizes a loss (β) subject to the nonconvex sparsity constraint β 0 ≤ k, where β 0 counts the number of non-zero components of β [18][19][20]. Fig. 1 explains graphically how the 0 penalty of IHT reduces the bias of the selected parameters compared to 1 and MPC penalties. In the figure λ, γ , and k are chosen so that the same range of β values are sent to zero. To its detriment, the lasso penalty shrinks all β's, no matter how large their absolute values. The nonconvex MCP penalty avoids shrinkage for large β's but exerts shrinkage for intermediate β's. IHT, which is both nonconvex and discontinuous, avoids shrinkage altogether. For GWAS, the sparsity model-size constant k also has a simpler and more intuitive interpretation than the lasso tuning constant λ. Finally, both false-positive and falsenegative rates are well controlled. Balanced against these advantages is the loss of convexity in optimization and concomitant loss of computational efficiency. In practice, the computational barriers are surmountable and are compensated by the excellent results delivered by IHT in high-dimensional regression problems such as multiple GWAS regression. This article has 4 interrelated goals. First, we extend IHT to generalized linear models (GLM). These models encompass most of applied statistics. Previous IHT algorithms focused on normal or logistic sparse regression scenarios. Our software can also perform sparse regression under Poisson and negative binomial response distributions and can be easily extended to other GLM distributions as needed. The key to our extension is the derivation of a nearly optimal step size s for improving the loglikelihood at each iteration. Second, we introduce doubly sparse regression to IHT. Previous authors have considered group sparsity [21]. The latter tactic limits the number of groups selected. It is also useful to limit the number of predictors selected per group. Double sparsity strikes a compromise that encourages selection of correlated causative variants in linkage disequilibrium (LD). Notably, this technique generalizes group-IHT. Third, we demonstrate how to incorporate predetermined SNP weights in IHT. Our simple and interpretable weighting option allows users to introduce prior knowledge into sparse projection. Thus, one can favor predictors whose association to the response is supported by external evidence. Fourth, we present MendelIHT.jl: a scalable, open source, and user-friendly software for IHT in the high-performance programming language Julia [22].

Model Development
This section sketches our extensions of IHT.

IHT background
IHT was originally formulated for sparse signal reconstruction, which is framed as sparse linear least-squares regression. In classical linear regression, we are given an n × p design matrix X and a corresponding n-component response vector y. We then postulate that y has mean E(y) = Xβ and that the residual vector y − Xβ has independent Gaussian components with a common variance. The parameter (regression coefficient) vector β is estimated by minimizing the sum of squares f (β) = (1/2) y − Xβ 2 2 . The solution to this problem is known as the ordinary least-squares estimator and can be written explicitly aŝ This paradigm breaks down in the high-dimensional regime n p, where the parameter vector β is underdetermined. In the spirit of parsimony, IHT seeks a sparse version of β that gives a good fit to the data. This is accomplished by minimizing f (β) subject to β 0 ≤ k for a small value of k, where · 0 counts the number of nonzero entries of a vector. The optimization problem is formally: IHT abandons the explicit formula forβ because it fails to respect sparsity and involves the numerically intractable matrix inverse (X t X) −1 . IHT combines 3 core ideas. The first is steepest descent. Elementary calculus tells us that the negative gradient −∇f(x) is the direction of steepest descent of f (β) at x. First-order optimization methods like IHT define the next iterate in minimization by the formula β n+1 = β n + s n v n , where v n = −∇ f (β n ) and s n > 0 is some optimally chosen step size. In the case of linear regression −∇ f (β) = X t (y − Xβ). To reduce the error at each iteration, the optimal step size s n can be selected by minimizing the secondorder Taylor expansion with respect to s n . Here d 2 f (β) = X t X is the Hessian matrix of second partial derivatives. Because f (β) is quadratic, the expan- In GLM, μ = g(x t β) denotes the mean, s = x t β the linear responses, g is the inverse link function, and φ the dispersion. Except for the negative binomial, all inverse links are canonical.
sion is exact. Its minimum occurs at the step size This formula summarizes the second core idea. The third component of IHT involves projecting the steepest descent update β n + s n v n onto the sparsity set S k = {β : β 0 ≤ k}.
The relevant projection operator P Sk (β) sets all but the k largest entries of β in magnitude to 0. In summary, IHT solves problem (1) by updating the parameter vector β according to the recipe: with the step size given by formula (2). An optional debiasing step can be added to improve parameter estimates. This involves replacing β n+1 by the exact minimum point of f (β) in the subspace defined by the support { j : Debiasing is efficient because it solves a lowdimensional problem. Several versions of hard-thresholding algorithms have been proposed in the signal-processing literature. The first of these, NIHT [20], omits debaising. The rest, HTP [23], GraHTP [24], and CoSaMp [25], offer debiasing.

IHT for generalized linear models
A GLM involves responses y following a natural exponential distribution with density in the canonical form where y is the data, θ is the natural parameter, φ > 0 is the scale (dispersion), and a(φ), b(θ), and c(y, φ) are known functions that vary depending on the distribution [26,27]. Simple calculations show that y has mean μ = b (θ) and variance σ 2 = b (θ)a(φ); accordingly, σ 2 is a function of μ. Table 1 summarizes the mean domains and variances of a few common exponential families. Covariates enter GLM modeling through an inverse link repre- , where x is a vector of covariates (predictors) and β is vector of regression coefficients (parameters). In statistical practice, data arrive as a sample of independent responses y 1 , . . . , y m with different covariate vectors x 1 , . . . , x m . To put each predictor on an equal footing, each should be standardized to have mean 0 and variance 1. Including an additional intercept term is standard practice. If we assemble a design matrix X by stacking the row vectors x t i , then we can calculate the loglikelihood, score, and expected Downloaded from https://academic.oup.com/gigascience/article-abstract/9/6/giaa044/5850823 by guest on 04 June 2020 information [26][27][28][29] where W 1 and W 2 are two diagonal matrices. The second has positive diagonal entries; they coincide under the identity inverse link g(s) = s.
In the GLM version of IHT, we maximize L(β) [equivalent to minimizing f (β) = −L (β)] and substitute the expected informa- (2). This translates into the following step size in GLM estimation: This substitution is a key ingredient of our extended IHT. It simplifies computations and guarantees that the step size is nonnegative.

Doubly sparse projections
The effectiveness of group sparsity in penalized regression has been demonstrated in general [30,31] and for GWAS [32] in particular. Group IHT [21] enforces group sparsity but does not enforce within-group sparsity. In GWAS, model selection is desired within groups as well to pinpoint causal SNPs. Furthermore, a concern in GWAS is that two causative SNPs can be highly correlated with each other due to LD. When sensible group information is available, doubly sparse IHT encourages the detection of causative yet correlated SNPs while enforcing sparsity within groups. Here we discuss how to carry out a doubly sparse projection that enforces both within-and between-group sparsity. Suppose we divide the SNPs of a study into a collection G of nonoverlapping groups. Given a parameter vector β and a group g ∈ G, let β g denote the components of β corresponding to the SNPs in g. Now suppose we want to select at most j groups and at most λ g ∈ Z + SNPs for each group g. In projecting β, the component β i is untouched for a selected SNP i. For an unselected SNP, β i is reset to 0. By analogy with our earlier discussion, we can define a sparsity projection operator P g (β g ) for each group g; P g (β g ) selects the λ g most prominent SNPs in group g. The potential reduction in the squared distance offered by group g is The j selected groups are determined by selecting the j largest values of r g . If desired, we can set the sparsity level λ g for each group high enough so that all SNPs in group g come into play. Thus, doubly sparse IHT generalizes group IHT. In Algorithm 1, we write P (β) for the overall projection with the component projections P g (β g ) on the j selected groups and projection to zero on the remaining groups.

Prior weights in IHT
Zhou et al. [32] treat prior weights in penalized GWAS. Before calculating the lasso penalty, they multiply each component of the parameter vector β by a positive weight w i . We can do the same in IHT before projection. Thus, instead of projecting the steepest descent step β = β n + s n v n , we project the Hadamard (pointwise) product w • β of β with a weight vector w. This produces a vec-tor with a sensible support S. The next iterate β n+1 is defined to have support S and to be equal to β n + s n v n on S.
In GWAS, weights can and should be informed by prior biological knowledge. A simple scheme for choosing nonconstant weights relies on minor-allele frequencies (MAFs). For instance, Zhou et al. [33] assign SNP i with MAF p i the weight 1/2 . Giving rare SNPs greater weight in this fashion is most appropriate for traits under strong negative selection [34,35]. Alternatively, our software permits users to assign weights geared to specific pathway and gene information.
de Lamare and Rodrigo [36] incorporate prior weights into IHT by adding an element-wise logarithm of a weight vector q before projection. The weight vector q is updated iteratively and requires 2 additional tuning constants that in practice are only obtained through cross-validation. Our weighting scheme is simpler, more computationally efficient, and more interpretable.

Algorithm summary
The final algorithm combining doubly sparse projections, prior weight scaling, and debiasing is summarized in Algorithm 1.

Results
Readers can reproduce our results by accessing the software, documentation, and Jupyter notebooks on our Github page: https://github.com/OpenMendel/MendelIHT.jl

Scalability of IHT
To test the scalability of our implementation, we ran IHT on p = 10 6 SNPs for sample sizes n = 10,000, 20,000, ..., 120,000 with 5 independent replicates per n. All simulations rely on a true sparsity level of k = 10. Using a machine with 63 GB of RAM and a single 3.3-GHz Intel-E5-2670, Fig. 2 plots the IHT median CPU time per iteration, median iterations to convergence, and median memory usage under Gaussian, logistic, Poisson, and neg- Benchmarks were carried out on compressed data with 10 6 SNPs and sample sizes ranging from 10,000 to 120,000. Hence, the largest matrix here requires 30 GB and can still fit into personal computer memories. ative binomial models. The largest matrix simulated here is 30 GB in size and can still fit into our personal computer's memory. Of course, it is possible to test even larger sample sizes using cloud or cluster resources, which are often needed in practice.
The formation of the vector μ of predicted values requires only a limited number of nonzero regression coefficients. Consequently, the computational complexity of this phase of IHT is relatively light. In contrast, calculation of the Fisher score (gradient) and information (expected negative Hessian) depends on the entire genotype matrix X. Fortunately, each of the np entries of X can be compressed to 2 bits. Fig. 2b and e show that IHT memory demands beyond storing X never exceeded a few gigabytes. Fig. 2a and d show that IHT run time per iteration increases linearly in problem size n. Similarly, we expect increasing p will increase run time linearly because the bottleneck of IHT is the matrix-vector multiplication step in computing the gradient, which scales as O(np). Debiasing increases run time per iteration only slightly. Except for negative binomial responses, debiasing is effective in reducing the number of iterations required for convergence and hence overall run time.

Cross-validation in model selection
In actual studies, the true number of genetic predictors k true is unknown. This section investigates how q-fold cross-validation can determine the best model size on simulated data. Under normal, logistic, Poisson, and negative binomial models, we considered 50 different combinations of X, y, and β true with k true = 10, n = 5,000 samples, and p = 50,000 SNPs fixed in all replicates. Here, k true is chosen so that it is closer to our NFBC and UK Biobank results. On these datasets we conducted 5-fold cross-validation across 20 model sizes k ranging from 1 to 20. Fig. 3 plots deviance residuals on the holdout dataset for each of the 4 GLM responses (mean squared error in the case of normal responses) and the best estimatek of k true . Fig. 3 shows that k true can be effectively recovered by crossvalidation. In general, prediction error starts off high where the proposed sparsity level k severely underestimates k true and plateaus when k true is reached (Fig. 3a-d). Furthermore, the estimated sparsityk for each run is narrowly centered around k true = 10 ( Fig. 3e and f). In fact, |k − k true | ≤ 4 always holds. Whenk exceeds k true , the estimated regression coefficients for the false predictors tend to be very small. In other words, IHT is robust to overfitting, in contrast to lasso penalized regression. We see qualitatively similar results when k true is large. This proved to be the case in our previous article [15] for Gaussian models with k true ∈ {100, 200, 300}.

Comparing IHT to lasso and marginal tests in model selection
Comparison of the true-positive and false-positive rates of IHT and its main competitors is revealing. For lasso regression we use the glmnet implementation of cyclic coordinate descent [9, 10, 37] (v2.0-16 implemented in R 3.5.2); for marginal testing we use the beta version of MendelGWAS [38]. As explained later,   R. Unfortunately, glmnet does not accommodate negative binomial regression. Because both glmnet and pscl operate on floating point numbers, we limit our comparisons to small problems with 1,000 subjects, 10,000 SNPs, 50 replicates, and k = 10 causal SNPs. IHT performs model selection by 3-fold cross-validation across model sizes ranging from 1 to 50. This range is generous enough to cover the models selected by lasso regression. We adjust for multiple testing in the marginal case by applying a Pvalue cut-off of 5 × 10 −6 . Table 2 demonstrates that IHT achieves the best balance between maximizing true-positive results and minimizing falsepositive results. IHT finds more true-positive results than marginal testing and almost as many as lasso regression. IHT also finds far fewer false-positive results than lasso regression. Poisson regression is exceptional in yielding an excessive number of false-positive results in marginal testing. A similar but less extreme trend is observed for lasso regression. The marginal false-positive rate is reduced by switching to zero-inflated Poisson regression. This alternative model is capable of handling overdispersion due to an excess of 0 values. Interestingly, IHT  As the magnitude of β true decreases, IHT estimates show an upward absolute bias, consistent with the winner's curse phenomenon. When sample sizes are small, small effect sizes make most predictors imperceptible amid the random noise. The win- ner's curse operates in this regime and cannot be eliminated by IHT. Lasso's strong shrinkage overwhelms the bias of the winner's curse and yields estimates smaller than true values. The results displayed in Table 3 reflect n = 5,000 subjects, p = 10,000 SNPs, 100 replicates, and a sparsity level k fixed at its true value k true = 10. The λ value for lasso is chosen by crossvalidation. To avoid datasets with monomorphic SNPs, the minimum MAF is set at 0.05. For linear, logistic, and Poisson regressions in marginal tests, we first screen for potential SNPs via a score test. Only top SNPs are used in the more rigorous and more computationally intensive likelihood ratio tests, which gives the β estimates. This procedure is described in Zhou et al. [38]. We ran likelihood ratio tests for all SNPs in the negative binomial model because the screening procedure is not yet implemented. However, the inflation in parameter estimates is present throughout all marginal tests.

Correlated covariates and doubly sparse projections
Next we study how well IHT works on correlated data and whether doubly sparse projection can enhance model selection. Table 4 shows that, in the presence of extensive LD, IHT performs reasonably well even without grouping information. When grouping information is available, group IHT enhances model selection. The results displayed in Table 4 reflect n = 1,000 samples, p = 10,000 SNPs, and 100 replicates. Each SNP belongs to 1 of 500 disjoint groups containing 20 SNPs each; j = 5 distinct groups are each assigned 1, 2, ..., 5 causal SNPs with effect sizes randomly chosen from {−0.2, 0.2}. In all there were 15 causal SNPs. For grouped IHT, we assume perfect group information; i.e., groups containing 1-5 causative SNPs are assigned λ g ∈ {1, 2, ..., 5}. The remaining groups are assigned λ g = 1. As described in the Methods section, the simulated data show LD within each group, with the degree of LD between 2 SNPs decreasing as their separation increases. Although the conditions of this simulation are somewhat idealized, they mimic what might be observed if small genetic regions of whole-exome data were used with IHT.
We repeated this examination of doubly sparse projection for the first 30,000 SNPs of the NFBC1966 [40] data for all samples passing the quality control measures outlined in our Methods section. We arbitrarily assembled 2 large groups with 2,000 SNPs, 5 medium groups with 500 SNPs, and 10 small groups with 100 SNPs, representing genes of different length. The remaining SNPs are lumped into a final group representing non-coding regions. In all there are 18 groups. Because group assignments are essentially random beyond choosing neighboring SNPs, this example represents the worst-case scenario of a relatively sparse marker map with undifferentiated SNP groups. We randomly selected 1 large group, 2 medium groups, and 3 small groups to   Table 5. We find that even in this worst-case scenario where group information is completely lacking, grouped IHT does no worse than ungrouped IHT.

Introduction of prior weights
This section considers how scaling by prior weights helps in model selection. Table 6 compares weighted IHT reconstructions with unweighted reconstructions where all weights w i = 1. The weighted version of IHT consistently finds ∼10% more true predictors than the unweighted version. Here we simulated 50 replicates involving 1,000 subjects, 10,000 uncorrelated variants, and k = 10 true predictors for each GLM. For the sake of simplicity, we defined a prior weight w i = 2 for 110%all variants, including the 10 true predictors. For the remaining SNPs the prior weight is w i = 1. These choices reflect a scenario where 10% of all genotyped variants fall in protein-coding regions, including the 10 true predictors, and where such variants are twice as likely to influence a trait as those falling in non-coding regions.

Hypertension GWAS in the UK Biobank
In this section we test IHT on the second release of UK Biobank [41] data. This dataset contains ∼500,000 samples and ∼800,000 SNPs without imputation. Phenotypes are systolic blood pressure (SBP) and diastolic blood pressure (DBP), averaged over 4 or fewer readings. To adjust for ancestry and relatedness, we included the following nongenetic covariates: sex, hospital center, age, age 2 , BMI, and the top 10 principal components com-Downloaded from https://academic.oup.com/gigascience/article-abstract/9/6/giaa044/5850823 by guest on 04 June 2020  [45,46] β is the estimated effect size. puted with FlashPCA2 [42]. After various quality control procedures that are outlined in the Methods section, the final dataset used in our analysis contains 185,565 samples and 470,228 SNPs. For UK Biobank analysis, we omitted debiasing, prior weighting, and doubly sparse projections.

Stage 2 hypertension under a logistic model
Consistent with the clinical definition for Stage 2 hypertension (S2 Hyp) [43], we designated patients as hypertensive if their SBP ≥ 140 mmHG or DBP ≥ 90 mmHG. We ran a 5-fold cross-validated logistic model across model sizes k = {1, 2, ..., 50}. The workload was distributed to 50 computers, each with 5 CPU cores.
Each computer was assigned one model size, and each completed its task within 24 hours. The model size that minimizes thether deviance residuals isk = 39. The selected predictors include the intercept, sex, age, age 2 , BMI, the fifth principal component, and the 33 SNPs listed in Table 7. Fig. 4, generated by MendelPlots.jl [ 44], compares univariate logistic GWAS with logistic IHT. SNPs recovered by IHT are circled. Our GitHub page records the full list of significant SNPs detected by univariate GWAS. There are 10 SNPs selected by IHT that have a P-value <5 × 10 −8 ; 83 SNPs pass the threshold in the univariate analysis but remain unselected by IHT. IHT tends to pick the most significant SNP among a group of SNPs in LD. Table 7 shows 25 SNPs selected by IHT that were previously reported to be associated with elevated SBP/DBP [45] or that exhibit genome-wide significance when the same data are analyzed as an ordinal trait [46]. Ordinal univariate GWAS treats the different stages of hypertension as ordered categories. Ordinal GWAS has higher power than logistic or multinomial GWAS [46]. The known SNPs displayed in Table 7 tend to have larger absolute effect sizes (mean = 0.033) than the unknown SNPs (mean = 0.027). Finally, IHT is able to recover 2 pairs of highly correlated SNPs: (rs1374264, rs1898841) and (rs7497304, rs2677738) with pairwise correlations of r 1, 2 = 0.59 and r 3, 4 = 0.49.

Cardiovascular GWAS in NFBC1966
We also tested IHT on data from the 1966 Northern Finland Birth Cohort (NFBC1966) [40]. Although this dataset is relatively modest with 5,402 participants and 364,590 SNPs, it has 2 virtues. First, it has been analyzed multiple times [15,40,47], so comparison with earlier analysis is easy. Second, due to a population bottleneck [48], the participants' chromosomes exhibit more extensive LD than is typically found in less isolated populations. Multiple-regression methods, including the lasso, have been criticized for their inability to deal with the dependence among predictors induced by LD. Therefore, this dataset provides an interesting test case.

High-density lipoprotein phenotype as a normal model
Using IHT we find previously associated SNPs as well as a few new potential associations. We model the high-density lipoprotein (HDL) phenotype as normally distributed and find a best model sizek = 9 based on 5-fold cross-validation across model sizes k = {1, 2, ..., 20}. Without debiasing, the analysis was completed in 2 hours and 4 minutes with 30 CPU cores on a single machine. Table 8 displays the recovered predictors. SNP rs1800961 was replaced by rs7499892 with similar effect size if we add the debiasing step in obtaining the final model. Importantly, IHT is able to simultaneously recover effects for SNPs (1) rs9261224, (2) rs6917603, and (3) rs6917603 with pairwise correlations of r 1, 2 = 0.618, r 1, 3 = 0.984, and r 2, 3 = 0.62. This result is achieved without grouping of SNPs, which can further increase association power. Compared with earlier analyses of these data, we find 3 SNPs that were not listed in our previous IHT article [15], presumably due to slight algorithmic modifications. The authors of NFBC [40] found 5 SNPs associated with HDL under SNP-by-SNP testing. We did not find SNPs rs2167079 and rs255049. To date, rs255049 was replicated [47]. SNP rs2167079 has been reported to be associated with an unrelated phenotype [49].

Discussion
Multiple-regression methods like IHT provide a principled way of model fitting and variable selection. With increasing computing power and better software, multiple-regression methods are likely to prevail over univariate methods. This article introduces a scalable implementation of IHT for GLMs. Because lasso regression can handle group and prior weights, we have also extended IHT to incorporate such prior knowledge. When this prior knowledge is available, enhanced IHT outperforms standard IHT. Given its sharper parameter estimates and more robust model selection, IHT is clearly superior to lasso selection or marginal association testing in GWAS.
Our real data analyses and simulation studies suggest that IHT can (i) recover highly correlated SNPs, (ii) avoid overfitting, (iii) deliver better true-positive and false-positive rates than either marginal testing or lasso regression, (iv) recover unbiased regression coefficients, and (v) exploit prior information and group sparsity. Our Julia implementation of IHT exploits parallel computing strategies that scale to biobanklevel data. In our opinion, the time is ripe for the genomics community to embrace multiple-regression models as a supplement to and possibly a replacement for marginal analysis.
Although we focused our attention on GWAS, the potential applications of IHT reach far beyond gene mapping. Our IHT implementation accepts arbitrary numeric data and is suitable for a variety of applied statistics problems. Genetics and the broader field of bioinformatics are blessed with rich, ultra-high-dimensional data. IHT is designed to solve such problems. By extending IHT to the realm of GLMs, it becomes possible to fit regression models with more exotic distributions than the Gaussian distributions implicit in ordinary linear regression. In our view IHT will eventually join and probably supplant lasso regression as the method of choice in GWAS and other high-dimensional regression settings.

Data simulation
Our simulations mimic scenarios for a range of rare and common SNPs with or without LD. Unless otherwise stated, we designate 10 SNPs to be causal with effect sizes of 0.1, 0.2, ..., 1.0.
To generate independent SNP genotypes, we first sample a MAF ρ j ∼ Uniform(0, 0.5) for each SNP j. To construct the genotype of person i at SNP j, we then sample from a binomial distribution with success probability ρ j and 2 trials. The vector of genotypes (minor-allele counts) for person i form row x t i of the design matrix X. To generate SNP genotypes with LD, we divide all SNPs into blocks of length 20. Within each block, we first sample x 1 ∼ Bernoulli(0.5). Then we form a single haplotype block of length 20 by the following Markov chain procedure: x i+1 = x i with probability p 1 − x i with probability 1 − p with default p = 0.75. For each block we form a pool of 20 haplotypes using this procedure, ensuring that each of the 40 alleles (2 at each SNP) are represented at least once. For each person, the genotype vector in a block is formed by sampling 2 haplotypes with replacement from the pool and summing the number of minor alleles at each SNP.
Depending on the simulation, the number of subjects ranges from 1,000 to 120,000, and the number of independent SNPs ranges from 10,000 to 1,000,000. We simulate data under 4 GLM distributions: normal (Gaussian), Bernoulli, Poisson, and negative binomial. We generate component y i of the response vector y by sampling from the corresponding distribution with mean μ i = g(x t i β), where g is the inverse link function. For normal models we assume unit variance, and for negative binomial models we assume 10 required failures. To avoid overflows, we clamp the mean g(x t i β) to stay within [−20, 20]. (See Ad Hoc Tactics for a detailed explanation). We apply the canonical link for each distribution, except for the negative binomial, where we apply the log link. Downloaded from https://academic.oup.com/gigascience/article-abstract/9/6/giaa044/5850823 by guest on 04 June 2020

UK Biobank
Following the UK Biobank's own quality control procedures, we first filtered all samples for sex discordance and high heterozygosity/missingness. Second, we included only participants of European ancestry and excluded first-and second-degree relatives on the basis of empiric kinship coefficients. Third, we also exclude impuded participants who had taken hypertensionrelated medications at baseline. Finally, we only included participants with ≥ 98% genotyping success rate over all chromosomes and SNPs with ≥ 99% genotyping success rate over all included individuals. Calculation of kinship coefficients and filtering were carried out via the OpenMendel module SnpArrays [50].
Remaining missing genotypes were imputed using modal genotypes at each SNP. After these quality control procedures, our UK Biobank data are the same data that were used by German et al. [46].

Northern Finland Birth Cohort
We imputed missing genotypes with Mendel [51]. Following Keys et al. [15], we excluded participants with missing phenotypes, fasting participants, and participants receiving diabetes medication. We conducted quality control measures using the Open-Mendel module SnpArrays [50]. On the basis of these measures, we excluded SNPs with MAF ≤ 0.01 and Hardy-Weinberg equilibrium P-values ≤10 −5 . Concerning non-genetic predictors, we included sex (the sexOCPG factor defined in Sabatti et al. [40]) as well as the first 2 principal components of the genotype matrix computed via PLINK 2.0 alpha [52]. To put predictors, genetic and non-genetic, on an equal footing, we standardized all predictors to have mean zero and unit variance.

Linear algebra with compressed genotype files
The genotype count matrix stores minor-allele counts. The PLINK genotype compression protocol [52] compactly stores the corresponding 0's, 1's, and 2's in 2 bits per SNP, achieving a compression ratio of 32:1 compared with storage as floating point numbers. For a sparsity level k model, we use OpenBLAS (a highly optimized linear algebra library) to compute predicted values. This requires transforming the k pertinent columns of X into a floating point matrix X k and multiplying it by the corresponding entries β k of β. The inverse link is then applied to X k β k to give the mean vector μ = g(X k β k ). In computing the GLM gradient (formula 3), formation of the vector W 1 (y − μ) involves no matrix multiplications. Computation of the gradient X t W 1 (y − μ) is more complicated because the full matrix X can no longer be avoided. Fortunately, the OpenMendel module SnpArrays [50] can be invoked to perform compressed matrix times vector multiplication. Calculation of the step length of IHT requires computation of the quadratic form ∇ L (β n ) t X t W 2 X∇ L (β n ). Given the gradient, this computation requires a single compressed matrix times vector multiplication. Finally, good statistical practice calls for standardizing covariates. To standardize the genotype counts for SNP j, we estimate its MAF p j and then substitute the ratio (x i j − 2 p j ) / [2 p j (1 − p j )] 1/2 for the genotype count x ij for person i at SNP j. This procedure is predicated on a binomial distribution for the count x ij . Our previous article [15] shows how to accommodate standardization in the matrix operations of IHT without actually forming or storing the standardized matrix.

Although multiplication via the OpenMendel module
SnpArrays [50] is slower than OpenBLAS multiplication on small datasets, it can be as much as 10 times faster on large datasets.
OpenBLAS has advantages in parallelization, but it requires floating point arrays. Once the genotype matrix X exceeds the memory available in RAM, expensive data swapping between RAM and disk memory sets in. This dramatically slows matrix multiplication. SnpArrays is less vulnerable to this hazard owing to compression. Once compressed data exceed RAM, SnpArrays also succumbs to the swapping problem. Current laptop and desktop computers seldom have >32 GB of RAM, so we may wish to resort to cluster or cloud computing when input files exceed 32 GB.

Computations involving non-genetic covariates
Non-genetic covariates are stored as double or single precision floating point entries in an n × r design matrix Z. To accommodate an intercept, the first column should be a vector of 1's. Let γ denote the r vector of regression coefficients corresponding to Z. The full design matrix is the block matrix (X Z). Matrix multiplications involving (X Z) should be carried out via Adherence to these rules ensures a low memory footprint. Multiplication involving X can be conducted as previously explained. Multiplication involving Z can revert to BLAS.

Parallel computation
The OpenBLAS library accessed by Julia is inherently parallel. Beyond that we incorporate parallel processing in cross-validation.
Recall that in q-fold cross-validation we separate subjects into q disjoint subsets. We then fit a training model using q − 1 of those subsets on all desired sparsity levels and record the meansquared prediction error on the omitted subset. Each of the q subsets serves as the testing set exactly once. Testing error is averaged across the different folds for each sparsity level k. The lowest average testing error determines the recommended sparsity.
MendelIHT.jl offers 2 parallelism strategies in crossvalidation. Either the q training sets are each loaded to q different CPUs where each computes and tests different sparsity levels sequentially, or each of the q training sets is cycled through sequentially and each sparsity parameter is fitted and tested in parallel. The former tactic requires enough disk space and RAM to store q different training datasets [where each typically requires (q − 1)/q GB of the full data] but offers immense parallel power because one can assign different computers to handle different sparsity levels. This tactic allows one to fit biobankscale data in less than one day, assuming enough storage space and computers are available. The latter tactic requires cycling through the training sets sequentially. Because intermediate data can be deleted, this tactic only requires enough disk space and RAM to store one copy of the training set. MendelIHT.jl uses one of Julia's [22] standard libraries, Distributed.jl, to achieve the aforementioned parallel strategies.

Ad hoc tactics to prevent overflows
In Poisson and negative binomial regressions, the inverse link argument exp(x t i β) experiences numerical overflows when the inner product x t i β is too large. In general, we avoid running Poisson regression when response means are large. In this regime a normal approximation is preferred. As a safety fea-ture, MendelIHT.jl clamps values of x t i β to the interval [−20, 20].
Note that penalized regression is hindered by the same overflow catastrophes.

Convergence and backtracking
For each proposed IHT step we check whether the objective L (β) increases. When it does not, we step-halve at most 5 times to restore the ascent property. Convergence is declared when ||β n+1 − β n || ∞ ||β n || ∞ + 1 < Tolerance, with the default tolerance being 0.0001. The addition of 1 in the denominator of the convergence criterion guards against division by 0. The code to generate simulated data, as well as their subsequent analysis, is available in our GitHub repository under the "figures" folder. Project.toml and Manifest.toml files can be used together to instantiate the same computing environment in our article. Notably, MendelIHT.jl interfaces with the Open-Mendel [38] package SnpArrays.jl [50] and JuliaStats's packages Distribution.jl [53] and GLM.jl [54].