Greater power and computational efficiency for kernel-based association testing of sets of genetic variants

Motivation: Set-based variance component tests have been identified as a way to increase power in association studies by aggregating weak individual effects. However, the choice of test statistic has been largely ignored even though it may play an important role in obtaining optimal power. We compared a standard statistical test—a score test—with a recently developed likelihood ratio (LR) test. Further, when correction for hidden structure is needed, or gene–gene interactions are sought, state-of-the art algorithms for both the score and LR tests can be computationally impractical. Thus we develop new computationally efficient methods. Results: After reviewing theoretical differences in performance between the score and LR tests, we find empirically on real data that the LR test generally has more power. In particular, on 15 of 17 real datasets, the LR test yielded at least as many associations as the score test—up to 23 more associations—whereas the score test yielded at most one more association than the LR test in the two remaining datasets. On synthetic data, we find that the LR test yielded up to 12% more associations, consistent with our results on real data, but also observe a regime of extremely small signal where the score test yielded up to 25% more associations than the LR test, consistent with theory. Finally, our computational speedups now enable (i) efficient LR testing when the background kernel is full rank, and (ii) efficient score testing when the background kernel changes with each test, as for gene–gene interaction tests. The latter yielded a factor of 2000 speedup on a cohort of size 13 500. Availability: Software available at http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/. Contact: heckerma@microsoft.com Supplementary information: Supplementary data are available at Bioinformatics online.

of the main paper (no background variance component used).
Red points denote hypotheses for which the numerical routine to compute the p-value for the score test lost precision and returned p=0. For these points, we assigned them a score p-value equal to the lowest observed p-value in that experiment. Table 1. Type I error for WTCCC data using half of chromosomes as polygenic background. Linear LR test 7.9 × 10 −6 1.2 × 10 −4 1.0 × 10 −3

Supplementary
There are no statistically significant deviations from expectation according to binomial test with significance level of 0.05.

Supplementary Table 2.
Validation of the three immune-related phenotypes from the WTCCC analysis. Raw number of validated hits found by each method. Data for validation were downloaded on 5/27/2014 from http://immunobase.org. "Max # could validate" denotes the total number of genes in our analysis found in the immune database, while, "# in db" denotes the total number genes listsed in the immune database. "1K" denotes the analysis with no background variance component, while "2K" denotes the analysis with a background variance component, as described in the main paper. Supplementary Methods

Definitions and frequently used identities
• I J denotes a J-by-J identity matrix.
• I denotes an identity matrix, where the dimensionality follows from the context.
• N -by-1 vector of phenotypes y • N -by-D covariates matrix X with full column rank D and N ≥ D • N -by-N covariance matrix Σ θ , for the Gaussian-distributed phenotype y, parameterized by the vector of covariance parameters θ ∈ Θ. In this paper we assume a simple weighted sum of fixed individual positive semi-definite matrices. So in the case of no background kernel (that is, no kernel appearing in the null model), Σ θ = σ 2 e I + σ 2 1 K 1 , whereas for the case where there is a background kernel, Σ θ = σ 2 e I + σ 2 g K g + σ 2 1 K 1 . In both cases, under the null hypothesis σ 2 1 = 0, as this is the parameter being tested. Σ θ is assumed to have full rank N (which it will so long as σ 2 e > 0).
• N -by-N symmetric covariate orthogonal projection matrix S = I N − X X ⊤ X −1 X ⊤ and has rank N − D.
and has rank N − D.
• For any I-by-J matrix A, the J-by-I matrix A † denotes the Moore-Penrose pseudo-inverse of A.
• For any square I-by-I matrix A, the expression |A| + denotes the pseudo-determinant of A. If A is positive semi-definite, then |A| + may be computed as the product of the non-zero eigenvalues. If all eigenvalues are zero, then the pseudo-determinant is 1.
• Tr(AB) = Tr(BA) for any two matrices A and B.
• A ⊥ is a matrix with columns that are all orthogonal to the columns of the matrix A.

Restricted Maximum Likelihood (REML)
For REML estimation [7] of the mixed model variance parameters, we use the restricted log likelihood [3] for N individuals with the N -by-1 vector of phenotypes y, the N -by-D matrix of covariates X and the covariance matrix Σ θ , parameterized by θ ∈ Θ, given by Using Proposition 19, the restricted likelihood can be simplified using P θ only.

Variance component tests
We perform variance component tests, by modifying the matrix Σ θ (and consequently the matrix P θ ) with respect to the variance parameters σ 2 e , σ 2 g and σ 2 1 , and the kernel matrices K 1 and K g . Recall that in the case of a variance component test without a background kernel (that is, no kernel K g in the null model) Σ θ = σ 2 e I + σ 2 1 K 1 , whereas for the case where there is a background kernel, Σ θ = σ 2 e I + σ 2 g K g + σ 2 1 K 1 . In both cases, under the null hypothesis we set the parameter of interest σ 2 1 to 0 and perform REML estimation on the remaining parameters only.

Efficient likelihood ratio tests in the presence of full-rank background kernels
In [6] it was shown how the likelihood of the mixed model can efficiently be maximized, when both K g and K 1 are low rank and can be factored, respectively, as G g G ⊤ g and G 1 G ⊤ 1 . Here, we make no special assumptions (e.g. rank) on the background kernel K g , only assuming that the variance component K 1 (the component being tested, corresponding to, say, variants in a gene set) is low rank and can be factored as Introducing δ = σ 2 e /σ 2 g and γ 1 = σ 2 1 /σ 2 g , we obtain There are several variations of this basic setting that we want to handle efficiently (and do handle efficiently). These cases are as follows.
The first case occurs when there are variants, W g , among the variants we are testing (G 1 ), which are also used in construction of the background kernel, K g . In this setting, we wish to remove these variants from the background kernel, in order to correct for proximal contamination [4,5]) without explicitly computing a new background kernel from scratch.
As K g decomposes in a sum over contributions from individual variants, we can simply subtract off the contribution of W g from the complete matrix K g [5].
A special case of this occurs when the set of variants being tested G 1 is a subset of the set being removed from the genetic background kernel. In this case, one can remove this subset from the matrix W g and instead modify the variance parameters as follows for additional computational savings: All of these cases can be treated by aggregating all of the "updating factors" (those factors which are added/subtracted from K g ) into a single N -by-k the matrix W and maintaining their relative "weighting" (i.e., variance parameters) by additionally introducing the k-by-k diagonal weight matrix Γ, which contains the variance parameters for each column in W on the diagonal, Now that we have presented the cases of interest, we next show how to perform efficient computations for these cases by using the idea of low-rank updates [5] to the full rank matrix K g . In Sections 4.1.1 and 4.1.2 we provide the updated squared form part and the updated determinant part of the REML likelihood respectively.

Low rank update
The computational bottleneck of the likelihood ratio test is computation of the restricted log likelihood (Equation 2), which contains several expensive terms if computed naively (the squared form, 1 2 y T P θ y, and the pseudo determinant of P θ , 1 2 log|P θ | + . Next we show how to compute each of them efficiently (in terms of time).
Assuming that the eigenvalue decomposition of S (δI + K g ) S has been pre-computed, the additional computations required are multiplication of W with the matrix U , an O(N 2 k) operation. Once this has been computed once for the corresponding matrix W , the remaining computations can be performed in O(N k 2 ).

Squared form update
Given the derivation of the update on P θ , we can efficiently plug this into the squared form.
For the determinant we need to modify the low rank update a bit in order to avoid numerical instability. The problem is that the update matrix is not necessarily positive-semi-definite.
where we have used Sylvester's theorem for matrix determinants, which states that for matrices A and B respectively of size, p × n and n × p, that For numerical reasons it is preferable to avoid computation of any matrix determinants and instead diagonalize the matrix its eigenvalue decomposition and adding up the logarithms of the respective eigenvalues. Yet the matrix AB ⊤ will have negative eigenvalues if there is at least one negative diagonal entry in Γ, such that that we cannot just take logarithms of its eigenvalues.
has an even number of negative eigenvalues.
Proof. By construction we know that the matrix P θ is positive semi-definite, implying that its pseudodeterminant is positive. As the matrix δI N −D + Λ also is positive definite, implying that its determinant is positive, it follows that AB ⊤ must also have a positive determinant. As the determinant equals the product of the eigenvalues, positivity of the determinant implies that the number of negative eigenvalues must be even.
As a result we can compute the log determinant as the sum of the logarithms of the absolute values of the eigenvalues and avoid taking logarithms of any negative numbers.
The computational complexity for computing the determinant update equals that for the squared form update.

Score-based test
We here define the score function of any given covariance parameter, θ i , as the derivative of the log restricted likelihood in Equation 1 with respect to θ i Note that this score function is the difference between a quadratic form in the phenotype, y, and a trace term that does not involve the phenotype.
Noting that the trace term does not depend on the phenotype data, one can choose to use only the quadratic form in the score function, ∂L(θ) as the test statistic. For the variance components model considered in this paper (which is additive/independent in the variance components) this quadratic form is equal to 1 2 y ⊤ P θ K 1 P θ y. We will refer to this as the score-based test statistic, although we will leave lemmas and proofs to apply more generally, valid for any of the score components (i.e., for testing any parameter in the mixed model). Note that this form of this test statistic is the same for the single and two kernel cases. However, once we set σ 2 1 = 0 (the parameter value corresponding to the null hypothesis used to compute the test statistic), then the score-based test statistic can be re-factored in different ways for the one and two kernel cases, as shown in Sections 6 and 6.1.
Next we describe what the distribution of the score-based test statistic is, and how to efficiently compute it. Following that, we describe how to efficiently compute the score-based test statistic itself, for various models.

Distribution of the score-based test statistic
In the following, we derive the null distribution of the score-based test statistic under the assumption that all variance parameters in the null model are known. In particular, for the matrix Σ θ , it is assumed that the nuisance parameters θ are known-that is, Σ θ is assumed to be the true covariance matrix of y under the null distribution. The resulting distribution is a linear combination of χ 2 variables, meaning that we can use Davies method [2] to evaluate significance. Lemma 2. Assuming that y ∼ N ( Xβ ; Σ θ ), the score-based test statistic, under the null hypothesis, is distributed as a weighted sum of χ 2 1 variables, where the weights, φ n , equal the eigenvalues of the matrix 1/2P θ , and where the matrix P 1/2 θ is any matrix square root of P θ .
Proof. We start out by re-writing the test statistic, factoring the matrix P θ as the product of its square roots (P θ = P 1/2 θ P ⊤/2 θ ).
It is our assumption that under the null distribution y is normal distributed with mean Xβ and covariance Σ θ . As shown in Proposition 20, the (N − D)-by-1 vector u = P ⊤/2 θ y is normally distributed with zero mean and unit variance.
Now let V ΦV ⊤ be the spectral decomposition of 1/2P θ , with eigenvalues given by the diagonal of Φ. If we replace this matrix by its spectral decomposition, we obtain the squared form which in turn can be written as a sum weighted by the eigenvalues, φ n , Because the matrix of eigenvectors V is orthonormal, it follows that v is also normally distributed with mean zero and unit variance (and therefore each v 2 n ∼ χ 2 1 ). Therefore, the term n φ n · v 2 n is distributed as the weighted sum of χ 2 1 variables with weights φ n equal to the eigenvalues of 1 2 P T /2 θ

Efficient computation of terms needed for the the score-based test statistic null distribution
To compute a p-value for each set test, we need to compute the null distribution from Lemma 2, which we now show how to do efficiently. In particular, we show how to efficiently compute the eigenvalues of θ , assuming that ∂Σ θ ∂θi factors as GG ⊤ . This assumption is true for our score-based test both with and without the presence of a background kernel. In particular, ∂Σ θ ∂σ 2 Lemma 3. For the case where ∂Σ θ ∂θi factors as GG ⊤ , the non-zero weights of the χ 2 1 distributions in Equation 10 can be computed from the matrix 1 2 G ⊤ P θ G. Proof.
This matrix has the same non-zero eigenvalues as the matrix Note that 1 2 G ⊤ P θ G can be efficiently computed by efficiently using the tricks shown in Section 7.3.

The single kernel score-based test statistic
Let the covariance Σ θ be defined as σ 2 e I + σ 2 1 K 1 , with the parameters θ = [σ 2 e , σ 2 1 ] and σ 2 e > 0 and σ 2 1 ≥ 0. We are interested in testing the null hypothesis σ 2 1 = 0 against the alternative hypothesis σ 2 1 > 0. Under the null hypothesis, σ 2 1 = 0, the matrix P θ reduces to σ −2 e S. It follows that the score with respect to the parameter σ 2 1 is 6.0.2 Low rank single kernel score-based test statistic Let the matrix K 1 be defined as G 1 G ⊤ 1 , where G 1 is of dimension N × k) contains the set of SNPs being tested. Then, the single kernel score with respect to the parameter σ 2 1 can be evaluated as ∂L σ 2 e , σ 2 1 = 0
The score with respect to the parameter σ 2 1 is ∂L σ 2 e , σ 2 g , σ 2 1 = 0 trace term (13) 6.1.1 Low rank two kernel score-based test statistic Let the matrix K 1 be defined as G 1 G ⊤ 1 . Further, let the matrix K g = G g G ⊤ g be low rank. Then, the single kernel score with respect to the parameter σ 2 1 can be evaluated as ∂L σ 2 e , σ 2 g , σ 2 1 = 0 Using Lemma 11 and Proposition 9, we can efficiently evaluate the squared form without computing P θ , and do so in a manner which is linear in N .

Additional derivations
The following derivations and proofs have been adapted from [1]. Please see the preamble of this document for variable definitions when no reminder is provided.
Recall that S = I N − X X ⊤ X −1 X ⊤ and has rank N −D, and that P θ = Σ −1 is an N -by-N symmetric matrix with has rank N − D.
The resulting term is nothing but the ordinary least squares (OLS) regression residuals after regressing out X. Note, that the pseudoinverse of the covariates, X † , need only be computed once (in O(N D 2 )) and then can be re-used across different a.For multiplication of matrices by S the result is applied by treating each row or column as a vector in the multiplication.
7.3 Spectral decomposition of P θ and efficient computation of squared forms in P θ One computational bottleneck for the score-based test is computation of a matrix square root of P θ (P 1/2 θ ), which can be obtained directly from the spectral decomposition of P θ . Therefore, we seek to efficiently compute P 1/2 θ . To do so, we note that σ 2 g P θ = (S (K g + δI N ) S) † , which can be seen using Lemma 12.
Below we show how to compute the a matrix square root of (S (K g + δI N ) S) † .
The following lemma was stated without proof in [7] and is proved in Proposition C.15 of [1]: Proof.
where we used idempotency of S and Lemma 17 to replace S by U U ⊤ .
The proof relies on K g + δI to be full rank, which is always true for δ > 0.
Under certain conditions ("low rank"), these computations can be made more efficient. In particular, for the case, where K g is factored as K = G g G ⊤ g , with the N -by-k g factor G g , only the singular value decomposition of the N -by-k g matrix SG g has to be computed (which is linear in N , rather than cubic as taking the singular value decomposition of K g would be). This can be seen by noting that SG g is a matrix square root of SK g S, and therefore that the economy SVD U Λ 1/2 V ⊤ of SG g gives the economy eigenvalue decomposition of SK g S, which is given by U ΛU ⊤ .
Let U ⊥ be an N -by-(N − k g − D) matrix with orthogonal columns that are all orthogonal to the columns of both U , as well as X. Then the economy eigenvalue decomposition of S (K g + δI N ) S is given by Then the economy eigenvalue decomposition of P θ can be obtained as follows: Lemma 11. It follows that when K g is factored as K g = G g G ⊤ g , then for all a and b, any squared form in P θ , written here as a ⊤ P θ b can be computed efficiently only using the singular value decomposition of SG g . Proof.
Proof. The four properties of the Moore-Penrose pseudoinverse are proven below in Propositions 13 to 16, thereby completing the proof.
This Lemma was stated and used in [7]. A proof can be found in Lemma C.10 of [1].
As S is symmetric, (SΣ θ S) P θ is also symmetric.
As S is symmetric, P θ (SΣ θ S) is also symmetric.
Proposition 15. P θ is a weak inverse of SΣ θ S. Proof.
Proposition 16. SΣ θ S is a weak inverse of P θ . Proof.
Lemma 17. Let U ΛU ⊤ be the economy spectral decomposition of (SΣ θ S), where Λ is an (N − D)-by-(N − D) diagonal matrix, holding the non-zero eigenvalues of (SΣ θ S), and U is the N -by-(N − D) matrix, holding the corresponding N − D eigenvectors of (SΣ θ S) as columns. Then S = U U ⊤ [7]. Proof.
This Proposition was stated without a proof in [3]. A proof was provided in Proposition C.7 in [1].