biMM: efficient estimation of genetic variances and covariances for cohorts with high-dimensional phenotype measurements

Abstract Summary Genetic research utilizes a decomposition of trait variances and covariances into genetic and environmental parts. Our software package biMM is a computationally efficient implementation of a bivariate linear mixed model for settings where hundreds of traits have been measured on partially overlapping sets of individuals. Availability and Implementation Implementation in R freely available at www.iki.fi/mpirinen. Supplementary information Supplementary data are available at Bioinformatics online.


S1.1 Northern Finland Birth Cohort 1966 (NFBC1966)
NFBC1966 is a birth cohort study of children born in 1966 in the two northernmost provinces of Finland originally designed to focus on factors affecting pre-term birth, low birth weight, and subsequent morbidity and mortality [10]. The blood sample for the DNA extraction and all phenotype data used in the present study were collected at a follow-up visit when the participants were 31 years of age. The phenotypes (Table S1) were adjusted for sex and first ten principal components of genetic structure. Additionally, blood pressure measurements were adjusted for BMI. The phenotype data are the same as used by Tukiainen et al. [12].
Genotyping was done using Illumina 370K chip. We computed the genetic relationship matrix R by using K = 319, 445 genotyped SNPs with minor-allele frequency (MAF) above 0.01.  Table S1: Sixteen quantitative traits analysed in NFBC1966 data. Heritability estimate (h 2 ) and its standard error (SE) are from biMM. Exact (ex) analysis was ran with t i = t d = 0 and approximate (ap) analysis was ran with t i = t d = 200. N is the number of individuals with a measurement for the trait.

S1.2 Simulated data with 20,000 individuals
We generated data for 20,000 individuals with HAPGEN2 using 117 Utah residents with Northern and Western European ancestry from the CEPH collection (CEU panel) from HapMap3 (r2.b36) data on chromosome 2 (116,430 SNPs). We computed the genetic relationship matrix R by using K = 98, 516 SNPs with minor-allele frequency (MAF) above 0.01. We generated 100 phenotype vectors with heritabilities between 0.2 and 0.8. The phenotype vectors were uncorrelated and contained no missing data.

S2.1 Northern Finland Birth Cohort 1966 (NFBC1966)
Below we show the scatter plots for six pairs of the four methods: biMM (complete, t i = t d = 0), GEMMA, BOLT-REML and GCTA across 120 trait pairs reported in Example analysis of the main paper, for heritabilities V G , genetic correlations ρ G and their standard errors SE(V G ) and SE(ρ G ).  Figure S1: Heritability estimates across 120 pairs of traits from NFBC1966 data.  Figure S2: Genetic correlation estimates across 120 pairs of traits from NFBC1966 data.

S2.2 Simulated data with 20,000 individuals
We ran biMM, GEMMA and BOLT-REML on the data set with 20,000 individuals but were unable to run GCTA with 16 Gb of memory. We focused on the completely observed (or completely pre-imputed) data because there GEMMA's option to reuse an eigendecomposition can speed up the analysis. After an eigendecomposition for R matrix was done (70 CPU minutes with biMM and 450 CPU minutes with GEMMA), the run times per each additional pair of traits was 1.6 CPU seconds for biMM and 75 CPU seconds for GEMMA. BOLT-REML took 45 CPU minutes to analyse one pair of traits and does not benefit from a stored eigendecomposition. The difference between biMM and GEMMA is mainly that GEMMA reads in the eigendecomposition from a file for each pair of analyzed traits whereas biMM is able to hold it in RAM. Next, we evaluated whether the analysis using GEMMA or BOLT-REML could be sped up by analyzing more than two traits at a time.
Let us assume that we have T traits, i.e. T (T − 1)/2 trait pairs, to analyze and that we use GEMMA or BOLT-REML by applying it to t traits at a time. This can be done by first analyzing B = T /t non-ovelapping blocks of the traits (block 1: 1, . . . , t; block 2: t + 1,. . . ,2t etc.) and then by computing the covariances for the remaining pairs by applying GEMMA/BOLT-REML four times per each pair of blocks. For example, to compute the cross-covariances between blocks 1 and 2, we do the following four analyses, each having exactly t traits: (1, . . . , t/2,t + 1, . . . , 3t/2), (1, . . . , t/2,3t/2 + 1, . . . , 2t), (t/2 + 1, . . . , t,t +   Figure S4: Estimates of standard errors of genetic correlation estimates across 120 pairs of traits from NFBC1966 data. Not available for GEMMA. 1, . . . , 3t/2) and (t/2 + 1, . . . , t,3t/2 + 1, . . . , 2t). Thus, altogether we need B + 4 B 2 = B(2B − 1) applications of t dimensional analysis. Table S2 reports run time of GEMMA and BOLT-REML for t = 2, 4, . . . , 20 dimensional analyses as well as an expectetd run time to analyze T = 500 traits, i.e., 124,750 trait pairs. We see that a multi-trait analysis strategy can drop the total run time from 100 days to 14 days with GEMMA and from 11 years to 7 years with BOLT-REML. For both methods, there seems to be an optimal number of traits per analysis (6-8 with GEMMA, 4-6 with BOLT-REML) after which an increase in the number of traits also increases the total run time.  Table S2: Run times of GEMMA and BOLT-REML for t = 2, 4, . . . , 20 dimensional analysis for one data set with 20,000 individuals in CPU minutes (row 'one') and for all pairs of T = 500 traits in CPU days or CPU years (row 'all'). For GEMMA, eigendecomposition is read from a file. Run time was not measured for BOLT-REML with t > 8.

S3 Polygenic model for genome-wide data
To derive the linear mixed model, we consider a polygenic model that assumes that the genetic component of the traits is distributed among a large number of individual variants having additive effects. When applied to "unrelated" individuals (or more precisely, only very distantly related individuals from our population cohorts), the model decomposes the trait variance into an additive genetic component (G) that is due to the available panel of SNPs, and the environmental component (ε) that includes also higher order genetic components together with the part of the additive component that is not captured by G.
In particular, the bivariate model can be used for estimating a lower bound for the additive genetic variance for both phenotypes, the correlation between the additive genetic components of the two phenotypes and the correlation between the environmental components of the two phenotypes. For large-scale human genetic studies with a univariate phenotype, the corresponding model was introduced by Yang et al. in their software package GCTA [15] and further explained by Visscher et al. [13] and Zaitlen and Kraft [17]. The bivariate extension was recently considered by Deary et al. [4], Korte et al. [6] and Davis et al. [3], and an extension to more than two traits by the GEMMA algorithm of Zhou and Stephens [19] and BOLT-REML by Loh et al. [8]. A completely new way to estimate some of the parameters of the univariate and bivariate polygenic model via LD-score regression using summary statistics from genome-wide association studies was recently intoduced by Bulik-Sullivan et al. [2,1].
Next we formulate the univariate (section 3.1) and bivariate (section 3.2) versions of the model, explain how we do the computation efficiently (section 3.3) and give simulation results to verify that we have a valid interpretation of the parameter estimates (section 3.4).

S3.1 Univariate polygenic model
We start by describing the model for a univariate phenotype (say phenotype 1). Let Y 1 be a vector of observed quantitative measurements for n individuals. Let g * ik be the standardized genotype of individual i at SNP k, that is, where g ik is the minor allele count (0,1 or 2) that i carries at locus k and f k = 1 2n n i=1 g ik is the sample allele frequency. The linear model assumes that where β k1 is the effect of SNP k on Y 1 , ε i1 is the environmental term of Y 1 for individual i and the sum is over all the SNPs. The term µ i1 describes the expected phenotype of individual i after all the non-genetic covariates (such as age, sex and cohort) have been taken into account, for example, by considering residuals from a regression model. Following [15], we assume that β k1 ∼ N (0, v g1 ) for each SNP k and ε i1 ∼ N (0, V ε1 ) for each individual i. Then the additive genetic variance of Y 1 due to the SNPs, can be estimated by a linear mixed model formulation of the model (S1) [15,17]: Note that since the variance of the effect size distribution, v g1 , is the same for all standardized SNPs, it follows that the variance of the effect size distribution for an allele at SNP k, v g1 / 2 f k 1 − f k , grows as the minor allele frequency f k decreases. Thus, this model assumes larger per allele effect sizes at rarer SNPs than at more common SNPs.

S3.2 Bivariate polygenic model
We follow the extension of the linear model to the bivariate case given by [4]. This corresponds to the model expressed as n × n blocks, where n is the number of individuals. From this model, an estimate of V Gt gives a lower bound for the additive genetic variance of each trait (t = 1, 2) and thus can be used to estimate a lower bound for the (narrow-sense) heritability. We can also estimate the genetic correlation which can be interpreted as the correlation of the additive genetic components. Similarly we can estimate ρ ε = V ε12 / √ V ε1 V ε2 , the correlation in the environmental terms between the phenotypes.
At the level of individual genetic effects, this model corresponds to the assumption that for all k ≤ K where v gt is the variance of the individual genetic effect on trait t = 1, 2 and v g12 is the covariance between the effects of a single variant on the two traits. We will consider the consequences of this assumption below at section S3.4.

S3.3 biMM algorithm
Our goal is to make an efficient algorithm for estimating (co)variance parameters in the setting where hundreds of traits have been measured on the same set of individuals. For this purpose, we extend computational solutions of previous algorithms on a univariate linear mixed model [18,7,9] to the bivariate case. The biMM algorithm described here has not been published before although M Pirinen has applied some ideas of biMM's likelihood computation in an earlier study of reading and mathematics skills [3].
Recently, general algorithms for multivariate linear mixed models have been introduced by Zhou and Stephens [19] and Loh et al. [8] with applications to genome-wide data sets.We see two main contributions of the biMM algorithm given the existing work in the field: (1) a new and simple way to write down the likelihood function of bivariate LMM using O(n) operations; and (2) implementation that is particularly efficient for estimating pairwise variance components across thousands of trait pairs including the functionality to order traits suitably and to impute or to ignore only some trait values as necessary for efficient computation.
While an appropriately adjusted version of the GEMMA algorithm of Zhou and Stephens [19] might also lead to a quick bivariate likelihood computation, to our knowledge, no publicly available implementation is available to efficiently handle our setting of hundreds of traits with possibly missing values.
BOLT-REML is the only existing linear mixed model implementation that can run on very large numbers of individuals (Loh et al. reported test runs with 50,000 [8]). Since BOLT-REML bypasses the generation of the explicit relationship matrix, it is based on very different approximations compared to biMM, GEMMA or GCTA. The differences in heritability estimates between the methods were a bit larger between BOLT-REML and the other methods ( Figure  S1) than among the other methods themselves. Additionally, BOLT-REML is much slower than biMM or GEMMA with cohort sizes between 5,000 and 20,000 individuals tested in the main paper. Thus, while BOLT-REML importantly extends the range of linear mixed model computations to very large cohorts, there are good reasons to still consider the approach based on explicit matrix computations when those are feasible to carry out, i.e., when the cohort size is up to a few tens of thousands of individuals.

S3.3.1 Likelihood computation
The log-likelihood function for model (S3), after the covariates and the population mean in µ have been regressed out from Y , is where variance parameters V = (V G1 , V G2 , V G12 , V ε1 , V ε2 , V ε12 ) are included in the matrix Σ = Σ G + Σ ε , and |Σ| denotes the determinant of Σ. The eigenvalue decomposition of the positive semi-definite matrix R yields an orthonormal n × n-matrix U of eigenvectors and a diagonal n × n-matrix D of non-negative eigenvalues for which R = U DU T (see e.g. [5]). Because U U T = I (orthonormality) it follows that where ∆(a i1 ) is the diagonal matrix whose diagonal elements are a 11 , . . . , a n1 and we have used notation To compute the determinant, we use the fact [11] To compute the inverse Σ −1 we use a formula for block matrices: By applying this to matrix we have that By transformation To optimize the likelihood we use a combination of Nelder-Mead and BFGS algorithms as implemented in the optim function of the R software package. For BFGS we need the gradient of the log-likelihood.

Derivatives
With the notation defined above we have:

S3.3.2 Ordering pairs, imputing and dropping values
To make eigendecomposition as reusable as possible biMM orders the trait pairs (or traits in the univariate case) to maximize the sample overlap between consecutive pairs. Given the first pair, we greedily choose the next pair from among the pairs having the smallest distance to the current pair. Here the distance between the pairs means the number of individuals for which one of the trait pairs has complete data and the other pair does not have complete data. After the second pair is chosen, the same process continues through the subsequent trait pairs until the full ordering is determined. If the first pair is not given, biMM iterates over all possible starting values and chooses the one that leads to the smallest number of eigendecompositions given the parameters t i and t d .
The input values t i and t d determine, respectively, for how many individuals biMM is allowed to impute the missing trait values and for how many individuals biMM is allowed to set the trait values missing, i.e., to drop the individuals from the analysis. A new eigendecomposition will be made if and only if the analysis of the next trait / trait pair would require imputing more than t i individuals or dropping more than t d individuals.
For the trait imputation, the user specifies k imp , i.e., the number of the traits that are used for imputing the missing values. When imputing trait T , for each individual with a missing value, biMM chooses those k imp traits that are most correlated (in absolute value) with trait T in the whole cohort and are observed for the individual considered. Trait T is imputed as the expected value from a conditional distribution of trait T given the observed value of the k imp other traits and an assumption that their k imp + 1 dimensional joint distribution is a multivariate Gaussian. The default value k imp = 0 corresponds to the mean value imputation.

S3.4 Testing biMM
Assume that there are S shared variants that affect both traits and T 1 and T 2 variants that affect only trait 1 and trait 2, respectively. Assume further that the effect sizes for all these variants follow the normal distribution with zero mean and variances v g1 for trait 1 and v g2 for trait 2, and that for each shared variant j, cor(β j1 , β j2 ) = ρ S . Then the covariance between the genetic components of individual i is and the genetic variances are V G1 = (S + T 1 )v g1 and V G2 = (S + T 2 )v g2 . It follows that the genetic correlation is In particular, if T 1 = T 2 = 0 (all variants shared), then ρ G = ρ S , but if max{T 1, T 2} S (most variants trait-specific), then ρ G ≈ 0, independent  Table S3: Parameters for data simulation. V Gt is the total genetic variance for trait t = 1, 2, S is the number of shared variants, T t is the number of variants specific to trait t = 1, 2, ρ is the total correlation between the traits, ρ S is the correlation of the effect sizes of the shared variants, ρ G is the genetic correlation between the traits, ρ ε is the environmental correlation between the traits.
of ρ S . In general, the genetic correlation ρ G is an average property of all variants affecting the traits that results from shrinking the shared correlation ρ S by (harmonic mean of) the proportion of the shared genetic effects. We hope that by the linear mixed model we could get unbiased estimates of genetic variances V G1 and V G2 as well as correlation parameters ρ G and ρ ε . However, this is not self-evident, because the model makes an assumption that every one of the K variants in the genome comes from the non-zero bivariate effect size distribution, which is clearly violated whenever only a minority of the variants truly have a non-zero effect. Thus, to validate our interpretation of the variance parameters in a realistic scenarios, and to verify our implementation of biMM, we did some simulation studies. Recently, some theoretical conditions for unbiasedness of univariate variance parameters have been listed by Yang et al. [14].
Simulations: We chose 5,000 individuals from the NFBC1966 data and used the R matrix of K = 319, 445 genotyped SNPs with minor-allele frequency (MAF) above 0.01. We further extracted from the total set of SNPs a subset of 20,000 approximately equally spaced SNPs from which we picked those that truly affected the phenotypes. We generated 1,000 data sets under each of the six scenarios by varying values for the parameters at columns 2 to 8 in Table  S3, which uniquely determine ρ G and ρ ε given in the last two columns of Table  S3. We always kept the total variance of both traits at 1 and sampled the effect sizes for the variants from the Gaussian distribution with variance V Gt /(S + T t ) for trait t, and with the correlation ρ S for the shared variants.
Results: The distribution of the maximum likelihood estimates over the 1,000 data sets for four parameters V G1 , V G2 , ρ G and ρ ε are given in Figure S5. Figure S5 shows that biMM gives (nearly) unbiased estimates of the parameters in all six scenarios that vary in the heritability, in the sign and magnitude of the correlations as well as in the proportion of the variants shared.
These results give confidence that we can interpret the estimates from biMM according to the heritability and correlation parameters of the polygenic model, even when the polygenic assumption is violated by a large majority of the variants having no effect on the traits at least as long as the variants with effects are a random sample from all variants included in the model. The second derivatives allow analytic estimation of standard error for each parameter. If the sampling distribution of the parameter estimate were Gaussian, then the standard error would estimate the standard deviation of that sampling distribution. Table S4 shows empirical standard deviation of the parameter estimates over the simulations together with the median of the standard errors over the same simulations. In general, the results show that the estimated standard error gives a good estimate of the magnitude of the empirical standard deviation. It seems that the standard errors typically underestimate slightly the standard deviations for the correlation parameters and for smaller heritability values (V G1 ) whereas for larger heritability values (V G2 ) the standard errors overestimate slightly the standard deviations. We note that the sampling distributions for these parameters might not follow a Gaussian near the boundary values of the parameter (0 for variances, and ±1 for correlations) and the interpretation of standard error is not as straightforward there.  Table S4: Uncertainty of parameters for data simulation. V Gt is the total genetic variance for trait t = 1, 2, ρ G is the genetic correlation between the traits, ρ ε is the environmental correlation between the traits. For each scenario, SD is the empirical standard deviation of the point estimates and SE is the median of the analytically calculated standard errors.