Estimation of dynamic SNP-heritability with Bayesian Gaussian process models

Abstract Motivation Improved DNA technology has made it practical to estimate single-nucleotide polymorphism (SNP)-heritability among distantly related individuals with unknown relationships. For growth- and development-related traits, it is meaningful to base SNP-heritability estimation on longitudinal data due to the time-dependency of the process. However, only few statistical methods have been developed so far for estimating dynamic SNP-heritability and quantifying its full uncertainty. Results We introduce a completely tuning-free Bayesian Gaussian process (GP)-based approach for estimating dynamic variance components and heritability as their function. For parameter estimation, we use a modern Markov Chain Monte Carlo method which allows full uncertainty quantification. Several datasets are analysed and our results clearly illustrate that the 95% credible intervals of the proposed joint estimation method (which ‘borrows strength’ from adjacent time points) are significantly narrower than of a two-stage baseline method that first estimates the variance components at each time point independently and then performs smoothing. We compare the method with a random regression model using MTG2 and BLUPF90 software and quantitative measures indicate superior performance of our method. Results are presented for simulated and real data with up to 1000 time points. Finally, we demonstrate scalability of the proposed method for simulated data with tens of thousands of individuals. Availability and implementation The C++ implementation dynBGP and simulated data are available in GitHub: https://github.com/aarjas/dynBGP. The programmes can be run in R. Real datasets are available in QTL archive: https://phenome.jax.org/centers/QTLA. Supplementary information Supplementary data are available at Bioinformatics online.

The covariance function of a discrete time Gaussian process is represented through a covariance matrix C that has a Cholesky decomposition LL . In the joint method, a square root of the covariance matrix (for example the Cholesky factor L) is needed to dewhiten the proposal vector w. Generally this is done as x = Lw, where x is the dewhitened proposal vector. Obtaining the Cholesky factor of a matrix is time consuming in general, so we use a sparse approximation for its inverse. Specifically, the approximate covariance matrix can be expressed as C ≈ (H H) −1 , where H is a sparse matrix that is defined in detail below. To dewhiten w, we need to solve x from the system of equations which yields x = H −1 w ≈ Lw. This can be efficiently implemented in the linear algebra library Eigen (Guennebaud et al., 2010). The matrix H is obtained by applying finite-element method to the stochastic partial differential equation representing the Matérn covariance function. The details can be found in Roininen et al. (2014). Here we just give the results. The Matérn correlation function (magnitude parameter in the covariance function set to 1) is given as where ν is the smoothness parameter and λ is the length scale. When ν is set to 1.5, H is a tridiagonal matrix of the form and of the same dimension as C. Here we assume that the time points are equidistant with distance 1. We note that this approximation is especially bad in the boundaries of the domain (near the first and the last time point measured). A possible remedy for this is to consider a domain extension which is also discussed by Roininen et al. (2014). This means that instead of using just the T time points, we add T /4 time points to the beginning and the end of the domain, yielding 1.5T time points in total. Hence, when dewhitening the proposal vector with Eq. (S1), the length of the vector is 1.5T but in our calculations (e.g. evaluation of the likelihood) we only use the middle part. This way we avoid producing artefacts at the boundaries. This will increase the computation times a little but in our tests we noticed that this increase is minimal compared to not using the approximation at all. In our tests we noticed that the results with the approximation and the exact method were practically indistinguishable.

9:
adapt s λ 2 10: end for 3 Traceplots for the joint model    Figure S4: Comparison of the results from the joint estimation method and ACEt on twin data. Posterior means for the joint method and maximum likelihood estimates for ACEt are drawn with solid lines and 95 % credible intervals for the joint method and 95% confidence intervals for ACEt with dashed lines. The accuracies of the joint method and RRM as implemented with MTG2 were compared by simulating data and computing mean squared errors (MSE) between the estimated variances and the ground truths. We simulated ten datasets with 1000 individuals and 50 time points with the variances σ 2 G (t) = cos t+2 and σ 2 E (t) = sin t + 2, where the time points were equidistant in the interval [0, 2π] hours. The relationship matrix was created by first simulating a 1000 × 1000 matrix S where each element is standard normally distributed. The relationship matrix was then computed as G = SS T /1000+0.1I. A small number was added to the diagonal to make the matrix positive definite. To simulate realistic data, the longitudinal dependencies need to also be taken into account. This was done by generating a 50 × 50 Gaussian process matrix C where the row i and column j intersection was set to [C]

Computation times
. The genetic and environmental components of the data were simulated from the distributions N (0, C ⊗ G) and N (0, C ⊗ I), respectively. Finally, the components were scaled with the corresponding variance at each time point and summed. The datasets were analysed with the joint method with 300 000 MCMC-iterations and with RRM using Legendre polynomials of degree 5 to model the genetic and residual covariances. The RRM at time t was defined as where α t is the overall mean at time t, a ∈ R N ×6 contains the genetic random regression coefficients, b ∈ R N ×6 contains the residual random regression coefficients, Φ t is the tth row of a T × 6 matrix that contains the Legendre polynomials up to degree 5 evaluated at the T measurement points and t are the residuals at time t which are assumed to be independently normally distributed but with different variances across time. Here bΦ t and t can be interpreted as the dependent and independent residual, respectively. To fit the model we used the software MTG2 (Lee and van der Werf, 2016). The software only supports frequentistic statistical inference. In particular, it uses the average information (AI-REML) algorithm to maximise the likelihood function.
The MSE for the joint method was computed from the posterior mean. The MSE for the estimated environmental variance (permanent environmental variance + residual variance), genetic variance, and heritability for both, the joint method and MTG2, can be found in Table S1. Additionally, we present averaged estimates from all of the datasets in Fig. S6.
Additionally, we fitted a Bayesian version of the model (S3) using BLUPF90 family of programs (Misztal et al., 2002) and in particular GIBBS2F90. This allows us to compare the differences of the uncertainties of the variance components between RRM and the joint model. The program uses inverse Wishart distribution as a prior for the covariance matrices of the random regression coefficients. To assign the priors, we first ran the frequentistic inference using MTG2 and used the results from that analysis as prior scale matrices for the inverse Wishart distribution. The degree of belief was set to 10 for both the genetic and residual covariance matrices. The degree of belief for the independent residual was set to 1. The results from this analysis are presented in Fig. S7. Due to long computation times of GIBBSF90 software (over 24 hours) only one dataset was analysed and MSE comparison was infeasible to carry out.