Mega-scale Bayesian regression methods for genome-wide prediction and association studies with thousands of traits

Abstract Large-scale phenotype data are expected to increase the accuracy of genome-wide prediction and the power of genome-wide association analyses. However, genomic analyses of high-dimensional, highly correlated traits are challenging. We developed a method for implementing high-dimensional Bayesian multivariate regression to simultaneously analyze genetic variants underlying thousands of traits. As a demonstration, we implemented the BayesC prior in the R package MegaLMM. Applied to Genomic Prediction, MegaBayesC effectively integrated hyperspectral reflectance data from 620 hyperspectral wavelengths to improve the accuracy of genetic value prediction on grain yield in a wheat dataset. Applied to Genome-Wide Association Studies, we used simulations to show that MegaBayesC can accurately estimate the effect sizes of QTL across a range of genetic architectures and causes of correlations among traits. To apply MegaBayesC to a realistic scenario involving whole-genome marker data, we developed a 2-stage procedure involving a preliminary step of candidate marker selection prior to multivariate regression. We then used MegaBayesC to identify genetic associations with flowering time in Arabidopsis thaliana, leveraging expression data from 20,843 genes. MegaBayesC selected 15 single nucleotide polymorphisms as important for flowering time, with 13 located within 100 kb of known flowering-time related genes, a higher validation rate than achieved by a single-stage analysis using only the flowering time data itself. These results demonstrate that MegaBayesC can efficiently and effectively leverage high-dimensional phenotypes in genetic analyses.

For simplicity, let ( Y T cor ) i = y cor i , Λ T = Λ T , (F T ) i = f i , µ (F T ) i = µ f i and D ( Y T ) i = D Y .The full conditional posterior distribution for (F T ) i is derived as: Therefore, (F T ) i |ELSE ∼ N(µ, Σ) with

Sample parameters in Λ
A. Full conditional posterior distribution of Λ The prior for λ j is specified as follows: This mixture prior for λ j can be parameterized as: and Conditional on F, Eq. 1 can be simplified into t independent univariate linear mixed models for the columns of Y.For the jth column of Y: where e R j ∼ N(0, σ 2 R j I).
Besides the full conditional posterior distribution for the multivariate β λ j as derived above, a univariate version for the elements in β λ j is also derived as follows to prepare for the derivation of γ λ kj .

Rj
, and r =

B. Full conditional posterior distribution of γ λ kj
From the model specification, γ variables can take either 0 or 1.Let θ denote all other parameters except for β λ kj and γ λ kj , the marginal full conditional distribution of γ λ kj that integrates β λ kj is shown as:

Rj
, and r = Given Eq.S16, we have

C. Full conditional posterior distribution of δ l
In order to sample τ k , we need to firstly sample δ l when K > 1.To derive the full conditional posterior distribution of δ l , vectorize Λ as λ.Then, we have Note that the determinant of a diagonal matrix is the product of elements of its diagonal. )

Parallel Model Setting
Given F and Λ, although the design matrices may differ for columns of Y and F, the form of both sets of conditional model can be similarly expressed as: where Conditional on F and Λ, Eq. 1 can be simplified into t independent univariate linear mixed models for the columns of Y cor = Y − FΛ: where Besides the columns of Y, the columns of F can be similarly expressed into K independent univariate linear mixed models: where Here, factor-specific and trait-specific prior on the marker exclusion probability (π F k and π j ) and the variance of marker effects (σ 2 B2F k and σ 2 B2R j ) are used for each latent factor and observed trait.We can see that the columns of Y and F can be generally expressed by Eq.S17.That is, for . Furthermore, we defined the following term based on the notation in Eq.S17:

D. Full conditional posterior distribution of α
The conditional posterior distribution for α(i.e., b 1j ) is derived as (integrating out β): where , and the dimension of V β is n × n.

E. Full conditional posterior distribution of σ 2
The conditional posterior distribution for σ 2 (i.e., σ 2 R j and σ 2 F k ) is derived as:

F. Full conditional posterior distribution of β
The conditional posterior distribution for β is derived as: where Besides the full conditional posterior distribution of the multivariate β as derived above, a univariate version for the elements β l in β is also written as follows.

G. Full conditional posterior distribution of γ l
Let θ denote all other parameters except for β l and γ l , the marginal full conditional distribution of γ l that integrates out β l is shown as: Since f (y|θ, γ l ) = f (y, β l |θ, γ l )dβ l , the derivation for f (y|θ, γ l ) is shown as follows.
Given Eq.S37, we have