iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data

Motivation: Analyzing data from multi-platform genomics experiments combined with patients’ clinical outcomes helps us understand the complex biological processes that characterize a disease, as well as how these processes relate to the development of the disease. Current data integration approaches are limited in that they do not consider the fundamental biological relationships that exist among the data obtained from different platforms. Statistical Model: We propose an integrative Bayesian analysis of genomics data (iBAG) framework for identifying important genes/biomarkers that are associated with clinical outcome. This framework uses hierarchical modeling to combine the data obtained from multiple platforms into one model. Results: We assess the performance of our methods using several synthetic and real examples. Simulations show our integrative methods to have higher power to detect disease-related genes than non-integrative methods. Using the Cancer Genome Atlas glioblastoma dataset, we apply the iBAG model to integrate gene expression and methylation data to study their associations with patient survival. Our proposed method discovers multiple methylation-regulated genes that are related to patient survival, most of which have important biological functions in other diseases but have not been previously studied in glioblastoma. Availability: http://odin.mdacc.tmc.edu/∼vbaladan/. Contact: veera@mdanderson.org Supplementary information: Supplementary data are available at Bioinformatics online.


S1: Full conditional posterior distributions of the iBAG model for continuous responses
We first present the directed acyclic graph (DAG) representation of the iBAG model in Figure   S1. The observed data are represented by square nodes (e.g., the methylation levels, M, the gene expression levels, g k s, and the patient clinical outcome Y ), and the parameters are represented by circular nodes (e.g., the methylation effects, Ω, and the gene expression effects, γ M and γM .). The nodes within the grey rectangle have K replicates. The arrows in the figure show the dependency structures for different nodes.
We provide the detailed, full confidential likelihood of the iBAG model that is introduced in Section 2.3 of the main article. All of the notations used here follow those introduced in the main article.
Based on the results obtained from Andrews and Mallows (1974), the double exponential distribution can be written as a mixture of a normal distribution with an exponential density. Specifically, the prior for the type M effect for the kth gene ([γ M k |λ M , σ]) can be rewritten as: for k = 1, . . . , K.
The full conditional distributions for the parameters in the mechanistic model can be expressed as follows.
• [ω j0,k0 |others ], the methylation level of the j 0 th methylation effect on the mRNA level of the k 0 th gene: for k 0 = 1, . . . , K, and m j0 within the promoter of the k 0 th gene, where • [σ 2 k0 |others], the random error modeling the part of the expression of the k 0 th gene that is not explained by methylation: The full conditional distributions for the parameters in the clinical model can be expressed as follows.
• [γ C | others], the effects of clinical factors (e.g., tumor stage) on clinical outcome: • [γ M | others], the effects of gene expression modulated by methylation: • [γM | others], the effects of gene expression modulated by other sources of genomic changes, follows • [σ 2 k0 | others], the standard deviation for the error term that accounts for variation not accounted for by the observed genomic and clinical factors: • [(λM ) 2 | others ]: The full conditional distributions for the shrinkage parameters can be expressed as follows.
• [(λ M ) 2 | others], the common shrinkage parameter for the gene expression effects modulated by methylation: • [(τM k ) −2 = ηM k | others], the common shrinkage parameter for the gene expression effects modulated by methylation: For a right-censored response variable, the relationship between Z and the response variable Y can be expressed as log(t n ) = Z n if δ n = 1 log(t n ) > Z n if δ n = 0 for n = 1, . . . , N.
Conditionally on Z, iBAG models for both discrete and right-censored responses are the same as that for continuous responses.
The drawing scheme for parameters of the iBAG model with binary or censored outcomes has the following steps: 1) Update (γ M , γM , Ω) using Gibbs sampling.
3) Update Z according to equation (S2) if patient clinical outcomes are binary or equation (S3) if patient clinical outcomes are censored.

S3.1: ROC analysis
In Section 4 of the main paper, we generate 12 different scenarios based on different combinations of total number of gene expression features, and the correlations between methylation and gene expression features. We fit four models, iBAG unified , iBAG 2-stage , single gene (SG), and non-integrative (nonINT) model, for all the simulated datasets. In this section, we show the receiver operating characteristic (ROC) curves in identifying the true effects of gene expressions and calculate the areas under these ROC curves (AUCs) for all 12 simulated scenarios in Figures S2.1 to S2.3. Figure   S2.1 summarizes the true positive rate (TPR) versus the false positive rate (FPR) for discovering genes with only type M effects (effects modulated only by methylation). Figure S2.2 summarizes the TPR versus the FPR for discovering genes with only typeM effects (effects modulated only by other mechanisms). Figure S2.3 summarizes the TPR versus the FPR for discovering genes with In Figure S2.1, since only the first 200 genes in the gene expression dataset are modulated by methylation, and genes 181 to 200 have effects modulated by both methylation and other mechanisms, we include only the first 180 genes when we calculate the TPR and the FPR. As expected, the nonINT model performs the worst of all three models, followed by the iBAG 2-stage model and the SG model. The proposed iBAG unified model performs the best of all three models.
All of the genes are included when we calculated the TPR and the FPR shown in Figure S2.2, except genes 1 to 20, which have effects modulated by only methylation, and genes 181 to 200, which have effects modulated by both methylation and other mechanisms. This figure shows that the SG model performs the worst in discovering the genes with typeM effects. The nonINT model performs slightly better than the iBAG 2-stage model in discovering this group of genes. The proposed iBAG unified model performs the best of all three models.
All of the genes are included when we calculate the TPR and the FPR shown in Figure S2.3, except genes 1 to 20, which have effects modulated only by methylation, and genes 201 to 220, which have effects modulated only by other mechanisms. This figure shows that the SG model performs the worst in discovering the genes with effects modulated by both methylation and other mechanisms.
The iBAG 2-stage model performs as well as the nonINT model in discovering the genes in this group of genes. The proposed iBAG unified model performs the best of all three models.

S3.2: Model performance and consistency
For the situation that best mimics our real data scenario (K = 1000 and ρ = −0.6), we ranked the performance of four different models in identifying different sets of genes based on their AUC values. Table S1 shows that our proposed iBAG unified model performs the best of all three models in discovering all three group of genes. The iBAG 2-stage model ranks second except for in one case.
The SG model and the nonINT model have the worst performances.
In addition, we evaluate the consistency of the iBAG unified model, both in terms of model selection and estimation. We generate 100 datasets for the simulation scenario in which the total number of genes (K) is 1000 and the methylation-gene expression correlation is equal to −0.6. To evaluate the consistency of the iBAG unified model in variable (gene) selection, we summarize the posterior probabilities for the genes with nonzero and zero effects using barplots with standard errors across the 100 datasets (see panels A and B in Figure S3). In addition, we calculate the AUCs for identifying genes with nonzero effects for each dataset. The barplots of the AUCs from 100 datasets are shown in Figure S3, panel C. Panels A and B show that for the 100 simulated datasets, the genes with nonzero gene expression effects consistently have higher posterior probabilities than the genes with zero gene expression effects modulated both by methylation (γ M ) and other genomic changes (γM ).
This leads to higher AUCs in selecting genes that are important (with nonzero effects) to clinical outcomes. As shown in panel C, the AUCs based on the iBAG unified model are about 0.8, with very small standard errors in identifying both gene effects modulated by methylation and gene effects modulated by other genomic changes. Hence, from Figure S3, we conclude that the iBAG unified model demonstrates good (empirical) consistency in selecting relevant variable (genes).
The mean biases for some of the regression and variance parameters over the 100 datasets, along with the standard errors, are summarized in Table S2. We observe some bias in our parameter estimates, with the bias being slightly higher for true (non-zero) effects than for zero effects. Although we observe good (empirical) consistency in our model selection, biases are observed in our parameter estimates are for two reasons. First, our construction of the double exponential prior has a single (global) shrinkage parameter for all the regression coefficients (large and small), which may tend to over-shrink large effects and under-shrink the small effects. This problem can be alleviated by assuming adaptive versions of the lasso prior such as normal-gamma-based priors Brown, 2010, 2011), which allows for additional flexibility, but at the cost of additional computation. Second, the consistency of the estimations of regression coefficients in the lasso model has been proven by Knight and Fu (2000) as the sample size goes to infinity. However, in our case, since the sample size (number of patients, N ) is much smaller than the total number of predictors (total number of genes, K), this may lead to bias in our parameter estimates.

S4.2 Additional analysis results
As explained in the main paper, we fit three models to the processed GBM dataset. The first model is the nonINT model, with only gene expression information as explanatory variables. The second model is the additive (ADD) model, with both gene expression and methylation information as explanatory variables and assuming their effects on patients' survival times are additive. The third model is the iBAG unified model for censored outcomes, which integrates both gene expression and methylation information in a hierarchical manner.
In Table S3, we provide a list of 136 genes that are identified as significantly modulated by at least one methylation feature using the iBAG unified model, based on 95% credible intervals. In Tables S4.1 and S4.2, we provide a summary of the number of significant prognostic genes for the iBAG unified and ADD models, respectively, along with their type of effects. In addition, we use a Venn diagram to compare the gene lists derived by the iBAG unified and ADD models ( Figure S4).
We observe that 59 out of 78 genes with significant gene expression effects obtained by the ADD model overlap with the genes with nonzero typeM effects obtained by the iBAG unified model, and affect survival time in the same direction (both are positive or negative). There are 12 common genes identified from the comparison of genes with significant methylation effects (using the ADD model) to genes with nonzero type M effects (effects modulated by methylation).
Of the 22 genes identified by effects modulated by methylation, 14 are negatively associated with survival, while 8 genes are positively associated with survival. Functional analysis with the database for annotation, visualization and integrated discovery (DAVID, Dennis et al., 2003) for these 22 genes are shown in Table S5.1 for genes negatively associated with survival, and in Table S5.2 for genes positively associated with survival.          (Gelman and Rubin, 1992), which are denoted as Gelman and Rubin plots, for important parameters in the iBAG unified model ( Figure S5). Gelman and Rubin's diagnostic statistics measure the convergence of MCMC samples for multiple chains by comparing the variance within chains to the variance between chains. Values substantially above 1 indicate a lack of convergence. The plots on the left side of Figure S5 are the trace plots of important parameters (those controlling the overall sparsity of the iBAG unified models and those with meaningful biological interpretations, e.g., shrinkage parameters for gene expression effects (λ M and λM ), some of the significant gene expression effects, and some of the significant methylation effects) in the applications of the iBAG unified model. The three plots on the right side of Figure