A powerful Bayesian meta-analysis method to integrate multiple gene set enrichment studies

Motivation: Much research effort has been devoted to the identification of enriched gene sets for microarray experiments. However, identified gene sets are often found to be inconsistent among independent studies. This is probably owing to the noisy data of microarray experiments coupled with small sample sizes of individual studies. Therefore, combining information from multiple studies is likely to improve the detection of truly enriched gene classes. As more and more data become available, it calls for statistical methods to integrate information from multiple studies, also known as meta-analysis, to improve the power of identifying enriched gene sets. Results: We propose a Bayesian model that provides a coherent framework for joint modeling of both gene set information and gene expression data from multiple studies, to improve the detection of enriched gene sets by leveraging information from different sources available. One distinct feature of our method is that it directly models the gene expression data, instead of using summary statistics, when synthesizing studies. Besides, the proposed model is flexible and offers an appropriate treatment of between-study heterogeneities that frequently arise in the meta-analysis of microarray experiments. We show that under our Bayesian model, the full posterior conditionals all have known distributions, which greatly facilitates the MCMC computation. Simulation results show that the proposed method can improve the power of gene set enrichment meta-analysis, as opposed to existing methods developed by Shen and Tseng (2010, Bioinformatics, 26, 1316–1323), and it is not sensitive to mild or moderate deviations from the distributional assumption for gene expression data. We illustrate the proposed method through an application of combining eight lung cancer datasets for gene set enrichment analysis, which demonstrates the usefulness of the method. Availability: http://qbrc.swmed.edu/software/ Contact: Min.Chen@UTSouthwestern.edu Supplementary information: Supplementary data are available at Bioinformatics online.

To simplify the notations, we use Θ /θ to denote the parameter set that includes all the parameters except for θ. The following list gives the full posterior conditionals derived from the full probability model (i = 1, ..., I k , j = 1, ..., J, k = 1, ..., K, d ∈ {0, +, −}); and R code for the corresponding Gibbs sampler is provided at the URL http://qbrc.swmed.edu/software/a-powerful-bayesian-metaanalysis-method-to-integrate-multiple-gene-set-enrichment-studies/ S1 Next, we relax the assumption that a common variance of error σ 2 k is shared by all genes in study k, and outline the changes in the full conditionals. That is, we replace σ 2 k by σ 2 jk s when study k has sufficient samples to produce stable estimates of the gene-wise variances. Then for such studies (say k), 1. When sampling β jk s and α jk s given V jk = 1, replace σ 2 k by σ 2 jk in (S2) and (S3).
2. The step of sampling σ 2 k should be replaced by the step of sampling σ 2 jk s. That is, (S4) should be replaced by the following: for any gene j with V jk = 1, under independent Inverse-Gamma(w, v) priors for all σ 2 jk s.
3. All the other steps remain the same as before.

S2 Additional Tables and Figures from Simulation
Gene Type Study Para. Scenarios of Simulation I 1 2 3 4 5 UR genes Table S1: Simulation I settings of five scenarios. In the first scenario, two studies with the same effect size are considered; in the second, two studies are considered but with different effect sizes. In the third, four studies instead of two are considered and everything else is the same as the first scenario. The last two scenarios consider varying effect sizes across genes, where the fourth considers two studies with the same mean of the effect sizes while the fifth considers two studies with different means of the effect sizes.
(a) Gene expression for cases Gene ID Study 1 Study 2 1-200    Figure S4: Simulation III-ROC curve comparison with the expression intensities of DE genes for cases from t-distributions with df = 4 (with heavier tails than normal distributions). In each subpanel, the blue dash-dot line represents MAPE_P; the green dotted line represents MAPE_G; the red dash line represents MAPE_I; and the black solid line represents our Bayesian method.  Figure S5: Simulation III-ROC curve comparison with the expression intensities of DE genes for cases from gamma distributions (skewed compared to normal distributions). In each subpanel, the blue dash-dot line represents MAPE_P; the green dotted line represents MAPE_G; the red dash line represents MAPE_I; and the black solid line represents our Bayesian method.

S3 Additional Tables and Figures from Data Analysis
In our data example, patients in all data sets were classified into two groups based on their survival time. We used the pamr.surv.to.class2 function in the pamr R package (Tibshirani et al. 2002;Tibshirani et al. 2003) to determine the two groups (i.e., long and short survival groups). The function splits observations into the two groups based on the Kaplan-Meier estimates. For each observation (survival time, censoring status), it computes the probability of that observation falling into one of the two survival groups. The probability is 1 or 0 for an uncensored observation depending on the survival time. For a censored observation, the probability is between 0 and 1 based on the Kaplan Meier estimate.
The following are tables and figures for the data example.
Data Set Name Number of (Controls, Cases) GSE10245 (Kuner et al. 2009) 13, 27 GSE14814 (Zhu et al. 2010) 7, 21 GSE3141  22, 36 GSE3593  12, 31 CL (Shedden et al. 2008) 17, 65 Moff (Shedden et al. 2008) 27, 52 NCI_U133A (Shedden et al. 2008) 18, 86 NCI_Lung_U133A (Shedden et al. 2008) 44, 131    We also carried out the following analysis for the data example, as suggested by one of the reviewers. First, we randomly split each data set in two with the requirement that both groups will have at least three samples. Then we ran our proposed method and the MAPEs at their default settings. After obtaining an ordered list of pathways from each method, we calculated the percent of common pathways identified in the two halves among the top 5%, 10%, 15% and 20% pathways, respectively. We repeated this procedure 10 times and report the average percent values of pathways in common in the table below. The result shows that the reproducibility of the Bayesian method appears to be better or in par with the MAPE ones.  Table S6: Data example-Comparison in percentage of common pathways identified based on random half splits. S10