Joint Analysis of Differential Gene Expression in Multiple Studies using Correlation Motifs

The standard methods for detecting differential gene expression are mostly designed for analyzing a single gene expression experiment. When data from multiple related gene expression studies are available, separately analyzing each study is not an ideal strategy as it may fail to detect important genes with consistent but relatively weak differential signals in multiple studies. Jointly modeling all data allows one to borrow information across studies to improve the analysis. However, a simple concordance model, in which each gene is assumed to be differential in either all studies or none of the studies, is incapable of handling genes with study-specific differential expression. In contrast, a model that naively enumerates and analyzes all possible differential patterns across all studies can deal with study-specificity and allow information pooling, but the complexity of its parameter space grows exponentially as the number of studies increases. Here we propose a"correlation motif"approach to address this dilemma. This approach automatically searches for a small number of latent probability vectors called"correlation motifs"to capture the major correlation patterns among multiple studies. The motifs provide the basis for sharing information among studies and genes. The approach improves detection of differential expression and overcomes the barrier of exponentially growing parameter space. It is capable of handling all possible study-specific differential patterns in a large number of studies. The advantages of this new approach over existing methods are illustrated using both simulated and real data.

We have ln P r(π, Q, A, B|T ) = Therefore, Q(π, Q|π old ,Q old ) = E old [ln P r(π, Q, A, B|T )] [ln q kd + ln(1 − q kd )] + constant (A.2) In the M-step, one finds π and Q that maximize the Q-function Q(π, Q|π old ,Q old ). Denote them asπ new andQ new . They will be used in the next iteration.
By solving ∂Q(π, Q|π old ,Q old ) ∂q kd = 0 (A.4) We haveπ new k = ∑ G g=1 P r old (b g = k) + 1 G + K (A.5) q new kd = ∑ G g=1 P r old (b g = k, a gd = 1) + 1 ∑ G g=1 P r old (b g = k) + 2 (A.6) In the formulae above, P r old (b g = k) and P r old (b g = k, a gd = 1) can be computed as below (A.7) P r old (b g = k, a gd = 1) = P r old (a gd = 1|b g = k) * P r old (b g = k) The E-step and M-step will iterate until convergence. The algorithm stops when none of the parameters in π and Q changes by more than 0.1% compared to their values in the previous iteration. Using this algorithm, we can obtain estimates for π and Q.
A.4. Bayesian information criterion (bic) and algorithm for choosing k BIC is computed as BIC(K) = −2 * ln P r(T |π, Q) Here K is the number of motifs in the data. K − 1 is the number of parameters for π. KD is the number of parameters involved in Q. G is the gene number.
In order to choose K, BIC for different values of K are calculated. The K corresponding to the smallest BIC is chosen. Intuitively, to implement this, one can start with K = 1. After evaluating BIC at K = 1, one will increase K by one and evaluates the BIC again. This procedure will be A.6 YY Wei, T Tenzen, and HK Ji repeated until one finds a K such that BIC(K) < BIC(K + 1) and BIC(K) < BIC(K + 2) and BIC(K) < BIC(K ′ ) for all K ′ < K, at which point the algorithm will stop and the K will be reported as the final motif number.
The algorithm above does not impose any upper limit on K. If the optimal K is big, it may require one to compute BIC for many different Ks which could make the computation slow. In order to make the computation faster, CorMotif actually uses a modified algorithm as follows.
1. Set a start point K 0 ← 1 and a step size s. The initial step size can be relatively big (e.g., s = 10) and is set by users.

Start with
and evaluate the BIC again. This procedure will be repeated until one finds a K such that BIC(K) < BIC(K + s) and BIC(K) < BIC(K + 2s) and BIC(K) < BIC(K ′ ) for all K ′ < K (note: here K − K ′ is a multiple of the step size s). This K will be recorded and denoted as K m .
3. If the step size s = 1, then report K m as the optimal K and exit the algorithm. Otherwise, the optimal K should be between K m − s and K m + s. One can search it within this range using a smaller step size. To do so, reset the start point K 0 ← max(K m − s, 1), and reset the step size s ← ⌊s/2⌋. Here ⌊.⌋ returns the largest integer that does not exceed s/2. Go back to step 2.
Again, this algorithm does not impose any upper limit for the motif number K. We note, however, that in real applications, we seldom see cases where the optimal K is big. Based on our own experience, the optimal K often is smaller than 10.
In all simulations and real data analyses in this article, CorMotif was run using s = 1, and the optimal K with the minimal BIC was all achieved below 10.

A.7
A.5. Data for real data based simulations Simulations 5-10 were based on real data characteristics. Each simulation contained multiple studies, and each study was composed of six samples from the same GEO experiment with the same biological condition as detailed in Table A.4. The six samples were further split into three pseudo cases and three pseudo controls. They were used as the simulated background since one does not expect any real differential signals between replicate samples. We then spiked in differential signals by adding random N (0, 1) numbers to the three cases according to the patterns shown in Figures A.2 (a,d,g) and A.4(a-b,e-f,i-j,m-n). Data simulated in this way were able to keep the background characteristics in real data.
A.6. Discussion on the two-stage design Currently, CorMotif is based on modeling the moderated t-statistics t gd . Instead of using this two-stage approach, a potential future extension is to introduce a single coherent Bayesian model that fully integrates the correlation motifs with a model directly describing the raw expression values x gdlj . In the present study, we chose to use the two-stage approach for several reasons.
First, it allows us to better present the core idea of this paper, that is, how to use correlation motifs to integrate multiple studies. By taking advantage of the well-documented limma approach, the two-stage approach allows us to simplify the presentation of some of the model details (i.e., those related to the moderated t-statistics) and use the limited amount of available space in the main article to focus on discussing the core idea of correlation motifs. Although a more coherent and sophisticated model for x gdlj may bring some additional performance gain, the burden for us to present and for readers to digest additional notations and model details may distract one from focusing on the core part of our approach. The space limit forces us to find a trade-off between these two. We believe that it is more important for one to understand the correlation motif idea.
Once one gets this idea, one can easily improve it by extending the data models. Moreover, the A.8 YY Wei, T Tenzen, and HK Ji two-stage approach as presented now also represents a very general framework. Conceptually, one can modify f d0 and f d1 to accommodate other data types. Because of the two-stage design, this will not change the correlation motif model and the corresponding EM algorithm.
Second, using the two-stage framework, one can develop a simple EM algorithm to fit the model. This approach is computationally more efficient than running a Markov Chain Monte Carlo (MCMC) algorithm on a fully Bayesian model with many levels of unknown parameters (e.g., mean and variances of x gdlj s and parameters in their prior distributions, missing indicators A and B, and motif parameters π and Q).
Third, the present design also allowed us to perform a well-controlled comparison with the state-of-the-art approach limma. In our two-stage design, the first stage of CorMotif uses the same model as limma to compute the moderated t-statistics. The only difference between CorMotif and limma is in the second stage, that is, the correlation motif part. For this reason, the comparison between CorMotif and limma can unambiguously demonstrate the gain of using correlation motifs to integrate multiple studies. This gain is not confounded with other factors such as differences in the data distributions f d0 and f d1 . By contrast, differences in performance between CorMotif and other methods such as SAM and eb1 , etc., can be caused by a number of different factors such as differences in models for data x gdlj . The two-stage design therefore has helped us to perform a clean comparison to show the effectiveness of correlation motifs. As a result, we were able to contribute a general tool with proven effectiveness (i.e., the correlation motif framework for data integration) to the toolbox other people can use to build future data analysis methods.

A.7. Discussion on computational time
We compared the computation time of different algorithms. We used real data based simulations 5-7 (with study number D = 4, 8 and 20 respectively) as well as the real SHH data to do this comparison in order to provide a realistic picture. All algorithms were run in a single 2.7GHz

Correlation Motif Discovery
A.9 CPU with 4Gb RAM. The results are shown in Table A.9. The computation time shown for CorMotif and eb10best includes the time used for searching the optimal motif number K (see Section A.4 for the algorithm used by CorMotif to search for K). For these two algorithms, the model was fitted at multiple different K values, and the average computation time for each K (i.e., the mean time required for a single K, also called "per K time") is also shown as "CorMotif (mean)" and "eb10best (mean)". The other methods do not need to search for K.
Based on the results, the total computation time required by CorMotif was between eb1 and eb10best when the study number D is relatively small (i.e., simulation 5 and SHH data).
However, both eb1 and eb10best became very slow or failed to run when D became big (i.e., simulations 6 and 7). CorMotif , eb1 and eb10best were all slower than SAM and all concord .
SAM analyzes each study separately and does not involve computation-intensive iterations. All concord assumes concordant signals and usually converges in a few iterations. These explain why they were fast. Separate limma also analyzes each study separately. However, recall that in this article an iterative EM algorithm was added to separate limma to call differential expression in order to match with CorMotif to better evaluate the gain in statistical power brought by correlation motifs. For this reason, separate limma was much slower than SAM in some datasets (e.g., simulation 7 and the real SHH data) because the EM algorithm took more iterations to converge in those datasets. In fact, when we used the original limma without adding the EM algorithm, the computation time was reduced to the same level as SAM (see "limma (original)" in Table A.9). When the study number D is small, the number of unknown parameters for CorMotif is comparable to or may even be bigger than that of full motif . For instance, if the study number D = 4 and the motif number K = 4, the number of parameters in π and Q in CorMotif is K − 1 + K * D = 4 − 1 + 4 * 4 = 19, whereas the number of equivalent parameters in full motif is 2 D = 2 4 = 16 < 19. This and many other factors including the number of iterations to convergence and implementation details may all affect the computation time. Consistent with A.10 REFERENCES this, the average computation time required by CorMotif for a single K (i.e., the per K time) was longer than that for full motif in simulation 5. In simulation 6 and real SHH data, their per K computation time became comparable. However, in simulation 7 where D was big, full motif failed to run. One additional thing to note is that the computation time is also affected by the signal-to-noise ratio. For example, the per K time for CorMotif in simulation 7 which involved more studies is slightly smaller than that in simulation 6 which involved fewer studies.
This was because with signals integrated from 20 studies, the signal-to-noise ratio in simulation 7 was stronger, leading to a faster convergence. Together, our results show that CorMotif is computationally tractable and it is able to handle a large number of studies without having the exponential complexity problem. We also note that CorMotif has a parameter that allows users to fit the model using a fixed and user-specified motif number K. Therefore, if one has multiple processors, the computation could be accelerated by running the model fitting jobs for different  T P d (r), the number of genes that are truly differentially expressed in study d among the top r ranked genes by a given method, is plotted against the rank cutoff r. For each simulation, results for a few representative studies are shown. Each plot is for one study.     Table A.9. Comparison of computation time. The time is shown in the unit of seconds. In some cases, the time is also converted to hours (hr) and the converted time is shown in parentheses. For CorMotif and eb10best, the displayed number includes the time used to search for the optimal motif number K. For these two algorithms, the average computation time per K (i.e., the mean time required for a single K) is also shown as "CorMotif (mean)" and "eb10best (mean)". "limma (original)" corresponds to the original limma without using the EM algorithm to declare differential expression.