Abstract
A switching mechanism in gene expression, where two genes are positively correlated in one condition and negatively correlated in the other condition, is a key to elucidating complex biological systems. There already exist methods for detecting switching mechanisms from microarrays. However, current approaches have problems under three real cases: outliers, expression values with a very small range and a small number of examples. ROSDET overcomes these three problems, keeping the computational complexity of current approaches. We demonstrated that ROSDET outperformed existing methods, under that all these three situations are considered. Furthermore, for each of the top 10 pairs ranked by ROSDET, we attempted to identify a pathway, i.e. consecutive biological phenomena, being related with the corresponding two genes by checking the biological literature. In 8 out of the 10 pairs, we found two parallel pathways, one of the two genes being in each of the two pathways and two pathways coming to (or starting with) the same gene. This indicates that two parallel pathways would be cooperatively used under one experimental condition, corresponding to the positive correlation, and the two pathways might be alternatively used under the other condition, corresponding to the negative correlation. ROSDET is available from http://www.bic.kyotou.ac.jp/pathway/kayano/rosdet.htm.
INTRODUCTION
Gene expression analysis is a basic and important technique in molecular biology. There are two typical and simple concepts for expression analysis: (i) differential expression, which examines the difference in expression for a single gene between different experimental conditions (classes), such as case and control patients (1), and (ii) coexpression, which focuses on a combination of multiple genes, checking whether they are over or underexpressed simultaneously (2). One notion with both of these two properties is differential coexpression, in which coexpression patterns differ depending upon the experimental conditions (3,4). We address an issue of finding one type of differential coexpression, which hereafter we call a ‘switching mechanism’. The switching mechanism has two experimental conditions for expression of two genes, where two genes are positively correlated under one experimental condition while they are negatively correlated under the other condition (3,5–8). Figure 1 shows one simulated example of the switching mechanism. A simple, wellknown case of the switching mechanism is Max, a transcription factor, which plays a role of an activator or a suppressor, depending on whether it binds to Myc (i.e. MycMax) or Mad (i.e. MadMax) (9). Another case is thyroid hormone receptor (TR), which forms a complex called TRRXR and can be also an activator or a suppressor, depending on the absence or presence (amount) of thyroid hormone (10). Finding the switching mechanisms would be a key step to elucidating complex biological systems.
There are two typical techniques for finding switching mechanisms: the absolute difference of two correlation coefficients (3,5–7) and interaction test (8). Interaction test, being popular as a standard approach for detecting epistasis in genetics (11), is the loglikelihood ratio test between two logistic regressions with/without an interaction term. This test examines the strength of the interaction between two genes, which is in general equivalent to checking whether two genes can take the switching mechanism. However, a serious disadvantage of interaction test is its high computational burden, making it hard to apply to a large number of gene combinations practically (8).
We then focus on the absolute difference of two correlation coefficients. The procedure of this approach is that we first compute the correlation coefficient of two genes in expression for each of the two experimental conditions and check the absolute difference of the two correlation coefficients. More concretely, in Figure 1, we first compute the correlation coefficient r_{1} between gene X and gene Y for one experimental condition, say class 1 (shown by +), and r_{2} between them for the other condition, say class 2 (shown by •). We then compute the absolute difference between these two correlation coefficients as the score of two genes X and Y as shown below:
A switching mechanism must have this of a larger value, because two correlation coefficients should be different in the switching mechanism. We can consider any measure of correlation coefficients in this approach, such as the Pearson's correlation coefficient (5,7) and the Spearman's rank correlation (6) [(3) is a review over differential coexpression, including the switching mechanism]. Because of various correlation coefficients, there can be many approaches of using the absolute difference of two correlation coefficients, but they still have problems by which negative cases can be detected as positives. The negatives which can be detected incorrectly are caused by the following three main reasons: (i) outliers, (ii) range bias and (iii) a small number of examples. Figure 2a illustrates a typical negative case of (i), where only two points of class 2 are in the upperright area, which are outliers and make this case pretend to be a switching mechanism. Figure 2b shows a case of (ii), where the range of expression values of class 2 is much smaller than that by class 1. Regardless of the very small range, class 2 can show a strong negative correlation in its figure, by which the absolute difference between the two correlation coefficients would become large. In reality, however, the small range of class 2 makes us unconvincing whether class 2 is negatively correlated or not. Thus, we cannot say that Figure 2b is a switching mechanism, and this means that Figure 2b should be a negative case. Figure 2c shows a typical case of (iii), where the number of points in two classes is so small that it cannot be considered as a switching mechanism, implying that this cannot be a positive gene pair.We propose a method, which we call ROSDET (standing for RObust Switching mechanisms DETector) to avoid detecting the negative cases as positives (or to reduce the socalled false positive rates) in the paradigm of the absolute difference of two correlation coefficients. First, ROSDET uses a robust measure of correlation coefficient, socalled biweight midcorrelation (12), to avoid detecting gene pairs with outliers such as Figure 2a as switching mechanisms. Most typical correlation coefficients robust against outliers are the Spearman's and Kendall rank correlations, but they use the ranking after sorting all the given values, which might lose some information in the original values. On the other hand, the biweight midcorrelation is a modification of the Pearson's correlation coefficient which allows to keep the original values and consider outliers carefully. The biweight midcorrelation was thus used in microarray data analysis already for detecting coexpression gene pairs (13), and the performance advantage of the biweight midcorrelation in expression analysis over other correlation coefficients was already shown (14). Second, ROSDET multiplies the difference of correlation coefficients by a weight, which reflects upon the range bias between two classes so that the weight becomes smaller as the range bias increases. This means that the absolute difference of correlation coefficients is lowered by the weight more as the range bias is larger. Third, ROSDET uses Pvalues to discard gene pairs with only a small number of examples even if their weighted absolute difference of correlation coefficients is large. We use hypothesis testing to check the equality of biweight midcorrelation coefficients derived from two experimental conditions, meaning that gene pairs with higher Pvalues are more insignificant in difference of two biweight midcorrelation coefficients. On the whole, ROSDET has two steps: WCOR (sorting gene pairs by the Weighted absolute difference between two biweight midCORrelation) and ECOR (the hypothesis testing on the Equality of two biweight midCORrelation). All gene pairs, which are generated from a given expression dataset, are first ranked by the weighted difference of two biweight midcorrelations in WCOR, and if Pvalues of gene pairs are high in ECOR, they are then deleted.
We experimentally confirmed the effectiveness of ROSDET through experiments with synthetic and real datasets. We first generated synthetic datasets to examine the performance of WCOR on two problems, i.e. outliers and range bias, comparing to existing approaches. The results showed that WCOR outperformed all the other competing methods. We then checked the effectiveness of ECOR by using real data consisting of over 2.60 × 10^{10} gene combinations generated from 46 datasets in Gene Expression Omnibus (GEO). The result showed that removing gene pairs with high Pvalues relaxed the bias to the data with a smaller number of examples. Finally, we focused on the top 10 gene pairs in the final output of ROSDET, to check the biological relevance of each pair. For 8 out of the 10 pairs, we found two (parallel) pathways, which start with or come to the same gene and, for each pair, have one of two genes at each pathway separately. The parallel pathways imply that two parallel pathways might be cooperatively used under one experimental condition, corresponding to the positive correlation of two genes, and the two pathways might be alternatively used under the other condition, corresponding to the negative correlation of two genes.
In summary, the main contribution of this article can be the following three: (i) we have developed ROSDET based on the weighted midcorrelation coefficients, (ii) empirically validated the performance of ROSDET and (iii) applied ROSDET to real data to find new gene pairs with switching mechanisms.
MATERIALS AND METHODS
The main input of ROSDET is an expression dataset measured under two experimental conditions, and the output is gene pairs with switching mechanisms between two experimental conditions. ROSDET first generates all possible gene pairs from a given dataset and the following two steps are run over all pairs: (i) WCOR: all gene pairs are sorted by the weighted absolute difference of biweight midcorrelation coefficients and (ii) ECOR: out of the sorted gene pairs, pairs with high Pvalues regarding the equality of two biweight midcorrelation coefficients are removed. The feature of WCOR is biweight midcorrelation coefficient and the weighted absolute difference of two correlation coefficients, while that of ECOR is hypothesis testing on the equality of two biweight midcorrelation coefficients. Below, we explain each of the three features in detail and finally show the entire procedure of our approach.
Preliminaries
Let be a given (microarray expression) dataset with G genes and two classes C_{1} and C_{2} (with N_{1} and N_{2} examples, respectively) corresponding to the two experimental conditions. We consider all possible pairs of genes over G genes. Let X and Y be two genes of an arbitrary pair. Let x_{i} be the expression value of gene X for an example i and similarly y_{i} be that of gene Y for i. Let m_{x,k} be the median of expression values in class C_{k} for gene X, and m_{y,k} be that for gene Y. Furthermore, let δ_{x,k} be the sample median of x_{i} − m_{x,k} over all values x_{i} satisfying i ∈ C_{k}, and δ_{y,k} be that of y_{i} − m_{y,k} over all y_{i} where i ∈ C_{k}. Let r_{1} and r_{2} be correlation coefficients for C_{1} and C_{2}, respectively. When we already know the true value of the correlation coefficient, we write the true correlation coefficients by ρ_{1} and ρ_{2} for C_{1} and C_{2}, respectively. For example, in order to generate synthetic data, we can use ρ_{1} and ρ_{2}.
WCOR: biweight midcorrelation
The biweight midcorrelation r_{k} (12) for class C_{k} between two genes X and Y can be given as follows:
where u_{i} = (x_{i} − m_{x,k})/(K · δ_{x,k}) and v_{i} = (y_{i} − m_{y,k})/(K · δ_{y,k}). [K was set at nine in our experiments, according to (12,15,16).] Here can be normalized into the ‘biweight midvariance’ (12), which is given in the Supplementary Data and hereafter we write by q_{x,k}. Similarly, we can write q_{y,k} for the biweight midvariance of gene Y. We note that if weights w(u_{i}) and w(v_{i}) are always both 1 over all i and medians m_{x,k} and m_{y,k} are both the means (denoted by μ_{x,k} and μ_{y,k}, respectively), then r_{k} is exactly the same as the Pearson's correlation coefficient, which is given by:WCOR: weighted absolute difference of two correlation coefficients
We modify Equation (1) into the following weighted form to deal with the range bias between two classes:
We here describe how we can compute c of Equation (2). The problem is to find a gene pair with a high bias (difference) in the range of expression values between two classes and then to discard it. To check the range of expression values, we can employ the variances, i.e. biweight midvariances, q_{x,k} and q_{y,k} for class C_{k}. We can say that the range of values is small if both q_{x,k} and q_{y,k} are small in class C_{k}. Thus, for class C_{k} we can keep the maximum of q_{x,k} and q_{y,k}, and if this value is small, we can see that the range is small. The range bias can then be checked by the ratio of the range (the maximum variance) of one class to that of the other. In summary, we can first take the maximum of q_{x,k} and q_{y,k} for each class C_{k} and then take the ratio of the maximum variance of one class to that of the other class. Here, we note that the ratio can be in two ways, i.e. the ratio of class 1 to class 2 and the ratio of class 2 to class 1. We further note that Equation (2) must become lower as the range bias increases more, and so c in Equation (2) must take 1 for the case with no range bias and be reduced to zero as the range bias increases. Thus to make c hold this property, we can compute the possible two ratios and take the minimum of the two ratios.Overall c can be given as follows:
By using Equations (2) and (3), we can then compute the score of each gene pair, indicating the possibility that the gene expression of this pair can be a switching mechanism. Then by computing scores of all possible combinations of genes, WCOR can rank these combinations by the computed scores.
ECOR: hypothesis testing for the equality of two correlation coefficients
The purpose of the hypothesis testing is to check the equality of two correlation coefficients (each being derived from the corresponding one of two classes), meaning that the hypotheses are H_{0}: ρ_{1} = ρ_{2} and H_{1}: ρ_{1} ≠ ρ_{2}. Given the expression values of two genes with two classes, if we assume that they take a bivariate normal distribution for each class and the correlation coefficient is the Pearson's correlation coefficient for each class, test statistic T of the likelihood ratio test for H_{0} and H_{1} follows the chisquare distribution with one degree of freedom under H_{0} (17,18):
where is a maximum likelihood estimator for ρ_{1} and ρ_{2} under H_{0}, where ρ_{1} = ρ_{2}, and satisfies . Interested readers should see the Supplementary Data for the derivation of Equation (4).We apply this test statistic to the biweight midcorrelation. More concretely, we use the biweight midcorrelation given by Equation (2) for r_{k} in Equation (4). We show the empirical validation on this application of Equation (4) to the biweight midcorrelation in the Supplementary Data. We compute Pvalues by using this hypothesis testing with the biweight midcorrelation. We can say that two correlation coefficients of pairs with higher Pvalues than a specified significance level are not different statistically. Thus, out of the ranked list generated in WCOR, ECOR removes pairs with higher Pvalues than a prespecified significance level.
The entire procedure
Given an expression dataset with two classes, ROSDET first generates all possible pairs of G genes. WCOR then computes, for each of all pairs, the score given by Equation (2), and further sorts all pairs according to the computed scores. Finally, ECOR computes Pvalues of the sorted pairs by using Equation (4) to remove pairs with higher Pvalues than a prefixed significance level. Figure 3 shows a pseudocode of ROSDET with WCOR and ECOR.
RESULTS
Validating WCOR (the first step of ROSDET) with synthetic data
Experimental setting
The procedure of this experiment is that we first generate datasets according to some distribution, changing its parameters, by which some datasets are true positives and the rest are true negatives. That is, the problem setting is binary classification over the generated datasets, i.e. predicting whether each dataset is a positive or a negative. We then apply WCOR to this problem and compare its performance with those of the other competing methods.
For each dataset in class C_{k}, we generated examples according to a bivariate gandh distribution with four parameters, g_{k}(≥0), h_{k}(≥0), σ_{k} and ρ_{k} (19) [More concretely, we first generated examples according to bivariate normal distributions (for genes X and Y) and transformed them into those in polar coordinates, in which we used only the distance from the origin to each example by which we generated examples following an univariate gandh distribution. We show the detail of generating the bivariate gandh distribution in the Supplementary Data]. Here g_{k} controls the skewness of the distribution, and the distribution is symmetrical at g_{k} = 0, which was used throughout all our experiments, meaning that this parameter had nothing to do with our experiment. h_{k}, a nonnegative parameter, controls the kurtosis (peakedness) of the distribution. We note that if g_{k} = h_{k} = 0, the distribution is the standard normal distribution. As h_{k} becomes larger, the distribution becomes broader (or the distribution tail becomes more emphasized), meaning that outliers can be generated more easily. σ_{k} and ρ_{k} are the variance of the distribution and the (true) correlation coefficient between two variates (genes X and Y), respectively. We note that σ_{k} can be used to generate the dataset with a different range between genes X and Y. Under a fixed h_{k}, we change σ_{k} and/or ρ_{k} according to some manner, where under each parameter setting, we generate a certain number of datasets.
We tested three values of h_{k}: 0, 0.5 and 1, fixing N_{1} = N_{2} = 100. Under each of them, we changed σ_{k} and/or ρ_{k} in the following three types of manners:

Randomness: under a certain h_{k}, we generated datasets, changing ρ_{k} such that ρ_{1} = −ρ_{2} = 0, 0.01, … , 1, fixing σ_{1} = σ_{2} = 1. ρ_{k} = 0 corresponds to that of no correlations among examples in C_{k}, while ρ_{k} = 1 corresponds to the state that examples are totally correlated between two variates in C_{k}. This means that examples are totally random when ρ_{1} = −ρ_{2} = 0 and examples must be the switching mechanism when ρ_{1} = −ρ_{2} = 1. Thus, datasets with ρ_{1} and ρ_{2} closer to ρ_{1} = −ρ_{2} = 1 should be positives and the rest should be negatives. In fact, we used ρ_{k} for r_{k} in Equation (5), and in Equation (5) can be computed from each generated dataset, and then if the Pvalue was less than the significance level (5%), we assigned ‘positive’ to the true class label of the dataset; otherwise, ‘negative’ was assigned. In prediction, we again emphasize that we used WCOR only, in which we computed score s(X, Y) in Equation (2) for each dataset and sorted all datasets by scores. We generated datasets 100 times for each pair of ρ_{1} and ρ_{2} and the results were averaged over total runs under each setting.

Range bias: under a certain h_{k}, we generated datasets, changing the ratio of ranges (variances) between two classes such that σ_{2} = 1, 1.1, … , 10, fixing σ_{1} = 1 and ρ_{1} = −ρ_{2} = 0.5. Thus, σ_{2} = 1 shows that the ratio is 1, while σ_{2} = 10 shows that the ratio is 10 (or ). In this case, datasets with lower σ_{2} should be positives and those with higher σ_{2} should be negatives. The true class label is assigned and predicted in a similar manner in the previous experiment, but in true class label assignment, we used the Bartlett test (20), which is the hypothesis testing on the equality of four different variances, to be generated in our case by the combinations of two classes and genes X and Y. We used the significance level at 5% for the Pvalue of the Bartlett test, meaning that a dataset was positive if its Pvalue is higher than 5%. We generated datasets 100 times for each σ_{2} and the results were averaged over total runs under each setting.

Randomness + Range bias: we changed both ρ_{k} and σ_{k} here so that ρ_{1} = −ρ_{2} = 0, 0.01, … , 1, and . To assign a class label to each dataset, we used both Equation (4) and the Bartlett test and regarded the datasets which were labeled as positives by both of these two tests as positives and the rest as negatives. We generated datasets 10 times for each parameter setting and the results were averaged over total runs under each h_{k}.
For each of the above three settings, we show the distribution of data points of representative samples in the Supplementary Data.
Results
We evaluated the performance of each competing method by using receiver operator characteristics (ROC) curves. Figure 4 shows the average ROC curves of WCOR (colored by red), being compared with those of four competing methods: (i) the absolute difference of two correlation coefficients when using the Pearson's correlation coefficient (Pearson), (ii) the Spearman's correlation coefficient (Spearman), (iii) the (unweighted) biweight midcorrelation (Biweight) and (iv) the interaction test (IT). Figure 4a1–a3 show the results of Randomness for h_{k} = 0, 0.5 and 1.0, respectively, where outliers were increased as h_{k} increased. Figure 4a1 shows that all ROC curves were lowered as increasing h_{k} as shown by Figure 4a2 and a3 except those using robust correlation coefficients, i.e. WCOR, Spearman and Biweight, implying that these methods were robust against outliers. Figure 4b1–b3 show the results of Range bias for h_{k} = 0, 0.5 and 1.0, respectively. These results show that WCOR outperformed other competing methods clearly, for any amount of outliers. In particular, the performance advantage of WCOR over other methods was clear in Figure 4b2 and b3, where the performances of the other methods were almost on the diagonal line, implying that their performances were similar to random guessing while WCOR showed good performance by showing a typical ROC curve. Figure 4c1–c3 show the results of Randomness + Range bias, which is a more real situation, for h_{k} = 0, 0.5 and 1.0, respectively. This case, again, WCOR outperformed other four competing methods clearly for all cases as shown by Range bias. Overall, WCOR outperformed other methods in a more real situation, keeping its performance under Randomness as a comparable one to noiserobust methods, such as the Spearman's rank correlation and the biweight midcorrelation. We run our experiments over the cases with N_{1} = N_{2} = 50 and N_{1} = N_{2} = 20 and confirmed that the performance advantage of WCOR over other competing methods was all kept, and this result is attached to the Supplementary Data. Furthermore, we checked the case with N_{1} = N_{2} = 10 (shown in the Supplementary Data), where however the advantage of WCOR was not significant but this case was mostly removed by ECOR in real situations, implying that this result will not affect the reliability of ROSDET.
Validating ECOR (the second step of ROSDET) with real data
Experimental setting
Out of 2089 GEO DataSets (GDSs) of the GEO database (21) of the latest update in July, 2008, we extracted 46 datasets (or GDSs) which satisfy the following two conditions: The 46 datasets are listed in the Supplementary Data. Using the 46 datasets, ROSDET first generated 2.60 × 10^{10} gene combinations. WCOR sorted out 2.60 × 10^{10} pairs, according to the weighted absolute difference of biweight midcorrelation. ECOR then discarded gene pairs with Pvalues higher than a prefixed threshold which was set to 1.92 × 10^{−12} ≃ 0.05/(2.60 × 10^{10}) by considering the Bonferroni correction and the significance level of 5%. We used the Bonferroni correction, which is known to be relatively conservative, to keep reliable gene pairs only. The top 100 gene pairs by ECOR, i.e. the final output, is shown in the Supplementary Data.
Experimental conditions can be divided into two or more classes.
Each class has 10 or more experiments.
We first compared the distribution of expression values of the top 10 gene pairs by WCOR with that by ECOR. We then focused on gene pairs, each being derived from a GDS with, for each class, a certain number of replicates (experiments) which we call ‘the number of examples’ or ‘sample size’. We showed the distribution (histogram) of the number of examples, being generated by the top 1000 gene pairs by WCOR, comparing with those by the original dataset and the top 1000 gene pairs by ECOR.
Results
Figure 5 shows expression values under two experimental conditions (• and +) of the top 10 gene pairs in the output of WCOR. This figure reveals that the number of examples was very small for any gene pair, by which some pairs cannot necessarily be switching mechanisms. For example, the eighth ranked pair consisted of three distant islands, which could not have been a switching mechanisms, and the 10th ranked pair had two nonoverlapped distributions which also could not have been a switching mechanism. On the other hand, Figure 6 shows the top 10 gene pairs in the output of ECOR. Each distribution of Figure 6 can be seen as a switching mechanism more clearly than those of Figure 5. This result indicates that the outputs of WCOR were likely to be the cases with the smaller number of examples, and ECOR works for removing dubious cases in the outputs of WCOR.
Figure 7a shows the distribution of the number of examples (sample sizes), by the original 46 datasets. This distribution is rather uniform, meaning that 46 datasets have a variety of sample sizes relatively equally, in the range of 10–70. Figure 7b shows the distribution by the top 1000 gene pairs obtained from WCOR. We here note that the total sum of the number of all gene pairs in (Figure 7b) is 2000 (=1000 × 2), since each gene pair has two classes and the top 1000 gene pairs are obtained for each class. This figure clearly reveals that the top 1000 gene pairs by WCOR were so biased, where the number of examples was all less than 20, being consistent with the figures in Figure 5, all with only a small number of examples. We then show that this bias can be relaxed by ECOR. Figure 7c shows the distribution by the top 1000 gene pairs by ECOR, when we changed the cutoff value for Pvalues from 10^{−8} (Figure 7c1) to 10^{−9} (Figure 7c2) and further to 1.92 × 10^{−12} (Figure 7c3) [Again note that that the total sum of yaxis of Figure 7c1 and Figure 7c2 is equal to 2000 (=1000 × 2). Further, note that the distribution of Figure 7c3 is shown by 191 gene pairs only, which were all pairs obtained at this cutoff.]. The distribution in Figure 7c1 was still biased and very similar to Figure 7b, while that of Figure 7c2 was rather similar to Figure 7a, meaning that the bias in Figure 7c1 was relaxed by lowering the cutoff value. Finally, Figure 7c3 also shows an uniform distribution, though the shape is awkward because of the small number of gene pairs. We note that Figure 7c3 was obtained by a systematic manner in which we used the significance level of 0.05 as a cutoff value after applying the Bonferroni correction to Pvalues. Overall, these results confirmed that ECOR relaxed the bias in the result of WCOR by removing dubious pairs.
Validating highly ranked gene pairs with the literature
Table 1 shows the information of the top 10 gene pairs ranked by ROSDET. For example, • and + in Figure 6 correspond to the left and righthand sides, respectively, of annotations in Table 1. In Table 1, two genes of each pair, say genes X and Y, have a clear switching mechanism in gene expression, to be related with some biological reason. Thus, we could make a pathway of genes that connects genes X and Y, where each step of the pathway would show a biological function like ‘binding’ or ‘positive regulation’. This case, one naturally arising question is what biological system causes a switching mechanism, i.e. that genes X and Y are correlated under one condition and negatively correlated under the other condition. This can be explained by the following mechanism, which we call ‘parallel pathways’.
Score  Pvalue  Gene pair  GDS  No. of examples in classes 1 and 2  Annotation (Two classes: •/+)  

1  1.710  1.56×10^{−12}  {NSMCE4A,USP4}  GDS2656  14,14  Fetal/adult 
2  1.688  2.77×10^{−13}  {HS3ST1,GANAB}  GDS2545  18,25  Normal prostate tissue/ 
metastatic prostate tumor  
3  1.684  1.30×10^{−12}  {ACTG1,LITAF}  GDS1726  12,16  Control/HIV encephalopathy 
4  1.639  8.29×10^{−13}  {HS3ST1,TAF10}  GDS2545  18,25  Normal prostate tissue/ 
metastatic prostate tumor  
5  1.606  1.31×10^{−12}  {FARP1,NONO}  GDS2545  18,25  Normal prostate tissue/ 
metastatic prostate tumor  
6  1.584  9.73×10^{−13}  {GATAD2A,LOC149643}  GDS1917  14,14  Control/schizophrenia 
7  1.581  2.68×10^{−13}  {EZR,HHEX}  GDS1650  19,20  Adjacent normal/tumor 
8  1.532  8.05×10^{−13}  {Ncl,Prkcb1}  GDS1455  10,10  Medial motoneuron/ 
intermediolateral column motoneuron  
9  1.522  3.66×10^{−13}  {ACAA1,EPHX2}  GDS2545  18,25  Normal prostate tissue/ 
metastatic prostate tumor  
10  1.488  1.64×10^{−12}  {INTS1,NPTX1}  GDS963  18,18  Normal/macular degeneration 
Score  Pvalue  Gene pair  GDS  No. of examples in classes 1 and 2  Annotation (Two classes: •/+)  

1  1.710  1.56×10^{−12}  {NSMCE4A,USP4}  GDS2656  14,14  Fetal/adult 
2  1.688  2.77×10^{−13}  {HS3ST1,GANAB}  GDS2545  18,25  Normal prostate tissue/ 
metastatic prostate tumor  
3  1.684  1.30×10^{−12}  {ACTG1,LITAF}  GDS1726  12,16  Control/HIV encephalopathy 
4  1.639  8.29×10^{−13}  {HS3ST1,TAF10}  GDS2545  18,25  Normal prostate tissue/ 
metastatic prostate tumor  
5  1.606  1.31×10^{−12}  {FARP1,NONO}  GDS2545  18,25  Normal prostate tissue/ 
metastatic prostate tumor  
6  1.584  9.73×10^{−13}  {GATAD2A,LOC149643}  GDS1917  14,14  Control/schizophrenia 
7  1.581  2.68×10^{−13}  {EZR,HHEX}  GDS1650  19,20  Adjacent normal/tumor 
8  1.532  8.05×10^{−13}  {Ncl,Prkcb1}  GDS1455  10,10  Medial motoneuron/ 
intermediolateral column motoneuron  
9  1.522  3.66×10^{−13}  {ACAA1,EPHX2}  GDS2545  18,25  Normal prostate tissue/ 
metastatic prostate tumor  
10  1.488  1.64×10^{−12}  {INTS1,NPTX1}  GDS963  18,18  Normal/macular degeneration 
Parallel pathways
Parallel pathways are two pathways which both reach (or start with) the same gene, which we call the destination gene, and these pathways satisfy the following two conditions: (i) gene X is in one pathway and gene Y is in the other, and (ii) under one experimental condition, genes X and Y are positively correlated in expression (meaning that two pathways are cooperatively used), and under the other experimental condition, genes X and Y are negatively correlated in expression (meaning that two pathways are alternatively used). One possible scenario of a switching mechanism with parallel pathways is that the expression of the destination gene can be controlled by (or can control) the two conditions. That is, the expression of the destination gene can be changed by (or can change) the cooperative or alternative expression of upstream (or downstream) genes.
One typical example found by Li (5) has two genes, each being in a different metabolic pathway from glutamate to ornithine, controlled by the expression of CPA2, which is in upstream of glutamate. That is, if the expression of CPA2 is low, two genes are positively correlated in expression, while if that of CPA2 is high, two genes are negatively correlated, meaning that these two genes are alternatively expressed.
Figure 8 illustrates a simulated example of two genes, genes X and Y, with a switching mechanism and parallel pathways, where gene Z is the destination gene. We further note that genes X and Y must be correlated with the gene Z in a particular manner: genes X and Y should be both positively or negatively correlated with gene Z under one condition (corresponding to + in Figure 8 showing both positive correlations), while genes X and Y should be correlated with gene Z in two different ways, i.e. one being positive and the other being negative, under the other condition (corresponding to • in Figure 8).
We here check how frequently the parallel pathways can be found in the top 10 gene pairs ranked by ROSDET and further confirm their validity by computing the biweight midcorrelations between highly ranked paired genes and the destination genes.
Top ranked gene pair
NSMCE4A [or NSE4A: nonSMC element 4 homolog A (Saccharomyces cerevisiae)] and USP4 (ubiquitin specific peptidase 4 or UNP) from GDS2656.
NSMCE4A has a human homolog, NSE4B (or EID3). The EID family has EID1, EID2 and EID3, in which both EID1 and EID2 inhibit cell differentiation while both EID1 and EID3 inhibit transcription under the existence of EP300 (22). This makes us assume that NSMCE4A has the same biological function as EID1. EID1 interacts with (probably binds to) MDM2 (23), which is a negative regulator of tumor suppressor protein TP53 (24), implying that MDM2 is rather a positive factor on tumor growth. On the other hand, USP7, a ubiquitin specific peptidase with the same function as USP4, stabilizes the activation of TP53 (25), implying some negative effect on tumor growth.
Figure 9 summarizes these genes into two parallel pathways which both go to TP53. This figure shows a direct implication to the mechanism between EID1 and USP4 (or USP7), being consistent with parallel pathways. Two conditions of GDS2656 are fetal and adult. Figure 6a shows that two genes are positively correlated under fetal while negatively correlated under adult, implying that TP53 might be alternatively regulated by NSMCE4A and USP4 under adult, while TP53 could be regulated in a cooperative manner under fetal. To confirm this finding, we attempted to check the expression of TP53 but we could not find TP53 in GDS2656, and so instead we checked the expression of TP53RK, a TP53 regulating kinase. Table 2 shows the biweight midcorrelation between TP53RK and each of two genes of the top ranked pair. The result shows that TP53RK is positively correlated with both NSMCE4A and USP4 under fetal, while under adult, TP53RK is positively and negatively correlated with NSMCE4A and USP4, respectively. This result is consistent with our finding, which implies cooperative regulation under fetal and alternative regulation under adult.
Gene 1  Gene 2  r_{1}(•: Fetal)  r_{2} (+: Adult) 

NSMCE4A  TP53RK  0.25  0.50 
USP4  TP53RK  0.25  −0.48 
MDM2  TP53RK  0.35  0.40 
USP7  TP53RK  −0.27  −0.40 
Gene 1  Gene 2  r_{1}(•: Fetal)  r_{2} (+: Adult) 

NSMCE4A  TP53RK  0.25  0.50 
USP4  TP53RK  0.25  −0.48 
MDM2  TP53RK  0.35  0.40 
USP7  TP53RK  −0.27  −0.40 
We further checked the biweight midcorrelation in expression between two neighboring genes which are in the parallel pathways and indicated by positive or negative regulation, e.g. negative regulation of MDM2 on TP53. We here note that we did not check neighboring pairs labeled by ‘binding’, since binding is elusive by including both positive and negative regulation. Thus, the neighboring pairs we checked are MDM2 → TP53 (negative regulation) and USP7 → TP53 (positive stabilization), and we used TP53RK instead of TP53. Table 2 shows the result, indicating that under both fetal and adult TP53RK was positively and negatively correlated with MDM2 and USP7, respectively, which are both reverse to the labels assigned to MDM2 → TP53 and USP7 → TP53. This implies that TP53RK might be negatively regulated by TP53, and this finding is consistent with (26). We note that even if TP53RK is negatively regulated by TP53, it is not contradictory to our finding that NSMCE4A and USP4 are cooperatively correlated under fetal and are alternatively correlated under adult.
Second ranked gene pair
HS3ST1 [heparan sulfate (glucosamine) 3Osulfotransferase 1] and GANAB (αglucosidase) from GDS2545.
Both of these two genes are related with generating glycans (27,28). HS3ST1 is involved with cancer cells (28–30), while GANAB is related with carbohydrate absorption and digestion, especially metastatic process of tumor cells (31). Two GANAB inhibitors, 1,6epicyclophellitol and castanospermine, inhibit experimental metastasis of tumor and tumor growth (31,32), meaning that GANAB may be rather an activator for tumor (metastasis), since these two GANAB inhibitors prevent tumor growth. GANAB stably interacts with (probably binds to) the external domain of PTPRC (CD45) (33), where Lck is completely dysfunctional in the absence of CD45 (34), and CD44 binds to Lck (35). On the other hand, HS3ST1 activates anticoagulant heparan sulfate (HS) (36) and can elevate the level of anticoagulant HS (37). Anticoagulant HS binds to FGFR2 (36), which further binds to FGF1 to produce a protein complex (38). FGF1 binds to HSPA9 (39), which binds to TP53 (40), an inhibitor of CD44 (41). CD44 selectively associates with active Lck, meaning that CD44 and Lck can form a complex.
Figure 10 summarizes these relations into parallel pathways both reach to the complex of CD44 and Lck. Two classes of GDS2545 are normal tissues and prostate tumor tissues. Figure 6b shows that GANAB and HS3ST1 are negatively correlated in expression under normal tissues, while they are positively correlated under tumor tissues. Here, TP53 is the inhibitor of CD44 in the pathway, by which some switching mechanism might be generated between GANAB and HS3ST1. That is, under normal tissues, the negative correlation between GANAB and HS3ST1 might lead to Lck and CD44 being highly expressed when the expression of one of HS3ST1 and GANAB is high (and the other is low). On the other hand, under tumor tissues, the correlation in expression between GANAB and HS3ST1 might indicate that the expression of CD44 and Lck is not well balanced. For example, even if both GANAB and HS3ST1 are well expressed, one of CD44 and Lck might be poorly expressed despite of the high expression of the other, which might be a result of the disorder, i.e. prostate tumor. To confirm this inference, we checked the biweight midcorrelation in expression between each gene of the second ranked pair and Lck (and CD44). Table 3 shows that under normal tissues Lck (and CD44) is correlated negatively and positively with GANAB and HS3ST1, respectively, implying that both CD44 and Lck can be expressed well when HST3ST1 is expressed highly and GANAB is expressed poorly. On the other hand, under tumor tissues, the expression of CD44 and Lck is unbalanced under any situation in expression of GANAB and HS3ST1. This result is clearly consistent with the above inference.
Gene 1  Gene 2  r_{1}(•: Normal)  r_{2} (+: Tumor) 

GANAB  Lck  −0.38  −0.44 
HS3ST1  CD44  0.28  0.34 
HS3ST1  Lck  0.24  −0.47 
CD45  Lck  −0.12  0.55 
TP53  CD44  0.50  −0.40 
Gene 1  Gene 2  r_{1}(•: Normal)  r_{2} (+: Tumor) 

GANAB  Lck  −0.38  −0.44 
HS3ST1  CD44  0.28  0.34 
HS3ST1  Lck  0.24  −0.47 
CD45  Lck  −0.12  0.55 
TP53  CD44  0.50  −0.40 
We further checked the biweight midcorrelation of neighboring two genes, which are in the parallel pathways and expected to be positively or negatively coexpressed. Again we did not check pairs labeled by ‘binding’, since binding includes both positive and negative regulation. We then checked CD45 → Lck (coexpression) and TP53 → CD44 (inhibitor). Table 3 shows that under tumor tissues, CD45 and Lck are positively correlated and TP53 and CD44 are negatively correlated, being consistent with the labels assigned to these two edges in Figure 10.
Third ranked gene pair use
ACTG1 (actin gamma 1) and LITAF [lipopolysaccharideinduced tumor necrorosis factor (TNF)α factor] from GDS1726.
LITAF activates the production of infectionfighting substance called TNFα. Both ACTG1 and TNF can be found in a HIV pathway (42), where TNF is an upstream signal while ACTG1 appears in downstream. In fact, the two experimental conditions measured in GDS1726 are the HIV encelophathy and its control. This pathway appears in the apoptosis pathway of KEGG. Figure 11 shows the two corresponding pathways. In both of these pathways, simply LITAF or TNF is in upstream and ACTG1 is in downstream, implying that this pair is not parallel pathways, but TNF and ACTG1 were connected by two different pathways, implying that this pair might be explained by another parallel association, which might cause a switching mechanism.
Fourth ranked gene pair
HS3ST1 [heparan sulfate (glucosamine) 3Osulfotransferase 1] and TAF10 [RNA polymerase II and TATA box binding protein (TBP)associated factor] from GDS2545.
HS3ST1 is one of two genes appeared in the second ranked gene pair. We can then use the same pathway as that of the second ranked gene pair, and TP53 binds to a TBP and represses its transcription (43). On the other hand, TAF10 forms a complex (TFIID) with TBP and other proteins, where TAF10 is important for stabilizing the complex. We can summarize these genes into Figure 12, which shows that HS3ST1 and TAF10 have parallel pathways. Two conditions of GDS2545 are normal tissues and prostate tumor tissues. Figure 6d shows that TAF10 and HS3ST1 are negatively correlated in expression under normal tissues, while they are positively correlated under tumor tissues. Here, TP53 represses transcription of TBP, with which TAF10 forms a complex, by which under normal tissues, the negative correlation between TAF10 and HS3ST1 might indicate that both TBP and TAF10 are highly expressed when the expression of HS3ST1 is low (and that of TAF10 is high), while they are both not expressed highly when the expression of HS3ST1 is high (and that of TAF10 is low). On the other hand, under tumor tissues, the expression of TAF10 and that of HS3ST1 are positively correlated, possibly implying that the balance in expression between TBP and TAF10 (by which a complex will be formed) is not kept well, maybe because of the disorder, i.e. prostate tumor. In order to confirm this inference, we checked the biweight midcorrelation between TBP and two genes in the fourth ranked pair, and Table 4 shows the result. From this table, we can see that under normal tissues, TBP is negatively and positively correlated with HS3ST1 and TAF10, respectively, being consistent with our scenario. That is, under normal tissues, both TBP and TAF10 can be expressed highly when HS3ST1 is not expressed well, while TBP and TAF10 will not be expressed highly if HS3ST1 is expressed well. On the other hand, under tumor tissues, TBP can be positively correlated with both HS3ST1 and TAF10, although the correlation values are relatively slight, indicating that the above scenario or mechanism under normal tissues would not work well under tumor tissues.
Gene 1  Gene 2  r_{1}(•: Normal)  r_{2} (+: Tumor) 

HS3ST1  TBP  −0.46  0.18 
TAF10  TBP  0.43  0.26 
TP53  TBP  −0.59  −0.33 
Gene 1  Gene 2  r_{1}(•: Normal)  r_{2} (+: Tumor) 

HS3ST1  TBP  −0.46  0.18 
TAF10  TBP  0.43  0.26 
TP53  TBP  −0.59  −0.33 
We further checked the biweight midcorrelation between two neighboring genes, being labeled by positive or negative regulation (as mentioned earlier, we did not check those labeled by ‘binding’). We then checked TP53 → TBP (transcription repression) (since anticoagulant HS is a chemical compound and not in GDS2545). Table 4 shows the biweight midcorrelation between TP53 and TBP, indicating negative correlation under both normal and tumor tissues, which is consistent with the fact that TP53 represses TBP.
Fifth ranked gene pair
FARP1 [FERM RhoGEF (ARHGEF) and pleckstrin domain protein] and NONO (nonPOU domain containing octamerbinding) from GDS2545.
NONO is deeply involved with human diseases, being a cause of cell carcinoma by a translocation with TFE3. TFE3 binds to a tumor suppressor p68 (DDX5) (44), which binds to EP300 (45), which further binds to TP53 (46). TP53 is an inhibitor of CD44 (41). On the other hand, CD44 forms a complex with ezrin family proteins (47). FARP1 has three domains including an ezrinlike domain. Figure 13 summarizes these genes into pathways which have CD44 as their final destination, indicating parallel pathways. Two classes are normal tissues and prostate tumor tissues, which are the same as those of the second and fourth ranked gene pairs. Figure 6e shows that NONO and FAPR1 are positively correlated in expression under normal tissues, while they are negatively correlated under tumor tissues, which is reverse against the second and fourth ranked gene pairs. However, these pathways are totally different from the second and fourth pathways, except TP53 → CD44, and so these pathways might explain the switching mechanism of FARP1 and NONO. Table 5 shows the biweight midcorrelation between CD44 and each of the two genes in this pair. From the table, we can see that under normal tissues CD44 is positively correlated with both NONO and FAPR1, while under tumor tissues CD44 is positively and negatively correlated with NONO and FAPR1, respectively. This result is consistent with our finding that NONO and FAPR are cooperatively expressed under normal tissues while they are alternatively expressed under tumor tissues.
Gene 1  Gene 2  r_{1}(•: Normal)  r_{2} (+: Tumor) 

NONO  CD44  0.43  0.45 
FAPR1  CD44  0.42  −0.58 
TP53  CD44  0.15  −0.49 
FAPR1  EZR  0.61  0.49 
Gene 1  Gene 2  r_{1}(•: Normal)  r_{2} (+: Tumor) 

NONO  CD44  0.43  0.45 
FAPR1  CD44  0.42  −0.58 
TP53  CD44  0.15  −0.49 
FAPR1  EZR  0.61  0.49 
We then further checked the biweight midcorrelation of two neighboring genes which are in the pathway and not shown by ‘binding’ and related labels. We then checked TP53 → CD44 (inhibitor) and FAPR1 → EZR (domain). The results are shown in Table 5, in which TP53 and CD44 are negatively correlated under tumor tissues and EZR and FAPR1 are positively correlated under both two conditions. This result is also consistent with the labels assigned to these neighboring genes.
Overall, to keep a switching mechanism with parallel pathways, two genes in each pair should have the following correlation with the destination gene. Two paired genes are both positively or negatively correlated with the destination gene under one condition, while under the other condition, two paired genes are correlated with the destination gene in two different ways, i.e. one being positive and the other being negative. In fact, all four pairs, i.e. the top, second, fourth and fifth ranked gene pairs have such correlations. This result also implies that the switching mechanisms found by ROSDET are reliable.
The detail of the sixth to 10th ranked gene pairs is shown in the Supplementary Data due to space limitations. We found that each of the seventh to 10th gene pairs has parallel pathways, meaning that totally eight out of the top 10 gene pairs had parallel pathways.
DISCUSSION AND CONCLUSION
We have developed an efficient and robust method, ROSDET, for detecting switching mechanisms in gene expression. ROSDET clearly outperformed current approaches in a variety of experimental settings. Particularly under the case of expression values with a very small range, where the performance of all competing methods was almost equal to random guessing, ROSDET achieved a significantly better accuracy. We examined the literature on the top five pairs ranked by ROSDET and found that each pair has been involved with a biological pathway, which can connect two genes of the pair. Furthermore, four out of the top five pairs have parallel pathways, which were suggested by Li (5) as a typical case of switching mechanisms, implying that four pairs have real switching mechanisms. A possible explanation on the parallel pathways which come to (or start with) the destination gene is that the destination gene controls (or is controlled by) two cases: two pathways are cooperatively (or positively) correlated or two pathways are alternatively (or negatively) correlated. In fact, in each of all four pairs, the biweight midcorrelation between the destination gene and two genes in the pair was consistent with the above explanation. This result also supports the performance of ROSDET in detecting switching mechanisms in gene expression.
Although the real computation time is not shown in our experimental results, ROSDET is very time efficient, because the time complexity of computing the biweight midcorrelation coefficient is totally the same as that of a simple correlation coefficient, such as the Pearson's correlation coefficient.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
Funding for open access charge: Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Agency (JST).
Conflict of interest statement. None declared.
Comments