Abstract

Motivation: The differential expression analysis focusing on inter-group comparison can capture only differentially expressed genes (DE genes) at the population level, which may mask the heterogeneity of differential expression in individuals. Thus, to provide patient-specific information for personalized medicine, it is necessary to conduct differential expression analysis at the individual level.

Results: We proposed a method to detect DE genes in individual disease samples by using the disrupted ordering in individual disease samples. In both simulated data and real paired cancer-normal sample data, this method showed excellent performance. It was found to be insensitive to experimental batch effects and data normalization. The landscape of stable gene pairs in a particular type of normal tissue could be predetermined using previously accumulated data, based on which dysregulated genes and pathways for any disease sample can be readily detected. The usefulness of the RankComp method in clinical settings was exemplified by the identification and application of prognostic markers for lung cancer.

Availability and Implementation:  RankComp is implemented in R script that is freely available from Supplementary Materials.

Contact:  guoz@ems.hrbmu.edu.cn or slliu@ucalgary.ca

Supplementary information:  Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The differential expression analysis methods on intergroup comparison for identifying disease-related genes can be classified into two broad categories: the intensity-based methods such as T-Test ( Rice, 1995 ), SAM ( Tusher et al. , 2001 ) and Limma ( Smyth, 2004 ), and the rank-based methods such as Wilcoxon signed rank test ( Wilcoxon, 1945 ), Wilcoxon rank sum test ( Hollander and Wolfe, 1973 ) and RP ( Breitling et al. , 2004 ). Of the rank-based methods, there is a special type of method using the fold-change ordering information, such as RP ( Breitling et al. , 2004 ), RcoS ( Navon et al. , 2009 ) and FCROS ( Dembele and Kastner, 2014 ). In general, these methods first entail ranking fold-change values within each pairwise comparison between two types of samples, then calculate a fold-change rank ordering statistic for each gene. The significance of the observed statistic could be determined by using a permutation test (e.g. RP) or probability distribution function (e.g. FCROS). The rank-based methods usually have some advantages in comparison with the intensity-based methods, such as robustness against outlier values and favorable power efficiency in some cases ( Lehmann, 1975 ). However, both the above-mentioned intensity-based and rank-based methods are designed to detect the population-level DE genes and cannot provide patient-specific differential expression information.

Taking the heterogeneous nature of disease into account, several outlier detection methods, including COPA ( Tomlins et al. , 2005 ), OS ( Tibshirani and Hastie, 2007 ), ORT ( Wu, 2007 ), MOST ( Lian, 2008 ) and others ( de Ronde et al. , 2013 ; Hu, 2008 ; Wang et al. , 2012 ), are developed to detect genes that are dysregulated in subsets of disease samples. However, they are more sensitive to various technical artifacts, especially experimental batch effects caused by differences in laboratory conditions, reagent lots and personnel ( Leek et al. , 2010 ). In general, batch effects can introduce serious problems in translating experimental findings to clinical settings. For example, because of this problem, an optimized threshold value of the risk score summarizing the expression levels of signature genes determined from a set of training samples cannot be directly applied to other samples. Usually, in validation studies, data normalization is required for independently extracted samples to make all data having nearly the same scale as the training samples. However, current normalization methods that are used to adjust for systematic technical artifacts are usually not able to remove batch effects ( Lazar et al. , 2012 ). They may even distort biological signals ( Wang et al. , 2011 ). More fundamentally, large interindividual variation in gene expression will exacerbate the problem of threshold setting for risk stratification. To tackle this difficult problem, researchers have proposed to make use of the relative ordering information of gene expression within each sample, considering that the relative ordering of gene expression within each sample would be rather robust against batch effects and insensitive to data normalization ( Geman et al. , 2004 ; Tan et al. , 2005 ).

The relative ordering of gene expression is overall stable in a particular type of normal human tissue but widely disturbed in diseased tissue. Taking this into account, a method of detecting DE genes in individual disease samples is here proposed. Using both simulated data and real paired cancer–normal data, this method was shown to have excellent performance for individual-level analyses of dysregulated genes and pathways. The use of this method in clinical contexts was exemplified by the identification and application of prognostic markers to risk stratification of lung cancer patients according to the dysregulation status of signature in each patient rather than a predefined risk threshold value.

2 MATERIALS AND METHODS

2.1 Data and preprocessing

Multiple microarray datasets of gene expression, generated with the Affymetrix platform (HG-U133 Plus2.0), were downloaded from Gene Expression Omnibus (GEO; Edgar et al. , 2002 ). Notably, normal samples of each type of tissue (lung or breast) were collected from datasets for studying different disorders on this tissue, as described in detail in Table 1 . Paired cancer–normal samples taken from the same patients were used to evaluate the performance of RankComp . The cancer sample data with survival information were used for survival analysis. More detailed clinical information on survival data can be found in the Supplementary Material .

Table 1.

Description of normal sample data, paired cancer–normal sample data and survival data used in this study

TissueGEO Acc Number of Normal aNumber of CancerReference
The normal sample data used for identifying stable gene pairs
LungGSE1884245 ( Sanchez-Palencia et al. , 2010 )
GSE1918865 ( Hou et al. , 2010 )
GSE3121020 ( Lu et al. , 2010 )
GSE1980460 ( Okayama et al. , 2011 )
GSE3776820
BreastGSE37447 ( Alimonti et al. , 2010 )
GSE73072
GSE79047 ( Richardson et al. , 2006 )
GSE1078060 ( Chen et al. , 2009 )
GSE1081027 ( Pedraza et al. , 2009 )
GSE207112 ( Dedeurwaerder et al. , 2011 )
GSE214225 ( Kretschmer et al. , 2011 )
GSE225444 ( Hawthorn et al. , 2010 )
GSE26457113 ( Russo et al. , 2011 )
GSE2943112
GSE30010107
GSE4256817 ( Clarke et al. , 2013 )

The paired cancer–normal sample data used for evaluating the performance of RankComp

LungGSE272622525 ( Wei et al. , 2012 )
BreastGSE107801111 ( Chen et al. , 2009 )

The data with survival information used for survival analysis

LungGSE31210204 ( Okayama et al. , 2011 )
GSE290138 ( Xie et al. , 2011 )
GSE3021983 ( Rousseaux et al. , 2013 )
GSE3154613
GSE3774578 ( Botling et al. , 2012 )
TissueGEO Acc Number of Normal aNumber of CancerReference
The normal sample data used for identifying stable gene pairs
LungGSE1884245 ( Sanchez-Palencia et al. , 2010 )
GSE1918865 ( Hou et al. , 2010 )
GSE3121020 ( Lu et al. , 2010 )
GSE1980460 ( Okayama et al. , 2011 )
GSE3776820
BreastGSE37447 ( Alimonti et al. , 2010 )
GSE73072
GSE79047 ( Richardson et al. , 2006 )
GSE1078060 ( Chen et al. , 2009 )
GSE1081027 ( Pedraza et al. , 2009 )
GSE207112 ( Dedeurwaerder et al. , 2011 )
GSE214225 ( Kretschmer et al. , 2011 )
GSE225444 ( Hawthorn et al. , 2010 )
GSE26457113 ( Russo et al. , 2011 )
GSE2943112
GSE30010107
GSE4256817 ( Clarke et al. , 2013 )

The paired cancer–normal sample data used for evaluating the performance of RankComp

LungGSE272622525 ( Wei et al. , 2012 )
BreastGSE107801111 ( Chen et al. , 2009 )

The data with survival information used for survival analysis

LungGSE31210204 ( Okayama et al. , 2011 )
GSE290138 ( Xie et al. , 2011 )
GSE3021983 ( Rousseaux et al. , 2013 )
GSE3154613
GSE3774578 ( Botling et al. , 2012 )

Note : a To determine stable gene pairs on a particular type of normal tissue, only normal samples were collected from datasets for studying different disorders on this tissue. Thus, the information of disease samples in each dataset was not presented.

Table 1.

Description of normal sample data, paired cancer–normal sample data and survival data used in this study

TissueGEO Acc Number of Normal aNumber of CancerReference
The normal sample data used for identifying stable gene pairs
LungGSE1884245 ( Sanchez-Palencia et al. , 2010 )
GSE1918865 ( Hou et al. , 2010 )
GSE3121020 ( Lu et al. , 2010 )
GSE1980460 ( Okayama et al. , 2011 )
GSE3776820
BreastGSE37447 ( Alimonti et al. , 2010 )
GSE73072
GSE79047 ( Richardson et al. , 2006 )
GSE1078060 ( Chen et al. , 2009 )
GSE1081027 ( Pedraza et al. , 2009 )
GSE207112 ( Dedeurwaerder et al. , 2011 )
GSE214225 ( Kretschmer et al. , 2011 )
GSE225444 ( Hawthorn et al. , 2010 )
GSE26457113 ( Russo et al. , 2011 )
GSE2943112
GSE30010107
GSE4256817 ( Clarke et al. , 2013 )

The paired cancer–normal sample data used for evaluating the performance of RankComp

LungGSE272622525 ( Wei et al. , 2012 )
BreastGSE107801111 ( Chen et al. , 2009 )

The data with survival information used for survival analysis

LungGSE31210204 ( Okayama et al. , 2011 )
GSE290138 ( Xie et al. , 2011 )
GSE3021983 ( Rousseaux et al. , 2013 )
GSE3154613
GSE3774578 ( Botling et al. , 2012 )
TissueGEO Acc Number of Normal aNumber of CancerReference
The normal sample data used for identifying stable gene pairs
LungGSE1884245 ( Sanchez-Palencia et al. , 2010 )
GSE1918865 ( Hou et al. , 2010 )
GSE3121020 ( Lu et al. , 2010 )
GSE1980460 ( Okayama et al. , 2011 )
GSE3776820
BreastGSE37447 ( Alimonti et al. , 2010 )
GSE73072
GSE79047 ( Richardson et al. , 2006 )
GSE1078060 ( Chen et al. , 2009 )
GSE1081027 ( Pedraza et al. , 2009 )
GSE207112 ( Dedeurwaerder et al. , 2011 )
GSE214225 ( Kretschmer et al. , 2011 )
GSE225444 ( Hawthorn et al. , 2010 )
GSE26457113 ( Russo et al. , 2011 )
GSE2943112
GSE30010107
GSE4256817 ( Clarke et al. , 2013 )

The paired cancer–normal sample data used for evaluating the performance of RankComp

LungGSE272622525 ( Wei et al. , 2012 )
BreastGSE107801111 ( Chen et al. , 2009 )

The data with survival information used for survival analysis

LungGSE31210204 ( Okayama et al. , 2011 )
GSE290138 ( Xie et al. , 2011 )
GSE3021983 ( Rousseaux et al. , 2013 )
GSE3154613
GSE3774578 ( Botling et al. , 2012 )

Note : a To determine stable gene pairs on a particular type of normal tissue, only normal samples were collected from datasets for studying different disorders on this tissue. Thus, the information of disease samples in each dataset was not presented.

The use of relative expression obviates the need of between-chip normalization because all direct comparisons between genes occur within individual samples and inter-chip normalization can preserve order ( Heinaniemi et al. , 2013 ). For this reason, the raw data (.CEL files) for each dataset was processed using the RMA algorithm for background adjustment without quantile normalization ( Irizarry et al. , 2003 ). Then, each probeset ID was mapped to Entrez gene ID with the custom CDF file. If multiple probesets were mapped to the same gene, the expression value for the gene was summarized as the arithmetic mean of the values of multiple probesets (on the log2 scale).

2.2 The MSigDB pathway

The 674 canonical Reactome pathways were downloaded from the C2-CP collection of the Molecular Signatures Database (MSigDB version 4.0, updated May 31, 2013; Subramanian et al. , 2005 ). These pathways covered 6025 unique genes for pathway enrichment analysis.

2.3 Rank comparison

First, each gene expression value is converted to its rank within each sample (the smallest expression value corresponding to the minimum rank, and the largest expression value corresponding to the maximum rank). Then, pairwise comparisons are performed for all genes to identify gene pairs with stable ordering in accumulated normal samples for a particular type of tissue from different data sources. For each gene pair ( G i , G j ), being viewed as an event with only two possible outcomes ( G i > G j or G i < G j ), the frequency of samples in normal samples for which the rank of G i is greater (or less) than that of G j is estimated as follows:
Pnorm(Gi>Gj)=1n1t=1n1I[Git>Gjt]Pnorm(Gi<Gj)=1n1t=1n1I[Git<Gjt]
where n 1 is the total number of normal samples and I is the indicator function. Stable gene pairs are defined as gene pairs with P norm ( G i > G j )>0.99 or P norm ( G i < G j ) > 0.99. Next, reversal gene pairs are defined for each disease sample as gene pairs with reversal ordering in comparison with their stable ordering in normal samples ( G i > G j G i < G j or G i < G j G i > G j ). Afterwards, the Fisher’s exact test is used to determine whether a given gene ( G i ) is differentially expressed in a given disease sample ( k ) by testing the null hypothesis that the numbers of reversal gene pairs supporting its upregulation and downregulation are equal. For G i , if its ordering is consistently lower (or higher) than that of G j in normal samples but the opposite in the disease sample k , then this reversal gene pair is considered to support upregulation (or downregulation) of G i in that sample. Let G denote the set of stable gene pairs including G i in normal samples, a and b denote the numbers of gene pairs belonging to G with ordering patterns { G i > G j } and { G i < G j' } and c and d denote the corresponding numbers of gene pairs belonging to G with ordering patterns { G i > G j } and { G i < G j' } in the disease sample k. To this end, the proportions of gene pairs with ordering patterns { G i > G j } and { G i < G j' } are Obs 11 /Obs 12 = a/b in normal samples and Obs 21 /Obs 22 = c/d in the disease sample k , respectively. Under the null hypothesis, we will expect that Obs 11 /Obs 12 = Obs 21 /Obs 22 , which can be tested by the Fisher’s exact test, wherein G i is defined as upregulated if c/d > a/b, downregulated if c/d < a/b and stable expression if c/d = a/b. Finally, a filtering process is adopted to minimize the potential effect of the expression changes of other genes on the upward or downward shift in the rank of a gene detected as an up- or downregulated DE gene: if and only if it is still significant after excluding the downregulated (or upregulated) partner genes involved in the reversal gene pairs supporting its upregulation (or downregulation), it will be retained.
Notably, besides its major application for detecting DE gene at the individual level, the RankComp method could also be applied to identify DE genes at the subpopulation level by using the binomial test to find a non-randomly high percentage ( f ) of disease samples sharing certain DE genes. Here, the minimum value of f ( f = k/n 2 ) that meets a prespecified significance level for the binomial test can be determined as follows:
P=kn2Cn2kp0k(1p0)(n2k)
where k is the number of the observed dysregulated samples for each gene, n 2 is the total number of disease samples and p 0 is the probability of observing a gene being differentially expressed in a disease sample by chance, calculated as the average of the frequencies of DE genes among all the genes on the array for individual disease samples.

2.4 Evaluation of performance

First, a simulation is performed to evaluate the performance of the RankComp method. To retain the intrinsic structure of real microarray data, the simulations were conducted based on real microarray dataset rather than beginning by assuming normal distribution (see Section 3 for detailed description of simulation experiments). The simulation experiment enables us to know both the DE genes and non-DE genes and facilitate the calculation of sensitivity, specificity and F-score. For each simulation scenario, the sensitivity, specificity and F-score of the RankComp method for detection of DE genes are estimated at different FDR control levels. Here, the sensitivity is defined as the ratio of correctly identified DE genes to all DE genes and the specificity is defined as the ratio of correctly identified non-DE genes to all non-DE genes. The F-score, a harmonic mean of sensitivity and specificity, is calculated as follows:
Fscore=2(sensitivity×specificity)(sensitivity+specificity)

Then, the real microarray data of paired cancer–normal samples, as described in Table 1 , is used as a benchmark to evaluate the performance of the RankComp method. For a paired samples of cancer and normal tissues, a gene is defined as up-regulated if its expression level in the cancer sample is larger than that in the normal sample, and defined as down-regulated if its expression level in the cancer sample is smaller than that in the normal sample, regardless of whether the changes in its expression level are significant or not. Taking the change directions (up- or down-regulation) of genes observed in the paired cancer-normal samples as the golden standard, we then define a DE gene, detected by the RankComp method in the same cancer sample, whose change direction is consistent with the golden standard as a consistent DE gene. For each cancer sample, the consistency score, as representative of the precision of DE detection, is calculated as the ratio of the consistent DE genes to all DE genes, and the number of DE genes is used to access the detection power of the method. Notably, the corresponding paired normal tissue samples are not used to determine stable gene pairs in normal tissues.

2.5 Pathway analysis at the individual level

The hypergeometric distribution model was used to determine the significance of biological pathways enriched with up- and downregulated DE genes separately ( Hong et al. , 2013 ). The P -values were adjusted using the Benjamini and Hochberg procedure, controlling FDR at the 5% level ( Hochberg and Benjamini, 1995 ).

2.6 Survival analysis and pathway signature selection

The overall survival was calculated from the date of surgery until death or last follow-up contact. To avoid the bias of patient follow-up duration, patients with >120 months of follow-up were truncated at 120 months. The survival curves were estimated by the Kaplan–Meier method ( Meier, 1958 ) and were compared using the log-rank test ( Gray, 1988 ). The Cox proportional hazard model was used for univariate and multivariate survival analysis ( Cox, 1972 ).

The forward stepwise algorithm was used to identify an optimal pathway predictor of survival. Beginning with the pathway with the highest concordance index (C-index; Harrell et al. , 1996 ) as the initial signature, candidate pathways were added one at a time to the signature until the next one to be added did not improve prognostic performance. At each step, predictive performance was gauged for all possible additions and evaluated using the c-index to select the optimal addition yielding the largest increase in the c-index value. The C-index is a measure of the probability that, given a pair of randomly selected patients, the model correctly predicts which patient will have a better outcome. This measure quantifies discriminatory ability, ranging from 0.5 (indicating random chance) to 1 (indicating perfect discrimination).

3 RESULTS

3.1 Stable expression ordering of gene pairs in normal human tissues

To determine whether relative expression ordering of gene pairs is stable in normal samples for a particular tissue, normal tissue samples from different datasets were collected for the study of different disorders. For normal lung tissue, a total of 210 samples were collected from multiple datasets of several lung diseases, including lung cancer and chronic obstructive pulmonary disease. They were then divided into two groups: one group of 65 samples was derived from the GSE19188 dataset and another group of 145 samples was derived from four other datasets ( Table 1 ). In each group, pairwise comparisons were performed for all genes to identify stable gene pairs present in at least 99% of normal samples. More than 95% (112 970 616 over 118 725 563) of the stable gene pairs identified in the second group were included in the 135 020 175 stable gene pairs identified in the first group. Remarkably, all of the 112 970 616 overlapping gene pairs had the same ordering patterns across the two groups of samples. This was highly unlikely to happen by chance (binomial test, P < 1.0e-16), indicating that stable gene pairs are highly reproducible in different sets of normal tissue samples. Similar results were observed in normal breast tissue ( Table 1 and Supplementary Table S1 ). Notably, all genes on the array were involved in stable gene pairs for each of the two types of tissue named above, indicating that stable gene pairs widely exist in normal human tissues and that stable gene pairs are an inherent feature of gene expression in normal human tissues. These results provide a basis for individualized differential analysis using gene ranking information.

In order to further show the inherent advantage of stable expression ordering of gene pairs over distribution variations of gene expression, a pair of genes ( GCKR and GATA1 ) with stable expression ordering obtained from normal lung tissue samples was taken as an example here. As illustrated in Figure 1 , even after quantile normalization, the expression distributions of the two genes in normal samples from different datasets are still different so that the expression intensities of these genes in different datasets are incomparable. These distribution variations hamper data integration from different data sources. However, the relative ordering of gene expression within each sample is insensitive to these distribution variations and thus would facilitate multi-source data integration.

Fig. 1.

An example of a pair of genes showing stable relative expression orderings in normal samples of lung tissue. The plot illustrates that the expression distributions of two genes (GCKR and GATA1) in normal samples from different datasets are still different even after quantile normalization. For example, the expression intensities of these two genes in all normal samples of GSE19804 are always higher than those in GSE19188. However, the relative expression ordering of these two genes within each sample is insensitive to the distribution variations from different datasets: the expression intensity of gene GCKR keeps larger than that of gene GATA1 within each sample in any of the datasets

3.2 Evaluation of performance

First, the performance of the RankComp method was evaluated using a simulation. To retain the intrinsic structure of the data, data were simulated for 60 disease samples and 100 disease samples on the basis of the expression profiles of 15 000 genes for 60 normal lung tissue samples and 100 normal breast tissue samples extracted from the GSE19804 and GSE26457 datasets, respectively. For each normal sample, 1500 upregulated and 1500 downregulated genes were randomly generated and used to produce a disease sample by setting these genes with different magnitudes of differential expression (e.g. log 2 FC =± 0.8, ± 1.0 and ± 1.2). Here, the fold-change (log 2 FC) was used to quantify the difference in expression levels of each gene between the disease sample and the corresponding normal sample. For the large dataset of 100 disease samples and 100 normal samples, when the magnitude of differential expression (|log 2 FC|) was increased from 0.8 to 1.2, the sensitivity increased from 79 to 96% at the cost of a slight decrease in specificity, from 94 to 85%. As shown in Table 2 , the method exhibited rather good performance with sensitivity >90%, specificity >85% and F-score >90% when the magnitude of differential expression was not too small (|log 2 FC| ≥ 1.0). To determine the effect of normal sample size, the performance of the method was studied in the small dataset of 60 disease samples and 60 normal samples. As expected, a slight decline in sensitivity, specificity and F-score was observed for each scenario. Similar results were observed for the scenarios with 5% FDR level.

Table 2.

Sensitivity, specificity and F-score for the RankComp method in simulated data

Dataset
100 versus 100
60 versus 60
|log 2 FC| FDR1%5%1%5%
0.8Sensitivity0.79150.83290.70980.7539
Specificity0.94390.93110.93040.9120
F-score0.86100.87930.80530.8254
1.0Sensitivity0.91530.93840.85200.8836
Specificity0.90780.87420.87930.8345
F-score0.91150.90520.86540.8583
1.2Sensitivity0.96330.97480.92480.9452
Specificity0.85270.78590.79700.7163
F-score0.90460.87020.85620.8150
Dataset
100 versus 100
60 versus 60
|log 2 FC| FDR1%5%1%5%
0.8Sensitivity0.79150.83290.70980.7539
Specificity0.94390.93110.93040.9120
F-score0.86100.87930.80530.8254
1.0Sensitivity0.91530.93840.85200.8836
Specificity0.90780.87420.87930.8345
F-score0.91150.90520.86540.8583
1.2Sensitivity0.96330.97480.92480.9452
Specificity0.85270.78590.79700.7163
F-score0.90460.87020.85620.8150
Table 2.

Sensitivity, specificity and F-score for the RankComp method in simulated data

Dataset
100 versus 100
60 versus 60
|log 2 FC| FDR1%5%1%5%
0.8Sensitivity0.79150.83290.70980.7539
Specificity0.94390.93110.93040.9120
F-score0.86100.87930.80530.8254
1.0Sensitivity0.91530.93840.85200.8836
Specificity0.90780.87420.87930.8345
F-score0.91150.90520.86540.8583
1.2Sensitivity0.96330.97480.92480.9452
Specificity0.85270.78590.79700.7163
F-score0.90460.87020.85620.8150
Dataset
100 versus 100
60 versus 60
|log 2 FC| FDR1%5%1%5%
0.8Sensitivity0.79150.83290.70980.7539
Specificity0.94390.93110.93040.9120
F-score0.86100.87930.80530.8254
1.0Sensitivity0.91530.93840.85200.8836
Specificity0.90780.87420.87930.8345
F-score0.91150.90520.86540.8583
1.2Sensitivity0.96330.97480.92480.9452
Specificity0.85270.78590.79700.7163
F-score0.90460.87020.85620.8150

Then, the real microarray data of paired cancer-normal samples was used as a benchmark to evaluate the performance of the method. Using the stable gene pairs obtained from the 210 normal lung tissue samples, with 5% FDR control, DE genes for each lung cancer sample from an independent paired cancer-normal sample dataset (GSE27262) were detected using the RankComp method described in Section 2.3. To ensure the association between the individualized DE genes and cancer, the analysis was restricted to genes that were dysregulated in a non-randomly high percentage of cancer samples. For each cancer sample, average 2722 DE genes showing dysregulated in at least 44% of samples were identified with an average precision of 90.35%, as indicated by the consistency between the detected dysregulation directions of the DE genes and their actual dysregulation directions observed in the paired samples. In particular, when focusing on the analysis of the top 100 ranked DE genes for each cancer sample, the average precision reached 99.80%. Similar evaluation results were observed in an independent paired cancer–normal sample dataset for breast cancer (GSE10780; Supplementary Table S2 ). These results indicated that the RankComp method can reliably capture differential expression signals, especially the top-ranked signals, in cancer patients based on the landscape of stable gene pairs obtained in advance using previously accumulated data.

Notably, the criteria for identifying stable gene pairs in normal samples may affect the results of differential expression analysis. When the criteria were decreased from strict control (99%) to loose control (95%), the average number of DE genes for lung cancer increased by about 44% and the average precision decreased by about 5%. Similar results were observed for breast cancer ( Supplementary Table S3 ). A strict control of 99% would minimize false positives.

In principle, the RankComp method is an incremental learning process, during which its performance can be improved along with the accumulation of normal samples for a particular type of tissue by setting an increasingly reliable landscape of stable gene pairs. This advantage here was demonstrated by investigating the impact of normal samples on the performance. From the 210 normal lung tissue samples, subsets of different sample sizes, ranging from 50 samples per group to the highest sample size, were randomly extracted. As shown in Figure 2 , when the normal sample size increased from 50 to 210, the average precision increased from 88.78 to 90.35% at the cost of a decline in average number of DE genes, from 3726 to 2722. However, the rate of decline gradually decreased, indicating a stably improving trend of DE genes detection as the normal sample size increased. The same trend was observed for breast cancer ( Fig. 2 ).

Fig. 2.

Impact of normal sample size on the performance of the RankComp method. Blue bars represent the average precision, and orange line represents the average number of DE genes per sample

3.3 Use of individual-level DE analysis for outlier detection

As described in the Methods, besides the individual-level analysis of DE genes, the RankComp method can be applied to identify DE genes at the subpopulation level, similar to outlier detection methods such as COPA, OS, ORT and MOST. Because it has been shown that COPA and MOST for outlier detection perform relatively better in comparison with several other methods such as OS and ORT ( Karrila et al. , 2011 ), we compared only RankComp with COPA and MOST in simulation data. The detailed descriptions of simulation experiments and parameter settings were presented in the Supplementary Material , and the detailed results were presented in Supplementary Table S4 . Briefly, the F-score comparison results showed that RankComp performed similarly with COPA and better than MOST at φ = 5%, 10% and 20%, but obviously better than COPA and slightly worse than MOST at φ = 50% and 80%. These results demonstrated that the RankComp method for detecting DE genes in subsets of disease samples can perform better in many scenarios, indicating a potential advantage of applying the RankComp method for detecting DE genes in various proportions of disease samples. In addition, we mimicked systematic batch effects in the simulated datasets to evaluate the robustness of the RankComp method against systematic batch effects. As shown in Supplementary Table S5 , the RankComp method had the advantage of insensitivity to systematic batch effects (for details, see Supplementary Material ).

Notably, we could also apply the RankComp method to identify the population-level DE genes, defined as the combination of subpopulation-level DE genes, considering that the population-level DE genes might not be dysregulated in all samples. In this sense, we compared RankComp with the Wilcoxon signed-rank test and Limma. As shown in Supplementary Table S4 , the F-score comparison showed that RankComp performed better than both the Wilcoxon signed-rank test and Limma for all the scenarios with φ =5% and still better than Limma for the scenarios with φ =10%, but worse than both the Wilcoxon signed-rank test and Limma for all the scenarios with φ =20, 50 and 80%. These results suggested that RankComp is generally uncompetitive for detecting the population-level DE genes. Thus, we do not recommend using the RankComp method to detect DE genes at the population level.

3.4 Use of individual-level DE analysis in survival analysis

Individualized DE gene analysis provides a basis for individual-level pathway analysis, which could be readily applied in clinical contexts. To exemplify this application, a publicly available microarray dataset of 204 early-stage lung adenocarcinoma samples with survival data was analyzed (GSE31210). Using the landscape of stable gene pairs obtained from the 235 normal lung tissue samples, dysregulated genes were detected for each of 204 cancer samples at the FDR level of 0.05. Focusing on these cancer-associated genes for each cancer sample, biological pathways that were significantly enriched with up- and down-DE genes in this sample were detected at the FDR level of 0.05. The results indicated that the dysregulation of biological pathways in lung cancer patients is heterogeneous ( Supplementary Fig. S1 ). For example, the ‘deposition of new CENPA-containing nucleosomes at the centromere pathway’ was significantly dysregulated in 50% of cancer patients but not in the others. Dividing the cohort of 204 patients into two groups according to the dysregulation status of this pathway showed that the dysregulation of this pathway was significantly associated with poor overall survival in lung cancer patients (Log-rank test, P = 1.29e-05). Multivariate analysis showed that high-risk and low-risk designation remained statistically significant after adjustment for age, gender and smoking status ( Supplementary Fig. S2A and Supplementary Table S6 ). This result was validated in an independent pooled cohort of 182 lung cancer samples ( Supplementary Fig. S2B and Supplementary Table S6 ).

Many other pathways whose dysregulation status was predictive for survival of lung cancer patients were also found ( Supplementary Table S7 ). To look into this further, a forward stepwise algorithm was used to select an optimal set of pathways with the highest predictive power for prognosis. A two-pathway signature consisting of the ‘deposition of new CENPA-containing nucleosomes at the centromere pathway’ and ‘DNA replication pathway’ was generated in the training cohort ( Fig. 3 A), with a poor prognosis for patients with simultaneous dysregulation of these two pathways confirmed in the validation cohort ( Fig. 3 B). Multivariate analysis showed that the two-pathway signature remained significantly associated with overall survival after adjustment for age, gender and stage ( Supplementary Table S8 ). It has been found that dysregulation of the former pathway can cause ectopic formation resulting in multicentric chromosomes and consequential genome instability that drives tumor progression ( Amato et al. , 2009 ). For instance, the overexpression of CENPA gene involved in this pathway promotes cancer progression by increasing genome instability ( Amato et al. , 2009 ). Dysregulation of the latter pathway can trigger replicative stress and replication-associated DNA damage, favoring the accumulation of genetic alterations in cancer cells ( Allera-Moreau et al. , 2012 ; Bartkova et al. , 2006 ). For instance, the overexpression of PLK1 gene involved in this pathway promotes cancer progression by increasing resistance to replication stress respectively ( Allera-Moreau et al. , 2012 ). These studies provided the additional evidence to support our findings from the aspect of biological importance.

Fig. 3.

Kaplan–Meier estimates of overall survival according to the optimal two-pathway signature in the training and validation cohorts. ( A ) Overall survival curves in the training cohort. ( B ) Overall survival curves in the validation cohort

Taken together, these analyses demonstrated the capacity of the RankComp method to add value in a prognostic setting and provide patient-specific information for personalized medicine.

4 DISCUSSION

The overall stable ordering of gene expression in a normal human tissue may reflect the biological reality that the normal state should be robust against various perturbations ( Shiraishi et al. , 2010 ). During the transition from the normal state to a disease state, the ordering of gene expression residing in a diseased tissue may be subject to extensive changes, which could be sufficient to reveal patient-specific differential expression information, as demonstrated by the results based on both simulated data and real paired cancer–normal sample data. One unique advantage of the present relative ordering-based method is that it is insensitive to batch effects and data normalization and thus can directly use microarray data from different data sources. In particular, the landscape of stable gene pairs in a particular type of normal tissue can be pre-determined using previously accumulated data of normal samples collected for the study of different disorders on the tissue and could become increasingly stable and reliable along with data accumulation of normal samples. Based on the landscape, dysregulated genes and pathways for an individual disease sample of this tissue can be readily detected.

Analysis of individualized DE genes could have important applications. First, it can be of value in the practice of personalized medicine. For prognostic risk stratification, the present method can directly stratify patients at the individual level based on the dysregulation status of signature in each patient. Optimal risk threshold value that is determined from a set of training samples using a risk-scoring method usually needs to be redetermined in independent samples because of technical and biological variations. Another possible application of the RankComp method is in addressing the scarcity of normal tissue samples, which are often rare because of the invasive nature of sample collection. Fortunately, normal samples for a particular type of tissue, generated in different laboratories for the study of different disorders, are often collected in public repositories. Using these normal samples, this method can be used to predetermine the landscape of stable expression ordering of gene pairs for that particular type of tissue. Then, based on this landscape, dysregulated genes and pathways for any disease sample of this tissue can be readily detected. In this way, the present method makes it possible to maximize the reuse of accumulated data for normal tissue samples and facilitate the research of human disease.

Nevertheless, the present method also has several limitations. First, the RankComp method may have insufficient power to detect genes whose differential expression causes minor changes in the gene ranking profile. Fortunately, even if a certain number of DE genes go undetected in a given disease sample, shifting the focus of the analysis from individual genes to pathways tends to produce relatively robust results despite insufficient power ( Yang et al. , 2008 ; Zou et al. , 2012 ). Second, the analysis presented here was performed only on microarray data from the same platform because the ordering of gene expression is sensitive to microarray platforms to some degree. One possible way of addressing this limitation is to filter out gene pairs with unstable ordering in datasets produced by different platforms.

Funding : The Natural Science Foundation of China (91029717, 81071646, 81372213, 81201822 and 81030029) and Research Fund for the Doctoral Program of Higher Education of China (20112307110011).

Conflict of interest : none declared.

REFERENCES

Alimonti
A
, et al. 
Subtle variations in Pten dose determine cancer susceptibility
Nat. Genet.
2010
, vol. 
42
 (pg. 
454
-
458
)
Allera-Moreau
C
, et al. 
DNA replication stress response involving PLK1, CDC6, POLQ, RAD51 and CLASPIN upregulation prognoses the outcome of early/mid-stage non-small cell lung cancer patients
Oncogenesis
2012
, vol. 
1
 pg. 
e30
 
Amato
A
, et al. 
CENPA overexpression promotes genome instability in pRb-depleted human cells
Mol. Cancer
2009
, vol. 
8
 pg. 
119
 
Bartkova
J
, et al. 
Oncogene-induced senescence is part of the tumorigenesis barrier imposed by DNA damage checkpoints
Nature
2006
, vol. 
444
 (pg. 
633
-
637
)
Botling
J
, et al. 
Biomarker discovery in non-small cell lung cancer: integrating gene expression profiling, meta-analysis, and tissue microarray validation
Clin. Cancer Res.
2012
, vol. 
19
 (pg. 
194
-
204
)
Breitling
R
, et al. 
Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments
FEBS Lett.
2004
, vol. 
573
 (pg. 
83
-
92
)
Chen
DT
, et al. 
Proliferative genes dominate malignancy-risk gene signature in histologically-normal breast tissue
Breast Cancer Res. Treat.
2009
, vol. 
119
 (pg. 
335
-
346
)
Clarke
C
, et al. 
Correlating transcriptional networks to breast cancer survival: a large-scale coexpression analysis
Carcinogenesis
2013
, vol. 
34
 (pg. 
2300
-
2308
)
Cox
DR
Regression Models and Life-Tables
J. R. Stat. Soc. Ser. B Methodol.
1972
, vol. 
34
 (pg. 
187
-
220
)
de Ronde
JJ
, et al. 
Identifying subgroup markers in heterogeneous populations
Nucleic Acids Res.
2013
, vol. 
41
 pg. 
e200
 
Dedeurwaerder
S
, et al. 
DNA methylation profiling reveals a predominant immune component in breast cancers
EMBO Mol. Med
2011
, vol. 
3
 (pg. 
726
-
741
)
Dembele
D
Kastner
P
Fold change rank ordering statistics: a new method for detecting differentially expressed genes
BMC Bioinformatics
2014
, vol. 
15
 pg. 
14
 
Edgar
R
, et al. 
Gene expression omnibus: NCBI gene expression and hybridization array data repository
Nucleic Acids Res.
2002
, vol. 
30
 (pg. 
207
-
210
)
Geman
D
, et al. 
Classifying gene expression profiles from pairwise mRNA comparisons
Stat. Appl. Genet. Mol. Biol.
2004
, vol. 
3
  
Article19
Gray
RJ
A class of K-sample tests for comparing the cumulative incidence of a competing risk
Inst. Math. Stat.
1988
, vol. 
16
 (pg. 
1141
-
1154
)
Harrell
FE
Jr
, et al. 
Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors
Stat. Med
1996
, vol. 
15
 (pg. 
361
-
387
)
Hawthorn
L
, et al. 
Integration of transcript expression, copy number and LOH analysis of infiltrating ductal carcinoma of the breast
BMC Cancer
2010
, vol. 
10
 pg. 
460
 
Heinaniemi
M
, et al. 
Gene-pair expression signatures reveal lineage control
Nat. Methods
2013
, vol. 
10
 (pg. 
577
-
583
)
Hochberg
Y
Benjamini
Y
Controlling the false discovery rate: a practical and powerful approach to multiple testing
J. R. Stat. Soc. Ser. B Methodol.
1995
, vol. 
57
 (pg. 
289
-
300
)
Hollander
M
Wolfe
DA
Nonparametric Statistical Methods
1973
New York
John Wiley & Sons
Hong
G
, et al. 
Separate enrichment analysis of pathways for up- and downregulated genes
J. R. Soc. Interface
2013
, vol. 
11
 pg. 
20130950
 
Hou
J
, et al. 
Gene expression-based classification of non-small cell lung carcinomas and survival prediction
PLoS One
2010
, vol. 
5
 pg. 
e10312
 
Hu
J
Cancer outlier detection based on likelihood ratio test
Bioinformatics
2008
, vol. 
24
 (pg. 
2193
-
2199
)
Irizarry
RA
, et al. 
Exploration, normalization, and summaries of high density oligonucleotide array probe level data
Biostatistics
2003
, vol. 
4
 (pg. 
249
-
264
)
Karrila
S
, et al. 
A comparison of methods for data-driven cancer outlier discovery, and an application scheme to semisupervised predictive biomarker discovery
Cancer Inform.
2011
, vol. 
10
 (pg. 
109
-
120
)
Kretschmer
C
, et al. 
Identification of early molecular markers for breast cancer
Mol. Cancer
2011
, vol. 
10
 pg. 
15
 
Lazar
C
, et al. 
Batch effect removal methods for microarray gene expression data integration: a survey
Brief Bioinform.
2012
, vol. 
14
 (pg. 
469
-
490
)
Leek
JT
, et al. 
Tackling the widespread and critical impact of batch effects in high-throughput data
Nat. Rev. Genet.
2010
, vol. 
11
 (pg. 
733
-
739
)
Lehmann
EL
Nonparametrics: Statistical Methods Based on Ranks
1975
San Francisco
Holden-Day
Lian
H
MOST: detecting cancer differential gene expression
Biostatistics
2008
, vol. 
9
 (pg. 
411
-
418
)
Lu
TP
, et al. 
Identification of a novel biomarker, SEMA5A, for non-small cell lung carcinoma in nonsmoking women
Cancer Epidemiol. Biomarkers Prev.
2010
, vol. 
19
 (pg. 
2590
-
2597
)
Meier
P
Kalpan
EL
Nonparametric estimation from incomplete observations
J. Am. Stat. Assoc.
1958
, vol. 
53
 (pg. 
457
-
481
)
Navon
R
, et al. 
Novel rank-based statistical methods reveal microRNAs with differential expression in multiple cancer types
PLoS One
2009
, vol. 
4
 pg. 
e8003
 
Okayama
H
, et al. 
Identification of genes upregulated in ALK-positive and EGFR/KRAS/ALK-negative lung adenocarcinomas
Cancer Res.
2011
, vol. 
72
 (pg. 
100
-
111
)
Pedraza
V
, et al. 
Gene expression signatures in breast cancer distinguish phenotype characteristics, histologic subtypes, and tumor invasiveness
Cancer
2009
, vol. 
116
 (pg. 
486
-
496
)
Rice
JA
Mathematical Statistics and Data Analysis
1995
 
second edition. Belmont, CA: Duxbury Press
Richardson
AL
, et al. 
X chromosomal abnormalities in basal-like human breast cancer
Cancer Cell
2006
, vol. 
9
 (pg. 
121
-
132
)
Rousseaux
S
, et al. 
Ectopic activation of germline and placental genes identifies aggressive metastasis-prone lung cancers
Sci. Transl. Med.
2013
, vol. 
5
  
186ra166
Russo
J
, et al. 
Pregnancy-induced chromatin remodeling in the breast of postmenopausal women
Int. J. Cancer
2011
, vol. 
131
 (pg. 
1059
-
1070
)
Sanchez-Palencia
A
, et al. 
Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer
Int. J. Cancer
2010
, vol. 
129
 (pg. 
355
-
364
)
Shiraishi
T
, et al. 
Large-scale analysis of network bistability for human cancers
PLoS Comput. Biol.
2010
, vol. 
6
 pg. 
e1000851
 
Smyth
GK
Linear models and empirical bayes methods for assessing differential expression in microarray experiments
Stat. Appl. Genet. Mol. Biol.
2004
, vol. 
3
  
Article3
Subramanian
A
, et al. 
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
Proc. Natl Acad. Sci. USA
2005
, vol. 
102
 (pg. 
15545
-
15550
)
Tan
AC
, et al. 
Simple decision rules for classifying human cancers from gene expression profiles
Bioinformatics
2005
, vol. 
21
 (pg. 
3896
-
3904
)
Tibshirani
R
Hastie
T
Outlier sums for differential gene expression analysis
Biostatistics
2007
, vol. 
8
 (pg. 
2
-
8
)
Tomlins
SA
, et al. 
Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer
Science
2005
, vol. 
310
 (pg. 
644
-
648
)
Tusher
VG
, et al. 
Significance analysis of microarrays applied to the ionizing radiation response
Proc. Natl Acad. Sci. USA
2001
, vol. 
98
 (pg. 
5116
-
5121
)
Wang
D
, et al. 
Extensive increase of microarray signals in cancers calls for novel normalization assumptions
Comput. Biol. Chem.
2011
, vol. 
35
 (pg. 
126
-
130
)
Wang
Y
, et al. 
Weighted change-point method for detecting differential gene expression in breast cancer microarray data
PLoS One
2012
, vol. 
7
 pg. 
e29860
 
Wei
TY
, et al. 
Protein arginine methyltransferase 5 is a potential oncoprotein that upregulates G1 cyclins/cyclin-dependent kinases and the phosphoinositide 3-kinase/AKT signaling cascade
Cancer Sci.
2012
, vol. 
103
 (pg. 
1640
-
1650
)
Wilcoxon
F
Individual comparisons by ranking methods
Biometr. Bull.
1945
, vol. 
1
 (pg. 
80
-
83
)
Wu
B
Cancer outlier differential gene expression detection
Biostatistics
2007
, vol. 
8
 (pg. 
566
-
575
)
Xie
Y
, et al. 
Robust gene expression signature from formalin-fixed paraffin-embedded samples predicts prognosis of non-small-cell lung cancer patients
Clin. Cancer Res.
2011
, vol. 
17
 (pg. 
5705
-
5714
)
Yang
D
, et al. 
Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories
Bioinformatics
2008
, vol. 
24
 (pg. 
265
-
271
)
Zou
J
, et al. 
Revealing weak differential gene expressions and their reproducible functions associated with breast cancer metastasis
Comput. Biol. Chem.
2012
, vol. 
39
 (pg. 
1
-
5
)

Author notes

Associate Editor: Ziv Bar-Joseph

Supplementary data