Abstract

Motivation: Gene class testing (GCT) is a statistical approach to determine whether some functionally predefined classes of genes express differently under two experimental conditions. GCT computes the P-value of each gene class based on the null distribution and the gene classes are ranked for importance in accordance with their P-values. Currently, two null hypotheses have been considered: the Q1 hypothesis tests the relative strength of association with the phenotypes among the gene classes, and the Q2 hypothesis assesses the statistical significance. These two hypotheses are related but not equivalent.

Method: We investigate three one-sided and two two-sided test statistics under Q1 and Q2. The null distributions of gene classes under Q1 are generated by permuting gene labels and the null distributions under Q2 are generated by permuting samples.

Results: We applied the five statistics to a diabetes dataset with 143 gene classes and to a breast cancer dataset with 508 GO (Gene Ontology) terms. In each statistic, the null distributions of the gene classes under Q1 are different from those under Q2 in both datasets, and their rankings can be different too. We clarify the one-sided and two-sided hypotheses, and discuss some issues regarding the Q1 and Q2 hypotheses for gene class ranking in the GCT. Because Q1 does not deal with correlations among genes, we prefer test based on Q2.

Contact:jchen@nctr.fda.gov

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

A common goal of microarray experiments is to identify a list of genes that express differently among experimental conditions of interest. Much initial research has focused on the development of techniques to rank each individual gene in the order of the evidence of differential expression and to determine a cutoff to divide the genes into two lists, differential expression and non-differential expression. Typically, the P-value of a test statistic is calculated for each gene, using either a parametric test or a permutation test. The genes are ranked according to the P-values, and a pre-specified significance criterion, such as the false discovery rate (FDR), is used to determine the cutoff. Small sample sizes coupled with efforts to maintain low FDRs often result in a short significant list. With this approach, the pathways with moderate effects may not be captured by the genes in the list. In addition, the interpretation of every gene in the list is challenging and laborious.

Recently, in addition to attempting to understand biological functions for each individual gene, gene class testing (GCT) or over-representation analysis (ORA) has been proposed for gene expression analysis. A gene class refers to a group of genes with related functions or a set of genes grouped together based on biologically relevant information (Pavlidis et al., 2004). For example, a class can represent a signal transduction, metabolic pathway, protein complex or GO (Gene Ontology) category. A common ORA approach involves comparing the genes in the differential expression list with the number that are expected under a particular null hypothesis to determine if a predefined functional class of genes is over-represented. The Fisher's exact test is typically used to assess the significance for an over-representation (Draghici et al., 2003). Several authors have pointed out shortcomings of this approach (e.g. Pavlidis et al., 2004; Tian et al., 2005); they proposed alternative analysis procedures by considering the distribution of genes in each gene class over the entire list. A typical GCT approach can be summarized as follow. (1) All genes are ranked by computing a test statistic (or P-value). (2) For each gene class or functional category, a class score is calculated as a summary measure of the class. (3) Re-sampling methods are used to generate the null distribution of the class score for each gene class. (4) Statistical significance is assessed by comparing the observed functional score to the percentile of the null distribution of the gene class.

Tian et al. (2005) presented two hypotheses for GCT. Hypothesis Q1 tests the relative strengths of the associations of the genes in a gene class with the experimental conditions as compared to the genes outside the gene class. That is, it tests if treatment effects in a gene class are random samples of the treatment effects observed among all genes in the array. Hypothesis Q2 tests if there are treatment effects in the gene class. In other words, hypothesis Q1 ranks the relative association of differential expressions among the gene classes, while the Q2 hypothesis ranks statistical significance of each gene class. Tian et al. (2005) proposed one statistic for testing Q1 and a different statistic for testing Q2. The null distribution of the statistic for Q1 was generated by permuting genes, and the null distribution for testing Q2 was generated by permuting samples.

O’Brien (1984) proposed an ordinary least squares (OLS) and a generalized least squares (GLS) global test statistics for the one-sided alternative (either up- or down-regulation) in the context of testing for an overall treatment effect on multiple clinical endpoints (all endpoints are favorably affected). The GLS is more powerful than the OLS. However, there are a few problems with the GLS (e.g. Pocock et al., 1987). Tang et al. (1993) recommended use of the OLS test. Similar to O’Brien's OLS, Läuter (1996) proposed a standardized weighted sum (LWS) statistic. Both OLS and LWS tests were shown to perform well in testing multiple clinical endpoints.

In this article a one-sided test statistic means that the changes of individual gene expressions in a gene class are all in one direction: either up or down. A two-sided test statistic means that the changes of individual gene expressions in the gene class can be up-regulated for some genes and down-regulated for other genes. The gene-set enrichment analysis (GSEA) (Mootha et al., 2003) was a one-sided test for a down-regulated alternative. The GSEA generated the null distribution by permuting the samples. The Tian et al. (2005) statistics were one-sided tests. Pavlidis et al. (2004) proposed a function scoring analysis using the average of the sum of (−log P)’s of individual P-values from a single univariate test; the univariate test can be either one-sided or two-sided. The function scoring analysis generated the null distribution by re-sampling (permuting) genes. The Fisher's exact test in the ORA is an analysis of permuting genes where the genes are dichotomized into differentially and non-differentially expressed groups. Goeman et al. (2004) proposed a two-sided parametric test using the generalized linear model by assuming random effects of gene expressions. Under the null hypothesis, the test statistic is asymptotic normal.

All authors, except Tian et al. (2005), did not distinguish Q1 and Q2. The P-values were computed, either by a parametric method or by a re-sampling (either permuting genes or permuting samples) method; and the P-values were used for gene class rankings as well as significance assessment. Therefore, they did not deal with the issue of different null distribution and different rankings under Q1 and Q2. On the other hand, Tian et al. (2005) applied their tests to a diabetes dataset that was presented by the GSEA analysis (Mootha et al., 2003). Their gene class rankings for the enrichment pathways are very different from the rankings by the GSEA, except the most significant pathway. Tian et al. (2005) commented that the differences were due to the fact that the null distributions of the GSEA statistic were not the same among gene classes. However, the Tian et al. (2005) gene class rankings in their Tables 1 and 2 for Q1 and Q2 were somewhat different.

Table 1.

Summary statistics of null distributions of the diabetes data: the average of the means and average of the SDs over the 143 gene classes of the five GCT statistics, Tols, Tlws, ES, |ES| and Ttwo, under Q1 and Q2, and the correlations between Q1 and Q2 for the means and SDs

Test Stat Hypothesis Q1 Hypothesis Q2 Correlation 
 Mean (SD) SD Mean (SD) SD Mean SD 
Tols 0.473 (0.28) 0.983 (0.15) −0.016 (0.01) 1.075 (0.01) −0.47 0.31 
Tlws 0.486 (0.29) 0.953 (0.14) −0.004 (0.01) 1.060 (0.01) −0.24 −0.06 
Ttwo 1.101 (0.07) 0.483 (0.07) 0.944 (0.05) 0.424 (0.11) 0.45 0.46 
ES 61.17 (3.23) 33.72 (0.99) 66.43 (11.2) 43.2 (16.20) 0.61 0.37 
|ES| 61.06 (3.11) 33.65 (0.95) 62.45 (5.10) 43.61 (6.14) 0.74 0.37 
Test Stat Hypothesis Q1 Hypothesis Q2 Correlation 
 Mean (SD) SD Mean (SD) SD Mean SD 
Tols 0.473 (0.28) 0.983 (0.15) −0.016 (0.01) 1.075 (0.01) −0.47 0.31 
Tlws 0.486 (0.29) 0.953 (0.14) −0.004 (0.01) 1.060 (0.01) −0.24 −0.06 
Ttwo 1.101 (0.07) 0.483 (0.07) 0.944 (0.05) 0.424 (0.11) 0.45 0.46 
ES 61.17 (3.23) 33.72 (0.99) 66.43 (11.2) 43.2 (16.20) 0.61 0.37 
|ES| 61.06 (3.11) 33.65 (0.95) 62.45 (5.10) 43.61 (6.14) 0.74 0.37 

The SDs of the averages are in parentheses. The results are the average over 10 000 replications.

Table 2.

Results of the one-sided GCT test for the diabetes dataset

ID Size Tols statistic ES statistic 
  Parametric Q1 Q2 Q1 Q2 
34 106 1(0.0058) 1(0.0049) 1(0.0047) 1(0.0001) 1(0.0043) 
38 84 2(0.0063) 2(0.0057) 3(0.0066) 7(0.0162) 4(0.0484) 
73* 19 3(0.0069) 3(0.0078) 2(0.0065) 114(0.910) 121(0.894) 
128 4(0.0096) 4(0.0130) 4(0.0132) 13(0.0806) 3(0.0473) 
43 53 5(0.0232) 5(0.0339) 5(0.0232) 5(0.0003) 8(0.0592) 
440 9(0.0629) 8(0.0618) 9(0.0714) 2(0.0001) 9(0.0624) 
450 12(0.0703) 10(0.0760) 11(0.0765) 3(0.0001) 10(0.0654) 
216 13(0.0735) 14(0.1045) 13(0.0832) 4(0.0002) 11(0.0671) 
ID Size Tols statistic ES statistic 
  Parametric Q1 Q2 Q1 Q2 
34 106 1(0.0058) 1(0.0049) 1(0.0047) 1(0.0001) 1(0.0043) 
38 84 2(0.0063) 2(0.0057) 3(0.0066) 7(0.0162) 4(0.0484) 
73* 19 3(0.0069) 3(0.0078) 2(0.0065) 114(0.910) 121(0.894) 
128 4(0.0096) 4(0.0130) 4(0.0132) 13(0.0806) 3(0.0473) 
43 53 5(0.0232) 5(0.0339) 5(0.0232) 5(0.0003) 8(0.0592) 
440 9(0.0629) 8(0.0618) 9(0.0714) 2(0.0001) 9(0.0624) 
450 12(0.0703) 10(0.0760) 11(0.0765) 3(0.0001) 10(0.0654) 
216 13(0.0735) 14(0.1045) 13(0.0832) 4(0.0002) 11(0.0671) 

*up-regulation.

Eight top ranked gene classes are selected from the combination of the five smallest permutation P-value (in parentheses) from the Tols and GSEA (ES) statistics under each hypothesis Q1 and Q2. The gene classes are listed according to the observed Tols statistic with the P-values computed by the normal approximation (Columns 3).

In this article, we investigate the null distributions of gene classes under Q1 and under Q2 for three one-sided and two two-sided test statistics. In one-sided test, we consider the GESA statistic and O’Brien's OLS (1984) and Läuter's LWS (1996) global statistics. In two-sided test, we consider a modified GESA statistic and a sum of the standardized differences statistic. The five statistics are applied to the diabetes dataset and a breast cancer dataset.

2 METHOD

Consider a microarray study of m genes with two experimental conditions (phenotypes) of sample sizes n1 and n2. Assume there are I gene classes of interest C1, …,CI and class Ci consists of mi genes. For gene class i, let ygjk be the normalized intensity of gene g (g = 1, …, mi) for sample j (j = 1, …, nk) in phenotype k (k = 1,2). Denote the standardized variable forumla, where yg.k is the sample mean for phenotype k and sg is the pooled sample SD for the g-th gene. Let d be an mi-dimensional vector for the mean difference of the standardized variables between two phenotypes. The O’Brien's OLS statistic is  

formula
where 1 is an mi × 1 vector of 1’s and V is the pooled sample correlation matrix of d. If d is multivariate normal, then the statistic, Tols, has an approximate t distribution with n1 + n2 − 2 degrees of freedom under the Q2 hypothesis. When sample size is small, Tols does not keep the prescribed level of significance (Läuter, 1996). Läuter (1996) proposed an exact test by standardizing the variables by the overall mean (zg. = yg.) and standard deviation (wg) for the gene class Ci, zgjk = (ygjk − zg.)/wg. He then applied the two-sample t-test to the sum of the standardized variables z.jk = Σgzgjk. For each gene class, let z.k = Σjz.jk/nk be the mean and w be the pooled SD for the standardized variables z.jk’s. Läuter's LWS test is  
formula

Under hypothesis Q2 Tlws has an exact t distribution with n1 + n2 − 2 degrees of freedom. Both Tols and Tlws can be used to test for Q1 or Q2; and both are a one-sided test either up- or down-regulation. The two statistics asymptotically have a standard normal distribution under Q2. We used the parametric normal approximation and permutation methods to compute the P-values and compared their gene rankings under Q1 and Q2. The ranking order based on permutation P-values may differ from the ranking order based on t-values if test statistics are not identically distributed (Tsai and Chen, 2004). We also consider the GESA (ES) statistic in the investigation (Mootha et al., 2003). The original ES is proposed for a down-regulated alternative based on the ordered signal-to-noise ratio (SNR). The ES statistic based on the negative of the SNR is for an up-regulated alternative, denoted by ESU. The ESU is also considered in the investigation.

For a two-sided hypothesis, we consider two statistics. For the first statistic, we consider a two-sided |ES| statistic based on the absolute value of the SNR difference metric, instead of the SNR. Thus, the |ES| has the same expression as the ES statistic (Mootha et al., 2003),  

formula

The second statistic is based on the sum of the standardized differences across genes in the gene class:  

formula
where yg.1 and yg.2 are the sample means, and sg is the sample variance. The null distribution of Ttwo is mathematically intractable, and a permutation test is used to compute the P-values.

A P-value of 0.05 has been used as the significance level for a gene class, (e.g. Goeman et al., 2004; Mooth et al.2003; Pavlidis et al., 2004). Since many gene classes are tested, Tian et al. (2005) and Barry et al. (2005) used the FDR approach (Benjamini and Hochberg, 1995) to account for multiple tests. Both used FDR at 0.10 for the significance level of a gene class. The FDR or familywise-error-rate (FWE) approach emphasizes the false positive error in determining the cutoff threshold. Since microarray experiments are exploratory in nature, some important gene classes might have larger P-values and the false negative error can also be of concern. We consider a receiver operating characteristic (ROC) approach to determine an optimal cutoff based on minimizing the total cost from false positive and false negative errors (Delongchamp et al., 2004). Let p(r) denote the r-th ordered P-value with p(1) is the smallest and p(2) is the second smallest, etc. Let CFP be a cost assigned to a false positive error and CFN be a cost assigned to a false negative error. The optimal cutoff minimizes the total expected cost:  

formula
where L0 and L1 are the numbers of truly non-differentially and differentially expressed gene classes, respectively. The optimal cutoff r can be estimated numerically. The ROC approach requires knowledge of L0 and L1; we used an estimate derived from the mean difference method (Hsueh et al., 2003) in the examples below.

A GCT analysis assigns an overall statistical significance of phenotype differences in gene expression for a gene class. It does not identify which genes in the gene class actually contribute to the phenotype difference. After identifying gene classes that show a difference, the standard univariate test can be used to identify which genes in the gene class are significant. However, in testing individual genes in a gene class, the total number of the tests in a gene class is mi, instead of m genes in the array. That is, for a given significant gene class identified, a multiple test procedure such as FDR or FWE should be only based on mi comparisons. In the follow-up analysis, we are interested in which genes are significant in the given significant gene class. A gene may be significant in one gene class, but, insignificant in another.

3 RESULTS

3.1 Null distributions of the GCT statistics

In a Monte Carlo simulation, Tian et al. (2005) showed that the null distributions of the GSEA’s ES statistic (Mootha et al., 2003) differ markedly over gene classes. We conducted a similar evaluation of the five GCT statistics using the same diabetes dataset. In addition, we also evaluated an alternative two-sided statistic based on the average of the absolute values of the t-statistics:  

formula
where tg’s are two sample t-statistics. We evaluated this statistic since its one-sided statistic (mi)1/2ΣgCi (tg/mi) has an asymptotical N(0,1) and was mentioned by Tian et al. (2005).

The diabetes dataset consisted of 143 gene classes from 10 526 genes measured on 17 subjects with normal glucose tolerance and 18 subjects with type 2 diabetes mellitus. Class sizes range from 1 to 450. We simulated the null distribution of each gene class under Q1 by permuting gene labels and simulated the null distribution of each gene class under Q2 by permuting samples. For each simulated sample, we computed the Tols, Tlws, Ttwo, Tav, ES and |ES| statistics. The simulation was repeated 10 000 times to obtain null distributions for each of the 143 gene classes for the six GCT statistics. The means and SDs over the 10 000 replicates were computed for each of the 143 gene classes. Figure 1 plots the means for the 143 gene classes versus the class sizes for the six statistics under Q1 and under Q2.

Fig. 1.

Mean (y-axis) of the null distributions among the 143 gene classes versus class size (x-axis) under Q1 and under Q2. The gene classes are ordered from small to large class sizes. The means are the averages of 10 000 replicates. The upper panel is for Q1 and the lower panel is for Q2.

Fig. 1.

Mean (y-axis) of the null distributions among the 143 gene classes versus class size (x-axis) under Q1 and under Q2. The gene classes are ordered from small to large class sizes. The means are the averages of 10 000 replicates. The upper panel is for Q1 and the lower panel is for Q2.

For hypothesis Q1, except Ttwo, the means of the null distributions increase as the class size increases. For hypothesis Q2, the means of the null distributions for Tols and Tlws are fairly constant. The means in Ttwo are somewhat similar, as are the means in |ES|. The means in ES are rather different. Also, the means for the 143 gene classes in Tols are similar to the corresponding means in Tlws under either Q1 or Q2. For the Tav statistic, the mean increases as the class sizes increases under either Q1 or Q2. The gene class of size one has the smallest means ranging 0.83–0.86, and the largest class size has the largest mean of 17.97. It is apparent that the null distributions for the 143 classes are very different. Since under Q2, the null distribution should not depend on the class size, the Tav statistic will not be considered in the further evaluation.

Table 1 shows the average of the means and the average of the SDs over the 143 gene classes of the five GCT statistics under Q1 and under Q2, and the correlation coefficients between Q1 and Q2 for the mean and the SD. Unlike Tols, Tlws and Ttwo, the averages of the SDs for ES and |ES| are rather large under either Q1 or Q2. Both statistics are a summation of quantities which depend on the class size. When the class sizes are different, the summations would result in large differences.

In each Tols, Tlws and Ttwo, the SDs of the averaged means and averaged SDs are small under either Q1 or Q2. That is, the 143 genes classes have almost the same means and same SDs. In ES and |ES|, under Q1 both the averaged means and averaged SDs have relatively small SDs. Under Q2 the means as well as standard deviations for the 143 null distributions are very different for ES. For example, the means range from 51.97 to 118.46 and the SDs range from 26.77 to 115.33 (Fig. 1). These results are very similar to the results of Tian et al. (2005). The means for |ES| are more similar; the means range from 51.57 to 82.55 and the SDs range from 29.16 to 76.77.

Under Q2, the means for Tols and Tlws are close to 0 and the SDs are slightly larger than 1. These values are in good agreements with the theoretical results that the null distributions of the two statistics have an asymptotic N(0,1). A deviation from 0, either positive or negative, indicates favoring an alternative hypothesis, either an up- or down-regulated pathway. Under Q1, the means for Tols and Tlws are about 0.48 and the SDs are slightly less than 1. The null distributions under Q1 do not appear to have an asymptotic N(0,1). For Ttwo, a large value indicates a difference, positive and/or negative. For the GESA statistics, a large value of ES favors down-regulation; also, a large value of ESU favors up-regulation. However, unlike Tols and Tlws, ES and ESU need to be computed separately. Similarly, a large |ES| indicates a difference. In the further analysis we will report only the results from the Tols since the null distributions for the two one-sided statistics Tols and Tlws are very similar.

In each of the five GCT statistics, the distribution of each gene class under Q1 appears to be different from the corresponding distribution under Q2 (Fig. 1). In other words, the null distributions of the 143 gene classes generated by permuting genes (under Q1) differ from the null distributions generated by permuting phenotypes (under Q2). The correlations between Q1 and Q2 are low for Tols, Tlws and Tols, and higher for ES and |ES|. Figure 2 shows the scatter plots of the means of null distributions for Q1 versus Q2 over the 143 gene classes. Tols and Ttwo show no clear correlation between Q1 and Q2, while ES and |ES| show positive correlations. However, these positive correlations could be from an indirect effect of class sizes.

Fig. 2.

Scatter plots of the means of null distributions between Q1 and Q2. Each point represents a gene set.

Fig. 2.

Scatter plots of the means of null distributions between Q1 and Q2. Each point represents a gene set.

In testing Q1 and Q2, different null distributions could result in different gene rankings. Under Q2, the P-values of the gene classes are calculated under no difference between two phenotypes. If the null distributions of the gene classes are identical, e.g. Tols or Tlws under Q2,, then these P-values are comparable and can be used for ranking of gene classes as well as identifying differential expression. The P-value computed under Q2 is consistent with the principle of statistical significance testing. The evaluation and analysis of the GCT tests will be primarily based on Q2.

3.2 GCT analysis of diabetes data

Table 2 lists eight gene classes from the one-sided Tols and ES tests under Q1 and under Q2, and the corresponding P-values from the observed Tols using the normal approximation. These are the eight most significant gene classes selected from the five smallest P-values in each test under each hypothesis.

For Tols, the P-values from the parametric and permutation methods are in good agreement. Despite the low correlation between Q1 and Q2 (Table 1), the rankings under Q1 and under Q2 are similar. For ES, under Q1 the four top gene classes all have large class sizes. It is worth mentioning that the P-values for gene class #73 are 0.914 and 0.898 for Q1 and Q2, respectively. This gene class is up-regulated and ranked second or third by Tols. The Tols test is designed to identify either up- or down-regulated gene class, while ES here is designed to identify only the down-regulated gene class. The P-values of ESU for an up-regulated alternative are 0.037 and 0.045 (<0.01) for Q1 and Q2, respectively.

Table 3 lists eight gene classes from the two-sided |ES| and Ttwo statistic under Q1 and under Q2, and the corresponding ranks from Tols are also shown. Seven of the eight, except for gene class #1, represent the most significant gene classes. For |ES|, the rankings under Q1 and under Q2 are similar. Like ES, the four top-ranked gene classes in |ES| all have large class sizes under Q1. For Ttwo, the rankings under Q1 and under Q2 are different. The rankings in |ES| differ from the rankings in Ttwo under either Q1 or Q2. For example, the two larges classes #450 and #440, which are ranked first and second in |ES|, are ranked 47 and 9 in Ttwo under Q1 and ranked 70 and 31 under Q2.

Table 3.

Results of the one-sided GCT test for the diabetes dataset

ID Size |ES| statistic Ttwo statistic Tols statistic 
  Q1 Q2 Q1 Q2 Q1 Q2 
450 1(0.0000) 6(0.0152) 47(0.2847) 70(0.3503) 10 11 
440 2(0.0000) 5(0.0145) 9(0.0559) 31(0.1461) 
34 106 3(0.0000) 2(0.0078) 6(0.0526) 11(0.0316) 
216 4(0.0011) 3(0.0124) 21(0.0957) 19(0.0708) 14 13 
73 19 5(0.0014) 1(0.0033) 25(0.1243) 22(0.0878) 
        
43 53 6(0.0077) 11(0.0410) 4(0.0512) 8(0.0283) 
105 7(0.0090) 4(0.0129) 7(0.0540) 10(0.0314) 120 123 
128 33(0.1253) 30(0.1162) 2(0.0258) 7(0.0252) 
ID Size |ES| statistic Ttwo statistic Tols statistic 
  Q1 Q2 Q1 Q2 Q1 Q2 
450 1(0.0000) 6(0.0152) 47(0.2847) 70(0.3503) 10 11 
440 2(0.0000) 5(0.0145) 9(0.0559) 31(0.1461) 
34 106 3(0.0000) 2(0.0078) 6(0.0526) 11(0.0316) 
216 4(0.0011) 3(0.0124) 21(0.0957) 19(0.0708) 14 13 
73 19 5(0.0014) 1(0.0033) 25(0.1243) 22(0.0878) 
        
43 53 6(0.0077) 11(0.0410) 4(0.0512) 8(0.0283) 
105 7(0.0090) 4(0.0129) 7(0.0540) 10(0.0314) 120 123 
128 33(0.1253) 30(0.1162) 2(0.0258) 7(0.0252) 

Eight gene classes are selected from the combination of the five smallest P-value (in parentheses) from |ES| and Ttwo under each hypothesis Q1 and Q2. The gene classes are listed according to P-values of |ES| under Q1. The ranks from the one-sided Tols are shown for comparison.

The top-ranked genes from the two-sided and one-sided tests are different, as expected. Among the two lists (Tables 2 and 3), there are six gene classes in common. Gene class #38, which is ranked second and third by Tols, is not in Table 3. On the other hand, gene class #105 is highly ranked in Table 3, but is not in Table 2. This is due to the fact that Tols lacks power when the gene class involves both up- and down-regulated genes. There are seven genes in the gene class #105, three are up-regulated and four are down-regulated. The remaining analysis of the diabetes data will focus on the one-sided Tols test.

Of the 143 gene classes analyzed, using the significance level of α = 0.05, the numbers of significances for the one-sided test are 7 and 6 for Q1 and Q2, respectively; and the numbers for the two-sided test are 10 and 15. Interestingly, Tian et al. (2003) discussed the three significant gene classes: (1) OXPHOS (oxidative phosphorylation) #34, (2) MAP00910 (nitrogen metabolism) #73 and (3) c22-U133 (cell growth and maintenance) #38. These three gene classes are ranked 1, 2 and 3 by the Tols (and Tlws) tests. Tian et al. (2003) also discussed that the finding of the gene class OXPHOS was supported by another gene class MAP00190 #43 which is ranked fifth by the Tols test. It is worth mentioning that the P-value of the OXPHOS gene class originally identified by the GSEA analysis (Mootha et al., 2003) is 0.029.

None of the gene classes has the FDR less than 0.10. For OXPHOS, the P-value from the Tols test is 0.0047 with the corresponding FDR = 0.323 under Q2 (note that Tian's analysis was based on 26 gene classes in their multiplicity adjustment). Alternatively, with the ROC approach, we assume the relative cost between CFP and CFN is 1 to 2, i.e. 2CFP = CFN (a two-fold higher cost on omitting potentially differentially expressed gene classes than on falsely selecting differentially expressed gene classes). Using the mean difference method (Hsueh et al., 2003), an estimate of L0 is 141. The optimal cutoff occurs at r = 2 with the corresponding p(2) = 0.0072. That is, two gene classes are selected when both false positive and false negative errors are taken into consideration. The individual genes in the significance gene class can be further investigated. For the gene class OXPHOS, it has 106 genes. Only one gene with the (smallest) P-value of 0.007 is significant at FDR = 0.1.

3.3 GCT analysis of breast cancer data

The breast cancer data was presented by van’t Veer et al. (2002). The dataset contained approximately 25 000 genes from 78 primary breast cancers (34 from patients who developed distant metastases within 5 years and 44 from patients who continued to be disease-free after a period of at least 5 years). The purpose of this example is to illustrate a GCT analysis defined by GO term. To identify the GO terms that are most related to prognostic reporters for breast cancer metastases, 231 genes identified as predictors previously by van’t Veer et al. (2002) were used to generate the GO terms. One hundred and thirty-eight genes out of the 231 having gene names and without missing values were submitted to generate GO terms. A total of 512 GO terms for biological process was generated. The four top level terms, biological process, cellular process, physiological process and cellular physiological process were not used in the analysis resulting in 508 GO terms linked with 4137 genes.

We performed a similar simulation to study the null distributions of the five GCT test statistics. The results are similar to the results from the diabetes data. (See Supplementary Material, Table S1 and Figs S1 and S2.) Under Q1, the means of the null distribution increases as the class size increases for ES, |ES| and Ttwo, Under Q2, that the null distributions of Tols and Tlws have an asymptotic N(0, 1). ES, |ES| and Ttwo show high positive correlations. Only Tols and |ES| will be considered in the remaining analysis since they show better performance than ES (or ESU) and Ttwo, respectively.

Table 4 lists the five top gene classes under Q1 and the five top gene classes under Q2, where the class sizes are at least 10. The two lists have no gene in common. The gene classes in Q1 are all down-regulated, while the gene classes in Q2 are all up-regulated. That is, the rankings between Q1 and Q2 are completely different.

Table 4.

Results of the one-sided Tols test for the breast cancer dataset

ID Size Category Pathway Q1 Q2 
30 610 GO:0045449 Regulation of transcription 1(0.0000) 39(0.0631) 
32 582 GO:0006355 Regulation of transcription, DNA-dependent 2(0.0003) 46(0.0755) 
28 641 GO:0006350 Transcription 3(0.0005) 66(0.1038) 
29 627 GO:0019219 Regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism 4(0.0010) 74(0.1185) 
31 604 GO:0006351 Transcription, DNA-dependent 5(0.0018) 78(0.1320) 
381 11 GO:0006270 DNA replication initiation 6(0.0020) 1(0.0014) 
390 10 GO:0006541 Glutamine metabolism 16(0.0092) 2(0.0017) 
362 13 GO:0007059 Chromosome segregation 17(0.0095) 3(0.0026) 
277 28 GO:0006096 Glycolysis 12(0.0053) 4(0.0034) 
389 11 GO:0043087 Regulation of GTPase activity 23(0.0161) 5(0.0044) 
ID Size Category Pathway Q1 Q2 
30 610 GO:0045449 Regulation of transcription 1(0.0000) 39(0.0631) 
32 582 GO:0006355 Regulation of transcription, DNA-dependent 2(0.0003) 46(0.0755) 
28 641 GO:0006350 Transcription 3(0.0005) 66(0.1038) 
29 627 GO:0019219 Regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism 4(0.0010) 74(0.1185) 
31 604 GO:0006351 Transcription, DNA-dependent 5(0.0018) 78(0.1320) 
381 11 GO:0006270 DNA replication initiation 6(0.0020) 1(0.0014) 
390 10 GO:0006541 Glutamine metabolism 16(0.0092) 2(0.0017) 
362 13 GO:0007059 Chromosome segregation 17(0.0095) 3(0.0026) 
277 28 GO:0006096 Glycolysis 12(0.0053) 4(0.0034) 
389 11 GO:0043087 Regulation of GTPase activity 23(0.0161) 5(0.0044) 

The top five gene classes identified under Q2 are consistent with the biological mechanism leading to rapid metastases, the poor prognosis of breast cancers. DNA replication initiation, chromosome segregation and regulation of GTPase activity are involved in cell growth and cell cycle control. It is not surprising to find that the activities of genes involved in cell cycles are enhanced because cell growth rate in rapidly growing tumors is high. High rates of glutamine metabolism are related to advanced malignant tumors because glutamine is the major respiratory fuel for tumor cells and it also provides tumors substrates for nucleotide and protein biosynthesis. The activities for glycolysis and sugar catabolism (glucose and hexose catabolism, etc.) in rapidly growing cancer cells are generally increased for ATP generation.

Using the significance level at FDR = 0.10, the numbers of significant gene classes is 10 under Q2. In this dataset, the estimated number of L0 is 470. The individual genes in the significance gene class can be further investigated. The numbers of significant genes at FDR = 0.10 for the top five gene classes, GO:0006270, GO:000541, GO:0007059, GO:0006096 and GO:0043087 (Table 4), are 4, 4, 1, 13, and 1, respectively. Note that these numbers are based on the individual gene class sizes (mi) in the FDR calculation. The numbers of significant genes for the five gene classes would be 0, 2, 0, 2 and 1 if the total 4137 genes are used in the multiple comparison adjustment.

Table 5 lists top seven gene classes from the five smallest P-values for the two-sided |ES| statistic under Q1 and under Q2. Unlike Tols, the two rankings are very similar. Figure 3 plots the null distributions of gene class GO:0045449 under Q1 and Q2.

Table 5.

Results of the two-sided |ES| test for the breast cancer dataset

ID Size Category Pathway Q1 Q2 
47 320 GO:0007049 Cell cycle 1 (0.0002) 5 (0.0049) 
174 64 GO:0007067 Mitosis 2 (0.0004) 1 (0.0011) 
168 66 GO:0000087 M phase of mitotic cell cycle 3 (0.0006) 2 (0.0014) 
126 102 GO:0000278 Mitotic cell cycle 4 (0.0007) 6 (0.0057) 
142 83 GO:0000279 M phase 5 (0.0014) 10 (0.0107) 
176 63 GO:00051301 Cell division 6 (0.0014) 4 (0.0023) 
287 26 GO:0000226 Microtubule cytoskeleton organization and biogenesis 8 (0.0015) 3 (0.0016) 
ID Size Category Pathway Q1 Q2 
47 320 GO:0007049 Cell cycle 1 (0.0002) 5 (0.0049) 
174 64 GO:0007067 Mitosis 2 (0.0004) 1 (0.0011) 
168 66 GO:0000087 M phase of mitotic cell cycle 3 (0.0006) 2 (0.0014) 
126 102 GO:0000278 Mitotic cell cycle 4 (0.0007) 6 (0.0057) 
142 83 GO:0000279 M phase 5 (0.0014) 10 (0.0107) 
176 63 GO:00051301 Cell division 6 (0.0014) 4 (0.0023) 
287 26 GO:0000226 Microtubule cytoskeleton organization and biogenesis 8 (0.0015) 3 (0.0016) 
Fig. 3.

The null distributions of gene set GO:0045449 under Q1 and Q2.

Fig. 3.

The null distributions of gene set GO:0045449 under Q1 and Q2.

In summary, for Tols, the null distributions of gene classes uner Q1 and under Q2 are different (Table 1, Table S1 and Fig. 3). The class rankings are generally consistent for the diabetes data. (Table 2) But the rankings are very different for the breast cancer data (Table 4). On the other hand, |ES| shows more consistent rankings between Q1 andQ2 (Table 3, Table 5, and Fig. 3). It is

4 DISCUSSION

Biological phenomena often occur through the interactions of multiple genes, via signaling pathways, networks or other functional relationships. With rapid developments of genomic databases and availability of more comprehensive annotations, e.g. KEGG pathway maps, GoMiner (Zeeberg et al., 2003), Onto-Express (Khatri et al., 2002), FatiGo (Al-Shahrour et al., 2004), GCT can provide powerful which is easier to interpret analysis. GCT analysis shows a list of significant gene classes that are more differentially expressed among the classes than expected by chance; the list is ranked by the P-values of a test statistic.

In the context of Q1 and Q2, the significance analysis is tested by the Q2 hypothesis, while the gene class ranking is tested by the Q1 hypothesis. For the Q1 hypothesis, the null distributions are generated by permuting genes. Permuting the genes only reassigns individual genes to different gene classes. Q1 is a conditional test, conditional on the association strength between phenotypes and gene expressions in the observed data. Under Q1, the null distributions of different gene classes could be very different when there are differentially expressed genes between the phenotypes. For the Q2 hypothesis, the null distributions are generated by permuting samples. Under Q2, the null distribution of different gene classes should be identically distributed. Q2 is an unconditional test. The Q1 and Q2 hypotheses are highly related; however, the two rankings can be very different. When the number of differentially expressed genes is small, such as the diabetes dataset, Q1 is approximately equivalent to Q2, and the rankings under Q1 and Q2 will be similar. For the breast cancer dataset, many more genes are differentially expressed; the rankings under Q1 and Q2 are very different for Tols. Nevertheless, for |ES| the two rankings are similar. We will investigate this issue in the future.

We have concerns with permutations of genes in the Q1 hypothesis on two main points. First, its null distribution is essentially generated from the empirical distribution of the observed P-values, instead of the uniform distribution under the null hypothesis that no difference in gene expression exists between two phenotypes. When many genes are differentially expressed, this (empirical) null distribution will not represent the distribution of genes which are not differentially expressed. This hypothesis does not test whether the gene class is different in two phenotypes. Rather, it tests whether the observed difference between two phenotypes in the gene class is more or less than the average differences in the study. The second point is that the validity of this test depends upon exchangeability among the genes. Correlation structures likely to occur among genes are not exchangeable and result in computed significance levels by this test that can be far too small. Given these difficulties the utility of the Q1 test which shuffles genes is unclear.

On the other hand, the Q2 hypothesis is consistent with the current use of a permutation test, such as the significance analysis of microarray (Tusher et al., 2001), to select a list of differentially expressed genes. The P-value is calculated by comparing the observed statistic to the percentile of the null distribution generated by permuting samples. The P-values are used for gene ranking and significance assessment. Permutation of samples computes the P-values based on the random (independent) samples, and the test statistics are comparable.

For testing hypothesis Q2, the null distributions of Tols and Tlws appear to have asymptotic standard normal distributions for all gene classes. These two global statistics account for class size and correlation structure. These two statistics are recommended for identifying either up- or down-regulated gene classes. A significant gene class may consist of up- or down-regulated genes or a mixture of both. The two statistics lack power when the gene class consists of both up- and down-regulated genes. We investigate |ES| and Ttwo for identifying gene classes with a mixture of both up- and down-regulated genes. The Ttwo statistic tries to account for the magnitude of differences in which a gene with a large standardized difference may lead to a relatively higher score. The two statistics could have very different rankings. (Ttwo)2 is χ2 distributed under the model of independence among genes with an approximate equal variance. These assumptions obviously do not hold. The Ttwo statistics appears to perform unsatisfactorily.

There were concerns that the null distributions of the GSEA ES statistic were affected by the class sizes (Damian and Gorfine, 2004; Tian et al., 2005). The |ES| statistics shows more consistent rankings between Q1 andQ2. However, |ES| appears to be affected by the class size. The |ES| statistic determines whether the members of a given gene class show enrichment among the most differentially expressed genes by considering the rankings of differences. We are currently investigating |ES| and other statistics including the score statistic proposed by Goeman et al. (2004).

ACKNOWLEDGEMENTS

The views presented in this article do not necessarily reflect those of the U.S. Food and Drug Administration.

Conflict of Interest: none declared.

REFERENCES

Al-Shahrour
F
, et al.  . 
FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
578
-
580
)
Barry
WT
, et al.  . 
Significance analysis of functional categories in gene expression studies: a structured permutation approach
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
1943
-
1949
)
Benjamini
Y
Hochberg
Y
Controlling the false discovery rate: a practical and powerful approach to multiple testing
J. R. Stat. Soc. B
 , 
1995
, vol. 
57
 (pg. 
289
-
300
)
Damian
D
Gorfine
M
Statistical concerns about the GSEA procedure
Nat. Genetic
 , 
2004
, vol. 
36
 (pg. 
663
-
663
)
Delongchamp
RR
, et al.  . 
Multiple testing strategy for analyzing cDNA array data on gene expression
Biometrics
 , 
2004
, vol. 
60
 (pg. 
774
-
782
)
Draghici
S
, et al.  . 
Global functional profiling of gene expression
Genomics
 , 
2003
, vol. 
81
 (pg. 
98
-
104
)
Goeman
JJ
, et al.  . 
A global test for groups of genes: testing association with a clinical outcome
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
93
-
99
)
Hsueh
H
, et al.  . 
Comparison of methods for estimating number of true null hypothesis in multiplicity testing
J. Biopharm. Stat
 , 
2003
, vol. 
13
 (pg. 
675
-
689
)
Khatri
P
, et al.  . 
Profiling gene expression using onto-express
Genomics
 , 
2002
, vol. 
79
 (pg. 
1
-
5
)
Läuter
J
Exact t and F tests for analyzing studies with multiple endpoints
Biometrics
 , 
1996
, vol. 
52
 (pg. 
964
-
970
)
Mootha
VK
, et al.  . 
PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes
Nat. Genetic
 , 
2003
, vol. 
34
 (pg. 
267
-
273
)
O'Brien
PC
Procedure for comparing samples with multiple endpoints
Biometrics
 , 
1984
, vol. 
40
 (pg. 
1079
-
1087
)
Pavlidis
P
, et al.  . 
Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex
Neurochem. Res
 , 
2004
, vol. 
29
 (pg. 
1213
-
1222
)
Pocock
SJ
, et al.  . 
The analysis of multiple endpoints in clinical trials
Biometrics
 , 
1987
, vol. 
43
 (pg. 
487
-
498
)
Tang
DI
, et al.  . 
On the design and analysis of randomized clinical trials with Multiple endpoints
Biometrics
 , 
1993
, vol. 
49
 (pg. 
23
-
30
)
Tian
L
, et al.  . 
Discovering statistically significant pathways in expression profiling studies
Proc. Natl Acad. Sci. USA
 , 
2005
, vol. 
102
 (pg. 
13544
-
13549
)
Tsai
C-A
Chen
JJ
Significance analysis of ROC indices for comparing diagnostic markers: applications to gene microarray data
J. Biopharm. Stat
 , 
2004
, vol. 
14
 (pg. 
985
-
1003
)
Tusher
VG
, et al.  . 
Significance analysis of microarrays applied to the ionizing radiation response
Proc. Natl Acad. Sci. USA
 , 
2001
, vol. 
98
 (pg. 
5116
-
5121
)
van't Veer
LJ
, et al.  . 
Gene expression profiling predicts clinical outcome of breast cancer
Nature
 , 
2002
, vol. 
415
 (pg. 
530
-
536
)
Zeeberg
BR
, et al.  . 
GoMiner: a resource for biological interpretation of genomic and proteomic data
Genome Biol
 , 
2003
, vol. 
4
 pg. 
R28
 

Author notes

Associate Editor: Martin Bishop

Comments

0 Comments