## Abstract

**Motivation:** Gene class testing (GCT) is a statistical approach to determine whether some functionally predefined classes of genes express differently under two experimental conditions. GCT computes the *P*-value of each gene class based on the null distribution and the gene classes are ranked for importance in accordance with their *P*-values. Currently, two null hypotheses have been considered: the Q1 hypothesis tests the relative strength of association with the phenotypes among the gene classes, and the Q2 hypothesis assesses the statistical significance. These two hypotheses are related but not equivalent.

**Method:** We investigate three one-sided and two two-sided test statistics under Q1 and Q2. The null distributions of gene classes under Q1 are generated by permuting gene labels and the null distributions under Q2 are generated by permuting samples.

**Results:** We applied the five statistics to a diabetes dataset with 143 gene classes and to a breast cancer dataset with 508 GO (Gene Ontology) terms. In each statistic, the null distributions of the gene classes under Q1 are different from those under Q2 in both datasets, and their rankings can be different too. We clarify the one-sided and two-sided hypotheses, and discuss some issues regarding the Q1 and Q2 hypotheses for gene class ranking in the GCT. Because Q1 does not deal with correlations among genes, we prefer test based on Q2.

**Contact:**jchen@nctr.fda.gov

**Supplementary information:** Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

A common goal of microarray experiments is to identify a list of genes that express differently among experimental conditions of interest. Much initial research has focused on the development of techniques to rank each individual gene in the order of the evidence of differential expression and to determine a cutoff to divide the genes into two lists, differential expression and non-differential expression. Typically, the *P*-value of a test statistic is calculated for each gene, using either a parametric test or a permutation test. The genes are ranked according to the *P*-values, and a pre-specified significance criterion, such as the false discovery rate (FDR), is used to determine the cutoff. Small sample sizes coupled with efforts to maintain low FDRs often result in a short significant list. With this approach, the pathways with moderate effects may not be captured by the genes in the list. In addition, the interpretation of every gene in the list is challenging and laborious.

Recently, in addition to attempting to understand biological functions for each individual gene, gene class testing (GCT) or over-representation analysis (ORA) has been proposed for gene expression analysis. A gene class refers to a group of genes with related functions or a set of genes grouped together based on biologically relevant information (Pavlidis *et al.*, 2004). For example, a class can represent a signal transduction, metabolic pathway, protein complex or GO (Gene Ontology) category. A common ORA approach involves comparing the genes in the differential expression list with the number that are expected under a particular null hypothesis to determine if a predefined functional class of genes is over-represented. The Fisher's exact test is typically used to assess the significance for an over-representation (Draghici *et al.*, 2003). Several authors have pointed out shortcomings of this approach (e.g. Pavlidis *et al.*, 2004; Tian *et al.*, 2005); they proposed alternative analysis procedures by considering the distribution of genes in each gene class over the entire list. A typical GCT approach can be summarized as follow. (1) All genes are ranked by computing a test statistic (or *P*-value). (2) For each gene class or functional category, a class score is calculated as a summary measure of the class. (3) Re-sampling methods are used to generate the null distribution of the class score for each gene class. (4) Statistical significance is assessed by comparing the observed functional score to the percentile of the null distribution of the gene class.

Tian *et al.* (2005) presented two hypotheses for GCT. Hypothesis Q1 tests the relative strengths of the associations of the genes in a gene class with the experimental conditions as compared to the genes outside the gene class. That is, it tests if treatment effects in a gene class are random samples of the treatment effects observed among all genes in the array. Hypothesis Q2 tests if there are treatment effects in the gene class. In other words, hypothesis Q1 ranks the relative association of differential expressions among the gene classes, while the Q2 hypothesis ranks statistical significance of each gene class. Tian *et al.* (2005) proposed one statistic for testing Q1 and a different statistic for testing Q2. The null distribution of the statistic for Q1 was generated by permuting genes, and the null distribution for testing Q2 was generated by permuting samples.

O’Brien (1984) proposed an ordinary least squares (OLS) and a generalized least squares (GLS) global test statistics for the one-sided alternative (either up- or down-regulation) in the context of testing for an overall treatment effect on multiple clinical endpoints (all endpoints are favorably affected). The GLS is more powerful than the OLS. However, there are a few problems with the GLS (e.g. Pocock *et al.*, 1987). Tang *et al.* (1993) recommended use of the OLS test. Similar to O’Brien's OLS, Läuter (1996) proposed a standardized weighted sum (LWS) statistic. Both OLS and LWS tests were shown to perform well in testing multiple clinical endpoints.

In this article a one-sided test statistic means that the changes of individual gene expressions in a gene class are all in one direction: either up or down. A two-sided test statistic means that the changes of individual gene expressions in the gene class can be up-regulated for some genes and down-regulated for other genes. The gene-set enrichment analysis (GSEA) (Mootha *et al.*, 2003) was a one-sided test for a down-regulated alternative. The GSEA generated the null distribution by permuting the samples. The Tian *et al.* (2005) statistics were one-sided tests. Pavlidis *et al.* (2004) proposed a function scoring analysis using the average of the sum of (−log *P*)’s of individual *P*-values from a single univariate test; the univariate test can be either one-sided or two-sided. The function scoring analysis generated the null distribution by re-sampling (permuting) genes. The Fisher's exact test in the ORA is an analysis of permuting genes where the genes are dichotomized into differentially and non-differentially expressed groups. Goeman *et al.* (2004) proposed a two-sided parametric test using the generalized linear model by assuming random effects of gene expressions. Under the null hypothesis, the test statistic is asymptotic normal.

All authors, except Tian *et al.* (2005), did not distinguish Q1 and Q2. The *P*-values were computed, either by a parametric method or by a re-sampling (either permuting genes or permuting samples) method; and the *P*-values were used for gene class rankings as well as significance assessment. Therefore, they did not deal with the issue of different null distribution and different rankings under Q1 and Q2. On the other hand, Tian *et al.* (2005) applied their tests to a diabetes dataset that was presented by the GSEA analysis (Mootha *et al.*, 2003). Their gene class rankings for the enrichment pathways are very different from the rankings by the GSEA, except the most significant pathway. Tian *et al.* (2005) commented that the differences were due to the fact that the null distributions of the GSEA statistic were not the same among gene classes. However, the Tian *et al.* (2005) gene class rankings in their Tables 1 and 2 for Q1 and Q2 were somewhat different.

**Table 1.**

Test Stat | Hypothesis Q1 | Hypothesis Q2 | Correlation | |||
---|---|---|---|---|---|---|

Mean (SD) | SD | Mean (SD) | SD | Mean | SD | |

T_{ols} | 0.473 (0.28) | 0.983 (0.15) | −0.016 (0.01) | 1.075 (0.01) | −0.47 | 0.31 |

T_{lws} | 0.486 (0.29) | 0.953 (0.14) | −0.004 (0.01) | 1.060 (0.01) | −0.24 | −0.06 |

T_{two} | 1.101 (0.07) | 0.483 (0.07) | 0.944 (0.05) | 0.424 (0.11) | 0.45 | 0.46 |

ES | 61.17 (3.23) | 33.72 (0.99) | 66.43 (11.2) | 43.2 (16.20) | 0.61 | 0.37 |

|ES| | 61.06 (3.11) | 33.65 (0.95) | 62.45 (5.10) | 43.61 (6.14) | 0.74 | 0.37 |

Test Stat | Hypothesis Q1 | Hypothesis Q2 | Correlation | |||
---|---|---|---|---|---|---|

Mean (SD) | SD | Mean (SD) | SD | Mean | SD | |

T_{ols} | 0.473 (0.28) | 0.983 (0.15) | −0.016 (0.01) | 1.075 (0.01) | −0.47 | 0.31 |

T_{lws} | 0.486 (0.29) | 0.953 (0.14) | −0.004 (0.01) | 1.060 (0.01) | −0.24 | −0.06 |

T_{two} | 1.101 (0.07) | 0.483 (0.07) | 0.944 (0.05) | 0.424 (0.11) | 0.45 | 0.46 |

ES | 61.17 (3.23) | 33.72 (0.99) | 66.43 (11.2) | 43.2 (16.20) | 0.61 | 0.37 |

|ES| | 61.06 (3.11) | 33.65 (0.95) | 62.45 (5.10) | 43.61 (6.14) | 0.74 | 0.37 |

The SDs of the averages are in parentheses. The results are the average over 10 000 replications.

**Table 2.**

ID | Size | T_{ols} statistic | ES statistic | |||
---|---|---|---|---|---|---|

Parametric | Q1 | Q2 | Q1 | Q2 | ||

34 | 106 | 1(0.0058) | 1(0.0049) | 1(0.0047) | 1(0.0001) | 1(0.0043) |

38 | 84 | 2(0.0063) | 2(0.0057) | 3(0.0066) | 7(0.0162) | 4(0.0484) |

73* | 19 | 3(0.0069) | 3(0.0078) | 2(0.0065) | 114(0.910) | 121(0.894) |

128 | 3 | 4(0.0096) | 4(0.0130) | 4(0.0132) | 13(0.0806) | 3(0.0473) |

43 | 53 | 5(0.0232) | 5(0.0339) | 5(0.0232) | 5(0.0003) | 8(0.0592) |

2 | 440 | 9(0.0629) | 8(0.0618) | 9(0.0714) | 2(0.0001) | 9(0.0624) |

1 | 450 | 12(0.0703) | 10(0.0760) | 11(0.0765) | 3(0.0001) | 10(0.0654) |

6 | 216 | 13(0.0735) | 14(0.1045) | 13(0.0832) | 4(0.0002) | 11(0.0671) |

ID | Size | T_{ols} statistic | ES statistic | |||
---|---|---|---|---|---|---|

Parametric | Q1 | Q2 | Q1 | Q2 | ||

34 | 106 | 1(0.0058) | 1(0.0049) | 1(0.0047) | 1(0.0001) | 1(0.0043) |

38 | 84 | 2(0.0063) | 2(0.0057) | 3(0.0066) | 7(0.0162) | 4(0.0484) |

73* | 19 | 3(0.0069) | 3(0.0078) | 2(0.0065) | 114(0.910) | 121(0.894) |

128 | 3 | 4(0.0096) | 4(0.0130) | 4(0.0132) | 13(0.0806) | 3(0.0473) |

43 | 53 | 5(0.0232) | 5(0.0339) | 5(0.0232) | 5(0.0003) | 8(0.0592) |

2 | 440 | 9(0.0629) | 8(0.0618) | 9(0.0714) | 2(0.0001) | 9(0.0624) |

1 | 450 | 12(0.0703) | 10(0.0760) | 11(0.0765) | 3(0.0001) | 10(0.0654) |

6 | 216 | 13(0.0735) | 14(0.1045) | 13(0.0832) | 4(0.0002) | 11(0.0671) |

*up-regulation.

Eight top ranked gene classes are selected from the combination of the five smallest permutation *P*-value (in parentheses) from the *T*_{ols} and GSEA (ES) statistics under each hypothesis Q1 and Q2. The gene classes are listed according to the observed *T*_{ols} statistic with the *P*-values computed by the normal approximation (Columns 3).

In this article, we investigate the null distributions of gene classes under Q1 and under Q2 for three one-sided and two two-sided test statistics. In one-sided test, we consider the GESA statistic and O’Brien's OLS (1984) and Läuter's LWS (1996) global statistics. In two-sided test, we consider a modified GESA statistic and a sum of the standardized differences statistic. The five statistics are applied to the diabetes dataset and a breast cancer dataset.

## 2 METHOD

Consider a microarray study of *m* genes with two experimental conditions (phenotypes) of sample sizes *n*_{1} and *n*_{2}. Assume there are I gene classes of interest *C*_{1}, …,*C*_{I} and class *C _{i}* consists of

*m*genes. For gene class

_{i}*i*, let

*y*be the normalized intensity of gene

_{gjk}*g*(

*g*= 1, …,

*m*) for sample

_{i}*j*(

*j*= 1, …,

*n*) in phenotype

_{k}*k*(

*k*= 1,2). Denote the standardized variable , where

*y*

_{g}_{.}

*is the sample mean for phenotype*

_{k}*k*and

*s*is the pooled sample SD for the

_{g}*g*-th gene. Let

*be an*

**d***m*-dimensional vector for the mean difference of the standardized variables between two phenotypes. The O’Brien's OLS statistic is

_{i}**1**is an

*m*× 1 vector of

_{i}**1**’s and

*V*is the pooled sample correlation matrix of

*. If*

**d***is multivariate normal, then the statistic,*

**d***T*

_{ols}, has an approximate

*t*distribution with

*n*

_{1}+

*n*

_{2}− 2 degrees of freedom under the Q2 hypothesis. When sample size is small,

*T*

_{ols}does not keep the prescribed level of significance (Läuter, 1996). Läuter (1996) proposed an exact test by standardizing the variables by the overall mean (

*z*. =

_{g}*y*.) and standard deviation (

_{g}*w*) for the gene class

_{g}*C*,

_{i}*z*= (

_{gjk}*y*− z

_{gjk}_{g}.)/

*w*. He then applied the two-sample

_{g}*t*-test to the sum of the standardized variables

*z.*= Σ

_{jk}_{g}

*z*. For each gene class, let

_{gjk}*z.*= Σ

_{k}_{j}

*z.*/

_{jk}*n*be the mean and

_{k}*w*be the pooled SD for the standardized variables

*z.*’s. Läuter's LWS test is

_{jk}Under hypothesis Q2 *T*_{lws} has an exact t distribution with *n*_{1} + *n*_{2} − 2 degrees of freedom. Both *T*_{ols} and *T*_{lws} can be used to test for Q1 or Q2; and both are a one-sided test either up- or down-regulation. The two statistics asymptotically have a standard normal distribution under Q2. We used the parametric normal approximation and permutation methods to compute the *P*-values and compared their gene rankings under Q1 and Q2. The ranking order based on permutation *P*-values may differ from the ranking order based on *t*-values if test statistics are not identically distributed (Tsai and Chen, 2004). We also consider the GESA (ES) statistic in the investigation (Mootha *et al.*, 2003). The original ES is proposed for a down-regulated alternative based on the ordered signal-to-noise ratio (SNR). The ES statistic based on the negative of the SNR is for an up-regulated alternative, denoted by ES^{U}. The ES^{U} is also considered in the investigation.

For a two-sided hypothesis, we consider two statistics. For the first statistic, we consider a two-sided |ES| statistic based on the absolute value of the SNR difference metric, instead of the SNR. Thus, the |ES| has the same expression as the ES statistic (Mootha *et al.*, 2003),

The second statistic is based on the sum of the standardized differences across genes in the gene class:

where*y*.

_{g}_{1}and

*y*.

_{g}_{2}are the sample means, and

*s*is the sample variance. The null distribution of

_{g}*T*

_{two}is mathematically intractable, and a permutation test is used to compute the

*P*-values.

A *P*-value of 0.05 has been used as the significance level for a gene class, (e.g. Goeman *et al.*, 2004; Mooth *et al.*2003; Pavlidis *et al.*, 2004). Since many gene classes are tested, Tian *et al.* (2005) and Barry *et al.* (2005) used the FDR approach (Benjamini and Hochberg, 1995) to account for multiple tests. Both used FDR at 0.10 for the significance level of a gene class. The FDR or familywise-error-rate (FWE) approach emphasizes the false positive error in determining the cutoff threshold. Since microarray experiments are exploratory in nature, some important gene classes might have larger *P*-values and the false negative error can also be of concern. We consider a receiver operating characteristic (ROC) approach to determine an optimal cutoff based on minimizing the total cost from false positive and false negative errors (Delongchamp *et al.*, 2004). Let *p _{(r)}* denote the

*r*-th ordered

*P*-value with

*p*

_{(1)}is the smallest and

*p*

_{(2)}is the second smallest, etc. Let CFP be a cost assigned to a false positive error and CFN be a cost assigned to a false negative error. The optimal cutoff minimizes the total expected cost:

*L*0 and

*L*1 are the numbers of truly non-differentially and differentially expressed gene classes, respectively. The optimal cutoff

*r*can be estimated numerically. The ROC approach requires knowledge of

*L*0 and

*L*1; we used an estimate derived from the mean difference method (Hsueh

*et al.*, 2003) in the examples below.

A GCT analysis assigns an overall statistical significance of phenotype differences in gene expression for a gene class. It does not identify which genes in the gene class actually contribute to the phenotype difference. After identifying gene classes that show a difference, the standard univariate test can be used to identify which genes in the gene class are significant. However, in testing individual genes in a gene class, the total number of the tests in a gene class is *m _{i}*, instead of

*m*genes in the array. That is, for a given significant gene class identified, a multiple test procedure such as FDR or FWE should be only based on

*m*comparisons. In the follow-up analysis, we are interested in which genes are significant in the given significant gene class. A gene may be significant in one gene class, but, insignificant in another.

_{i}## 3 RESULTS

### 3.1 Null distributions of the GCT statistics

In a Monte Carlo simulation, Tian *et al.* (2005) showed that the null distributions of the GSEA’s ES statistic (Mootha *et al.*, 2003) differ markedly over gene classes. We conducted a similar evaluation of the five GCT statistics using the same diabetes dataset. In addition, we also evaluated an alternative two-sided statistic based on the average of the absolute values of the *t*-statistics:

*t*

_{g}’s are two sample

*t*-statistics. We evaluated this statistic since its one-sided statistic (

*m*)

_{i}^{1/2}Σ

_{g∈Ci}(

*t*) has an asymptotical

_{g}/m_{i}*N*(0,1) and was mentioned by Tian

*et al.*(2005).

The diabetes dataset consisted of 143 gene classes from 10 526 genes measured on 17 subjects with normal glucose tolerance and 18 subjects with type 2 diabetes mellitus. Class sizes range from 1 to 450. We simulated the null distribution of each gene class under Q1 by permuting gene labels and simulated the null distribution of each gene class under Q2 by permuting samples. For each simulated sample, we computed the *T*_{ols}, *T*_{lws}, *T*_{two}, *T*_{av}, ES and |ES| statistics. The simulation was repeated 10 000 times to obtain null distributions for each of the 143 gene classes for the six GCT statistics. The means and SDs over the 10 000 replicates were computed for each of the 143 gene classes. Figure 1 plots the means for the 143 gene classes versus the class sizes for the six statistics under Q1 and under Q2.

**Fig. 1.**

**Fig. 1.**

For hypothesis Q1, except *T*_{two}, the means of the null distributions increase as the class size increases. For hypothesis Q2, the means of the null distributions for *T*_{ols} and *T*_{lws} are fairly constant. The means in *T*_{two} are somewhat similar, as are the means in |ES|. The means in ES are rather different. Also, the means for the 143 gene classes in *T*_{ols} are similar to the corresponding means in *T*_{lws} under either Q1 or Q2. For the *T*_{av} statistic, the mean increases as the class sizes increases under either Q1 or Q2. The gene class of size one has the smallest means ranging 0.83–0.86, and the largest class size has the largest mean of 17.97. It is apparent that the null distributions for the 143 classes are very different. Since under Q2, the null distribution should not depend on the class size, the *T*_{av} statistic will not be considered in the further evaluation.

Table 1 shows the average of the means and the average of the SDs over the 143 gene classes of the five GCT statistics under Q1 and under Q2, and the correlation coefficients between Q1 and Q2 for the mean and the SD. Unlike *T*_{ols}, *T*_{lws} and *T*_{two}, the averages of the SDs for ES and |ES| are rather large under either Q1 or Q2. Both statistics are a summation of quantities which depend on the class size. When the class sizes are different, the summations would result in large differences.

In each *T*_{ols}, *T*_{lws} and *T*_{two}, the SDs of the averaged means and averaged SDs are small under either Q1 or Q2. That is, the 143 genes classes have almost the same means and same SDs. In ES and |ES|, under Q1 both the averaged means and averaged SDs have relatively small SDs. Under Q2 the means as well as standard deviations for the 143 null distributions are very different for ES. For example, the means range from 51.97 to 118.46 and the SDs range from 26.77 to 115.33 (Fig. 1). These results are very similar to the results of Tian *et al.* (2005). The means for |ES| are more similar; the means range from 51.57 to 82.55 and the SDs range from 29.16 to 76.77.

Under Q2, the means for *T*_{ols} and *T*_{lws} are close to 0 and the SDs are slightly larger than 1. These values are in good agreements with the theoretical results that the null distributions of the two statistics have an asymptotic *N*(0,1). A deviation from 0, either positive or negative, indicates favoring an alternative hypothesis, either an up- or down-regulated pathway. Under Q1, the means for *T*_{ols} and *T*_{lws} are about 0.48 and the SDs are slightly less than 1. The null distributions under Q1 do not appear to have an asymptotic *N*(0,1). For *T*_{two}, a large value indicates a difference, positive and/or negative. For the GESA statistics, a large value of ES favors down-regulation; also, a large value of ES^{U} favors up-regulation. However, unlike *T*_{ols} and *T*_{lws}, ES and ES^{U} need to be computed separately. Similarly, a large |ES| indicates a difference. In the further analysis we will report only the results from the *T*_{ols} since the null distributions for the two one-sided statistics *T*_{ols} and *T*_{lws} are very similar.

In each of the five GCT statistics, the distribution of each gene class under Q1 appears to be different from the corresponding distribution under Q2 (Fig. 1). In other words, the null distributions of the 143 gene classes generated by permuting genes (under Q1) differ from the null distributions generated by permuting phenotypes (under Q2). The correlations between Q1 and Q2 are low for *T*_{ols}, *T*_{lws} and *T*_{ols}, and higher for ES and |ES|. Figure 2 shows the scatter plots of the means of null distributions for Q1 versus Q2 over the 143 gene classes. *T*_{ols} and *T*_{two} show no clear correlation between Q1 and Q2, while ES and |ES| show positive correlations. However, these positive correlations could be from an indirect effect of class sizes.

**Fig. 2.**

**Fig. 2.**

In testing Q1 and Q2, different null distributions could result in different gene rankings. Under Q2, the *P*-values of the gene classes are calculated under no difference between two phenotypes. If the null distributions of the gene classes are identical, e.g. *T _{ols}* or

*T*under Q2,, then these

_{lws}*P*-values are comparable and can be used for ranking of gene classes as well as identifying differential expression. The

*P*-value computed under Q2 is consistent with the principle of statistical significance testing. The evaluation and analysis of the GCT tests will be primarily based on Q2.

### 3.2 GCT analysis of diabetes data

Table 2 lists eight gene classes from the one-sided *T*_{ols} and ES tests under Q1 and under Q2, and the corresponding *P*-values from the observed *T*_{ols} using the normal approximation. These are the eight most significant gene classes selected from the five smallest *P*-values in each test under each hypothesis.

For *T*_{ols}, the *P*-values from the parametric and permutation methods are in good agreement. Despite the low correlation between Q1 and Q2 (Table 1), the rankings under Q1 and under Q2 are similar. For ES, under Q1 the four top gene classes all have large class sizes. It is worth mentioning that the *P*-values for gene class #73 are 0.914 and 0.898 for Q1 and Q2, respectively. This gene class is up-regulated and ranked second or third by *T*_{ols}. The *T*_{ols} test is designed to identify either up- or down-regulated gene class, while ES here is designed to identify only the down-regulated gene class. The *P*-values of ES^{U} for an up-regulated alternative are 0.037 and 0.045 (<0.01) for Q1 and Q2, respectively.

Table 3 lists eight gene classes from the two-sided |ES| and *T _{tw}*

_{o}statistic under Q1 and under Q2, and the corresponding ranks from

*T*are also shown. Seven of the eight, except for gene class #1, represent the most significant gene classes. For |ES|, the rankings under Q1 and under Q2 are similar. Like ES, the four top-ranked gene classes in |ES| all have large class sizes under Q1. For

_{ols}*T*

_{two}, the rankings under Q1 and under Q2 are different. The rankings in |ES| differ from the rankings in

*T*

_{two}under either Q1 or Q2. For example, the two larges classes #450 and #440, which are ranked first and second in |ES|, are ranked 47 and 9 in

*T*

_{two}under Q1 and ranked 70 and 31 under Q2.

**Table 3.**

ID | Size | |ES| statistic | T_{two} statistic | T_{ols} statistic | |||
---|---|---|---|---|---|---|---|

Q1 | Q2 | Q1 | Q2 | Q1 | Q2 | ||

1 | 450 | 1(0.0000) | 6(0.0152) | 47(0.2847) | 70(0.3503) | 10 | 11 |

2 | 440 | 2(0.0000) | 5(0.0145) | 9(0.0559) | 31(0.1461) | 8 | 9 |

34 | 106 | 3(0.0000) | 2(0.0078) | 6(0.0526) | 11(0.0316) | 1 | 1 |

6 | 216 | 4(0.0011) | 3(0.0124) | 21(0.0957) | 19(0.0708) | 14 | 13 |

73 | 19 | 5(0.0014) | 1(0.0033) | 25(0.1243) | 22(0.0878) | 3 | 2 |

43 | 53 | 6(0.0077) | 11(0.0410) | 4(0.0512) | 8(0.0283) | 5 | 5 |

105 | 7 | 7(0.0090) | 4(0.0129) | 7(0.0540) | 10(0.0314) | 120 | 123 |

128 | 3 | 33(0.1253) | 30(0.1162) | 2(0.0258) | 7(0.0252) | 4 | 4 |

ID | Size | |ES| statistic | T_{two} statistic | T_{ols} statistic | |||
---|---|---|---|---|---|---|---|

Q1 | Q2 | Q1 | Q2 | Q1 | Q2 | ||

1 | 450 | 1(0.0000) | 6(0.0152) | 47(0.2847) | 70(0.3503) | 10 | 11 |

2 | 440 | 2(0.0000) | 5(0.0145) | 9(0.0559) | 31(0.1461) | 8 | 9 |

34 | 106 | 3(0.0000) | 2(0.0078) | 6(0.0526) | 11(0.0316) | 1 | 1 |

6 | 216 | 4(0.0011) | 3(0.0124) | 21(0.0957) | 19(0.0708) | 14 | 13 |

73 | 19 | 5(0.0014) | 1(0.0033) | 25(0.1243) | 22(0.0878) | 3 | 2 |

43 | 53 | 6(0.0077) | 11(0.0410) | 4(0.0512) | 8(0.0283) | 5 | 5 |

105 | 7 | 7(0.0090) | 4(0.0129) | 7(0.0540) | 10(0.0314) | 120 | 123 |

128 | 3 | 33(0.1253) | 30(0.1162) | 2(0.0258) | 7(0.0252) | 4 | 4 |

Eight gene classes are selected from the combination of the five smallest *P*-value (in parentheses) from |ES| and *T*_{two} under each hypothesis Q1 and Q2. The gene classes are listed according to *P*-values of |ES| under Q1. The ranks from the one-sided *T*_{ols} are shown for comparison.

The top-ranked genes from the two-sided and one-sided tests are different, as expected. Among the two lists (Tables 2 and 3), there are six gene classes in common. Gene class #38, which is ranked second and third by *T*_{ols}, is not in Table 3. On the other hand, gene class #105 is highly ranked in Table 3, but is not in Table 2. This is due to the fact that *T*_{ols} lacks power when the gene class involves both up- and down-regulated genes. There are seven genes in the gene class #105, three are up-regulated and four are down-regulated. The remaining analysis of the diabetes data will focus on the one-sided *T*_{ols} test.

Of the 143 gene classes analyzed, using the significance level of α = 0.05, the numbers of significances for the one-sided test are 7 and 6 for Q1 and Q2, respectively; and the numbers for the two-sided test are 10 and 15. Interestingly, Tian *et al.* (2003) discussed the three significant gene classes: (1) OXPHOS (oxidative phosphorylation) #34, (2) MAP00910 (nitrogen metabolism) #73 and (3) c22-U133 (cell growth and maintenance) #38. These three gene classes are ranked 1, 2 and 3 by the *T*_{ols} (and *T*_{lws}) tests. Tian *et al.* (2003) also discussed that the finding of the gene class OXPHOS was supported by another gene class MAP00190 #43 which is ranked fifth by the *T*_{ols} test. It is worth mentioning that the *P*-value of the OXPHOS gene class originally identified by the GSEA analysis (Mootha *et al.*, 2003) is 0.029.

None of the gene classes has the FDR less than 0.10. For OXPHOS, the *P*-value from the *T*_{ols} test is 0.0047 with the corresponding FDR = 0.323 under Q2 (note that Tian's analysis was based on 26 gene classes in their multiplicity adjustment). Alternatively, with the ROC approach, we assume the relative cost between CFP and CFN is 1 to 2, i.e. 2CFP = CFN (a two-fold higher cost on omitting potentially differentially expressed gene classes than on falsely selecting differentially expressed gene classes). Using the mean difference method (Hsueh *et al.*, 2003), an estimate of *L*0 is 141. The optimal cutoff occurs at *r* = 2 with the corresponding *p*_{(2)} = 0.0072. That is, two gene classes are selected when both false positive and false negative errors are taken into consideration. The individual genes in the significance gene class can be further investigated. For the gene class OXPHOS, it has 106 genes. Only one gene with the (smallest) *P*-value of 0.007 is significant at FDR = 0.1.

### 3.3 GCT analysis of breast cancer data

The breast cancer data was presented by van’t Veer *et al.* (2002). The dataset contained approximately 25 000 genes from 78 primary breast cancers (34 from patients who developed distant metastases within 5 years and 44 from patients who continued to be disease-free after a period of at least 5 years). The purpose of this example is to illustrate a GCT analysis defined by GO term. To identify the GO terms that are most related to prognostic reporters for breast cancer metastases, 231 genes identified as predictors previously by van’t Veer *et al.* (2002) were used to generate the GO terms. One hundred and thirty-eight genes out of the 231 having gene names and without missing values were submitted to generate GO terms. A total of 512 GO terms for biological process was generated. The four top level terms, biological process, cellular process, physiological process and cellular physiological process were not used in the analysis resulting in 508 GO terms linked with 4137 genes.

We performed a similar simulation to study the null distributions of the five GCT test statistics. The results are similar to the results from the diabetes data. (See Supplementary Material, Table S1 and Figs S1 and S2.) Under Q1, the means of the null distribution increases as the class size increases for ES, |ES| and *T*_{two}, Under Q2, that the null distributions of *T*_{ols} and *T*_{lws} have an asymptotic *N*(0, 1). ES, |ES| and *T*_{two} show high positive correlations. Only *T*_{ols} and |ES| will be considered in the remaining analysis since they show better performance than ES (or ES^{U}) and *T*_{two}, respectively.

Table 4 lists the five top gene classes under Q1 and the five top gene classes under Q2, where the class sizes are at least 10. The two lists have no gene in common. The gene classes in Q1 are all down-regulated, while the gene classes in Q2 are all up-regulated. That is, the rankings between Q1 and Q2 are completely different.

**Table 4.**

ID | Size | Category | Pathway | Q1 | Q2 |
---|---|---|---|---|---|

30 | 610 | GO:0045449 | Regulation of transcription | 1(0.0000) | 39(0.0631) |

32 | 582 | GO:0006355 | Regulation of transcription, DNA-dependent | 2(0.0003) | 46(0.0755) |

28 | 641 | GO:0006350 | Transcription | 3(0.0005) | 66(0.1038) |

29 | 627 | GO:0019219 | Regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism | 4(0.0010) | 74(0.1185) |

31 | 604 | GO:0006351 | Transcription, DNA-dependent | 5(0.0018) | 78(0.1320) |

381 | 11 | GO:0006270 | DNA replication initiation | 6(0.0020) | 1(0.0014) |

390 | 10 | GO:0006541 | Glutamine metabolism | 16(0.0092) | 2(0.0017) |

362 | 13 | GO:0007059 | Chromosome segregation | 17(0.0095) | 3(0.0026) |

277 | 28 | GO:0006096 | Glycolysis | 12(0.0053) | 4(0.0034) |

389 | 11 | GO:0043087 | Regulation of GTPase activity | 23(0.0161) | 5(0.0044) |

ID | Size | Category | Pathway | Q1 | Q2 |
---|---|---|---|---|---|

30 | 610 | GO:0045449 | Regulation of transcription | 1(0.0000) | 39(0.0631) |

32 | 582 | GO:0006355 | Regulation of transcription, DNA-dependent | 2(0.0003) | 46(0.0755) |

28 | 641 | GO:0006350 | Transcription | 3(0.0005) | 66(0.1038) |

29 | 627 | GO:0019219 | Regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism | 4(0.0010) | 74(0.1185) |

31 | 604 | GO:0006351 | Transcription, DNA-dependent | 5(0.0018) | 78(0.1320) |

381 | 11 | GO:0006270 | DNA replication initiation | 6(0.0020) | 1(0.0014) |

390 | 10 | GO:0006541 | Glutamine metabolism | 16(0.0092) | 2(0.0017) |

362 | 13 | GO:0007059 | Chromosome segregation | 17(0.0095) | 3(0.0026) |

277 | 28 | GO:0006096 | Glycolysis | 12(0.0053) | 4(0.0034) |

389 | 11 | GO:0043087 | Regulation of GTPase activity | 23(0.0161) | 5(0.0044) |

The top five gene classes identified under Q2 are consistent with the biological mechanism leading to rapid metastases, the poor prognosis of breast cancers. DNA replication initiation, chromosome segregation and regulation of GTPase activity are involved in cell growth and cell cycle control. It is not surprising to find that the activities of genes involved in cell cycles are enhanced because cell growth rate in rapidly growing tumors is high. High rates of glutamine metabolism are related to advanced malignant tumors because glutamine is the major respiratory fuel for tumor cells and it also provides tumors substrates for nucleotide and protein biosynthesis. The activities for glycolysis and sugar catabolism (glucose and hexose catabolism, etc.) in rapidly growing cancer cells are generally increased for ATP generation.

Using the significance level at FDR = 0.10, the numbers of significant gene classes is 10 under Q2. In this dataset, the estimated number of *L*0 is 470. The individual genes in the significance gene class can be further investigated. The numbers of significant genes at FDR = 0.10 for the top five gene classes, GO:0006270, GO:000541, GO:0007059, GO:0006096 and GO:0043087 (Table 4), are 4, 4, 1, 13, and 1, respectively. Note that these numbers are based on the individual gene class sizes (*m _{i}*) in the FDR calculation. The numbers of significant genes for the five gene classes would be 0, 2, 0, 2 and 1 if the total 4137 genes are used in the multiple comparison adjustment.

Table 5 lists top seven gene classes from the five smallest *P*-values for the two-sided |ES| statistic under Q1 and under Q2. Unlike *T*_{ols}, the two rankings are very similar. Figure 3 plots the null distributions of gene class GO:0045449 under Q1 and Q2.

**Table 5.**

ID | Size | Category | Pathway | Q1 | Q2 |
---|---|---|---|---|---|

47 | 320 | GO:0007049 | Cell cycle | 1 (0.0002) | 5 (0.0049) |

174 | 64 | GO:0007067 | Mitosis | 2 (0.0004) | 1 (0.0011) |

168 | 66 | GO:0000087 | M phase of mitotic cell cycle | 3 (0.0006) | 2 (0.0014) |

126 | 102 | GO:0000278 | Mitotic cell cycle | 4 (0.0007) | 6 (0.0057) |

142 | 83 | GO:0000279 | M phase | 5 (0.0014) | 10 (0.0107) |

176 | 63 | GO:00051301 | Cell division | 6 (0.0014) | 4 (0.0023) |

287 | 26 | GO:0000226 | Microtubule cytoskeleton organization and biogenesis | 8 (0.0015) | 3 (0.0016) |

ID | Size | Category | Pathway | Q1 | Q2 |
---|---|---|---|---|---|

47 | 320 | GO:0007049 | Cell cycle | 1 (0.0002) | 5 (0.0049) |

174 | 64 | GO:0007067 | Mitosis | 2 (0.0004) | 1 (0.0011) |

168 | 66 | GO:0000087 | M phase of mitotic cell cycle | 3 (0.0006) | 2 (0.0014) |

126 | 102 | GO:0000278 | Mitotic cell cycle | 4 (0.0007) | 6 (0.0057) |

142 | 83 | GO:0000279 | M phase | 5 (0.0014) | 10 (0.0107) |

176 | 63 | GO:00051301 | Cell division | 6 (0.0014) | 4 (0.0023) |

287 | 26 | GO:0000226 | Microtubule cytoskeleton organization and biogenesis | 8 (0.0015) | 3 (0.0016) |

**Fig. 3.**

**Fig. 3.**

In summary, for *T*_{ols}, the null distributions of gene classes uner Q1 and under Q2 are different (Table 1, Table S1 and Fig. 3). The class rankings are generally consistent for the diabetes data. (Table 2) But the rankings are very different for the breast cancer data (Table 4). On the other hand, |ES| shows more consistent rankings between Q1 andQ2 (Table 3, Table 5, and Fig. 3). It is

## 4 DISCUSSION

Biological phenomena often occur through the interactions of multiple genes, via signaling pathways, networks or other functional relationships. With rapid developments of genomic databases and availability of more comprehensive annotations, e.g. KEGG pathway maps, GoMiner (Zeeberg *et al.*, 2003), Onto-Express (Khatri *et al.*, 2002), FatiGo (Al-Shahrour *et al.*, 2004), GCT can provide powerful which is easier to interpret analysis. GCT analysis shows a list of significant gene classes that are more differentially expressed among the classes than expected by chance; the list is ranked by the *P*-values of a test statistic.

In the context of Q1 and Q2, the significance analysis is tested by the Q2 hypothesis, while the gene class ranking is tested by the Q1 hypothesis. For the Q1 hypothesis, the null distributions are generated by permuting genes. Permuting the genes only reassigns individual genes to different gene classes. Q1 is a conditional test, conditional on the association strength between phenotypes and gene expressions in the observed data. Under Q1, the null distributions of different gene classes could be very different when there are differentially expressed genes between the phenotypes. For the Q2 hypothesis, the null distributions are generated by permuting samples. Under Q2, the null distribution of different gene classes should be identically distributed. Q2 is an unconditional test. The Q1 and Q2 hypotheses are highly related; however, the two rankings can be very different. When the number of differentially expressed genes is small, such as the diabetes dataset, Q1 is approximately equivalent to Q2, and the rankings under Q1 and Q2 will be similar. For the breast cancer dataset, many more genes are differentially expressed; the rankings under Q1 and Q2 are very different for *T*_{ols}. Nevertheless, for |ES| the two rankings are similar. We will investigate this issue in the future.

We have concerns with permutations of genes in the Q1 hypothesis on two main points. First, its null distribution is essentially generated from the empirical distribution of the observed *P*-values, instead of the uniform distribution under the null hypothesis that no difference in gene expression exists between two phenotypes. When many genes are differentially expressed, this (empirical) null distribution will not represent the distribution of genes which are not differentially expressed. This hypothesis does not test whether the gene class is different in two phenotypes. Rather, it tests whether the observed difference between two phenotypes in the gene class is more or less than the average differences in the study. The second point is that the validity of this test depends upon exchangeability among the genes. Correlation structures likely to occur among genes are not exchangeable and result in computed significance levels by this test that can be far too small. Given these difficulties the utility of the Q1 test which shuffles genes is unclear.

On the other hand, the Q2 hypothesis is consistent with the current use of a permutation test, such as the significance analysis of microarray (Tusher *et al.*, 2001), to select a list of differentially expressed genes. The *P*-value is calculated by comparing the observed statistic to the percentile of the null distribution generated by permuting samples. The *P*-values are used for gene ranking and significance assessment. Permutation of samples computes the *P*-values based on the random (independent) samples, and the test statistics are comparable.

For testing hypothesis Q2, the null distributions of *T*_{ols} and *T*_{lws} appear to have asymptotic standard normal distributions for all gene classes. These two global statistics account for class size and correlation structure. These two statistics are recommended for identifying either up- or down-regulated gene classes. A significant gene class may consist of up- or down-regulated genes or a mixture of both. The two statistics lack power when the gene class consists of both up- and down-regulated genes. We investigate |ES| and *T*_{two} for identifying gene classes with a mixture of both up- and down-regulated genes. The *T*_{two} statistic tries to account for the magnitude of differences in which a gene with a large standardized difference may lead to a relatively higher score. The two statistics could have very different rankings. (*T*_{two})^{2} is χ^{2} distributed under the model of independence among genes with an approximate equal variance. These assumptions obviously do not hold. The *T*_{two} statistics appears to perform unsatisfactorily.

There were concerns that the null distributions of the GSEA ES statistic were affected by the class sizes (Damian and Gorfine, 2004; Tian *et al.*, 2005). The |ES| statistics shows more consistent rankings between Q1 andQ2. However, |ES| appears to be affected by the class size. The |ES| statistic determines whether the members of a given gene class show enrichment among the most differentially expressed genes by considering the rankings of differences. We are currently investigating |ES| and other statistics including the score statistic proposed by Goeman *et al.* (2004).

## ACKNOWLEDGEMENTS

The views presented in this article do not necessarily reflect those of the U.S. Food and Drug Administration.

*Conflict of Interest*: none declared.

## Comments