Abstract

Summary: The most widely used statistical methods for finding differentially expressed genes (DEGs) are essentially univariate. In this study, we present a new T2 statistic for analyzing microarray data. We implemented our method using a multiple forward search (MFS) algorithm that is designed for selecting a subset of feature vectors in high-dimensional microarray datasets. The proposed T2 statistic is a corollary to that originally developed for multivariate analyses and possesses two prominent statistical properties. First, our method takes into account multidimensional structure of microarray data. The utilization of the information hidden in gene interactions allows for finding genes whose differential expressions are not marginally detectable in univariate testing methods. Second, the statistic has a close relationship to discriminant analyses for classification of gene expression patterns. Our search algorithm sequentially maximizes gene expression difference/distance between two groups of genes. Including such a set of DEGs into initial feature variables may increase the power of classification rules. We validated our method by using a spike-in HGU95 dataset from Affymetrix. The utility of the new method was demonstrated by application to the analyses of gene expression patterns in human liver cancers and breast cancers. Extensive bioinformatics analyses and cross-validation of DEGs identified in the application datasets showed the significant advantages of our new algorithm.

Availability: The program for the method proposed in this article and supplementary material are available at http://orclinux.creighton.edu/hotelling/index.htm

Contact:  [email protected]

Supplementary information:  http://orclinux.creighton.edu/hotelling/index.htm

1 INTRODUCTION

Microarray technology is a powerful approach for genomic research, which allows the monitoring of expression profiles for tens of thousands of genes in parallel and is already producing huge amounts of data. Microarray data contain valuable information about gene functions, inter-gene dependencies and underlying biological processes, and open a new avenue for discovering gene co-regulations, gene interactions, metabolic pathways and gene–environment interactions, etc (Slonim, 2002). Microarray data are characterized with high dimensions and small sample sizes. Statistical inference from such high dimensional data structures is challenging (Mehta et al., 2004). Several data-mining methodologies, such as clustering analysis and classification techniques, have been widely used to analyze gene expression data for identifying groups of genes sharing similar expression profiles (Alon et al., 1999; Brazma and Vilo, 2000; Eisen et al., 1998; Golub et al., 1999).

While clustering and classification techniques have proven to be useful to search similar gene expression patterns, these techniques do not answer the most fundamental question in microarray experiments: which genes are responsible for biological differences between different cell types and/or states of a cell cycle? This question amounts to statistically testing the null hypothesis that there is no difference in expression under comparison. Various statistical methods have been proposed for identifying differentially expressed genes (DEGs), from using a simple fold change to using various linear models as well as Bayesian methods (Baldi and Long, 2001; Chen et al., 1997; Kerr et al., 2000; Tusher et al., 2001; Wang and Ethier, 2004; Wettenhall and Smyth, 2004); but none of them has yet gained widespread acceptance for microarray data analyses. A common characteristic of these methods is essentially of univariate nature.

It is well known that genes never act alone in a biological system—they work in a cascade of networks (Leung and Cavalieri, 2003). Expression profiles of multiple genes are often correlated and thus are more suitably modeled as mutually dependent variables in development of a statistical testing framework. However, most of the current statistical methods ignore the multidimensional structure of the expression data and fail to efficiently utilize the valuable information for gene interactions. Multivariate analyses take advantage of the correlation information and analyze the data from multiple genes jointly. As a result, multivariate statistical techniques are receiving increased attention in expression data analyses (Chilingaryan et al., 2002; Szabo et al., 2002). However, applications of well-established multivariate statistical techniques for microarray data analyses are not straightforward because of the unusual features of the microarray data, such as high dimensions and small sample sizes.

In this study, we presented a Hotelling's T2 test that utilizes multiple gene expression information to identify DEGs in two test groups. The Hotelling's T2 statistic is a natural multidimensional extension of the t-statistic that is currently a widespread approach for detecting DEGs in testing individual genes. The Hotelling's T2 method has been applied to various aspects of life sciences including genome association studies (Xiong et al., 2002), microarray process control (Model et al., 2002) and data control charts (Mason et al., 1995). We implemented our method using a multiple forward search (MFS) algorithm that is designed for selecting a subset of feature vectors in high-dimensional microarray datasets. A resampling-based technique was also developed to smooth statistical fluctuation in T2 statistic owing to large variability inherent in microarray data and to accommodate experiments with missing values for various spots on microarrays. The proposed Hotelling's T2 method was validated by using a spike-in HGU95 dataset from Affymetrix. To illustrate its utility, we applied our method to the microarray data analyses of gene expression patterns in human liver cancers (Chen et al., 2002) and breast cancers (van't Veer et al., 2002).

2 METHODS

2.1 Hotelling's T2 statistic

We consider a microarray experiment composed of nD samples from a disease group and nN samples from a normal group. Suppose that the expression levels of J genes are measured and used as variables to construct a T2 statistic. Let
\({X}_{ij}^{D}\)
be the expression level for gene j of sample i from the disease group and
\({X}_{kj}^{N}\)
be the expression level for gene j of sample k from the normal group. The expression level vectors for samples i and k from the disease and normal groups can be expressed as
\({\hbox{ X }}_{i}^{D}={({X}_{i1}^{D},\dots ,{X}_{iJ}^{D})}^{\hbox{ T }}\)
and
\({\hbox{ X }}_{k}^{N}={({X}_{k1}^{N},\dots ,{X}_{kJ}^{N})}^{\hbox{ T }}\)
, respectively. The mean expression levels of gene j in the disease and normal groups can be expressed as
respectively. The mean expression level vectors for J genes in the disease and normal groups can be expressed as
\({\overline{\hbox{ X }}}^{D}={({\overline{\hbox{ X }}}_{1}^{D},\dots ,{\overline{\hbox{ X }}}_{j}^{D})}^{\hbox{ T }}\)
and
\({\overline{\hbox{ X }}}^{N}={({\overline{\hbox{ X }}}_{1}^{N},\dots ,{\overline{\hbox{ X }}}_{j}^{N})}^{\hbox{ T }}\)
, respectively. The pooled variance–covariance matrix of expression levels of Jgenes for the disease and normal samples is then defined as,
(1)
where SD and SN are the variance–covariance matrix of expression levels for J genes in the disease and normal groups, respectively. The covariance terms in SD and SN account for the correlation and interdependence (interactions) of gene expression levels.
Hotelling's T2 statistic for gene differential expression studies is then defined as,
(2)
This statistic combines information from the mean and dispersion of all the variables (genes being tested) in microarray experiments. When we compare two groups of samples, both of which have a large sample size, see under the null hypothesis that the distributions in both groups are the same, i.e. there is no differential expression for any genes being tested in the disease and normal groups, the central limit theorem dictates that,
(3)
is asymptotically F-distributed with J degrees of freedom for the numerator and nD + nNJ − 1 for the denominator.

2.2 Multiple forward search (MFS) algorithm

There are usually a relatively small number of independent samples used in microarray experiments, while a relatively large number of genes under comparison may actually be differentially expressed. Hence, the pooled sample variance–covariance matrix S in T2 statistic may be singular and not invertible. As a remedy for this problem, we propose a MFS algorithm that sequentially maximizes expression differences between groups of genes. This algorithm allows for iteratively and exhaustively finding a set of target DEGs. The basic structure of our MFS algorithm is outlinedbelow:

  • Step 1. Calculate T2 statistics for each of all the genes that are measured in datasets, and find the gene j1 that maximizes T2, denoting

    \({T}_{{j}_{1}}^{2}\)
    ⁠.

  • Step 2. If

    \(P{\hbox{ -value }}_{\left({T}_{{j}_{1}}^{2}\right)} < \alpha \)
    (a predefined significance level), calculate T2 statistics for two genes, one is the gene j1 and the other is one of the remaining genes excepting the gene j1. Find the gene j2 that maximizes T2 combining with the gene j1, denoting
    \({T}_{{j}_{1},{j}_{2}}^{2}\)
    .

  • Step 3. If

    \(P{\hbox{ -value }}_{\left({T}_{{j}_{1},{j}_{2}}^{2}\right)} < P{\hbox{ -value }}_{\left({T}_{{j}_{1}}^{2}\right)}\)
    ⁠, repeat Step 2 by adding one more gene that maximizes T2 combining with the genes j1 and j2.

  • Step 4. Repeat Step 3 until

    \(P{\hbox{ -value }}_{\left({T}_{{j}_{1},\dots ,{j}_{n-1},{j}_{n}}^{2}\right)} > P{\hbox{ -value }}_{\left({T}_{{j}_{1},\dots ,{j}_{n-1}}^{2}\right)}\)
    or the number of genes is larger than n1 + n2 − 2. Then the selected genes j1,j2,…,jn−1 are the first subset of identified DEGs.

  • Step 5. Exclude gene j1,…,jn−1, and repeat Steps 1–4.

  • Step 6. Repeat Step 5 until the P-value of T2 statistic of the starting gene is larger than α, i.e.

    \(P{\hbox{ -value }}_{\left({T}_{{j}_{1}}^{2}\right)} > \alpha \)
    ⁠, and stop searching.

The structure of our algorithm that adds one gene to T2 statistics in each of the Steps 1–4 allows for a fast updated computation of T2 statistics by avoiding matrix inverse. For example, the pooled variance–covariance matrix for n genes can be partitioned into a block form:
where Sn−1 is the variance–covariance matrix for n − 1 genes, a is the covariance vector for the first n − 1 genes and the n-th gene, aT is the transpose of a, and b is the variance for the n-th gene. The inverse of Sn is then calculated as,
(4)
where
\(k=b-{\hbox{ a }}^{\hbox{ T }}{\hbox{ S }}_{n-1}^{-1}\hbox{ a }\)
. We only need to invert S1 for the starting gene and recursively calculate the inverse of Sn by the above formula (4). Singular value decomposition can be used to get the general inverse of covariance matrix when the iterative method fails.

2.3 Resampling

In microarray experiments, it is not unusual that some data points are missing due to poor quality. Large inherent ‘noise’ in microarray data also renders estimation of the pooled sample variance–covariance matrix unrobust and variable. We propose the following procedure to handle such incomplete multivariate data that are used in constructing Hotelling's T2 statistics used in the MFS algorithm:

  • Step 1. Resample N replicates with replacement of subjects from the original data, denoting a replicate as Rr, r = 1,…,N.

  • Step 2. Calculate T2 statistic for each Rr, denoting

    \({T}_{r}^{2},r=1,\dots ,N\)
    ⁠.

  • Step 3. Exclude 5% the lower and upper tails of

    \({T}_{r}^{2}\)
    ⁠, and obtain the mean T2 using the remaining
    \({T}_{r}^{2}\)
    .

The number of replicates N could be as large as possible but is limited by computational capability, however. This resampling strategy can alleviate statistical fluctuation and reduce sampling errors. It can help identify a consistent subset of DEGs.

3 VALIDATION OF HOTELLING'S T2 STATISTIC

3.1 Human genome U95 spike-in dataset

In the course of developing and validating the Affymetrix Microarray Suite (MAS) 5.0 algorithm, Affymetrix produced and provided data (containing 12 640 genes) from a set of 59 arrays (HGU95) organized in a Latin-square design (http://www.affymetrix.com/support/technical/sample_data/datasets.affx). This dataset consists of 14 spike-in gene groups at known concentrations in 14 experimental groups, comprising 12 groups of three replicates (A–L) and two groups of 12 replicates (group M–P and group Q–T). In our analyses, the latter two groups of 12 replicates were used to validate our method for identifying DEGs. The correlations among 14 genes, measured by their concentrations, ranged from −0.965 to 0.924. Since a single RNA source was used, any probe-set not in the list of 14 genes should be negative for differential expression. Conversely, all of the probes in the list of 14 genes should be positive for differential expression.

3.2 Identification of DEGs

The probe levels were obtained by Robust Multi-array Analysis (RMA) (Irizarry et al., 2003) and MAS 5.0, respectively. The significance level 0.001 was adopted in the analyses. The t-test resulted in 4 false negatives and 7 false positives using the expression levels from RMA, and 2 false negatives and 33 false positives using the expression levels from MAS 5.0. The Hotelling's T2 method resulted in 1 false negative and 6 false positives using the expression level from RMA, and 2 false negatives and 16 false positives using the expression levels from MAS 5.0 (Table 1).

4 APPLICATION EXAMPLES

4.1 Human liver cancers

4.1.1 Dataset

We applied our method to publicly available datasets from the study of Chen et al. (2002) who examined gene expression patterns in human liver cancers by cDNA microarrays containing 23 075 clones representing ∼17 400 genes. In their study, they profiled genomic expressions in >200 samples including 102 primary hepatocellular carcinoma (HCC) from 82 patients and 74 non-tumor liver tissues from 72 patients. Two-sample Welch t-statistic was adopted to identify DEGs between two sets of samples. Permutation procedure was used to determine P-values. Genes with permutation P-values <0.001 were considered to be differentially expressed. Data are available online at the Stanford Microarray Database (SMD; http://genome-www5.stanford.edu/).

We chose genes for our analyses using the same standard as the original study (Chen et al., 2002). Namely, all non-flagged array elements for which the fluorescent intensity in each channel was >1.5 times the local background were considered well measured. Genes for which fewer than 75% of measurements across all the samples met this standard were excluded from further analyses (http://genome-www.stanford.edu/hcc/Figures/Materials_and_Methods_v5.pdf). After primary data processing, we performed the following three comparisons using Hotelling's T2 tests:

  1. HCC versus non-tumor liver tissues. We compared gene expression in 82 HCC versus 74 non-tumor liver tissue samples. 11 386 genes with good measured data for more than 62 HCC and more than 56 normal livers were used in our analyses.

  2. HCC with negative versus positive p53 staining. To investigate the relationship between p53 mutations and gene expression programs in HCC, Chen et al. (2002) examined 59 HCC specimens by immunohistochemical staining for p53 protein. They found 23 of these HCC analyzed have positive p53 staining, which has been noted to correlate with p53 mutation or inflammation in HCC (Hsu et al., 1993). 11 744 genes with good measured data for more than 27 HCC with negative p53 staining and more than 17 HCC with positive p53 staining were used in our analyses.

  3. HCC with versus without venous invasion. To identify the role of vascular invasion in tumor spread and metastasis, 81 tumor samples were classified by histopathological evaluation as 43 negative and 38 positive for vascular invasion (Chen et al., 2002). 11 161 genes with good measured data for more than 32 HCC without venous invasion and more than 29 HCC with venous invasion were used in our analyses.

4.1.2 Identification of DEGs

Table 2 summarizes the total numbers of DEGs identified in the above three sub-datasets by the two-sample Welch t statistic in the original paper (Chen et al., 2002) and those by the Hotelling's T2 statistic we proposed. The Hotelling's T2 statistic found more DEGs than the t-test. In the comparison of the HCC versus non-tumor tissues, more than 2000 DEGs were identified by both of the two methods. In the comparison of the HCC with positive versus negative p53 staining, two-thirds of DEGs declared by the t-test were also identified by our Hotelling's T2 statistic. However, a relatively smaller proportion of the DEGs is shared by the two methods in the comparison of the HCC with versus without venous invasion.

Nine bioinformatics resources were used to evaluate the relative performance of the two statistical methods on the basis of our known knowledge about the gene function that has been obtained by various aspects of empirical studies, including biochemical, genetic, epidemiological, pharmacological and physiological. These nine bioinformatics resources include KEGG (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.ad.jp/kegg/kegg2.html), MedGENE database (http://hipseq.med.harvard.edu/MEDGENE/), OMIM (Online Mendelian Inheritance in Man™, http://www.ncbi.nlm.nih.gov/Omim/), Atlas of Genetics and Cytogenetics in Oncology and Haematology (http://www.infobiogen.fr/services/chromcancer/index_genes_gc.html), CancerGene Database (http://caroll.vjf.cnrs.fr/cancergene/), Cancer GeneWeb (http://www.cancerindex.org/geneweb//X070601.htm), MTB (Mouse Tumor Biology Database, http://tumor.informatics.jax.org/FMPro), GeneCards™ (http://bioinfo.weizmann.ac.il/cards/) and GenePool (http://www.genscript.com/cgi-bin/products/genome.cgi). In the following, we detail our bioinformatics analyses for the DEGs identified by each of the two methods in the comparison of the HCC versus non-tumor tissues.

Table 3 shows the summary of the numbers of DEGs discovered in several main pathways implicated in high correlation with HCC. The Hotelling's T2 statistic found significantly more DEGs than the t-test in the p53 signaling pathway, cell cycle, apoptosis profile, MAP kinase signaling pathway and DNA repair pathway. For some other pathways related to HCC such as the Wnt signaling pathway, DNA damage signaling pathway, metastasis, TGFb BMP signaling pathway and JAK/STAT signaling pathway, the numbers of DEGs identified by the two methods are very close to each other. Totally, the Hotelling's T2 test identified 31 more DEGs than the t-test in these pathways. The names of these DEGs in these pathways are detailed in Supplementary Table 1.

Table 4 shows grouping categories of the DEGs according to their relationships with HCC as cited in literature using the database MedGENE. We classified the DEGs identified by the two methods into several different categories (Hu et al., 2003b): (1) first-degree associations, i.e. genes that have been directly linked to this disease by gene term search; (2) first-degree associations by gene family term, i.e. genes that have been directly linked to this disease by gene family term search; and (3) second-degree associations, i.e. genes that have never been co-cited with this disease but have been linked (in the same pathway) to at least one first-degree gene. Others are genes that have not been previously associated with this disease. The two-sample Welch t-test identified 26 and 42 DEGs more than the Hotelling's T2 test in the categories of first-degree associations and first-degree associations by gene family term, respectively. However, the Hotelling's T2 test identified 166 more DEGs than the t-test in the category of second degree associations. These results indicate that the Hotelling's T2 test is sensitive to finding genes whose differential expression is not detectable marginally, in addition to the genes found by the one-dimensional criteria. The names of these DEGs are detailed in Supplementary Table 2. In Supplementary Table 2, the statistic for ranking the gene list is relative risk of disease (Hu et al., 2003b).

In our analyses, several reported novel and important cancer-associated genes that were differentially expressed between the HCC and normal tissues were identified by the Hotelling's T2 test but not by the t-test. These genes included, for example, pleiomorphic adenoma gene-like 2 (PLAGL2), budding uninhibited by benzimidazoles 1 homolog beta (BUB1B) and budding uninhibited by benzimidazoles 3 homolog (BUB3), centromere protein F (CENPF), hepatoma-derived growth factor (HDGF), interleukin 1 beta (IL1B) and catenin (cadherin-associated protein) beta 1 (CTNNB1). PLAGL2 is a zinc-finger protein that recognizes DNA and/or RNA. It displays typical biomarkers of neoplastic transformation: (1) lose cell–cell contact inhibition, (2) show anchorage-independent growth and (3) induce tumors in nude mice (Hensen et al., 2002). BUB1B and BUB3 are the mitotic checkpoint genes, which were found over expressed in gastric cancer and associated with tumor cell proliferation (Cahill et al., 1998; Grabsch et al., 2003). CENPF was identified as antigen-inducing novel antibody responses during the course of transition to malignancy (Casiano et al., 1993). RT–PCR and western blot analyses detected increased HDGF expression in malignant hepatoma cell lines (Hu et al., 2003a). Polymorphisms in the IL-1B-511 genetic locus are one of the possible determinants of progression of hepatitis C to HCC (Tanaka et al., 2003). The expression of CTNNB1 is highly correlated with tumor progression and postoperative survival in HCC (Inagawa et al., 2002). A recent study also reported a new non-canonical pathway through which Wnt-5a antagonizes the canonical Wnt pathway by promoting the degradation of beta-catenin (Topol et al., 2003).

4.1.3 Classification of human liver tissues

Each of the above three sub-datasets in human liver cancers was divided into 10 subsets, and the cross validation was repeated 10 times. Each time, one of the 10 subsets was used as a validation set and the other 9 subsets were put together to form a learning set. We applied the most significant genes identified in the learning sets by each of the two methods to classify samples in the validation sets using the following discriminant analyses. The average error rate across all 10 trials was then computed.

The discriminant analyses were based on Mahalanobis distance that accounts for ranges of acceptability (variance) between variables and compensates for interactions (covariance) between variables (Taguchi and Jugulum, 2002). The distance between two samples can be calculated as,

\({D}^{2}(T,{G}_{i})={(X-{\mu }_{i})}^{\hbox{ T }}{V}_{i}^{-1}(X-{\mu }_{i})\)
where i = 1,2 represent group 1 and 2, respectively. D2 is the generalized squared distance of the test sample T from the i-th group Gi. Vi is the within-group covariance matrix of the i-th group. μi is the vector of means of gene expression levels of the i-th group. X is the vector of gene expression levels observed in the test sample T. If D2(T,G1 ) < D2(T,G2 ), the test sample is from group 1, otherwise from group 2.

The average error rates for classification of samples for the two statistics are listed in Table 5. The average error rates from the Hotelling's T2 test are always lower than those from the t-test in all of the three sub-datasets when the numbers of genes used for classification are the same. Furthermore, the average error rates of the Hotelling's T2 test were much more stable than the t-test when different numbers of DEGs are used for classification. For example, for the HCC versus non-tumor tissues, the average error rates of the Hotelling's T2 test varied from 0.05 to 0.07 when the numbers of genes used for classification varied from 29 to 63, while the average error rates of the t-test varied from 0.17 to 0.51. When using the T2 test, the average error rates for the sub-datasets of the HCC with negative versus positive p53 staining and the HCC with versus without venous invasion were higher than that for the HCC versus non-tumor tissues. This is due to the fact that the sample sizes of learning sets in the former two datasets are much smaller than the latter one.

4.2 Human breast cancers

4.2.1 Dataset

We applied our method to another dataset from the study of van't Veer et al. (2002). This dataset contains totally 78 primary breast cancers: 34 from patients who developed distant metastases within 5 years and 44 from patients who continued to be disease-free after a period of at least 5 years. These data were collected for the purpose of finding a prognostic signature of breast cancers in their gene expression profiles.

4.2.2 Cluster analysis

van't Veer et al. (2002) found 70 optimal marker genes as ‘prognosis classifier’. This classifier predicted correctly the outcome of disease for 65 out of the 78 patients (83%), with five poor prognosis and eight good prognosis patients, respectively, assigned to the opposite category (see the Figure 2b in the study of van't Veer). We used our Hotelling's T2 test to find the signature genes and applied the top 70 and 52 genes to classify the respective samples. Hierarchical clustering was used for classifying these 78 primary breast cancers. We obtained 8 out of 78 incorrect classifications, with 4 poor prognosis and 4 good prognosis patients assigned to the opposite category when using the 70-genes classifier (Fig. 1a); while we obtained 5 out of 78 incorrect classifications with 1 poor prognosis and 4 good prognosis patients assigned to the opposite category, when using the 52-gene classifier (Fig. 1b).

5 DISCUSSION

The exponential growth of gene expression data is accompanied by an urgent need for theoretical and algorithmic advances in integrating, analyzing and processing the large amount of valuable information. The Hotelling's T2 statistic presented in this paper is a novel tool for analyzing microarray data. The proposed T2 statistic is a corollary to its original counterpart developed for multivariate analyses (Hotelling, 1947). Our statistic for microarray data analysis possesses two prominent statistical properties. First, our method takes into account multidimensional structure of microarray data. The utilization of the information for gene interactions allows for finding genes whose differential expressions are not marginally detectable in univariate testing methods. Second, the statistic has a close relationship to discriminant analyses for classification of gene expression patterns. Our search algorithm sequentially maximizes gene expression difference/distance between two comparison groups. Inclusion of such a set of DEGs into initial feature variables may increase the power of classification rules. In addition, The Hotelling's T2 test gives one P-value for groups of genes rather than a P-value for each gene. Multiple testing problems are hence much less serious than univariate testing methods, unless too many groups of genes are tested.

We first validated our method by using a spike-in HGU95 dataset from Affymetrix. This dataset consists of 14 spike-in genes at known concentrations. The Hotelling's T2 method gave fewer false positives and negatives than t-test. For example, our method produced 1 false negative and 6 false positives, while the t-test produced 4 false negatives and 7 false positives, using the expression levels from RMA. We then applied our method to the analyses of gene expression patterns in human liver cancers (Chen et al., 2002). Extensive bioinformatics analyses and cross-validation of the DEGs identified in the study illustrated several significant advantages over the univariate t-test. First, our method discovered more DEGs in pathways related to HCC. Though P-values of the t-statistics for some genes do not exceed the threshold for significance level in our study, these genes contribute significantly to the Hotelling's T2 statistic. These genes per se usually show marginal differential expressions but are correlated with other strong DEGs (Chilingaryan et al., 2002). Second, our method identified significantly more DEGs in the category of second-degree associations with HCC. Interestingly, these genes have never been co-cited with HCC in the literature but have been linked to at least one gene that has been directly linked to this disease by gene term search, using the MedGene database (Hu et al., 2003b). Their roles in HCC tumors are worthy of further examination. This may also reflect a potential bias in the literature, where either HCC was investigated on an individual gene-by-gene study or multiple-gene expression profiles were analyzed univariately. In this sense, it is not surprising that the t-test found more DEGs in the categories of first-degree associations since current available databases seem to unduly favor the univariate approaches such as the t-test here. Third, we found several novel cancer-associated genes such as PLAGL2 and BUB1B that were also highly expressed in HCC tumors. They play important roles in the process of tumor formation and development as evidenced in a considerable body of literature (Cahill et al., 1998; Casiano et al., 1993; Grabsch et al., 2003; Hu et al., 2003a; Inagawa et al., 2002; Tanaka et al., 2003; Topol et al., 2003). Fourth, we reduced the misclassification to as low as 5% using <30 genes identified by the T2 tests. This holds great promise in clinical diagnoses and classification of tumors. A large number of training samples, such as those we have examined here, are imperative to establish highly predictive classification functions. Finally, we applied our method to find signature genes in human breast cancers (van't Veer et al., 2002). These signature genes used in hierarchical clustering resulted in higher accuracy of predicting disease outcome.

Recently, several authors proposed multivariate approaches for selecting subsets of DEGs (Chilingaryan et al., 2002; Szabo et al., 2002). In their pioneering efforts, a measure of distance between vectors of gene expressions is defined for simultaneously comparing a set of genes. However, the final chosen DEGs are dependent on the choice of the predetermined cluster size, k: larger values of k typically lead to a larger list of DEGs found in a study (Szabo et al., 2002); while in our method, a variable entering into T2 statistic depends on its excess contribution to the maximization of two-group differences, thus circumventing some of the problems inherent in variable selection. This was implemented by a heuristic MFS algorithm that is designed for selecting a subset of feature vectors in high-dimensional microarray datasets. Another prominent feature of our method is that we developed a resampling-based approach to smoothen out statistical fluctuations of the T2 statistic owing to large variability inherent in microarray data which is very helpful for finding a consistent set of DEGs. The resampling technique can also be used for handling incomplete multivariate data, a common problem in microarray experiments.

More recently, Goeman et al. (2004) developed a score test for testing whether or not some pre-specified groups of genes are differentially expressed. The groups of genes could be those that are involved in a particular biochemical pathway or a genomic region of interest and should be specified before testing. This method is very valuable for testing some known pathways that affect clinical outcome in combination with groups of genes. However, the score statistic merely tests gene-expression differences in pre-specified groups of genes between two tissue types and is not intended for group wise search for DEGs. Intuitively, the Hotelling's T2 statistic can also be used for testing pre-specified group differences and its relative performance requires further investigations in comparison with the score statistic.

One of the stopping rules associated with our search algorithm is the number of genes entering into the T2 statistic is smaller than n1 + n2 − 2, where n1 and n2 are sample sizes of two different tissue types under comparison. This restriction can be released by using principle component analysis (PCA) to reduce the dimensionality of variables entered in the Hotelling's T2 statistic (Mardia et al., 1979). Within the framework of stabilized multivariate tests (Lauter et al., 1996), the Hotelling's T2 statistic can be performed on the basis of linear scores that are derived from the original variables using PCA. Incorporating the PCA technique into our method is valuable, especially for microarray experiments with small sample sizes. However, a small sample size may not warrant multivariate normal distribution of the data, in which permutation tests using the T2 statistic may be necessary.

Table 1

DEGs identified by two methods in human genome U95 spike-in dataset

t-testT  2-test
RMAMAS 5.0RMAMAS 5.0
1024_ata1024_ata37342_s_at684_ata36202_ata1047_s_at
1091_ata1091_ata37420_i_at1091_ata684_ata37777_ata
1552_i_at1552_i_at37777_ata36085_ata36085_ata1547_at
32115_r_at1708_ata38377_at36202_ata1024_ata39981_at
32283_at1991_s_at38513_at36311_ata407_ata33117_r_at
33698_at286_at38518_at40322_ata40322_ata
36085_ata31804_f_at38729_at37777_ata32660_at
36202_ata32660_at38734_at1024_ata38729_at
36311_ata32682_at38997_at38254_at35270_at
36889_ata33117_r_at39058_ata1552_i_at39733_at
38254_at33715_r_at39091_at39058_ata36311_ata
38502_at34540_at39311_at36889_ata36889_ata
38734_ata35270_at39733_at33698_at1091_ata
38953_at35339_at39939_at38734_ata38513_at
39058_ata35986_at40322_ata38953_at36986_at
40322_ata36085_ata407_ata1526_i_at1552_i_at
684_ata36181_at41285_at407_ata38997_at
36200_at41386_i_at38502_at35862_at
36202_ata644_at1708_ata39058_ata
36311_ata677_s_at39939_at
36839_at684_ata38734_ata
36889_ata707_s_at34026_at
36986_at41036_at
t-testT  2-test
RMAMAS 5.0RMAMAS 5.0
1024_ata1024_ata37342_s_at684_ata36202_ata1047_s_at
1091_ata1091_ata37420_i_at1091_ata684_ata37777_ata
1552_i_at1552_i_at37777_ata36085_ata36085_ata1547_at
32115_r_at1708_ata38377_at36202_ata1024_ata39981_at
32283_at1991_s_at38513_at36311_ata407_ata33117_r_at
33698_at286_at38518_at40322_ata40322_ata
36085_ata31804_f_at38729_at37777_ata32660_at
36202_ata32660_at38734_at1024_ata38729_at
36311_ata32682_at38997_at38254_at35270_at
36889_ata33117_r_at39058_ata1552_i_at39733_at
38254_at33715_r_at39091_at39058_ata36311_ata
38502_at34540_at39311_at36889_ata36889_ata
38734_ata35270_at39733_at33698_at1091_ata
38953_at35339_at39939_at38734_ata38513_at
39058_ata35986_at40322_ata38953_at36986_at
40322_ata36085_ata407_ata1526_i_at1552_i_at
684_ata36181_at41285_at407_ata38997_at
36200_at41386_i_at38502_at35862_at
36202_ata644_at1708_ata39058_ata
36311_ata677_s_at39939_at
36839_at684_ata38734_ata
36889_ata707_s_at34026_at
36986_at41036_at

a Spike-in genes.

Table 1

DEGs identified by two methods in human genome U95 spike-in dataset

t-testT  2-test
RMAMAS 5.0RMAMAS 5.0
1024_ata1024_ata37342_s_at684_ata36202_ata1047_s_at
1091_ata1091_ata37420_i_at1091_ata684_ata37777_ata
1552_i_at1552_i_at37777_ata36085_ata36085_ata1547_at
32115_r_at1708_ata38377_at36202_ata1024_ata39981_at
32283_at1991_s_at38513_at36311_ata407_ata33117_r_at
33698_at286_at38518_at40322_ata40322_ata
36085_ata31804_f_at38729_at37777_ata32660_at
36202_ata32660_at38734_at1024_ata38729_at
36311_ata32682_at38997_at38254_at35270_at
36889_ata33117_r_at39058_ata1552_i_at39733_at
38254_at33715_r_at39091_at39058_ata36311_ata
38502_at34540_at39311_at36889_ata36889_ata
38734_ata35270_at39733_at33698_at1091_ata
38953_at35339_at39939_at38734_ata38513_at
39058_ata35986_at40322_ata38953_at36986_at
40322_ata36085_ata407_ata1526_i_at1552_i_at
684_ata36181_at41285_at407_ata38997_at
36200_at41386_i_at38502_at35862_at
36202_ata644_at1708_ata39058_ata
36311_ata677_s_at39939_at
36839_at684_ata38734_ata
36889_ata707_s_at34026_at
36986_at41036_at
t-testT  2-test
RMAMAS 5.0RMAMAS 5.0
1024_ata1024_ata37342_s_at684_ata36202_ata1047_s_at
1091_ata1091_ata37420_i_at1091_ata684_ata37777_ata
1552_i_at1552_i_at37777_ata36085_ata36085_ata1547_at
32115_r_at1708_ata38377_at36202_ata1024_ata39981_at
32283_at1991_s_at38513_at36311_ata407_ata33117_r_at
33698_at286_at38518_at40322_ata40322_ata
36085_ata31804_f_at38729_at37777_ata32660_at
36202_ata32660_at38734_at1024_ata38729_at
36311_ata32682_at38997_at38254_at35270_at
36889_ata33117_r_at39058_ata1552_i_at39733_at
38254_at33715_r_at39091_at39058_ata36311_ata
38502_at34540_at39311_at36889_ata36889_ata
38734_ata35270_at39733_at33698_at1091_ata
38953_at35339_at39939_at38734_ata38513_at
39058_ata35986_at40322_ata38953_at36986_at
40322_ata36085_ata407_ata1526_i_at1552_i_at
684_ata36181_at41285_at407_ata38997_at
36200_at41386_i_at38502_at35862_at
36202_ata644_at1708_ata39058_ata
36311_ata677_s_at39939_at
36839_at684_ata38734_ata
36889_ata707_s_at34026_at
36986_at41036_at

a Spike-in genes.

Table 2

Comparison of the DEGs discovered by the two methods in human liver cancers

Datasetst-testT  2-testShared
HCC versus non-tumor (82/74)396445082051
HCC with negative versus positive p53 staining (36/23)12114683
HCC without versus with venous invasion (43/38)9115134
Datasetst-testT  2-testShared
HCC versus non-tumor (82/74)396445082051
HCC with negative versus positive p53 staining (36/23)12114683
HCC without versus with venous invasion (43/38)9115134
Table 2

Comparison of the DEGs discovered by the two methods in human liver cancers

Datasetst-testT  2-testShared
HCC versus non-tumor (82/74)396445082051
HCC with negative versus positive p53 staining (36/23)12114683
HCC without versus with venous invasion (43/38)9115134
Datasetst-testT  2-testShared
HCC versus non-tumor (82/74)396445082051
HCC with negative versus positive p53 staining (36/23)12114683
HCC without versus with venous invasion (43/38)9115134

Table 3

Number of DEGs discovered in various pathophysiological pathways by the two statistical methods in human liver cancers

PathwaysOnly t-testOnly T2-testBoth methods
p53 signaling pathway81615
Wnt signaling pathway211923
Cell cycle162229
Apoptosis profile122122
Angiogenesis15722
MAP kinase signaling pathway243957
Growth factors7410
Signal transduction in cancer201323
DNA damage signaling pathway131312
Stress and toxicity171221
Metastasis161322
DNA repair pathway42311
TGFb BMP signaling pathway141321
JAK/STAT signaling pathway131621
Total200231309
PathwaysOnly t-testOnly T2-testBoth methods
p53 signaling pathway81615
Wnt signaling pathway211923
Cell cycle162229
Apoptosis profile122122
Angiogenesis15722
MAP kinase signaling pathway243957
Growth factors7410
Signal transduction in cancer201323
DNA damage signaling pathway131312
Stress and toxicity171221
Metastasis161322
DNA repair pathway42311
TGFb BMP signaling pathway141321
JAK/STAT signaling pathway131621
Total200231309

Note: In Tables 3 and 4, ‘both methods’ indicates the number of DEGs discovered by both the Hotelling's T2 and t-tests. ‘Only t-test’ indicates the number of DEGs that were only found by t-test but not by the Hotelling's T2 test. While ‘Only Hotelling's T2’ indicates the number of DEGs that were only found by the Hotelling's T2 test but not by t-test.

Table 3

Number of DEGs discovered in various pathophysiological pathways by the two statistical methods in human liver cancers

PathwaysOnly t-testOnly T2-testBoth methods
p53 signaling pathway81615
Wnt signaling pathway211923
Cell cycle162229
Apoptosis profile122122
Angiogenesis15722
MAP kinase signaling pathway243957
Growth factors7410
Signal transduction in cancer201323
DNA damage signaling pathway131312
Stress and toxicity171221
Metastasis161322
DNA repair pathway42311
TGFb BMP signaling pathway141321
JAK/STAT signaling pathway131621
Total200231309
PathwaysOnly t-testOnly T2-testBoth methods
p53 signaling pathway81615
Wnt signaling pathway211923
Cell cycle162229
Apoptosis profile122122
Angiogenesis15722
MAP kinase signaling pathway243957
Growth factors7410
Signal transduction in cancer201323
DNA damage signaling pathway131312
Stress and toxicity171221
Metastasis161322
DNA repair pathway42311
TGFb BMP signaling pathway141321
JAK/STAT signaling pathway131621
Total200231309

Note: In Tables 3 and 4, ‘both methods’ indicates the number of DEGs discovered by both the Hotelling's T2 and t-tests. ‘Only t-test’ indicates the number of DEGs that were only found by t-test but not by the Hotelling's T2 test. While ‘Only Hotelling's T2’ indicates the number of DEGs that were only found by the Hotelling's T2 test but not by t-test.

Table 4

Categories of the DEGs discovered by the two methods for the comparison of HCC versus non-tumor tissues

CategoriesOnly t-testOnly T2-testBoth methods
First-degree associations165139318
First-degree associations by gene family term12785156
Second-degree associations509675571
Total8018991045
CategoriesOnly t-testOnly T2-testBoth methods
First-degree associations165139318
First-degree associations by gene family term12785156
Second-degree associations509675571
Total8018991045
Table 4

Categories of the DEGs discovered by the two methods for the comparison of HCC versus non-tumor tissues

CategoriesOnly t-testOnly T2-testBoth methods
First-degree associations165139318
First-degree associations by gene family term12785156
Second-degree associations509675571
Total8018991045
CategoriesOnly t-testOnly T2-testBoth methods
First-degree associations165139318
First-degree associations by gene family term12785156
Second-degree associations509675571
Total8018991045

Table 5

Error rates of discriminant analyses with DEGs discovered by the two methods in human liver tissues

DatasetsNumber of genes usedError rates
t-testT  2-test
HCC versus non-tumor (82/74)630.510.07
290.170.05
HCC with negative versus positive p53 staining (36/23)420.550.16
190.340.14
HCC without versus with venous invasion (43/38)560.510.23
220.430.18
DatasetsNumber of genes usedError rates
t-testT  2-test
HCC versus non-tumor (82/74)630.510.07
290.170.05
HCC with negative versus positive p53 staining (36/23)420.550.16
190.340.14
HCC without versus with venous invasion (43/38)560.510.23
220.430.18
Table 5

Error rates of discriminant analyses with DEGs discovered by the two methods in human liver tissues

DatasetsNumber of genes usedError rates
t-testT  2-test
HCC versus non-tumor (82/74)630.510.07
290.170.05
HCC with negative versus positive p53 staining (36/23)420.550.16
190.340.14
HCC without versus with venous invasion (43/38)560.510.23
220.430.18
DatasetsNumber of genes usedError rates
t-testT  2-test
HCC versus non-tumor (82/74)630.510.07
290.170.05
HCC with negative versus positive p53 staining (36/23)420.550.16
190.340.14
HCC without versus with venous invasion (43/38)560.510.23
220.430.18

Fig. 1

Hierarchical clustering analyses of 78 primary human breast tumors using the 70-gene classifier (a) and the 52-gene classifier (b). Each row represents a single gene and each column represents a single tumor. <5 represents patients developing distant metastases within 5 years; >5 represents patients continuing to be disease-free after a period of at least 5 years. The samples marked with dashes were incorrectly classified.

Investigators of this work were partially supported by grants from NIH, State of Nebraska and CNSF. The study was also benefited by grants from Huo Ying Dong Education Foundation, the Ministry of Education of China, Xi'an Jiaotong University and Hunan Normal University.

REFERENCES

Alon, U., et al.

1999
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proc. Natl Acad. Sci. USA
96
6745
–6750

Baldi, P. and Long, A.D.

2001
A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes.
Bioinformatics
17
509
–519

Brazma, A. and Vilo, J.

2000
Gene expression data analysis.
FEBS Lett.
480
17
–24

Cahill, D.P., et al.

1998
Mutations of mitotic checkpoint genes in human cancers.
Nature
392
300
–303

Casiano, C.A., et al.

1993
Autoantibodies to a novel cell cycle-regulated protein that accumulates in the nuclear matrix during S phase and is localized in the kinetochores and spindle midzone during mitosis.
J. Cell Sci.
106
1045
–1056

Chen, X., et al.

2002
Gene expression patterns in human liver cancers.
Mol. Biol. Cell
13
1929
–1939

Chen, Y., et al.

1997
Ratio-based dicisions and the quantitative analysis of cDNA microarray images.
J. Biomed. Opt.
2
364
–374

Chilingaryan, A., et al.

2002
Multivariate approach for selecting sets of differentially expressed genes.
Math. Biosci.
176
59
–69

Eisen, M.B., et al.

1998
Cluster analysis and display of genome-wide expression patterns.
Proc. Natl Acad. Sci. USA
95
14863
–14868

Goeman, J.J., et al.

2004
A global test for groups of genes: testing association with a clinical outcome.
Bioinformatics
20
93
–99

Golub, T.R., et al.

1999
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Science
286
531
–537

Grabsch, H., et al.

2003
Overexpression of the mitotic checkpoint genes BUB1, BUBR1, and BUB3 in gastric cancer–association with tumour cell proliferation.
J. Pathol.
200
16
–22

Hensen, K., et al.

2002
The tumorigenic diversity of the three PLAG family members is associated with different DNA binding capacities.
Cancer Res.
62
1510
–1517

Hotelling, H.

1947
Multivariate Quality Control. In Eisenhart, C., Hastay, M.W., Wallis, W.A. (Eds.).
Techniques of Statistical Analysis
, NY McGraw-Hill

Hsu, H.C., et al.

1993
Expression of p53 gene in 184 unifocal hepatocellular carcinomas: association with tumor growth and invasiveness.
Cancer Res.
53
4691
–4694

Hu, T.H., et al.

2003
Expression of hepatoma-derived growth factor in hepatocellular carcinoma.
Cancer
98
1444
–1456

Hu, Y., et al.

2003
Analysis of genomic and proteomic data using advanced literature mining.
J. Proteome. Res.
2
405
–412

Inagawa, S., et al.

2002
Expression and prognostic roles of beta-catenin in hepatocellular carcinoma: correlation with tumor progression and postoperative survival.
Clin. Cancer Res.
8
450
–456

Irizarry, R.A., et al.

2003
Summaries of Affymetrix GeneChip probe level data.
Nucleic Acids Res.
31
e15

Kerr, M.K., et al.

2000
Analysis of variance for gene expression microarray data.
J. Comput. Biol.
7
819
–837

Lauter, J., et al.

1996
New multivariate tests for data with an inherent structure.
Biometrical J.
38
5
–23

Leung, Y.F. and Cavalieri, D.

2003
Fundamentals of cDNA microarray data analysis.
Trends Genet.
19
649
–659

Mardia, K.V., Kent, J.T., Bibby, J.M.

Multivariate Analysis
1979
, Duluth, London Academic Press

Mason, R.L., et al.

1995
Decomposition of T2 for Multivariate Control Chart Interpretation.
J. Qual. Technol.
27
99
–108

Mehta, T., et al.

2004
Towards sound epistemological foundations of statistical methods for high-dimensional biology.
Nat. Genet.
36
943
–947

Model, F., et al.

2002
Statistical process control for large scale microarray experiments.
Bioinformatics
18
Suppl. 1,
S155
–S163

Slonim, D.K.

2002
From patterns to pathways: gene expression data analysis comes of age.
Nat. Genet.
32
Suppl.,
502
–508

Szabo, A., et al.

2002
Variable selection and pattern recognition with gene expression data generated by the microarray technology.
Math. Biosci.
176
71
–98

Taguchi, G. and Jugulum, R.

The Mahalanobis-Taguchi Strategy
2002
, New York Wiley

Tanaka, Y., et al.

2003
Impact of interleukin-1beta genetic polymorphisms on the development of hepatitis C virus-related hepatocellular carcinoma in Japan.
J. Infect. Dis.
187
1822
–1825

Topol, L., et al.

2003
Wnt-5a inhibits the canonical Wnt pathway by promoting GSK-3-independent beta-catenin degradation.
J. Cell Biol.
162
899
–908

Tusher, V.G., et al.

2001
Significance analysis of microarrays applied to the ionizing radiation response.
Proc. Natl Acad. Sci. USA
98
5116
–5121

van't Veer, L.J., et al.

2002
Gene expression profiling predicts clinical outcome of breast cancer.
Nature
415
530
–536

Wang, S. and Ethier, S.

2004
A generalized likelihood ratio test to identify differentially expressed genes from microarray data.
Bioinformatics
20
100
–104

Wettenhall, J.M. and Smyth, G.K.

2004
limmaGUI: a graphical user interface for linear modeling of microarray data.
Bioinformatics
20
3705
–3706

Xiong, M., et al.

2002
Generalized T2 test for genome association studies.
Am. J. Hum. Genet.
70
1257
–1268