Abstract

Sample classification and class prediction is the aim of many gene expression studies. We present a web-based application, Prophet, which builds prediction rules and allows using them for further sample classification. Prophet automatically chooses the best classifier, along with the optimal selection of genes, using a strategy that renders unbiased cross-validated errors. Prophet is linked to different microarray data analysis modules, and includes a unique feature: the possibility of performing the functional interpretation of the molecular signature found.

Availability: Prophet can be found at the URL or within the GEPAS package at

Contact:jdopazo@cipf.es

Supplementary information:

BACKGROUND

One of the crucial factors behind the success of DNA microarray technologies has been its application to the definition of predictors of clinical outcomes (van 't Veer et al., 2002). Albeit not free from criticisms (Simon, 2005), the practical implications of this particular goal have definitively fuelled the use of microarrays. Common errors in the early proposals of predictors, such as the selection bias (Ambroise and Mclachlan, 2002; Simon et al., 2003), which causes unrealistic, biased-down error estimations, are behind the above mentioned criticisms. Recently, proper strategies for an unbiased cross-validation have been proposed. The estimation of the classification errors must take into account the gene selection step as well as any other parallel step taken such as the optimization of the number of selected genes, the selection among various classifiers, etc. However, it is still frequent to find publications in which this important fact has not been taken into account (Ambroise and McLachlan, 2002). In the root of this commonly extended conceptual error is, probably, the lack of easy-to-use, accurate and freely available solutions that allow end users to carry out such analysis.

Prophet aims to fulfill the demand of a simple but powerful tool for prediction purposes in the microarray context. Since web-based solutions are gaining acceptance in the microarray community for data analysis purposes (see for example: ), Prophet was conceived to be accessible over the web. To our knowledge, the only other web-based, equivalent tool available is the M@CBETH (Pochet et al., 2005). However, this program can only handle two-class problems. Moreover, since PCA is used to reduce the dimension of the data the identity of the genes is lost and only the principal components can be retrieved from the predictor. Finally, only support vector machines (SVMs) and Fisher's discriminant analysis can be used as classification algorithms.

BUILDING THE PREDICTOR AND PREDICTING

Prophet has two main options: ‘train’ and ‘predict’. The first one (corresponding to the training step) is used to build the predictor while in the second one the predictor found can be used for predicting class membership for new samples.

Prophet builds a prediction rule based on genes. There are several options for defining the dataset of genes to be used for training the predictors. Prophet accepts user-defined selections of genes or, alternatively, it can find the optimal subset within the whole set of genes. For the second option, also known as the ‘filter approach’ in the machine learning literature, Prophet pre-selects the genes which will potentially provide more accuracy to the predictor. Two ways of ranking genes for subsequent selection have been implemented: the F-ratio (Dudoit et al., 2002) and the Wilcoxon statistic, a non-parametric test for differences between two classes. These can be used in combination with any of the class-prediction algorithms implemented in Prophet, which have been shown to perform very efficiently with microarray data (Dudoit et al., 2002; Romualdi et al., 2003; Wessels et al., 2005). The methods are: SVM (Vapnik, 1999), k-nearest neighbor (KNN), diagonal linear discriminant analysis (DLDA), SOM (Kohonen, 1997) and shrunken centroids (PAM) (Tibshirani et al., 2002).

The ‘train’ option of the Prophet form implements the strategy for finding the best predictor with the optimal number of genes. A leave-one-out (LOO) cross-validation strategy is implemented here to return the cross-validated error rate of the complete process of building several predictors and then choosing the one with the smallest error rate. The procedure used is as follows: a LOO sample is drawn from the training dataset. Genes are ranked by one of the methods above mentioned (F-ratio or Wilcoxon statistic) and using the n top genes (n = 2, 5, 10, 20, 35, 50, 75 and 100, by default) a predictor is built with the methods above mentioned (KNN, DLDA, SVM, PAM, SOM or a sub-selection of them). Then, the LOO error is calculated for each method for each n genes. Finally, the smallest set of n genes in combination with the method that results in the smallest CV error is reported. The results include a plot of the CV error across the range of sets of n genes for all the classification methods tried along with the corresponding confusion matrices (very useful to detect asymmetries in the determination of classes). In addition, the prediction for each LOO sample is provided, which is quite useful for detecting outlayers or anomalous missassignments. More detailed information and examples are available in the tutorial page at, . Finally, all the supplementary information was included in the tutorial.

Once the optimal predictor (combination of a set of n genes and a classification method) has been found, it can be saved. Then, in the ‘predict’ option of the form, the predictor can be retrieved and applied to new samples and a class membership prediction will be obtained for them.

The input file format is quite simple: a tab-delimited text file with genes in rows and experiments in columns. The first column corresponds to the gene identifiers. Individual experiment identifiers as well as class identifiers can be provided in a separate file or within the main file with the corresponding labels (#NAME and #CLASS, respectively, see tutorial and Supplementary information for details).

Prophet is integrated within the GEPAS (Herrero et al., 2003; Montaner et al., 2006) environment, thus a complete analysis of the microarray data, from the first steps of normalization and preprocessing, can be performed without the necessity of switching among different programs with different input/output formats. Another unique feature is the possibility of having a functional interpretation of the genes included in the predictor. This is achieved through tools such as FatiGO+ (Al-Shahrour et al., 2004) an others, included in the Babelomics package (Al-Shahrour et al., 2005, 2006), to which Prophet is also linked.

In addition to the web interface, Prophet can be invoked as a web service.

To summarize, Prophet provides an accurate, conceptually correct and easy-to-use framework for building predictors based on microarray gene expression data that can be later used to predict class membership for new samples. Moreover, this is the only web-based tool that builds predictors based on genes and allows a further functional interpretation of the results.

This work is supported by grants from NRC Canada-SEPOCT Spain, project BIO 2005-01078 from the MEC and INDIGO EU project. The Functional Genomics node (INB) is supported by Genoma España. Funding to pay the Open Access publication charges for this article was provided by Genoma Españna.

Conflict of Interest: none declared.

REFERENCES

Al-Shahrour
F.
, et al.  . 
FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
578
-
580
)
Al-Shahrour
F.
, et al.  . 
BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments
Nucleic Acids Res.
 , 
2005
, vol. 
33
 (pg. 
W460
-
W464
)
Al-Shahrour
F.
, et al.  . 
BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments
Nucleic Acids Res.
 , 
2006
, vol. 
34
 (pg. 
W472
-
W476
)
Ambroise
C.
McLachlan
G.J.
Selection bias in gene extraction on the basis of microarray gene-expression data
Proc. Natl Acad. Sci. USA
 , 
2002
, vol. 
99
 (pg. 
6562
-
6566
)
Dudoit
S.
, et al.  . 
Comparison of discrimination methods for the classification of tumors suing gene expression data
J. Am. Stat. Assoc.
 , 
2002
, vol. 
97
 (pg. 
77
-
87
)
Herrero
J.
, et al.  . 
GEPAS: A web-based resource for microarray gene expression data analysis
Nucleic Acids Res.
 , 
2003
, vol. 
31
 (pg. 
3461
-
3467
)
Kohonen
T.
Self-organizing Maps
 , 
1997
Berlin
Springer-Verlag
Montaner
D.
, et al.  . 
Next station in microarray data analysis: GEPAS
Nucleic Acids Res.
 , 
2006
, vol. 
34
 (pg. 
W486
-
W491
)
Pochet
N.L.
, et al.  . 
M@CBETH: a microarray classification benchmarking tool
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
3185
-
3186
)
Romualdi
C.
, et al.  . 
Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification
Hum. Mol. Genet.
 , 
2003
, vol. 
12
 (pg. 
823
-
836
)
Simon
R.
Roadmap for developing and validating therapeutically relevant genomic classifiers
J. Clin. Oncol.
 , 
2005
, vol. 
23
 (pg. 
7332
-
7341
)
Simon
R.
, et al.  . 
Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification
J. Natl Cancer Inst.
 , 
2003
, vol. 
95
 (pg. 
14
-
18
)
Tibshirani
R.
, et al.  . 
Diagnosis of multiple cancer types by shrunken centroids of gene expression
Proc. Natl Acad. Sci USA
 , 
2002
, vol. 
99
 (pg. 
6567
-
6572
)
van 't Veer
L.J.
, et al.  . 
Gene expression profiling predicts clinical outcome of breast cancer
Nature
 , 
2002
, vol. 
415
 (pg. 
530
-
536
)
Vapnik
V.
Statistical Learning Theory
 , 
1999
New York
John Wiley and Sons
Wessels
L.F.
, et al.  . 
A protocol for building and evaluating predictors of disease state based on microarray data
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
3755
-
3762
)

Author notes

Associate Editor: Chris Stoeckert
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comments

0 Comments