Gene Expression Profile Analysis Suite (GEPAS) is one of the most complete and extensively used web-based packages for microarray data analysis. During its more than 5 years of activity it has continuously been updated to keep pace with the state-of-the-art in the changing microarray data analysis arena. GEPAS offers diverse analysis options that include well established as well as novel algorithms for normalization, gene selection, class prediction, clustering and functional profiling of the experiment. New options for time-course (or dose-response) experiments, microarray-based class prediction, new clustering methods and new tests for differential expression have been included. The new pipeliner module allows automating the execution of sequential analysis steps by means of a simple but powerful graphic interface. An extensive re-engineering of GEPAS has been carried out which includes the use of web services and Web 2.0 technology features, a new user interface with persistent sessions and a new extended database of gene identifiers. GEPAS is nowadays the most quoted web tool in its field and it is extensively used by researchers of many countries and its records indicate an average usage rate of 500 experiments per day. GEPAS, is available at http://www.gepas.org .
Since its introduction in the mid 1990s ( 1 ), microarrays have revolutionized the way in which the research community addresses biological problems. Its success relays on its application to classify types of tumours ( 2 ), predicting disease outcome ( 3 ) or even the response to treatments ( 4 ). These practical applications of microarrays, despite them not being free of criticisms ( 5 ), have definitively fuelled the use of the methodology. In this scenario, the real bottleneck in the use of microarray technologies comes from the data analysis step ( 6 ). The web-based package Gene Expression Profile Analysis Suite (GEPAS) has been growing during the last 5 years ( 7–10 ) trying to keep pace with the state-of-the-art in algorithms for high-throughput gene expression data analysis as well as responding to the demands of the microarray community.
Although originally designed to analyse microarray data, the most important modules of GEPAS are not tied to the technology or to the microarray platforms used to extract the data on gene expression. GEPAS is rather oriented to analyse high-throughput gene expression data and to test different types of genome-scale hypotheses.
GEPAS is not a web server of a simple tool, but it constitutes one of the largest resources for integrated microarray data analysis available over the web. GEPAS is used by researchers worldwide as can be seen in the usage map, where all the sessions are mapped to its geographic location ( http://bioinfo.cipf.es/access_map/map.html ). By the end of year 2007, an average of 500 experiments per day were being analysed in GEPAS. The recent release 4.0 presented here includes new modules, new tests in already existent modules, technical improvements (GEPAS is now based on web services technology and includes Web 2.0 features) and a more powerful and intuitive interface which includes graphical tools to define workflows and persistent private sessions.
GEPAS has been designated for the analysis of high-throughput gene expression data. Obviously, today this means microarray data analysis, but this situation might change in the future and the data could come from different platforms or technologies. Although some of their modules are platform dependent, the core of GEPAS aims to analyse and test hypothesis using gene expression data in a simple but rigorous way.
Many different biological questions can be addressed through gene-expression experiments, nevertheless, there are usually three types of objectives in this context: ‘class comparison’, ‘class prediction’ and ‘class discovery’ ( 6 ). The first two objectives fall into the category of supervised methods and usually involve the application of tests to define differentially expressed genes, or the use of different procedures to predict class membership on the basis of the values observed for a number of ‘key’ genes. Clustering methods belong to the last category, also known as unsupervised analysis, because no previous information about the class structure of the data set is used in the study. Thus, GEPAS is composed by the following modules:
Normalization and pre-processing
GEPAS implements normalization facilities for both two-colour and Affymetrix arrays. Normalization in two-colour arrays is performed using print-tip loess ( 11 ) with a number of different options. Affymetrix CEL files using standard bioconductor ( 12 ) tools, in particular the package affy ( 13 ). Besides its friendly web interface we provide the user with the speed and above all, the physical memory available in our server. In addition, the pre-processor ( 14 ) module performs some pre-processing of the data (log-transformations, standardizations, imputation of missing values, etc.).
Clustering techniques are used for class discovery either in genes or in experiments. GEPAS includes the best performing clustering methods according to different independent benchmarkings ( 15 , 16 ). There are obviously more methods but among the most extensively used for gene expression data clustering we can highlight: hierarchical clustering ( 17 ), SOM ( 18 ), SOTA ( 19 ) and K-means ( 20 ). It is worth mentioning that the version of SOM implemented here can automatically find the optimal number of clusters ( 21 ).The evaluation of cluster quality, a barely addressed issue, has been implemented here using the silhouette method ( 22 ), which presents an optimal performance in noisy situations, such as microarray data ( 23 ), along with some descriptive measures for each cluster partition (average profiles, standard deviation profiles, inter- and intra-cluster distances).
Differential gene expression
GEPAS implements tests for finding genes with significant differences in expression between two or more classes, related to a continuous experimental factor (e.g. the concentration of a metabolite) or to survival data. For two-class comparisons, GEPAS implements the popular t-test , the empirical Bayes test ( 24 ), the CLEAR-test that combines differential expression and variability ( 25 ), the data-adaptive test ( 26 ) and the SAM test ( 27 ). For comparisons involving more than two classes GEPAS uses the classical ANOVA . In order to find genes whose expression is significantly correlated to a continuous variable (e.g. the level of a metabolite), regression analysis and estimates of Pearson's and Spearman's correlation co-efficients can be obtained. Finally, for finding genes whose expression is related to survival times GEPAS estimates a Cox proportional hazards regression model ( 28 ). Right censored data is allowed as well as replicates in the survival times. Censoring variables should be provided by the researcher together with survival times that may be replicated.
When appropriate, P values adjusted for multiple testing are provided. Three methodologies are implemented. One of them controls the FWER (family-wise error rate) ( 29 ) while the others control the FDR (false discovery rate) ( 30 ).
A new module for class prediction ( 31 ) has been implemented. The module includes different classifiers, such as diagonal linear discriminant analysis (DLDA) ( 32 ), k-nearest neighbour (KNN) ( 33 ), support vector machines (SVM) ( 34 ), SOM ( 18 ) and shrunken centroids (PAM) ( 35 ) of well-known efficiency as class predictors using microarray data ( 32 ). Cross-validation error is calculated in such a way as to avoid the well-known selection bias problem ( 36 ). See ref. ( 31 ) for details. Once the model has been trained it can be used for further prediction of new samples. This implementation is unique among similar programmes.
Time-course and dose–response gene-expression experiments
A new module for the analysis of multi-series time-course and dose–response microarray experiments has been added. In this type of experiments, the researcher aims to study gene expression changes across time or across dosages and to evaluate trend differences between the various experimental groups ( 37 ).
This module implements and extends the maSigPro statistical approach for the study of gene expression changes along time and the specific trend differences between various experimental groups ( 38 ). The method is a two-regression step approach where individual series are identified by dummy variables. The procedure first adjusts a global regression model which considers all experiment series and a maximum complexity in the time/dosage-dependent response. This first step indentifies differentially expressed genes at a given false positive control rate. In the second step, a variable selection method is applied to find the best model for each gene and to analyse particular significant profile differences between series. Finally, significant genes are clustered and displayed showing these trend differences.
There are many available tools that make use of gene functional annotations to provide an interpretation for the observed global changes in gene expression in microarray experiments ( 39 ). Probably, one of the most complete packages for functional profiling analysis is the Babelomics suite ( 40 , 41 ). This suite of programs for functional annotation of genome-scale experiments has undergone a deep modification described in detail elsewhere (Al-Shahrour, submitted to this issue). Babelomics performs functional enrichment analysis, that is, comparing two lists of genes and testing simultaneously in order to find significant over-abundance of diverse biologically relevant terms that would define functional modules such as GO, KEGG pathways, Interpro motifs or regulatory modules such as Transfac® motifs, CisRed motifs, miRNA binding motifs or other types of modules such as the ones defined by relative abundance in tissues and bioentities extracted from PubMed. All the tests are further adjusted for multiple testing effects ( 42 , 43 ). Additionally, gene set enrichment analysis can be performed using different algorithms ( 44 , 45 ) using several sources of information ( 46 ). The Babelomics suite is fully integrated into GEPAS. Gene expression analyses resulting in lists of genes to be compared (different clusters, genes differentially expressed, etc.) can be submitted to Babelomics for functional enrichment analysis. Moreover, arrangements of genes according to, for example, differential expression or other criteria can be sent to Babelomics to be studied by gene set enrichment analysis. This allows discovering pathways or functional modules of genes that are coordinately activated or deactivated in the experiment studied.
Entry points and data formats
There are two entry points to GEPAS: platform dependent and platform independent. GEPAS accepts and normalizes different types of microarray data which include Affymetrix CEL files and 13 different two-channel arrays including Agilent, Genepix and other. Once the files are normalized any type of analysis can be applied. On the other hand, there is another simple format by means of which data from other platforms, other technologies (e.g. SAGE) and even other nature (e.g. proteomics, Chip-on-chip data) can be input in any of the GEPAS modules. A very simple text file with the numeric gene expression values are in the format of a tabulator-delimited matrix, in which rows make reference to gene identifiers and columns to experiments, can be used for this purpose. Information on the experiments can be stored in the first rows starting by a # symbol. The first column contains the gene identifiers.
WHAT IS NEW IN VERSION 4.0?
The novelties added to this version have been described in more detail above, in the general overview of the programme. Summarizing, we have implemented a number of new tests, inexistent in previous versions, apart from new whole modules. Thus, much more options for normalization have been added (support for 12 more formats). New tests for differential expression such as an improved version of clear ( 25 ) test or the popular SAM test ( 27 ) were implemented. The module for cluster visualization has also been extensively improved. Much work has been invested if implementing an improved tool for protein and gene ID conversion which includes a large number of species and databases. Now, the converter tool supports more than 10 species and more than 40 gene ID references for human [including single nucleotide polymorphism (SNP) and orthologous information]. In general, almost all the modules of GEPAS have undergone improvements to some extent. We have included a new complete module that allows the analysis of multi-series time-course and dose–response microarray experiments. The module is an implementation of the maSigPro statistical approach for the study of gene expression changes along time and the specific trend differences between various experimental groups ( 38 ). Another new module is the clustering by a version of SOM ( 21 ) that automatically finds the number of clusters. Obviously, the Babelomics has its own catalogue of novelties that are described in an accompanying paper.
In addition, there are technical novelties such as the re-engineering to web services, the inclusion of Web 2.0 technology features, the new interface of sessions and the pipeliner, which are described below.
All the novelties included in GEPAS are, in terms of resources invested, far beyond the work demanded by a conventional web server that offers a unique facility.
The pipeliner: a graphic module for easy implementation of workflows
Microarray data analysis consists of a series of steps that can be carried out by sequentially running different GEPAS modules (e.g. normalization + pre-processing + gene selection + functional profiling of significant genes). If some of these steps have to be repeated systematically many times (which would happen, for example in a microarray core facility) it is easier to have the possibility of saving the sequence of operations as a workflow and using it in future analysis. The possibility of saving and storing operations is also useful when a researcher uses a non-default set of parameters in the tools. The advanced ‘pipeliner’ module allows users to define workflows, for repetitive tasks, in a completely visual manner by choosing, dragging and dropping icons representing the different modules in the package (without the need of any scripting skills). Figure 1 shows the graphic interface that allows defining sequences of operations as well as setting the parameter used in these. The workflows so defined by this Java applet can be stored in the sessions and can be further loaded from them.
Internal re-engineering, technological improvements and the session interface
GEPAS has been completely re-engineered and now it is based on SOAP web services and on new Web 2.0 technology features such as AJAX. This has facilitated the design of a new interface that allows asynchronous use, as well as projects, jobs and user management. Thus, the users can choose between the traditional anonymous sessions without loging in (as in previous versions) or to log into the new environment with username and password. This new environment offers persistent sessions in which data is kept stored as well as different facilities for tracking of the operations performed. Both options are free.
GEPAS is now running in a high-end cluster with 10 dedicated Intel XEON Quad-Core CPUs at 2.0 GHz (summing up a total of 40 cores) with a large amount of RAM (total 60 GB). In this way we can offer a high computer power to end users.
An improved module for protein and gene ID conversion including a large number of species and databases is used behind the scene. This module allows importing any microarray file regardless of the IDs used in the platform. More species and gene references have been added and now the converter module supports more than 10 species and more than 40 ID references for human (including SNP and orthologous information). This module has been implemented in Java to speed up the performance. Besides the web interface a public web service Application Programming Interface is provided, allowing anyone to access the data from their code.
Related training activities
In addition, there is a teaching programme related to GEPAS ( http://bioinfo.cipf.es/docus/courses/courses.html ) with on-line tutorials that can be freely used ( http://bioinfo.cipf.es/docus/courses/on-line.html ).
The impact over the user's community has been estimated by the corresponding number of Scholar Google citations. According to the number of citations, GEPAS is by far the most popular web resource in its category with 196 citations [252 if the citations of the SOTA ( 19 ) are included]. The updated citations for the web-tools with a significant presence in the scientific community can be found at: http://bioinfo.cipf.es/docus/tools-citations/microarrays . GEPAS is used by a broad research community of many countries and its records indicate an average usage rate of around 500 users per day. The geographical distribution of users can be monitored in real time at: http://bioinfo.cipf.es/access_map/map.html . The web-based pipeline for microarray gene expression data, GEPAS, is available at http://www.gepas.org .
We are working on several improvements that will be released in an upcoming version. These include normalization for one channel Agilent arrays, for exon arrays (both Agilent and Affymetrix), for tiling arrays and for Illumina arrays. New tests for differential expression will be included. A new version of the predictor with more predictor tools and new cross-validation methods will also be implemented. The ISACHG ( 47 ) for array-CGH analysis will be fully integrated in GEPAS and interfaces to databases such as ArrayExpress ( http://www.ebi.ac.uk/arrayexpress/ ) or Gene Expression Omnibus (GEO) ( http://www.ncbi.nlm.nih.gov/geo/ ) will be provided.
GEPAS is a long-term, ongoing ambitious project that aims to provide the scientific community with an advanced set of tools for high-throughput gene expression data analysis, without renouncing to an easy and intuitive use. Since its official release in 2003 ( 7 ), GEPAS has been running uninterruptedly and has grown-up to include more tools to keep pace with the novelties in the microarray data analysis arena ( 7–9 ). GEPAS has the vocation of being a consistent set of both state-of-the-art and widely established algorithms, instead of a simple collection of as-much-as-possible tools. In fact, any new tool which has been included in the package has been the response to a new or emerging requirement requested by our users. As the Functional Genomics node of the Spanish Institute of Bioinformatics (INB; http://www.inab.org ) and being part of the Spanish Network of Cancer (RTICC; http://www.rticcc.org ) and the Network of Centres for Research in Rare Diseases (CIBERER, http://www.ciberer.es ), we have a direct contact with researchers from which we get much of the feedback necessary to build up a useful tool. We are also integrated in the EMERALD project ( http://www.microarray-quality.org/ ), where we will provide input in the data mining methodologies such as clustering, gene selection or predictors, to assess the implications of QA/QC.
GEPAS, integrated with the Babelomics suite ( 40 , 41 ), offers all the necessary methods in order to perform the most common analysis of microarray data. GEPAS has been designed to take full advantage of the properties of the web: connectivity, cross-platform functionality and remote usage. Its modular architecture based on web services allows easy implementation of new tools and facilitates the connectivity of GEPAS from and to other web-based tools.
It cannot be discarded that the technologies and the platforms will change in the future. Such foreseeable changes can only affect the entry point and the technology-related part of GEPAS (that is, the normalization). The important contribution of GEPAS is its potential for analyzing high-throughput gene expression data and for testing different types of hypotheses in this context, regardless the technology that has produced such results.
The step of functional interpretation is typically made by studying the enrichment in pre-defined modules of genes related among them by any interesting biological property (common function, regulation, chromosomal location, etc.) as a function of some parameter derived from the experiment. Thus, functional enrichment methods ( 39 ) are used to find gene modules significantly over-represented among the relevant genes selected in the experiment. Over-representation of a given gene module means that genes with a particular property have been activated or deactivated in the experiment. Recently, gene set enrichment methods are superseding conventional functional enrichment methods for the functional interpretation of high-throughput gene-expression data, given their higher sensitivity ( 39 , 48 , 49 ). Both families of methods along with several definitions of modules (functional, transcriptional, text-mining based and phenotypical and tissues based) are implemented in the Babelomics module, fully integrated in GEPAS.
GEPAS is now running in a high-end cluster that offers high computer power. This allows using tools (for example normalization tools are highly RAM-consuming) that are usually beyond the capabilities of the hardware available to many end users.
Although there are many alternatives for microarray data analysis, there is no other similar resource over the web with the number of possibilities offered by GEPAS.
This work is supported by grants from the Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER) ISCIII, and projects BIO2005-01078 from the Spanish Ministry of Education and Science, EMERALD from the EU and the National Institute of Bioinformatics ( www.inab.org ), a platform of Genoma España. Funding to pay the Open Access publication charges for this article was provided by project BIO2005-01078 fom the Spanish Ministry of Education and Science.
Conflict of interest statement . None declared.