We present a new version of Babelomics, a complete suite of web tools for functional analysis of genome-scale experiments, with new and improved tools. New functionally relevant terms have been included such as CisRed motifs or bioentities obtained by text-mining procedures. An improved indexing has considerably speeded up several of the modules. An improved version of the FatiScan method for studying the coordinate behaviour of groups of functionally related genes is presented, along with a similar tool, the Gene Set Enrichment Analysis. Babelomics is now more oriented to test systems biology inspired hypotheses. Babelomics can be found at http://www.babelomics.org .
Genes do not operate alone in the cell, but in a sophisticated network of interactions that we only recently start to envisage ( 1 – 3 ). It is a long recognized fact that co-expressing genes tend to be playing some common roles in the cell ( 4 , 5 ) and recently there are evidences that functionally related genes map close in the genome, even in higher eukaryotes ( 6 , 7 ). Complex traits, including diseases are starting to be considered from a systems biology perspective ( 8 ). Because of this, there is a clear necessity for methods and tools which can help to understand genome-scale experiments (microarrays, proteomics and the like) from a systems biology perspective. The proper interpretation of the experiments require functional annotation, but this annotation must be done in a systems biology context, in which the collective properties of groups of genes are taken into account. With the popularisation of DNA microarray technologies a number of methods arise to compare the enrichment in functional terms shown in groups of genes defined in the experiments. Programs such as ontoexpress ( 9 ) or FatiGO ( 10 ) are representatives of a family of methods designed for this purpose ( 11 ). A problem related to the management of genome-scale data followed by the inspection of thousands of functional terms is that a large number of associations will appear simply by chance ( 12 , 13 ). The multiple testing problem ( 14 ) was addressed for the first time by FatiGO ( 10 ) although now is a standard among these type of tools ( 11 ).
The extensive availability of functional annotations of a reasonable quality, specially facilitated by the universal adoption of the Gene Ontology (GO) ( 15 ) controlled vocabulary and other related initiatives such as KEGG ( 16 ), Interpro ( 17 ) and the like has improved enormously the accuracy of the above mentioned procedures of functional annotation. But beyond this, the extensive annotation permits to take conceptually different approaches to the analysis of genome-scale experiments more based on systems biology criteria. Thus, instead first selecting important genes (according to some criteria such as differential expression and the like) and then analysing them in terms of their biological roles, some authors proposed to directly analyse the behaviour of blocks of functionally related genes. The Gene Set Enrichment Analysis (GSEA) ( 18 , 19 ), the FatiScan ( 13 ) or the global test ( 20 ) constitute examples of this type of approach.
Suites such as Babelomics ( 21 ) or onto-tools ( 22 ), which gathers in an integrated environment different possibilities for functional annotation, will be more and more demanded in the future as the necessity of a more detailed interpretation of genome-scale experiments becomes more obvious.
Babelomics, named after the tale ‘The Babel library’ ( 23 ), a masterpiece by the famous Argentinean writer Jorge Luís Borges, has been running for more than one year and individual parts of it, such as the FatiGO tool ( 10 ), have been running for >3 years.
BIOLOGICAL INFORMATION USED FOR FUNCTIONAL ANNOTATION
Different repositories of functionally relevant biological information are available and can be used for the functional annotation of genome-scale experiments. In this new release of Babelomics we have collected information from different repositories for several model organisms ( Homo sapiens , Mus musculus , Rattus norvegicus , Drosophila melanogaster , Caenorhabditis elegans , Saccharomyces cerevisiae and Arabidopsis thaliana ), which has been cross-referenced using Ensembl ( 24 ) identifiers. The repositories used are as follows:
GO is, probably, the most successful among the initiatives for the standardization of the nomenclature of biological processes, molecular functions and subcellular location, its three main ontologies ( 15 ). GO represents the biological knowledge as a tree (more precisely as a directed acyclic graph, DAG, in which a node can have more that one parent). Upper nodes represent more general concepts and as the DAG is traversed towards deeper levels, the definitions are more and more precise (e.g. cell cycle > regulation of cell cycle > positive regulation of cell cycle and so on) Since genes are annotated at different levels it is common to use the inclusive analysis ( 11 , 25 ) instead of using directly the annotation of the genes at the deepest level possible. In the inclusive analysis a level of abstraction is chosen and genes annotated at deeper levels are assigned to this level. This increments the efficiency of the test because there are less terms to test and more genes per term, but the selection of the level is arbitrary. We have implemented here the Nested Inclusive Analysis (NIA), in which the test is done recursively until the deepest level in which significance is obtained and only this last level is reported. In this way both variables: efficiency of the test and highest precision in the term found are optimized.
InterPro ( 17 ) is a database of protein families, domains and functional sites in which identifiable features (motifs) found in known proteins can be applied to guess about the possible functionality of unknown protein sequences.
The SwissProt ( 26 ) database contains for each entry a field called keywords which contain a controlled vocabulary of words, many of them (although not all) with functional meaning.
KEGG pathways ( 16 ) is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for Metabolism, Genetic Information Processing, Environmental Information Processing, Cellular Processes and Human Diseases ( http://www.genome.jp/kegg/kegg2.html ).
Transcription factor (TF) - binding sites predicted using Transfac ® . TFs are assigned to genes if the corresponding predicted TF-binding site (TFBS) for that TF if found in the 10 kb 5′ region of the gene. Search is carried out by the Match program ( 27 ), using only high quality matrices and with a cut-off to minimize false positives, from the Transfac database ( 28 ). TFBSs are only available for human and mouse.
CisRed ( 29 ) is a database for conserved regulatory elements predicted in promoter regions using multiple discovery methods applied to sequence sets that include corresponding sequence regions from vertebrates. Motif significance is estimated by comparison to randomized sequence sets that are adaptively derived from target sequence sets. In theory, all the Transfac ® predictions should be a subset of these regulatory elements, but in practice the overlap is not complete. For this reason we still keep the Transfac ® predictions. In addition, CisRed tables are only available for humans.
Gene expression in tissues : Two repositories containing information of gene expression in different tissues have been used:
SAGE Tag libraries from the Cancer Genome Anatomy Project. A total of 279 human libraries that belong to 29 different tissues and 190 mouse libraries from 26 tissues have been used. The data were taken from http://cgap.nci.nih.gov/SAGE .
Genomics Institute of the Novartis Foundation data. A total of 79 human tissues and 61 mouse tissues with normal histology were downloaded from http://wombat.gnf.org/index.html and used here.
Generation of annotations from the biomedical literature
The curated repositories above mentioned contain valuable information but a large amount of biomedical knowledge is still communicated in the old fashioned way of research publications. This information can only be extracted from the text with text-mining methods. Modern text-mining technology is still far away from ‘understanding’ human language ( 30 ) but some important advances have been made to extract some factual information with sufficient reliability from the scientific literature to be useful.
For the analysis of the biomedical literature precise identification of key entities of interest, such as genes, proteins, chemical compounds and disease names is crucial to index and retrieve relevant documents. As the biomedical language and vocabulary is of great complexity and changes constantly the identification of entities, commonly known as named entity recognition, is a cumbersome task.
For the detection of genes, proteins and diseases a combination of dictionaries (e.g. EntrezGene or UniProt for genes and proteins and UMLS for diseases), heuristics based on hand crafted rules and statistical measures are used. Chemical compounds are extracted based on morphological criteria (using knowledge about chemical nomenclature) and dictionaries of common names for chemicals.
Here, relationships between different biomedical entities that are calculated based on co-occurrences in sentences (a co-occurrence is when two entities appear in the same sentence) were used. The calculation is based on how unlikely it is to observe a certain level of co-occurrences to happen by chance ( 31 ). The more unlikely the observed event, the stronger the relation between the entities is valued by the system. Using this approximation, gene association networks can be created, not specifying the precise relationships between the genes but organizing the literature in a way that makes exploration a lot easier.
The data used here were taken from the almaKnowledgeServer ( http://aks.bioalma.com ).
In order to maintain this huge system of gene annotations an universal index has been adopted. A total of 179 tables of different biological annotations and gene identifiers for seven organisms have been linked to their Ensembl IDs. Although the use of an universal cross-reference has many advantages this is not free of problems. Any gene not annotated in Ensembl will be lost in the analysis. This, obviously will affect to a very small amount of genes and should not affect to any general functional conclusion obtained by analysing a large and significant number of genes.
STRATEGIES FOR ANNOTATION OF GENOME-SCALE EXPERIMENTS
Typical genome-scale experiments are annotated in two steps. Firstly, genes of interest are selected (because they co-express in a cluster or they are significantly over- or under-expressed when two classes of experiments are compared and so on) and then the enrichment of any type of biologically relevant label in these genes is compared with the corresponding distribution of the label in the background (typically the rest of genes). There are different available tools, such as FatiGO ( 10 ) and others ( 11 ), that use GO terms ( 15 ) or different functional labels, such as KEGG pathways, SwissProt keywords and the like, available in packages such as the Babelomics suite ( 21 ). From a systems biology perspective, this way of annotating the experiments is far from being efficient. This has led several groups to propose a different approach based on directly selecting blocks of functionally related genes ( 13 , 19 , 20 ). The rationale of these new approaches relies on the fact that the final aim in a typical genome-scale experiment is finding a molecular explanation for a given macroscopic observation (e.g. which pathways are affected by the deprivation of glucose in a cell). In the two-steps approach described previously, genes with different behaviour are firstly selected, usually ignoring the fact that these genes are acting cooperatively in the cell and consequently their behaviours must be coupled to some extent. To achieve this, very stringent thresholds to reduce the false positives ratio in the results are usually imposed. Then, the lists so obtained are compared with the background as described above. This procedure causes a tremendous loss of information because a large number of false negatives are sacrificed in order to preserve a low ratio of false positives, and the nosier the data are, the worse this effect is. Systems biology oriented methods can use lists of genes arranged by any biological criteria (e.g. differential expression when comparing cases and healthy controls) and search for the distribution of blocks of functionally related genes across it. If a particular function is defining the arrangement it will cumulate towards the extremes of the arrangement. A nice example is the study of differential gene expression between diabetics cases and normal controls, where no one single gene was found to be differentially expressed (because of the noise of the system), but pathways such as oxidative phosphorilation were found to be significantly repressed in the diabetic cases ( 13 , 18 ).
GENE-BY-GENE SELECTION FOLLOWED BY FUNCTIONAL ANNOTATION
Babelomics implements different procedures for the functional annotation of sets of pre-selected of genes, based on any experimental measure. Since GEPAS ( 32 – 34 ) is connected to Babelomics it is straightforward to analyse relevant genes, which have been selected by differential expression, or because they are part of a class predictor, or they co-express in clusters and so on. As mentioned above, different biological labels have been used for testing functional enrichment when comparing the distribution of such labels between gene datasets of interest and their corresponding references or backgrounds. The following tools are available:
FatiGO+ . This tool constitutes the evolution of FatiGO ( 10 ). In addition to GO terms it can test simultaneously for KEGG pathways, Interpro motifs, SwissProt keywords, TFBSs and CisRed motifs. The distribution of any combination (or all) of the terms between two groups of genes can be simultaneously tested by means of a Fisher exact test. All the P -values are adjusted by FDR. It can also be used to test genes defined by chromosomal positions (thus integrating the functionality of the old GenomeGO module ( 34 ) which has now been discontinued). The functionality of the old modules FatiWise and TransFat ( 34 ) have been completely included here and, consequently both modules have been discontinued. For the case of GO terms, the NIA has been implemented. So, GO terms are automatically tested from level 3 to depth 9 and only the deepest significant term is reported for each branch.
FatiGO . This tool has been in use for more than three years and has been described elsewhere ( 10 , 21 , 25 ). Owing to its popularity still remain as an independent module although much of its functionality is integrated in FatiGO+. FatiGO implements NIA too.
Tissues Mining Tool (TMT) . This tool compares the pre-tabulated expression values of two lists of genes in a set of tissues (see above) and report the tisues in which the differences in expression of the genes of both lists are more extreme by using a t -test. The resultant P -values are adjusted by FDR. For details see ( 21 ).
MARMITE (My Accurate Resource for Mining TExts). This is the equivalent to FatiGO+ using as biological information precomputed gene-bioentity co-occurrences obtained using the text-mining software almaKnowledgeServer (see above). MARMITE reports significant differences in the distribution of the scores gene-bioentity between the two lists compared using for this a Kolmogorov–Smirnov test. The module uses data of co-occurrences among human gene names (HUGO ids) and three bioentity categories: disease-associated words, chemical products and word roots. As in the rest of tests of the modules of Babelomics, P -values are adjusted by FDR.
DIRECT ANNOTATION OF BLOCKS OF FUNCTIONALLY RELATED GENES
GSEA test the coordinated over- or under-expression of sets of genes using a Kolmogorov–Smirnov test over a weighted summation. This allows to detect asymmetrical distributions of sets of genes (defined because they share some functional property) cumulated in the highest or lowest values of an arrangement of genes according its differential expression when two experimental conditions are compared ( 18 , 19 ). Significance is obtained by means of the permutation of the dataset of gene expression values. In the implementation presented here, more biological terms than in the original distribution ( http://www.broad.mit.edu/gsea/index.html ) can be used (GO, KEGG pathways, SwissProt keywords, Interpro motifs can be tested for seven organisms—see above—while TFBSs and CisRed motifs can be tested only for human).
FatiScan implements a segmentation test which checks for asymmetrical distributions of biological labels associated to genes ranked in a list ( 13 , 21 ). Unique in this type of approaches, this test only needs the list of ordered genes and not the original data which generated the sorting. This means that can be applied to the study of the relationship of biological labels to any type of experiment whose outcome is an sorted list of genes. Since Babelomics is linked to GEPAS, genes sorted by differential expression between two experimental conditions can be studied, but also genes correlated to a clinical variable (such as the level of a metabolite) or even to survival ( 33 , 34 ). Moreover, other lists of genes ranked by any other experimental or theoretical criteria can be studied (e.g. genes arranged by physico-chemical properties, mutability, structural parameters and so on) in order to understand whether there is some biological feature (among the labels used) which is related to the experimental parameter studied.
Obtaining, for example, a list of genes differentially expressed between two experimental conditions is only half the way to the proper interpretation of a genome-scale experiment. The functional annotation of these genes is a key step that many times is not performed just because the lack of the appropriate tool. Babelomics can be considered one of the largest and most complete resources for the functional annotation of genome-scale experiments. It contains tools unique in its functionality. Moreover, the tight connection of Babelomics to the GEPAS package ( 32 – 34 ) makes of it an invaluable resource for the analysis of microarray data.
An effort for innovating the tools and the subjacent philosophy of the package, with the aim of providing the possibility of addressing the problem of the annotation from a systems biology perspective, has been made. Thus, a new tool that makes use of annotations extracted from Pubmed abstracts by means of text-mining procedures (the MARMITE) has been included. Moreover, in addition to modules for functional annotation of pre-selected sets of genes, such as FatiGO+, MARMITE or TMT, Babelomics includes a completely renewed version of FatiScan and the GSEA. These last modules allows finding blocks of functionally related genes with a coordinated behaviour in a genome-scale experiment.
This work is supported by grants from Fundació La Caixa, Fundación BBVA, MEC BIO2005-01078 and NRC Canada-SEPOCT Spain. The Functional Genomics node (INB) is supported by Genoma España. Funding to pay the Open Access publication charges for this article was provided by Genoma España.
Conflict of interest statement . None declared.