Motivation: CoGenT++ is a data environment for computational research in comparative and functional genomics, designed to address issues of consistency, reproducibility, scalability and accessibility.
Description: CoGenT++ facilitates the re-distribution of all fully sequenced and published genomes, storing information about species, gene names and protein sequences. We describe our scalable implementation of ProXSim, a continually updated all-against-all similarity database, which stores pairwise relationships between all genome sequences. Based on these similarities, derived databases are generated for gene fusions—AllFuse, putative orthologs—OFAM, protein families—TRIBES, phylogenetic profiles—ProfUse and phylogenetic trees. Extensions based on the CoGenT++ environment include disease gene prediction, pattern discovery, automated domain detection, genome annotation and ancestral reconstruction.
Conclusion: CoGenT++ provides a comprehensive environment for computational genomics, accessible primarily for large-scale analyses as well as manual browsing.
Availability: The database and component downloads are accessible at http://cgg.ebi.ac.uk/cogentpp.html.
Since the publication of the first entire genome sequence for Haemophilus influenzae Rd in 1995 (Fleischmann et al., 1995), more than 220 genomes comprising >800 000 genes/proteins have been sequenced and published (Fig. 1) (Janssen et al., 2003a). This wealth of sequence information provides immense opportunities for genome-wide mining and exploration, including aspects of comparative and evolutionary studies, functional genomics, protein interactions and protein family discovery. Yet, most of the existing databases are designed to provide access to genomic information via integrated browsers that are gene-centric (Maglott et al., 2005), and are not always suitable for large-scale studies. To address this problem in our own research, we first developed the Complete Genome Tracking (CoGenT) database (Janssen et al., 2003b), enabling both large-scale analyses and unified linking of various projects. We have now extended the original sequence database to include other facets of computational genomics with pre-computed entities, such as phylogenetic profiles and protein families.
The CoGenT++ environment
The CoGenT++ system is designed to provide a comprehensive, robust, flexible and useful environment guided by research issues in computational genomics. The original design principle of the CoGenT++ environment is to capture various aspects of genomic information, including a multitude of pre-computed entities, in a uniform manner. We opted for a relational database schema, currently implemented in MySQL (mysql.org), and made both the schema and contents of the databases available. Figure 2 shows the conceptual schema of the system, replicated from the web site, where it is available as a clickable site map. The CoGenT++ environment has been designed as a three-tier system:
the database tier: this includes the primary input data and results from a self-comparison;
the application tier: this includes all results computed for specific purposes via a range of applications and further maintained in secondary databases, either directly derived (cyan in Fig. 2) or linked (gray in Fig. 2) to other imported resources (blue in Fig. 2) and
the client tier: this includes external access for both pre-computed data and interactive computation available to users through the World Wide Web (WWW).
The first tier consists of two layers. The first layer corresponds to the original CoGenT database, storing information about genomes and the corresponding proteins (Janssen et al., 2003b). The second layer corresponds to a pairwise similarity database for protein sequences, called ProXSim, which provides the basis for the production of derived databases for the next tier (Fig. 2).
The second tier consists of directly derived databases, containing protein families, groups of orthologs, phylogenetic profiles and fusion events (Fig. 2). These databases are computed by a range of algorithms that interact with the database, for instance GeneRage (Enright and Ouzounis, 2000), TRIBE-MCL (Enright et al., 2002) and GeneTrace (Kunin and Ouzounis, 2003). Other applications may require the importing of other resources for the creation of secondary databases, for example, the Disease Gene Prediction (DGP) (Lopez-Bigas and Ouzounis, 2004) or the Genome Phylogeny Server (GPS) (Kunin et al., 2005) (Fig. 2).
The third tier facilitates access to a substantial amount of pre-computed data via the WWW access as well as interactive querying and processing, for example, CAST masking (Promponas et al., 2000) and BLAST searching (Altschul et al., 1997) (Fig. 2). The amount of data made available for research purposes is substantial: the current CoGenT++ environment provides access to >65 GB of data (Table 1), e.g. almost 12 times larger than the current UniProt release. (Bairoch et al., 2005) The details of database components are described below.
The core of the system is the CoGenT database (Janssen et al., 2003b), which provides information about genomes and proteins from all published and fully sequenced genomes, the two criteria that define admittance of a genome into the database. We admit only genomes for which sufficient coverage has been achieved so that protein sequence information has been made available either at the original sequencing center site or in a major molecular biology database (Janssen et al., 2003b), linked to a publication. The recent rate of genome inclusion has reached one genome per week, on average (Janssen et al., 2005).
The CoGenT database consists of two tables (database tier): the genomes table and the proteins table. The Genomes table stores information regarding each fully sequenced genome: full name of the species and strain, its taxonomic classification, genome size, number of genes, date of publication, additional curator information and a simple versioning mechanism (Janssen et al., 2003b). The Proteins table stores information about proteins from these genomes, including the full sequence, original annotation, the originally submitted protein identifier and further information (Janssen et al., 2003b). Additional CoGenT identifiers are generated for both genomes and protein sequences in a consistent fashion (Janssen et al., 2003b), to facilitate easy recognition of sequences by users and programs alike.
One of the most important recent developments is the linking of the CoGenT identifier space with ‘official’ identifier conventions. Thus, the CoGenT++ environment is now fully cross-linked with 5 million links to major molecular biology databases, namely UniProt (Bairoch et al., 2005), EMBL (Kanz et al., 2005), GenBank (Benson et al., 2005), RefSeq (Pruitt et al., 2005) and PDB (Deshpande et al., 2005). This is achieved through the MagicMatch algorithm (Smith et al., submitted for publication), accessible at http://cgg.ebi.ac.uk/services/magicmatch/.
The CoGenT++ schema allows the user to include additional protein databases and link them to the system. In the current set-up, we have included the Swiss-Prot database (Boeckmann et al., 2003), in order to link to a high-quality annotation resource.
ProXSim: a similarity database
The similarity database contains all-against-all pairwise similarities of proteins computed using BlastP (Altschul et al., 1997), and filtered for compositionally biased regions using CAST (Promponas et al., 2000). This database is built from all protein sequence data in CoGenT, plus other imported sequence collections, currently Swiss-Prot, release 42.11 (Boeckmann et al., 2003). The principal use of this component (database tier) is the storage of pre-computed similarities and their subsequent use by different applications (application tier), depending on the question at hand, for example phylogenetic profiles or protein families (see below). In this way, a single large-scale computation provides the basis upon which other systems or resources are automatically built: for example, the entry of each new genome sequence triggers its comparison against CoGenT sequences (see below). Thus, we reduce the need for recurrent similarity searches for specific genome subsets within a finite protein universe, which are performed daily, consuming significant computational resources. It is hoped that our other colleagues will find the pre-computed set of similarities useful in their own research. Due to the large size of the database (Table 1), users can interactively access the similarity information only by providing a protein identifier, and the ‘Phylogen’ server (Fig. 2) returns the phylogenetic profile of the query protein. Interactive sequence searches against genomes are also supported at this level via BlastP (Altschul et al., 1997). Both Phylogen and BlastP support ad hoc analyses.
Incremental update mechanism
All similarities are computed by using the BlastP program (Altschul et al., 1997). However, the main estimation of similarity significance by Blast is the E-value, which is dependent on the database size. With the natural growth of the database, the old E-value estimation needs to be recomputed. We aimed to reduce the computational load on the system and avoid recalculating the complete database every time a new genome is being included. We use the BlastP bit-score as an estimate of sequence similarity, which is independent of the database size, which is set to a constant value (see below). The E-value might be computed on the basis of bit-scores and query-dependent database sizes (Altschul et al., 1997).
Therefore, we use the BlastP bit-score b as an estimate of sequence similarity because it only depends on the alignment and the substitution matrix. Then, the E-value is calculated in a simplified yet uniform manner as follows: Esimpl = Leff * Seff * 2(−b), where the effective database size Seff is set to 108 residues and Leff is the effective protein length (number of amino acid residues not masked by CAST). In order to perform a consistent and incremental update of the similarity database each time a new genome is released and processed, while keeping the previously computed values in the database, we use an E-value cut-off of 10−5 on Esimpl. Note that for very small peptides this cut-off is too stringent to accept even full length alignments. Hence, we calculate the cut-off that would accept alignments covering 40% of the query protein length and we actually use the most permissive cut-off between this alternative cut-off and 10−5.
This setup allows us to perform a consistent and incremental update of the database each time a new genome is released, while keeping the previously computed values in the database. By designing the system (database tier), so that it makes incremental updates of genome similarities, CoGenT++ provides a scalable and automatic update mechanism for the pairwise comparison of all genomes, at least for the foreseeable future. With the growth of genome sequence information (Fig. 1), the size of the similarity database increases quadratically. This will eventually lead to a data size explosion that could challenge methods of storage and distribution for end users, and other solutions must be sought, e.g. distributed databases across the GRID (Stevens et al., 2003; Teo et al., 2004).
TRIBEs and OFAM: protein families and putative orthologs
Protein family classification is a key step in many computational genomics projects. CoGenT++ provides protein family information in two forms: the TRIBES protein family database and the OFAM ortholog family database. The TRIBES database (Enright et al., 2003) is derived from the complete set of pairwise similarities using the TRIBE-MCL algorithm (Enright et al., 2002), a method suited to the rapid and accurate detection of protein families on a large scale. OFAM is a database of protein ortholog families derived using best bidirectional hits—instead of pairwise similarities—and clustered as in TRIBES. Reciprocal best hits have been proposed as an operational definition of orthologs (Overbeek et al., 1999). The OFAM database provides higher granularity, i.e. specific clusters, while TRIBES provides wider protein families.
AllFuse: protein fusions
The detection of gene fusion events across genomes can be used for predicting functional associations of proteins (Marcotte et al., 1999), including physical interaction and complex formation (Enright et al., 1999). CoGenT++ incorporates data on gene fusion events computed with an updated automatic protocol called Diffuse-2 (Enright et al., 1999), and is expected to supersede the previous AllFuse collection (Table 1). Each genome is used as a query against the remaining complete genomes to detect gene fusions. Pairs of proteins in the query genome that are not similar to each other and are found to be similar to a single, composite protein from the reference genomes, across non-overlapping segments, are predicted to be functionally associated (Enright et al., 1999).
ProfUse: phylogenetic profiles
A phylogenetic profile is defined as a string that encodes the presence or absence of a protein in every known genome (Pellegrini et al., 1999). Proteins with similar phylogenetic profiles have been shown to be involved in related cellular processes. Phylogenetic profiles in CoGenT++ are generated from the ProXSim similarity database data and are represented as binary vectors, where each bit represents an individual protein's hit to a genome. Given a query protein, the ‘Phylogen’ server returns its pre-computed phylogenetic profile (see above). The reverse operation is achieved through the ‘ProfUse’ server: given an arbitrary phylogenetic profile from a list of species, it returns all proteins consistent with the specified profile. Note that this set of proteins might contain homologous sequences exhibiting identical profiles due to evolutionary relationships or non-homologous sequences exhibiting identical profiles due to functional relationships.
CoGenT++ can easily be linked to other systems (Fig. 2). Currently it is loosely coupled to GeneQuiz—an expert system for automated genome annotation (Andrade et al., 1999), GeneRAGE—a sequence clustering algorithm (Enright and Ouzounis, 2000), disease gene prediction (DGP)—a database of human genes with their probability of being involved in a hereditary disease (Lopez-Bigas and Ouzounis, 2004) and the Genome Phylogeny Server (GPS)—an approach for the computation of phylogenetic trees, using estimation of species distances from genome-wide sequence similarity (Kunin et al., 2005). Other systems are in the process of being linked to CoGenT++, in the near future. Current work towards this direction includes the reconstruction of ancestral states using the GeneTrace algorithm (Kunin and Ouzounis, 2003), pattern discovery methods applied to protein families using Teiresias (Rigoutsos and Floratos, 1998) and automated metabolic reconstructions with the Pathway Tools software suite (Karp et al., 2002) (‘in progress’, Fig. 2). We hope that other groups who find this resource useful in their research projects will both link and provide pertinent open-access data following the CoGenT++ architecture.
The potential adoption of the CoGenT schema by other groups and the provision of additional datasets that could easily be linked to the original CoGenT tables, for instance gene expression or protein interaction information, is expected to contribute towards highly consistent results that are readily accessible and reproducible.
Comparison to other systems
Similar systems that deliver genome information have been developed before. Notable cases are the NCBI genomes database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome) and EBI Genome (http://www.ebi.ac.uk/integr8/) database, serving a varied community of end-users, ranging from novice to sophisticated. No other provided information is integrated in the form of protein families or interactions, or any tightly coupled methods that generate secondary information. The COG database maintains a list of orthologs for various species but requires manual intervention, and no database dumps are readily available (Tatusov et al., 2003). The STRING database contains information about both genomes and gene context (such as gene fusions), and derives orthology directly from the COG database (von Mering et al., 2005). Analogous databases providing information for gene context are ProLinks (Bowers et al., 2004) and Predictome (Mellor et al., 2002). Both of these databases provide flat-file downloads and links to other databases, but no database dumps or link mechanisms. Finally, the Comprehensive Microbial Resource provides detailed information about gene and protein function, and various computed characteristics of sequences and genomes (Peterson et al., 2001), yet it does not provide tightly coupled systems for analysis and inference.
Unique features of CoGenT/CoGenT++ include its transparency and the tracking of genome information in precise chronological order (Janssen et al., 2003b). In this manner, we are able to trace the time of discovery of novel protein families and the sampling of sequence or phylogenetic space (Kunin et al., 2003). Moreover, the two strict criteria of both publication and availability of genomes allow us to incorporate genome data in an objective manner. Other unique features include the naming scheme that facilitate both computer and human interaction, the full availability of the entire resource in both flat-file and MySQL dumps, and finally the simplicity of design and implementation. All these assets should make the CoGenT++ environment useful for a wider community and facilitate research in computational genomics.
DATA ACCESS AND PLATFORM REQUIREMENTS
CoGenT++ is available via an interactive website, MySQL or flat files. The URL http://cgg.ebi.ac.uk/cogentpp.html is the main entry point to the CoGenT++ environment.
The MySQL database and the Apache HTTP server run on a Sun Microsystems Enterprise E450 server with 4 CPUs and 4 GB of shared memory with access to a 140 GB disk partition. All data are downloadable as MySQL dumps and flat files, where applicable. MySQL is available for multiple platforms.
Similarity searches are performed on a 200-CPU cluster, kindly provided by IBM to the Research Programme of the European Bioinformatics Institute.
CAO would like to acknowledge further support by IBM Research and the UK Medical Research Council.
Conflict of Interest: none declared.
|ProXSima||221||435 505 934||43 000.0|
|AllFuseb||184||2 192 019||20 700.0|
|ProXSima||221||435 505 934||43 000.0|
|AllFuseb||184||2 192 019||20 700.0|
Entries: protein sequences in CoGenT, pairwise similarities in ProXSim, phylogenetic profiles in ProfUse, putative ortholog clusters in OFAM, protein families in TRIBES and fusion events in AllFuse.
bIn the process of being updated.