Abstract

Motivation: CoGenT++ is a data environment for computational research in comparative and functional genomics, designed to address issues of consistency, reproducibility, scalability and accessibility.

Description: CoGenT++ facilitates the re-distribution of all fully sequenced and published genomes, storing information about species, gene names and protein sequences. We describe our scalable implementation of ProXSim, a continually updated all-against-all similarity database, which stores pairwise relationships between all genome sequences. Based on these similarities, derived databases are generated for gene fusions—AllFuse, putative orthologs—OFAM, protein families—TRIBES, phylogenetic profiles—ProfUse and phylogenetic trees. Extensions based on the CoGenT++ environment include disease gene prediction, pattern discovery, automated domain detection, genome annotation and ancestral reconstruction.

Conclusion: CoGenT++ provides a comprehensive environment for computational genomics, accessible primarily for large-scale analyses as well as manual browsing.

Availability: The database and component downloads are accessible at http://cgg.ebi.ac.uk/cogentpp.html.

Contact:ouzounis@ebi.ac.uk

INTRODUCTION

Since the publication of the first entire genome sequence for Haemophilus influenzae Rd in 1995 (Fleischmann et al., 1995), more than 220 genomes comprising >800 000 genes/proteins have been sequenced and published (Fig. 1) (Janssen et al., 2003a). This wealth of sequence information provides immense opportunities for genome-wide mining and exploration, including aspects of comparative and evolutionary studies, functional genomics, protein interactions and protein family discovery. Yet, most of the existing databases are designed to provide access to genomic information via integrated browsers that are gene-centric (Maglott et al., 2005), and are not always suitable for large-scale studies. To address this problem in our own research, we first developed the Complete Genome Tracking (CoGenT) database (Janssen et al., 2003b), enabling both large-scale analyses and unified linking of various projects. We have now extended the original sequence database to include other facets of computational genomics with pre-computed entities, such as phylogenetic profiles and protein families.

The CoGenT++ environment

The CoGenT++ system is designed to provide a comprehensive, robust, flexible and useful environment guided by research issues in computational genomics. The original design principle of the CoGenT++ environment is to capture various aspects of genomic information, including a multitude of pre-computed entities, in a uniform manner. We opted for a relational database schema, currently implemented in MySQL (mysql.org), and made both the schema and contents of the databases available. Figure 2 shows the conceptual schema of the system, replicated from the web site, where it is available as a clickable site map. The CoGenT++ environment has been designed as a three-tier system:

  1. the database tier: this includes the primary input data and results from a self-comparison;

  2. the application tier: this includes all results computed for specific purposes via a range of applications and further maintained in secondary databases, either directly derived (cyan in Fig. 2) or linked (gray in Fig. 2) to other imported resources (blue in Fig. 2) and

  3. the client tier: this includes external access for both pre-computed data and interactive computation available to users through the World Wide Web (WWW).

The first tier consists of two layers. The first layer corresponds to the original CoGenT database, storing information about genomes and the corresponding proteins (Janssen et al., 2003b). The second layer corresponds to a pairwise similarity database for protein sequences, called ProXSim, which provides the basis for the production of derived databases for the next tier (Fig. 2).

The second tier consists of directly derived databases, containing protein families, groups of orthologs, phylogenetic profiles and fusion events (Fig. 2). These databases are computed by a range of algorithms that interact with the database, for instance GeneRage (Enright and Ouzounis, 2000), TRIBE-MCL (Enright et al., 2002) and GeneTrace (Kunin and Ouzounis, 2003). Other applications may require the importing of other resources for the creation of secondary databases, for example, the Disease Gene Prediction (DGP) (Lopez-Bigas and Ouzounis, 2004) or the Genome Phylogeny Server (GPS) (Kunin et al., 2005) (Fig. 2).

The third tier facilitates access to a substantial amount of pre-computed data via the WWW access as well as interactive querying and processing, for example, CAST masking (Promponas et al., 2000) and BLAST searching (Altschul et al., 1997) (Fig. 2). The amount of data made available for research purposes is substantial: the current CoGenT++ environment provides access to >65 GB of data (Table 1), e.g. almost 12 times larger than the current UniProt release. (Bairoch et al., 2005) The details of database components are described below.

System backbone

The core of the system is the CoGenT database (Janssen et al., 2003b), which provides information about genomes and proteins from all published and fully sequenced genomes, the two criteria that define admittance of a genome into the database. We admit only genomes for which sufficient coverage has been achieved so that protein sequence information has been made available either at the original sequencing center site or in a major molecular biology database (Janssen et al., 2003b), linked to a publication. The recent rate of genome inclusion has reached one genome per week, on average (Janssen et al., 2005).

The CoGenT database consists of two tables (database tier): the genomes table and the proteins table. The Genomes table stores information regarding each fully sequenced genome: full name of the species and strain, its taxonomic classification, genome size, number of genes, date of publication, additional curator information and a simple versioning mechanism (Janssen et al., 2003b). The Proteins table stores information about proteins from these genomes, including the full sequence, original annotation, the originally submitted protein identifier and further information (Janssen et al., 2003b). Additional CoGenT identifiers are generated for both genomes and protein sequences in a consistent fashion (Janssen et al., 2003b), to facilitate easy recognition of sequences by users and programs alike.

One of the most important recent developments is the linking of the CoGenT identifier space with ‘official’ identifier conventions. Thus, the CoGenT++ environment is now fully cross-linked with 5 million links to major molecular biology databases, namely UniProt (Bairoch et al., 2005), EMBL (Kanz et al., 2005), GenBank (Benson et al., 2005), RefSeq (Pruitt et al., 2005) and PDB (Deshpande et al., 2005). This is achieved through the MagicMatch algorithm (Smith et al., submitted for publication), accessible at http://cgg.ebi.ac.uk/services/magicmatch/.

The CoGenT++ schema allows the user to include additional protein databases and link them to the system. In the current set-up, we have included the Swiss-Prot database (Boeckmann et al., 2003), in order to link to a high-quality annotation resource.

ProXSim: a similarity database

The similarity database contains all-against-all pairwise similarities of proteins computed using BlastP (Altschul et al., 1997), and filtered for compositionally biased regions using CAST (Promponas et al., 2000). This database is built from all protein sequence data in CoGenT, plus other imported sequence collections, currently Swiss-Prot, release 42.11 (Boeckmann et al., 2003). The principal use of this component (database tier) is the storage of pre-computed similarities and their subsequent use by different applications (application tier), depending on the question at hand, for example phylogenetic profiles or protein families (see below). In this way, a single large-scale computation provides the basis upon which other systems or resources are automatically built: for example, the entry of each new genome sequence triggers its comparison against CoGenT sequences (see below). Thus, we reduce the need for recurrent similarity searches for specific genome subsets within a finite protein universe, which are performed daily, consuming significant computational resources. It is hoped that our other colleagues will find the pre-computed set of similarities useful in their own research. Due to the large size of the database (Table 1), users can interactively access the similarity information only by providing a protein identifier, and the ‘Phylogen’ server (Fig. 2) returns the phylogenetic profile of the query protein. Interactive sequence searches against genomes are also supported at this level via BlastP (Altschul et al., 1997). Both Phylogen and BlastP support ad hoc analyses.

Incremental update mechanism

All similarities are computed by using the BlastP program (Altschul et al., 1997). However, the main estimation of similarity significance by Blast is the E-value, which is dependent on the database size. With the natural growth of the database, the old E-value estimation needs to be recomputed. We aimed to reduce the computational load on the system and avoid recalculating the complete database every time a new genome is being included. We use the BlastP bit-score as an estimate of sequence similarity, which is independent of the database size, which is set to a constant value (see below). The E-value might be computed on the basis of bit-scores and query-dependent database sizes (Altschul et al., 1997).

Therefore, we use the BlastP bit-score b as an estimate of sequence similarity because it only depends on the alignment and the substitution matrix. Then, the E-value is calculated in a simplified yet uniform manner as follows: Esimpl = Leff * Seff * 2(−b), where the effective database size Seff is set to 108 residues and Leff is the effective protein length (number of amino acid residues not masked by CAST). In order to perform a consistent and incremental update of the similarity database each time a new genome is released and processed, while keeping the previously computed values in the database, we use an E-value cut-off of 10−5 on Esimpl. Note that for very small peptides this cut-off is too stringent to accept even full length alignments. Hence, we calculate the cut-off that would accept alignments covering 40% of the query protein length and we actually use the most permissive cut-off between this alternative cut-off and 10−5.

This setup allows us to perform a consistent and incremental update of the database each time a new genome is released, while keeping the previously computed values in the database. By designing the system (database tier), so that it makes incremental updates of genome similarities, CoGenT++ provides a scalable and automatic update mechanism for the pairwise comparison of all genomes, at least for the foreseeable future. With the growth of genome sequence information (Fig. 1), the size of the similarity database increases quadratically. This will eventually lead to a data size explosion that could challenge methods of storage and distribution for end users, and other solutions must be sought, e.g. distributed databases across the GRID (Stevens et al., 2003; Teo et al., 2004).

TRIBEs and OFAM: protein families and putative orthologs

Protein family classification is a key step in many computational genomics projects. CoGenT++ provides protein family information in two forms: the TRIBES protein family database and the OFAM ortholog family database. The TRIBES database (Enright et al., 2003) is derived from the complete set of pairwise similarities using the TRIBE-MCL algorithm (Enright et al., 2002), a method suited to the rapid and accurate detection of protein families on a large scale. OFAM is a database of protein ortholog families derived using best bidirectional hits—instead of pairwise similarities—and clustered as in TRIBES. Reciprocal best hits have been proposed as an operational definition of orthologs (Overbeek et al., 1999). The OFAM database provides higher granularity, i.e. specific clusters, while TRIBES provides wider protein families.

AllFuse: protein fusions

The detection of gene fusion events across genomes can be used for predicting functional associations of proteins (Marcotte et al., 1999), including physical interaction and complex formation (Enright et al., 1999). CoGenT++ incorporates data on gene fusion events computed with an updated automatic protocol called Diffuse-2 (Enright et al., 1999), and is expected to supersede the previous AllFuse collection (Table 1). Each genome is used as a query against the remaining complete genomes to detect gene fusions. Pairs of proteins in the query genome that are not similar to each other and are found to be similar to a single, composite protein from the reference genomes, across non-overlapping segments, are predicted to be functionally associated (Enright et al., 1999).

ProfUse: phylogenetic profiles

A phylogenetic profile is defined as a string that encodes the presence or absence of a protein in every known genome (Pellegrini et al., 1999). Proteins with similar phylogenetic profiles have been shown to be involved in related cellular processes. Phylogenetic profiles in CoGenT++ are generated from the ProXSim similarity database data and are represented as binary vectors, where each bit represents an individual protein's hit to a genome. Given a query protein, the ‘Phylogen’ server returns its pre-computed phylogenetic profile (see above). The reverse operation is achieved through the ‘ProfUse’ server: given an arbitrary phylogenetic profile from a list of species, it returns all proteins consistent with the specified profile. Note that this set of proteins might contain homologous sequences exhibiting identical profiles due to evolutionary relationships or non-homologous sequences exhibiting identical profiles due to functional relationships.

CoGenT++ extensions

CoGenT++ can easily be linked to other systems (Fig. 2). Currently it is loosely coupled to GeneQuiz—an expert system for automated genome annotation (Andrade et al., 1999), GeneRAGE—a sequence clustering algorithm (Enright and Ouzounis, 2000), disease gene prediction (DGP)—a database of human genes with their probability of being involved in a hereditary disease (Lopez-Bigas and Ouzounis, 2004) and the Genome Phylogeny Server (GPS)—an approach for the computation of phylogenetic trees, using estimation of species distances from genome-wide sequence similarity (Kunin et al., 2005). Other systems are in the process of being linked to CoGenT++, in the near future. Current work towards this direction includes the reconstruction of ancestral states using the GeneTrace algorithm (Kunin and Ouzounis, 2003), pattern discovery methods applied to protein families using Teiresias (Rigoutsos and Floratos, 1998) and automated metabolic reconstructions with the Pathway Tools software suite (Karp et al., 2002) (‘in progress’, Fig. 2). We hope that other groups who find this resource useful in their research projects will both link and provide pertinent open-access data following the CoGenT++ architecture.

The potential adoption of the CoGenT schema by other groups and the provision of additional datasets that could easily be linked to the original CoGenT tables, for instance gene expression or protein interaction information, is expected to contribute towards highly consistent results that are readily accessible and reproducible.

Comparison to other systems

Similar systems that deliver genome information have been developed before. Notable cases are the NCBI genomes database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome) and EBI Genome (http://www.ebi.ac.uk/integr8/) database, serving a varied community of end-users, ranging from novice to sophisticated. No other provided information is integrated in the form of protein families or interactions, or any tightly coupled methods that generate secondary information. The COG database maintains a list of orthologs for various species but requires manual intervention, and no database dumps are readily available (Tatusov et al., 2003). The STRING database contains information about both genomes and gene context (such as gene fusions), and derives orthology directly from the COG database (von Mering et al., 2005). Analogous databases providing information for gene context are ProLinks (Bowers et al., 2004) and Predictome (Mellor et al., 2002). Both of these databases provide flat-file downloads and links to other databases, but no database dumps or link mechanisms. Finally, the Comprehensive Microbial Resource provides detailed information about gene and protein function, and various computed characteristics of sequences and genomes (Peterson et al., 2001), yet it does not provide tightly coupled systems for analysis and inference.

Unique features of CoGenT/CoGenT++ include its transparency and the tracking of genome information in precise chronological order (Janssen et al., 2003b). In this manner, we are able to trace the time of discovery of novel protein families and the sampling of sequence or phylogenetic space (Kunin et al., 2003). Moreover, the two strict criteria of both publication and availability of genomes allow us to incorporate genome data in an objective manner. Other unique features include the naming scheme that facilitate both computer and human interaction, the full availability of the entire resource in both flat-file and MySQL dumps, and finally the simplicity of design and implementation. All these assets should make the CoGenT++ environment useful for a wider community and facilitate research in computational genomics.

DATA ACCESS AND PLATFORM REQUIREMENTS

CoGenT++ is available via an interactive website, MySQL or flat files. The URL http://cgg.ebi.ac.uk/cogentpp.html is the main entry point to the CoGenT++ environment.

The MySQL database and the Apache HTTP server run on a Sun Microsystems Enterprise E450 server with 4 CPUs and 4 GB of shared memory with access to a 140 GB disk partition. All data are downloadable as MySQL dumps and flat files, where applicable. MySQL is available for multiple platforms.

Similarity searches are performed on a 200-CPU cluster, kindly provided by IBM to the Research Programme of the European Bioinformatics Institute.

CAO would like to acknowledge further support by IBM Research and the UK Medical Research Council.

Conflict of Interest: none declared.

Fig. 1

Exponential growth of the CoGenT genome collection as number of protein sequence entries. The x-axis corresponds to the date of publication and the y-axis to the number of protein entries in CoGenT. Despite the fact that the CoGenT project was initiated in 2000, we record the original date of publication for each genome since 1995 (Janssen et al., 2003b).

Fig. 1

Exponential growth of the CoGenT genome collection as number of protein sequence entries. The x-axis corresponds to the date of publication and the y-axis to the number of protein entries in CoGenT. Despite the fact that the CoGenT project was initiated in 2000, we record the original date of publication for each genome since 1995 (Janssen et al., 2003b).

Fig. 2

Conceptual design of the CoGenT++ environment. This representation is the entry point of the web site as a clickable map. The database tier is composed of the CoGenT and ProXSim databases, the application tier is composed of derived (cyan) or linked (gray) databases to other imported resources (blue). The client tier includes external access to data and interactive servers via the WWW. Users can query the system either by identifier using precomputed database cross-references (via MagicMatch) or sequence (Similarity calculations, via BLAST).

Fig. 2

Conceptual design of the CoGenT++ environment. This representation is the entry point of the web site as a clickable map. The database tier is composed of the CoGenT and ProXSim databases, the application tier is composed of derived (cyan) or linked (gray) databases to other imported resources (blue). The client tier includes external access to data and interactive servers via the WWW. Users can query the system either by identifier using precomputed database cross-references (via MagicMatch) or sequence (Similarity calculations, via BLAST).

Table 1

Contents of the principal databases in the CoGenT++ environment

 Genomes Entries Size (Mb) 
CoGenT 221 822 115 365.0 
ProXSima 221 435 505 934 43 000.0 
ProfUse 221 822 115 870.0 
OFAM 221 308 594 67.6 
TRIBES 221 209 021 65.4 
AllFuseb 184 2 192 019 20 700.0 
Total   65 068.0 
 Genomes Entries Size (Mb) 
CoGenT 221 822 115 365.0 
ProXSima 221 435 505 934 43 000.0 
ProfUse 221 822 115 870.0 
OFAM 221 308 594 67.6 
TRIBES 221 209 021 65.4 
AllFuseb 184 2 192 019 20 700.0 
Total   65 068.0 

Entries: protein sequences in CoGenT, pairwise similarities in ProXSim, phylogenetic profiles in ProfUse, putative ortholog clusters in OFAM, protein families in TRIBES and fusion events in AllFuse.

aIncludes SwissProt.

bIn the process of being updated.

REFERENCES

Altschul, S.F., et al.
1997
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res.
 
25
3389
–3402
Andrade, M.A., et al.
1999
Automated genome sequence analysis and annotation.
Bioinformatics
 
15
391
–412
Bairoch, A., et al.
2005
The Universal Protein Resource (UniProt).
Nucleic Acids Res.
 
33
D154
–D159
Benson, D.A., et al.
2005
GenBank.
Nucleic Acids Res.
 
33
D34
–D38
Boeckmann, B., et al.
2003
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.
Nucleic Acids Res.
 
31
365
–370
Bowers, P.M., et al.
2004
Prolinks: a database of protein functional linkages derived from coevolution.
Genome Biol.
 
5
R35
Deshpande, N., et al.
2005
The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema.
Nucleic Acids Res.
 
33
D233
–D237
Enright, A.J. and Ouzounis, C.A.
2000
GeneRAGE: a robust algorithm for sequence clustering and domain detection.
Bioinformatics
 
16
451
–457
Enright, A.J., et al.
1999
Protein interaction maps for complete genomes based on gene fusion events.
Nature
 
402
86
–90
Enright, A.J., et al.
2002
An efficient algorithm for large-scale detection of protein families.
Nucleic Acids Res
 
30
1575
–1584
Enright, A.J., et al.
2003
Protein families and TRIBES in genome sequence space.
Nucleic Acids Res.
 
31
4632
–4638
Fleischmann, R.D., et al.
1995
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.
Science
 
269
496
–512
Janssen, P., et al.
2003
Beyond 100 genomes.
Genome Biol.
 
4
402
Janssen, P., et al.
2003
COmplete GENome Tracking (COGENT): a flexible data environment for computational genomics.
Bioinformatics
 
19
1451
–1452
Janssen, P., et al.
2005
Genome coverage, literally speaking. The challenge of annotating 200 genomes with 4 million publications.
EMBO Rep.
 
6
397
–399
Kanz, C., et al.
2005
The EMBL Nucleotide Sequence Database.
Nucleic Acids Res.
 
33
D29
–D33
Karp, P.D., et al.
2002
The Pathway Tools software.
Bioinformatics
 
18
Suppl 1,
S225
–S232
Kunin, V. and Ouzounis, C.A.
2003
GeneTRACE-reconstruction of gene content of ancestral species.
Bioinformatics
 
19
1412
–1416
Kunin, V., et al.
2003
Myriads of protein families, and still counting.
Genome Biol.
 
4
401
Kunin, V., et al.
2005
Measuring genome conservation across taxa: divided strains and united kingdoms.
Nucleic Acids Res.
 
33
616
–621
Lopez-Bigas, N. and Ouzounis, C.A.
2004
Genome-wide identification of genes likely to be involved in human genetic disease.
Nucleic Acids Res.
 
32
3108
–3114
Maglott, D., et al.
2005
Entrez Gene: gene-centered information at NCBI.
Nucleic Acids Res.
 
33
D54
–D58
Marcotte, E.M., et al.
1999
Detecting protein function and protein–protein interactions from genome sequences.
Science
 
285
751
–753
Mellor, J.C., et al.
2002
Predictome: a database of putative functional links between proteins.
Nucleic Acids Res
 
30
306
–309
Overbeek, R., et al.
1999
The use of gene clusters to infer functional coupling.
Proc. Natl Acad. Sci. USA
 
96
2896
–2901
Pellegrini, M., et al.
1999
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.
Proc. Natl Acad. Sci. USA
 
96
4285
–4288
Peterson, J.D., et al.
2001
The Comprehensive Microbial Resource.
Nucleic Acids Res.
 
29
123
–125
Promponas, V.J., et al.
2000
CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts.
Bioinformatics
 
16
915
–922
Pruitt, K.D., et al.
2005
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.
Nucleic Acids Res.
 
33
D501
–D504
Rigoutsos, I. and Floratos, A.
1998
Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm.
Bioinformatics
 
14
55
–67
Smith, M., Kunin, V., Goldorsky, L., Enright, A.J., Ouzounis, C.A.
2005
Magicmatch – cross-referencing sequence idenfifiers across databases.
Bioinformatics
 
21
3429
–3430
Stevens, R.D., et al.
2003
myGrid: personalised bioinformatics on the information grid.
Bioinformatics
 
19
Suppl 1,
i302
–i304
Tatusov, R.L., et al.
2003
The COG database: an updated version includes eukaryotes.
BMC Bioinformatics
 
4
41
Teo, Y.M., et al.
2004
GLAD: a system for developing and deploying large-scale bioinformatics grid.
Bioinformatics
 
21
794
–802
von Mering, C., et al.
2005
STRING: known and predicted protein–protein associations, integrated and transferred across organisms.
Nucleic Acids Res.
 
33
D433
–D437

Comments

0 Comments