Functional links between proteins can often be inferred from genomic associations between the genes that encode them: groups of genes that are required for the same function tend to show similar species coverage, are often located in close proximity on the genome (in prokaryotes), and tend to be involved in gene-fusion events. The database STRING is a precomputed global resource for the exploration and analysis of these associations. Since the three types of evidence differ conceptually, and the number of predicted interactions is very large, it is essential to be able to assess and compare the significance of individual predictions. Thus, STRING contains a unique scoring-framework based on benchmarks of the different types of associations against a common reference set, integrated in a single confidence score per prediction. The graphical representation of the network of inferred, weighted protein interactions provides a high-level view of functional linkage, facilitating the analysis of modularity in biological processes. STRING is updated continuously, and currently contains 261 033 orthologs in 89 fully sequenced genomes. The database predicts functional interactions at an expected level of accuracy of at least 80% for more than half of the genes; it is online at http://www.bork.embl-heidelberg.de/STRING/ .
Received August 14, 2002; Accepted September 11, 2002
Protein–protein interactions are not limited to direct physical binding. Proteins may also interact indirectly—by sharing a substrate in a metabolic pathway, by regulating each other transcriptionally, or by participating in larger multi-protein assemblies. For predicting such functional associations (including direct binding), the current growth in completed genomes offers unique opportunities through so-called ‘genomic context’ or ‘nonhomology-based’ inference methods ( 1 – 3 ).
These methods are based on the fact that functionally associated proteins are encoded by genes that share similar selection pressures—the genes need to be maintained together, and regulated together, such that the encoded proteins can interact at the same time and place in the cell. This leaves signals in the genome, which become detectable above the noise of random genomic events when analyzing multiple species. For example, the need for maintaining functionally associated genes together can become visible as an agreement in occurrence-patterns across several genomes ( 4 , 5 ): the genes tend to be either present together, or absent together—they have the same ‘phylogenetic profile’. This is particularly informative when the profile is not in agreement with organismal phylogeny, as is the case when horizontal transfers or gene losses are involved ( 6 , 7 ). Likewise, the need for similar regulation is often reflected in a tendency of functionally associated genes to be close neighbors in prokaryotic genomes ( 8 , 9 ), where they generally have the same transcriptional orientation and little or no sequence between them. This suggests that they are single transcription units (operons), recurring in similar but not identical composition across several genomes ( 10 ). Finally, genes whose protein products need to interact closely in the cell have a noticeable tendency to be fused into a single gene, encoding a combined polypeptide ( 11 , 12 ) in which the proteins have a higher chance of interacting productively.
Optimal, user-friendly exploitation of genomic context for the prediction of functional interactions requires: (i) a benchmarked scoring scheme that integrates the three types of context and gives a confidence value for each prediction, (ii) automatic implementation and orthology assignment of the genes in newly published genomes, and (iii) easy navigation between various displays so that not only the pairwise interactions, but also the network of interactions and the presence of potential (sub)modules in the network become visible. Previous genomic context databases such as Indigo ( 13 ), the first version of STRING ( 14 ), the Clusters of Orthologous Group (COG) database ( 15 ), Predictome ( 16 ), and SNAPper ( 17 ) only rely on a single form of genomic context. Where they do include multiple forms (Predictome and COG) these are not integrated; nor do any of the databases indicate the reliability of the predictions. This indication of reliability is necessary: with the ever-increasing number of genomes, the amount of predictions can become quite large and, depending on the parameters, may include many false positives. We took the opportunity of a complete redesign of STRING to introduce such a scoring scheme, derived by integrating all three types of genomic context. Additionally, STRING is now continuously updated and the predictions are fully precomputed. Particular emphasis has been placed on fast and easy navigation, coupled to integrated visual outputs (see Fig. 1 for an example output of STRING).
Users enter the database via a protein of interest, for which functional associations are to be predicted. This protein can be identified by its accession number or identifier. Alternatively, the raw amino acid sequence of the protein can be supplied (in this case, checksum lookups and similarity searches are done to identify the corresponding entry in the database). The user is then presented with a summary of the predicted functional links for the protein, ranked by estimated confidence. Further pages are accessible which summarize and explain the evidence that leads to the predictions. Additionally, a fully interactive network display is available—allowing navigation through the combined functional associations. The network display also allows iteration—zooming out of a particular module and visualizing its connections to other modules. For independent computational analysis, the entire set of predictions contained in STRING is available as computer-readable flat-files through the website.
The concepts behind the individual algorithms for the prediction of functional associations have all been published and validated previously; for STRING, only minor modifications were made. The requirements for the detection of gene fusions are more strict than those published previously ( 11 , 12 ); fused proteins are not recognized by homology, but rather by orthology of the fused parts to other, non-fused proteins ( 18 , 19 ).
For neighborhood evidence, a repeatedly occurring neighborhood is required, in species that are sufficiently remote to uncover functional constraints on gene order.
For the analysis of gene co-occurrence, STRING does not require perfect agreement between the occurrence of two genes, but uses a measure from information theory, mutual information ( 20 , 21 ), which quantifies the information gained—from the knowledge that one gene is present—about the presence of another gene in the same genome. The specific algorithm used here corrects for biases in the number of genomes sequenced for a particular branch of phylogeny, by collapsing into a single node those taxa in which the presence or absence of a specific gene pair is in agreement in all the species.
SCORING-FRAMEWORK AND BENCHMARKING
The three types of genomic association each contain quantitative information (e.g. the number of times two genes occur together in an operon). Additionally, there is a positive correlation between the genomic associations and the likelihood and strength of interactions ( 9 , 21 ); this allows the derivation of a scoring system.
We benchmarked the various genomic associations separately (Fig. 2 ), based on the co-occurrence of proteins on metabolic maps in the KEGG database ( 22 ); proteins that occur on the same metabolic KEGG map are presumed to be functionally interacting, those that occur on different maps are not. For both fusion and conserved gene order, we find that the simple counting of events is insufficient; it is outperformed by a score that includes normalization by the number of species covered by the genes involved (Fig. 3 ).
The comparison of the different types of genomic association to the same benchmark helps to establish which scores in each method are equivalent. For example, at a fusion frequency of 0.04, 50% of the predicted pairs are on the same KEGG map, while this is only reached at a conserved gene order frequency of 0.10 (Fig. 2 ). This equivalency can be formalized by finding a function that describes the relation between the score and the observed accuracy. The correlations of the genomic association counts with the fraction of proteins on the same KEGG map are sigmoidal, and we, therefore, fitted them to hill-equations (Fig. 2 ).
The equivalency mapping makes it possible to combine the three hill-equations into a single score. We integrate the scores by multiplying the probabilities of associations not predicting a functional interaction. In this way, multiple scores can be combined to form a single score that expresses a higher confidence (Fig. 3 ). Combining the separate scores leads to a higher coverage at a given accuracy, specifically for the genes that score sub-optimally for all the individual genomic associations (Fig. 3 ). Remarkably, gene-order conservation remains clearly the most power-full method of the three ( 21 ).
DATA SOURCES, ORTHOLOGY
For information on genomes, genes, and encoded proteins, STRING relies on the annotated proteomes maintained by SWISS-PROT ( 23 ). Assignment of functional equivalence of genes across these genomes is essential for the predictions, and this information is derived from the manually curated orthology database, COGs ( 15 ). For any genomes not yet present in the COG database, orthology assignments are made by an automatic method resembling the COG procedure. This results not only in the addition of new genes to COGs, which are presently based on 43 genomes, but also in the creation of a number of additional orthologous groups (NOGs, non-supervised orthologous groups) (see http://www.bork.embl-heidelberg.de/STRING/ for details on the orthology assignment procedure. Essentially, assignments are based on triangles of reciprocal best matches between species in all-against-all Smith–Waterman searches, allowing for recent duplications within the genome, and including a clean-up step to join remaining genes by simple bidirectional hits).
STRING uses a relational database system (PostgreSQL, http://www.postgresql.org ) to store primary data, such as genes and genomic locations. Periodically, complete all-against-all runs of the prediction algorithms are performed, and the resulting functional associations are stored in the database system as well. Precomputed results are stored at several levels of detail, allowing for very fast navigation through the predictions.
This work was supported in part by grants from the Netherlands Organization for Scientific Research (NWO), from the Deutsche Forschungsgemeinschaft, and from the Bundesministerium für Forschung und Bildung, Germany, through its contribution to the Helmholtz Network for Bioinformatics.
1European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany 2Max-Delbrück-Centre for Molecular Medicine, Robert-Rössle-Strasse 10, 13092 Berlin, Germany 3Nijmegen Centre for Molecular Life Sciences p/a Centre of Molecular and Biomolecular Informatics, Toernooiveld 1, 6525 ED Nijmegen, The Netherlands