Abstract

Motivation: High-throughput experiments are being performed at an ever-increasing rate to systematically elucidate protein–protein interaction (PPI) networks for model organisms, while the complexities of higher eukaryotes have prevented these experiments for humans.

Results: The Online Predicted Human Interaction Database (OPHID) is a web-based database of predicted interactions between human proteins. It combines the literature-derived human PPI from BIND, HPRD and MINT, with predictions made from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Mus musculus. The 23 889 predicted interactions currently listed in OPHID are evaluated using protein domains, gene co-expression and Gene Ontology terms. OPHID can be queried using single or multiple IDs and results can be visualized using our custom graph visualization program.

Availability: Freely available to academic users at http://ophid.utoronto.ca, both in tab-delimited and PSI-MI formats. Commercial users, please contact I.J.

Contact:juris@ai.utoronto.ca

Supplementary information:http://ophid.utoronto.ca/supplInfo.pdf

INTRODUCTION

The network of protein–protein interactions (PPIs), referred to as the interactome, forms a backbone of signaling pathways, metabolic pathways and cellular processes required for normal cell function. Complete knowledge of these pathways will help in the understanding of the normal processes in the cell, as well as how diseases such as cancer develop from mutation of individual pathway components. It has been the central aim of many high-throughput (HTP) experiments to elucidate the PPI networks in model organisms such as Saccharomyces cerevisiae (Gavin et al., 2002; Ho et al., 2002; Ito et al., 2001; Uetz et al., 2000), Caenorhabditis elegans (Li et al., 2004), Drosophilamelanogaster (Giot et al., 2003) and Mus musculus (Suzuki et al., 2003). While few studies have been performed in humans (Colland et al., 2004; Lehner et al., 2004), we have used the HTP model organism interactions to infer some of the millions of potential human PPIs.

Many databases are devoted to the human interactome, with a substantial number of them appearing in recent months [DIP, HPID, HPRD, MINT, PINdb (Han et al., 2004; Luc and Tempst, 2004; Peri et al., 2003; Xenarios et al., 2000; Zanzoni et al., 2002)]. However, the majority of these databases are derived from hand-curated, literature-based interactions. Although highly useful in providing ready access to the known human interactions, they do little to expand the knowledge of the interactome. Several databases have also been published that make predictions about the functional relationships between proteins based on a variety of in silico methods (Predictome, STRING, Prolinks, POINT) (Bowers et al., 2004; Huang et al., 2004; Mellor et al., 2002; von Mering et al., 2003).

The Online Predicted Human Interaction Database (OPHID) was designed to extend the human interactome using model organism data and to provide a repository for already known, experimentally derived human PPIs. While these predictions should be thought of as hypotheses until experimentally validated, there is increasing evidence that PPIs are conserved through evolution (Pagel et al., 2004; Wuchty et al., 2003). OPHID catalogs 16 034 known human PPIs obtained from BIND, MINT and HPRD, and makes predictions for 23 889 additional interactions.

Multiple types of evidence have been used in the literature both to support experimentally derived PPIs and to predict interactions in silico. Examples include domain–domain co-occurrence (Deng et al., 2002; Sprinzak and Margalit, 2001), gene co-expression (Bader et al., 2004; Deane et al., 2002; Deng et al., 2003) and Gene Ontology (GO) terms (Bader et al., 2004; Sprinzak et al., 2003). Using the combination of the three types of evidence allows us to support a broader range of PPIs than any single method.

We have applied all three evidence types to OPHID, providing support for 5483 (23%) of our predicted PPIs. We believe that OPHID will be a useful resource for researchers concerned with the human interactome, especially when integrated with additional HTP datasets that are likely to be available in the future.

SYSTEM AND METHODS

OPHID generation

OPHID was constructed by mapping model organism PPIs to human protein orthologs using BLASTP and the reciprocal best-hit approach. Briefly, a database of model organism-to-human orthologs was constructed by BLASTing each model organism protein against the Swiss-Prot database filtered for human proteins. Each top BLAST hit with an E-value <10−5 was BLASTed back against the set of all model organism protein sequences. If the top hit in the reverse direction (with E-value <10−5) matched the original query protein, the matching human protein was selected as a potential ortholog. These were filtered to remove any hits that occurred over <50% of the query sequence length, to avoid interactions that may involve a single protein domain.

Each model organism protein was translated to its human ortholog and a predicted human interaction was added if both proteins in the model organism interaction were conserved in humans. Model organism PPIs were added from S.cerevisiae, C.elegans, D.melanogaster and M.musculus using this technique. For a complete listing of data sources and references, refer to Table 1.

Domain co-occurrence dataset generation

The literature-derived PPIs from BIND, DIP1 HPRD and MINT were used to create a domain–domain co-occurrence network using the InterPro domains obtained from Swiss-Prot. For every interacting protein pair, each domain from protein A was connected to the domains in protein B. The frequency of these domain pairs was determined for all interacting protein pairs (n = 16 107), as well as all non-interacting pairs (i.e. all proteins not reported to interact in BIND, DIP, MINT or HPRD; n = 1.8 × 107). The hypergeometric distribution was used to determine which domain pairs were enriched in interacting protein pairs compared to the non-interacting pairs. After applying the Bonferroni correction to account for repeated sampling, 4182 domain–domain pairs were identified with P < 9.2 × 10−7 between 1164 domains.

Co-expression dataset

Human gene expression data was obtained from the GeneAtlas Affymetrix dataset, which includes expression data for 44 775 human genes from 79 normal human tissues (Su et al., 2004). Gene co-expression was determined using the Pearson correlation coefficient between gene vectors for each protein in the interaction.

GO term similarity measure

We used a modification of the semantic similarity measure (Lord et al., 2003) to determine the relatedness of each interacting protein pair. The semantic similarity method examines the frequency with which each GO term appears in Swiss-Prot for human proteins and assigns a higher score to terms that appear less frequently (i.e. have greater ‘information content’). For example, non-specific terms such as the top-level ‘molecular_function’ (GO:0003674) provide little information about the relatedness of two proteins, reflected in the P-value = 1.0. In contrast, more descriptive terms such as ‘translation regulator activity’ (P = 0.0048) or ‘chaperone activity’ (P = 0.0052) have greater information content, as they are used less frequently to describe human proteins and are potentially more specific for function. The GO similarity was determined by calculating the maximum semantic similarity from the set of all GO term pairs between interacting proteins. See Supplementary information for a complete example.

Background distributions

Statistically significant cutoffs for domain co-occurrence, gene co-expression and GO term similarity, were determined by estimating the background distributions using a bootstrap approach. Briefly, all OPHID PPIs (known and predicted) were randomized 1000 times to produce equivalent-sized random networks. The mean of the 95th percentiles was chosen as a cutoff. The thresholds for each metric are: domain co-occurrence (one significant domain pair); gene co-expression (Pearson = 0.607; see Supplementary information); GO similarity (GOSim = 3.14).

IMPLEMENTATION

Databases and software

Known (literature-derived, LIT) human PPIs were acquired from BIND, DIP, HPRD and MINT (see Supplementary information). The data and sequences from Swiss-Prot (v. 45.0) were loaded into our IBM DB2 database (v. 8.1.1.16). Protein sequences for each organism were obtained from the following sources: S.cerevisiae, Yeast Protein Databank (YPD); C.elegans, WormPep; D.melanogaster, FlyBase; M.musculus, Swiss-Prot (see Supplementary information for full versions). A local NCBI BLAST server (v. 2.2.4) was run through IBM's Information Integrator (v. 8.1.1) using the default BLAST settings. GO terms and InterPro domains were gathered from Swiss-Prot. The OPHID web interface and query engine was implemented on an IBM WebSphere web server (v. 5.0.0). All additional processing software was written in Java.

RESULTS

Protein interaction network

OPHID was generated from a total of 108 867 model organism PPIs mapped to human proteins through orthology. Orthologs were identified using the reciprocal best-hit approach (see Systems and Methods section). In total, 31.9% of the S.cerevisiae proteins had orthologs in humans, while 39.7 and 21.2% of the D.melanogaster and C.elegans proteins had orthologs, respectively. Through this orthology database, 23 889 model organism PPIs were mapped to human proteins, providing predictions for interactions that may occur in the human interactome, including 929 that are confirmed human interactions. Seventy two of the predicted interactions were from more than one model organism.

The predicted PPI dataset from OPHID (referred to as the OPHID set hereafter) contains 4552 proteins, 1872 of which are not in the LIT set (6144 proteins). Thus, OPHID extends the human interactome by hundreds of proteins that have not yet been included in the literature-derived databases.

Importantly, there is a large difference in the types of proteins that are being covered in the two datasets. Figure 1 shows the distributions of the functional categories represented in the LIT dataset, compared to the interactions in OPHID. The proteins involved in the LIT dataset are primarily involved in ‘cellular fate and organization’ pathways (29.3%), such as apoptosis, cell cycle regulation and cytoskeletal remodeling, followed by ‘transcription’ (9.8%) and ‘transport and sensing’ (9.0%). Only 19.9% of the proteins in this set are ‘Uncharacterized’, meaning that they lack GO terms in the Swiss-Prot database. In contrast, 29.1% of the proteins involved in OPHID are ‘uncharacterized’. OPHID is enriched for proteins involved in ‘energy production’ (2.3% versus 0.9%) and ‘other metabolism’ (6.0% versus 2.8%) compared to the LIT set, while the LIT set is enriched for proteins involved in ‘stress and defense’. This data suggests that the combination of the known and predicted interactions complement each other in many GO categories. In addition, the linking of the uncharacterized proteins, which make up ∼ 30% of OPHID, to known interactions will help provide functional information for these unannotated proteins.

The use of HTP experiments from model organisms has the potential to include false positive interactions. For example, Sprinzak et al. (2003) suggested that only 50% of yeast Y2H interactions are reliable. Producing a predicted PPI network may compound this problem by including those false positives, as well as potentially creating new ones through the ortholog mapping. In order to help filter out noisy interactions, we chose to look for additional supporting evidence in the form of protein domains, gene co-expression and GO terms (see Systems and Methods section). In essence, this additional evidence provides in silico validation of the OPHID interactions and will help rank the predicted interactions for future experimental confirmation.

Support through domain co-occurrence

The presence of domain pairs has been used extensively to predict de novo protein interactions (Deng et al., 2002; Wojcik and Schachter, 2001), as well as for the validation of reported interactions (Sprinzak and Margalit, 2001). Here, we have used more than 16 000 human PPIs from the LIT dataset to produce a domain co-occurrence network and selected those domain pairs that are significantly enriched in the interacting proteins compared to the non-interacting pairs (Systems and Methods section). While 93.0% of the LIT PPIs have at least one domain for each of the proteins in the pair, 44.3% of those have ≥ 2 statistically significant domain pairs (Fig. 2). This is in contrast to the OPHID dataset, where 92.1% of the PPIs have domain information, with 5.6% of these containing significant domains.

This difference in domain support is likely due to two factors: (1) The domain network was derived from the LIT dataset, which should lead to higher support for this dataset: (2) Differences in the functions of the proteins in the LIT dataset will also be reflected in the types of domains that are present in this network. The predicted network likely utilizes somewhat different domains than the LIT set. This is in line with the findings of Betel et al. (2004) who recently assessed domain–domain networks in S.cerevisiae and found that there are fundamental differences in the topology of these networks arising from the various yeast HTP datasets. These findings, combined with the data from Figure 1, suggest that at least some of the reduced support for the predicted interactions may be due to the differences in functional categories of the respective interaction networks, as well as the purification techniques that may bias towards transient or stable complexes. In addition, greater annotation of the human proteins will lead to increased support for the predicted interactions. For instance, between Build 44.0 and 45.0 of Swiss-Prot, support for the predicted interactions through domains increased from 3.1 to 5.6%.

Gene co-expression

Several studies have suggested that gene co-expression provides evidence for protein interactions (Deane et al., 2002; Ge et al., 2001; Kemmeren et al., 2002). We used the human GeneAtlas data (Su et al., 2004), derived from 79 normal human tissues, to provide evidence of PPIs through gene co-expression. The cutoff for significance of co-expression was found to correspond to a Pearson correlation = 0.607. GeneAtlas contains gene-expression data for both proteins in the interaction for 85.0% of LIT PPIs, with 9.0% significantly coexpressed. This compares with 86.2% of the OPHID interactions that have expression data, of which 17.3% are statistically significant. The most highly coexpressed protein pairs in the OPHID set involve ribosomal and proteasomal subunits, which show Pearson correlations >0.90. This finding indicates not only the presence of known stable complexes, but also that the gene co-expression of these complexes is conserved from yeast to humans (Jansen et al., 2002).

GO terms

Traditional approaches using GO terms to validate PPIs have employed the Jaccard similarity metric, which looks for cooccurring terms (Bader et al., 2004). This approach works well for highly annotated proteins, such as those found in yeast; however, human proteins do not share this level of annotation. Further, this method fails to take into account the depth within the GO tree of the overlapping terms, where deeper terms infer greater specificity (weight). We therefore used a modified semantic similarity measure described in Lord et al. (2003) (see Systems and Methods section).

The LIT set had a semantic similarity score for 76.9% of the PPIs, with 19.6% of these being significant (Fig. 2). The OPHID set, with a larger fraction of hypothetical and unannotated proteins, had a semantic similarity score for 58.2% of the PPIs, with 12.0% of these being significant. As the annotation of human proteins increases, we expect that support from GO similarity will increase, as was observed for domain support.

Measuring reliability by combined evidence

For the LIT interactions, 99.2% have at least one piece of evidence present (i.e. at least one of domains, expression data or GO terms for both proteins). Of these, 42.5% have evidence that is statistically significant. If the same number of interactions are chosen at random from the same set of proteins (to maintain similar levels of annotation), 10.1% of the randomized interactions are significant. For LIT interactions that have two or more pieces of evidence (92.9%), 15.9% are significant, indicating that, 16% of the known human PPIs are supported by at least two of these evidence types. This compares favorably to the 0.7% that are significant in the randomized network. While it would not be expected that all interactions would be supported by all evidence types, 16% is likely a lower limit on the number that may be supported in future. There are still more than 23% of the known interactions without related GO terms and many others with few terms present.

In the OPHID dataset, 23.0% of the predicted interactions have at least one significant piece of supporting evidence and 5.7% have ≥ 2 statistically significant pieces of evidence. This compares with 9.3 and 0.6% for the matching randomized non-interacting set (P < 0.05). Since there are 23 889 predicted PPIs, 5483 PPIs have some evidence (one type) and 1364 have ≥2 pieces of supporting evidence.

Evaluating the model organism source datasets

To examine the reliability of the model organism data, we have broken down the support for the interactions according to the source of the prediction. Figure 3A shows the breakdown of the percentage of original interactions that were supported by at least two types of evidence. Not surprisingly, two of the Riken (M.musculus) datasets (Suzuki et al., 2001; 2003) showed the highest support, since they are LIT interactions mapped from mouse to humans. This was also expected, as mice are closer evolutionarily to humans than S.cerevisiae, C.elegans or D.melanogaster, with 99% of the mouse genes having a human homolog and 80% having 1 : 1 human orthologs (Waterston et al., 2002). The next most reliable dataset is the INTEROLOG subset mapped from C.elegans. This subset includes interactions that were mapped from S.cerevisiae to C.elegans and then to humans, and thus likely represents a group of highly conserved protein interactions. The C.elegans LITERATURE set is similar to the Riken data, in that it was derived from small-scale published experiments and is therefore of higher quality. The MIPS, high and medium confidence datasets are derived from yeast, but represent the highest quality interactions in yeast, elucidated by multiple experiments. Finally, the remaining C.elegans (CORE_1, CORE_2, NON_CORE) and D.melanogaster(FlyHigh, FlyLow) Y2H experiments appear to be the least reliable source, which is not surprising given the inherent inaccuracy of Y2H (Sprinzak et al., 2003).

Figure 3B shows the number of interactions that have two or more types of supporting evidence, albeit not statistically significant. These graphs are not reciprocals, as interactions having only one supporting evidence type are not included. However Figure 3B shows similar trends as seen in Figure 3A, e.g. the C.elegans CORE and D.melanogaster datasets appearing to be the least accurate.

OPHID web interface

OPHID has been designed to aid not only the prediction of novel PPIs, but also to provide a regularly updated and expanded dataset that is easily accessible and can be used to further both small-scale experiments as well as support large-scale bioinformatics efforts. Thus, OPHID has been made available as a web-accessible database, where queries can be entered using a single identifier or by large batch queries using a variety of ID types (Genbank, Swiss-Prot, Unigene, LocusLink, etc.). The entire dataset can be downloaded as a tab-delimited text file or in the PSI-compliant XML format (Hermjakob et al., 2004). The OPHID interface contains a Java-based viewer to display the resulting PPI networks, which allows for the expansion of the search based on selected nodes in the graph or saving the visualized networks as either JPEG or SVG files.

DISCUSSION

One goal of the many proteomics projects published to date has been to map the PPI networks that exist in the respective organisms and thus determine the interactions that govern normal cell function. OPHID was designed to utilize this model organism interaction data in order to rapidly extend our knowledge of the human interactome. Only recently have LIT databases of human interactions begun to catch up with those devoted to model organisms, but while these are highly useful resources that improve access to the human interactome, these databases only recapitulate the known interactions published in the literature. Although HTP experiments are being performed on increasingly complex organisms, to date, few have been performed on mouse or humans.

Given the combinatorial explosion in the mouse and human interactomes that will surely emanate from the 20 to 25 000 genes in the genomes (International Human Genome Sequencing Consortium, 2004) (compared to 6000 in S.cerevisiae, 22 000 in C.elegans and 13 500 in D.melanogaster), it is unlikely that the higher eukaryote interactomes will be fully covered by experimental means in the near future. Thus, model organism interactomes must be used to gain insight into the human interaction networks and to begin using the resulting network to explore normal and disease processes in the near term. Further, this provides an opportunity for functional annotation of human and mouse proteins (currently 27 939 human proteins lack GO terms in Swiss-Prot Build 45.0) and provides a means for studying evolutionary conservation of important subnetworks in PPI datasets.

OPHID provides predictions of ∼24 000 PPIs, many of which we have supported with additional evidence. The database can be used in several ways. First, as a model of the human interactome, it can be used to explore known pathways, add new proteins to existing pathways or develop novel pathways altogether. Second, OPHID may be used as an aid in designing new PPI experiments by indicating whether orthologous proteins have been reported to interact in other organisms. Third, the data within OPHID can be integrated with additional datasets (e.g. expression data from disease profiles, OMIM data on disease-related proteins) to reveal new protein interactions and pathways that may be involved in human disease (Barrios-Rodiles et al., 2005). As new PPI datasets become available, they are being incorporated into OPHID; thus, OPHID will continue to represent an up-to-date, valuable resource for experiment planning.

Homology-based approaches to predicting PPIs may contain some inaccuracies (Deane et al., 2002; Matthews et al., 2001) depending on the filtering criteria used. For example, in mapping S.cerevisiae interactions to C.elegans, Matthews et al. (2001) were only able to reproduce 16–31% of the predicted interactions in a Y2H system. In this experiment, the method of mapping interactions was to consider only the best matching C.elegans homolog for each S.cerevisiae protein. The reciprocal best match approach that we have used (System and Methods section) provides a more stringent mapping between orthologous proteins. While providing a lower coverage of the potential interactome, this method provides better accuracy in the predicted interactions (Yu et al., 2004).

Other groups have used InParanoid to predict human PPIs (Lehner and Fraser, 2004) rather than the reciprocal best-hit approach. Using our semantic similarity measure, only 13.7% of interactions in the Lehner dataset are supported, while OPHID has 20.6% supported interactions (considering only those PPIs with GO terms). The reciprocal best-hit approach thus has more in silico support, which suggests greater accuracy than the InParanoid-based predictions.

Our additional evidence currently supports 23% of the predicted PPIs. This is influenced by limitations in the domain network and sparse GO annotation of the human proteins and therefore likely it represents a lower limit to the interaction support. Further, it has been suggested that only 66% of previously known PPIs may show co-expression at the mRNA level (Kemmeren et al., 2002). Therefore, a lack of in silico validation does not necessarily indicate that the interaction is less reliable, but may simply be due to the lower level of annotation of human proteins to date. Despite these challenges, OPHID provides a sizable number of novel PPIs supported by in silico evidence.

In building OPHID, we chose to include the entire von Mering dataset (von Mering et al., 2002), which consists of high, medium and low confidence subsets. The protein complexes in this dataset were connected in an all-to-all (matrix) fashion. While the matrix model has been shown to be less accurate than the spoke model (Bader and Hogue, 2002), the decision to include this data in its entirety was based on providing the largest possible coverage of the human interactome and then filtering at a later time by using supporting evidence. Although the low confidence subset contains fewer supportable interactions relative to the high and medium subsets (Fig. 3B), it is important to note that the results are comparable to the most reliable experimental C.elegans interactions (CORE_1, CORE_2) or the D.melanogaster Y2H interactions.

OPHID users can easily filter out less reliable interactions and include only the highest quality interaction data in their subsequent analysis, bearing in mind that reducing the false-positive rate increases the false-negative rate. We believe that there are numerous reliable (supportable) interactions to be gained by including the low quality data from each of these subsets (yeast low, NON_CORE and FlyLow) and we have indeed found many mapped interactions from these subsets that appear to be reliable human interactions.

FUTURE DIRECTIONS

OPHID will continue to grow as new interaction datasets become available and additional evidence will continue to be sought. We expect the in silico evidence for the OPHID interactions to improve in parallel with the annotation of human proteins. Additionally, including metrics such as coevolution can help reinforce the relatedness of the individual predicted interactions (Tan et al., 2004). Ultimately, a machine classifier will be developed to provide a unified confidence score for the OPHID interactions that will allow users an additional means of filtering the predicted protein interactions.

Note: DIP is only used internally for analysis. It is not reproduced on the OPHID website due to copyright restrictions.

Table 1

Model organism protein–protein interactions in OPHID

Model Organism PPI Source Mapped PPI 
S.cerevisiae 78 390 von Mering et al. (2002) 17 757 
S.cerevisiae 7 554 MIPS (http://mips.gsf.de/914 
C.elegans 5 444 Li et al. (2004) 1 238 
D.melanogaster 20 405 Giot et al. (2003) 3 394 
M.musculus 145 Suzuki et al. (2001) 66 
M.musculus 1 570 Suzuki et al. (2003) 1 544 
M.musculus 442 http://www.signaling-gateway.org/data/Y2H/cgi-bin/y2h.cgi 442 
Total   25 173 
   (23 889 unique) 
Model Organism PPI Source Mapped PPI 
S.cerevisiae 78 390 von Mering et al. (2002) 17 757 
S.cerevisiae 7 554 MIPS (http://mips.gsf.de/914 
C.elegans 5 444 Li et al. (2004) 1 238 
D.melanogaster 20 405 Giot et al. (2003) 3 394 
M.musculus 145 Suzuki et al. (2001) 66 
M.musculus 1 570 Suzuki et al. (2003) 1 544 
M.musculus 442 http://www.signaling-gateway.org/data/Y2H/cgi-bin/y2h.cgi 442 
Total   25 173 
   (23 889 unique) 

Fig. 1

Distribution of functional categories of proteins within OPHID. The GO terms obtained from Swiss-Prot for each protein within the interaction network were mapped to one of 11 broad categories on a first-matched basis using a custom keyword dictionary. The distributions of protein function are shown for the ‘Known’ PPIs (LIT/BIND, DIP, HPRD and MINT data; 6144 proteins), for the ‘Predicted’ interactions that were mapped from model organisms (4552 proteins) and for all known human proteins within Swiss-Prot (‘HUMAN’; 57 400 proteins). Proteins in the ‘Not matched’ category did not match against the keyword dictionary, while the ‘Uncharacterized’ category represents proteins that lacked any descriptive GO terms.

Fig. 1

Distribution of functional categories of proteins within OPHID. The GO terms obtained from Swiss-Prot for each protein within the interaction network were mapped to one of 11 broad categories on a first-matched basis using a custom keyword dictionary. The distributions of protein function are shown for the ‘Known’ PPIs (LIT/BIND, DIP, HPRD and MINT data; 6144 proteins), for the ‘Predicted’ interactions that were mapped from model organisms (4552 proteins) and for all known human proteins within Swiss-Prot (‘HUMAN’; 57 400 proteins). Proteins in the ‘Not matched’ category did not match against the keyword dictionary, while the ‘Uncharacterized’ category represents proteins that lacked any descriptive GO terms.

Fig. 2

Supporting evidence for known and predicted PPIs. Evidence was gathered to support each of the LIT interactions (BIND, DIP, HPRD and MINT) or OPHID predictions. The evidence was gathered in the form of domain–domain co-occurrence (‘Domains’), gene co-expression (‘Express’) and GO term semantic similarity (‘GO Terms’). For each dataset, the fraction of total PPIs with each evidence type is shown by the white bars. The fraction of total PPIs with significant evidence (≥2 domains, r ≥ 0.607 and GOsim ≥ 3.14) is indicated in black (nknown = 16 107 PPIs, npredicted = 23 889 PPIs).

Fig. 2

Supporting evidence for known and predicted PPIs. Evidence was gathered to support each of the LIT interactions (BIND, DIP, HPRD and MINT) or OPHID predictions. The evidence was gathered in the form of domain–domain co-occurrence (‘Domains’), gene co-expression (‘Express’) and GO term semantic similarity (‘GO Terms’). For each dataset, the fraction of total PPIs with each evidence type is shown by the white bars. The fraction of total PPIs with significant evidence (≥2 domains, r ≥ 0.607 and GOsim ≥ 3.14) is indicated in black (nknown = 16 107 PPIs, npredicted = 23 889 PPIs).

Fig. 3

Reliability of predicted interactions in OPHID. We have examined which source datasets (after mapping to human proteins) had the most supporting evidence. (A) The proportion of interactions from each dataset with ≥ 2 pieces of supporting evidence (domains, co-expression or GO similarity). (B) The proportion of interactions with ≥2 evidence types, but which are not supported by that evidence. For instance, of the 914 interactions predicted from MIPS, 626 have ≥ 2 evidence types. Of those, 103 (11.3%) are supported, while 523 (57.2%) are not. Another 205 (22.4%) are supported by only one piece of evidence.

Fig. 3

Reliability of predicted interactions in OPHID. We have examined which source datasets (after mapping to human proteins) had the most supporting evidence. (A) The proportion of interactions from each dataset with ≥ 2 pieces of supporting evidence (domains, co-expression or GO similarity). (B) The proportion of interactions with ≥2 evidence types, but which are not supported by that evidence. For instance, of the 914 interactions predicted from MIPS, 626 have ≥ 2 evidence types. Of those, 103 (11.3%) are supported, while 523 (57.2%) are not. Another 205 (22.4%) are supported by only one piece of evidence.

The authors thank R. Lu and D. Otasek for software development. We acknowledge the hardware and software support from IBM Life Sciences through a Shared University Research Grant and support from the National Science and Engineering Research Council (RGPIN 203833-02), the Institute for Robotics and Intelligent Systems, Precarn Inc, National Institutes of Health (#P50-GM62413), Fashion Show and Younger Foundations change.

REFERENCES

International Human Genome Sequencing Consortium.
2004
Finishing the euchromatic sequence of the human genome.
Nature
 
431
931
–945
Bader, G.D. and Hogue, C.W.
2002
Analyzing yeast protein–protein interaction data obtained from different sources.
Nat. Biotechnol
 
20
991
–997
Bader, J.S., Chaudhuri, A., Rothberg, J.M., Chant, J.
2004
Gaining confidence in high-throughput protein interaction networks.
Nat. Biotechnol
 
22
78
–85
Barrios-Rodiles, M., Brown, K.R., Ozdamar, B., Liu, Z., Donovan, R.S., Shinfo, F., Liu, Y., Bose, R., Dembowy, J.R.
2005
High-Throughput Mapping of a Dynamic Signalling Network In Mammalian Cells.
Science
  in press
Betel, D., Isserlin, R., Hogue, C.W.
2004
Analysis of domain correlations in yeast protein complexes.
Bioinformatics
 
20
Suppl 1,
SI55
–SI62
Bowers, P.M., Pellegrini, M., Thompson, M.J., Fierro, J., Yeates, T.O., Eisenberg, D.
2004
Prolinks: a database of protein functional linkages derived from coevolution.
Genome Biol
 
5
R35
Colland, F., Jacq, X., Trouplin, V., Mougin, C., Groizeleau, C., Hamburger, A., Meil, A., Wojcik, J., Legrain, P., Gauthier, J.M.
2004
Functional proteomics mapping of a human signaling pathway.
Genome Res
 
14
1324
–1332
Deane, C.M., Salwinski, L., Xenarios, I., Eisenberg, D.
2002
Protein interactions: two methods for assessment of the reliability of high throughput observations.
Mol. Cell. Proteomics
 
1
349
–356
Deng, M., Mehta, S., Sun, F., Chen, T.
2002
Inferring domain–domain interactions from protein–protein interactions.
Genome Res
 
12
1540
–1548
Deng, M., Sun, F., Chen, T.
2003
Assessment of the reliability of protein–protein interactions and protein function prediction.
Pac. Symp. Biocomput
 
140
–151
Gavin, A.-C., Bösche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J., Michon, A.-M., Cruciat, C., et al.
2002
Functional organization of the yeast proteome by systematic analysis of protein complexes.
Nature
 
415
141
–147
Ge, H., Liu, Z., Church, G.M., Vidal, M.
2001
Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae.
Nat. Genet.
 
29
482
–486
Giot, L., Bader, J.S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y.L., Ooi, C.E., Godwin, B., Vitols, E., et al.
2003
A protein interaction map of Drosophila melanogaster.
Science
 
302
1727
–1736
Han, K., Park, B., Kim, H., Hong, J., Park, J.
2004
HPID: the human protein interaction database.
Bioinformatics
 
20
2466
–2470
Hermjakob, H., Montecchi-Palazzi, L., Bader, G., Wojcik, J., Salwinski, L., Ceol, A., Moore, S., Orchard, S., Sarkans, U., von Mering, C., et al.
2004
The HUPO PSI's molecular interaction format—a community standard for the representation of protein interaction data.
Nat. Biotechnol
 
22
177
–183
Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., et al.
2002
Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry.
Nature
 
415
180
–183
Huang, T.-W., Tien, A.-C., Huang, W.-S., Lee, Y.C.G., Peng, C.-L., Tseng, H.-H., Kao, C.-Y., Huang, C.-Y.F.
2004
POINT: a database for the prediction of protein–protein interactions based on the orthologous interactome.
Bioinformatics
 
20
3273
–3276
Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., Sakaki, Y.
2001
A comprehensive two-hybrid analysis to explore the yeast protein interactome.
Proc. Natl Acad. Sci. USA
 
98
4569
–4574
Jansen, R., Greenbaum, D., Gerstein, M.
2002
Relating whole-genome expression data with protein–protein interactions.
Genome Res
 
12
37
–46
Kemmeren, P., van Berkum, N.L., Vilo, J., Bijma, T., Donders, R., Brazma, A., Holstege, F.C.P.
2002
Protein interaction verification and functional annotation by integrated analysis of genome-scale data.
Mol. Cell
 
9
1133
–1143
Lehner, B. and Fraser, A.G.
2004
A first-draft human protein-interaction map.
Genome Biol
 
5
R63.61
–R63.69
Lehner, B., Semple, J.I., Brown, S.E., Counsell, D., Campbell, R.D., Sanderson, C.M.
2004
Analysis of a high-throughput yeast two-hybrid system and its use to predict the function of intracellular proteins encoded within the human MHC class III region.
Genomics
 
83
153
–167
Li, S., Armstrong, C.M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain, P.O., Han, J.D., Chesneau, A., Hao, T., et al.
2004
A map of the interactome network of the metazoan C. elegans.
Science
 
303
540
–543
Lord, P.W., Stevens, R.D., Brass, A., Goble, C.A.
2003
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation.
Bioinformatics
 
19
1275
–1283
Luc, P.V. and Tempst, P.
2004
PINdb: a database of nuclear protein complexes from human and yeast.
Bioinformatics
 
20
1413
–1415
Matthews, L.R., Vaglio, P., Reboul, J., Ge, H., Davis, B.P., Garrels, J., Vincent, S., Vidal, M.
2001
Identification of potential interaction networks using sequence-based searches for conserved protein–protein interactions or ‘interologs’.
Genome Res.
 
11
2120
–2126
Mellor, J.C., Yanai, I., Clodfelter, K.H., Mintseris, J., DeLisi, C.
2002
Predictome: a database of putative functional links between proteins.
Nucleic Acids Res
 
30
306
–309
Pagel, P., Mewes, H.W., Frishman, D.
2004
Conservation of protein–protein interactions—lessons from ascomycota.
Trends Genet
 
20
72
–76
Peri, S., Navarro, J.D., Amanchy, R., Kristiansen, T.Z., Jonnalagadda, C.K., Surendranath, V., Niranjan, V., Muthusamy, B., Gandhi, T.K.B., Gronborg, M., et al.
2003
Development of human protein reference database as an initial platform for approaching systems biology in humans.
Genome Res
 
13
2363
–2371
Sprinzak, E. and Margalit, H.
2001
Correlated sequence-signatures as markers of protein–protein interaction.
J. Mol. Biol
 
311
681
–692
Sprinzak, E., Sattath, S., Margalit, H.
2003
How reliable are experimental protein–protein interaction data?
J. Mol. Biol
 
327
919
–923
Su, A.I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K.A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., et al.
2004
A gene atlas of the mouse and human protein-encoding transcriptomes.
Proc. Natl Acad. Sci. USA
 
101
6062
–6067
Suzuki, H., Fukunishi, Y., Kagawa, I., Saito, R., Oda, H., Endo, T., Kondo, S., Bono, H., Okazaki, Y., Hayashizaki, Y.
2001
Protein–protein interaction panel using mouse full-length cDNAs.
Genome Res
 
11
1758
–1765
Suzuki, H., Saito, R., Kanamori, M., Kai, C., Schonbach, C., Nagashima, T., Hosaka, J., Hayashizaki, Y.
2003
The mammalian protein–protein interaction database and its viewing system that is linked to the main FANTOM2 viewer.
Genome Res
 
13
1534
–1541
Tan, S.H., Zhang, Z., Ng, S.K.
2004
ADVICE: automated detection and validation of interaction by co-evolution.
Nucleic Acids Res
 
32
W69
–W72
Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., et al.
2000
A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae.
Nature
 
403
623
–627
von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P., Snel, B.
2003
STRING: a database of predicted functional associations between proteins.
Nucleic Acids Res.
 
31
258
–261
von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., Bork, P.
2002
Comparative assessment of large-scale data sets of protein–protein Interactions.
Nature
 
417
399
–403
Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexanderson, M., An, P., et al.
2002
Initial sequencing and comparative analysis of the mouse genome.
Nature
 
420
520
–562
Wojcik, J. and Schachter, V.
2001
Protein–protein interaction map inference using interacting domain profile pairs.
Bioinformatics
 
17
S296
–S305
Wuchty, S., Oltvai, Z.N., Barabasi, A.L.
2003
Evolutionary conservation of motif constituents in the yeast protein interaction network.
Nat. Genet
 
35
176
–179
Xenarios, I., Rice, D.W., Salwinski, L., Baron, M.K., Marcotte, E.M., Eisenberg, D.
2000
DIP: the database of interacting proteins.
Nucleic Acids Res.
 
28
289
–291
Yu, H., Luscombe, N.M., Lu, H.X., Zhu, X., Xia, Y., Han, J.D., Bertin, N., Chung, S., Vidal, M.
2004
Annotation transfer between genomes: protein–protein interologs and protein–DNA regulogs.
Genome Res.
 
14
1107
–1118
Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., Ausiello, G., Helmer-Citterich, M., Cesareni, G.
2002
MINT: a Molecular INTeraction database.
FEBS Lett.
 
513
135
–140

Comments

0 Comments