Abstract

Motivation: Protein–protein interactions have proved to be a valuable starting point for understanding the inner workings of the cell. Computational methodologies have been built which both predict interactions and use interaction datasets in order to predict other protein features. Such methods require gold standard positive (GSP) and negative (GSN) interaction sets. Here we examine and demonstrate the usefulness of homologous interactions in predicting good quality positive and negative interaction datasets.

Results: We generate GSP interaction sets as subsets from experimental data using only interaction and sequence information. We can therefore produce sets for several species (many of which at present have no identified GSPs). Comprehensive error rate testing demonstrates the power of the method. We also show how the use of our datasets significantly improves the predictive power of algorithms for interaction prediction and function prediction.

Furthermore, we generate GSN interaction sets for yeast and examine the use of homology along with other protein properties such as localization, expression and function. Using a novel method to assess the accuracy of a negative interaction set, we find that the best single selector for negative interactions is a lack of co-function. However, an integrated method using all the characteristics shows significant improvement over any current method for identifying GSN interactions. The nature of homologous interactions is also examined and we demonstrate that interologs are found more commonly within species than across species.

Conclusion: GSP sets built using our homologous verification method are demonstrably better than standard sets in terms of predictive ability. We can build such GSP sets for several species. When generating GSNs we show a combination of protein features and lack of homologous interactions gives the highest quality interaction sets.

Availability: GSP and GSN datasets for all the studied species can be downloaded from http://www.stats.ox.ac.uk/~deane/HPIV

Contact:saeed@stats.ox.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Protein–protein interaction (PPI) networks for several model organisms have been generated with varying degrees of coverage and integrity (Piehler, 2005). Interaction information can be obtained from experimental methods (Pellegrini et al., 2004) and computational approaches (Yu and Fotouhi, 2006). High-throughput methods such as yeast two hybrid and tandem affinity purification have currently elucidated interactions for several species including Saccharomyces cerevisiae (Gavin et al., 2002; Ho et al., 2002; Ito et al., 2001; Uetz et al., 2000), Drosophila melanogaster (Formstecher et al., 2005; Giot et al., 2003) and Caenorhabditis elegans (Li et al., 2004; Walhout et al., 2000) and more are forthcoming. Saccharomyces cerevisiae is the most widely studied organism yet despite this its interactome is far from complete (Bork et al., 2004). As well as this lack of coverage the accuracy of interaction data has also been challenged, with many studies indicating a large fraction of false positives in current interaction networks (Deane et al., 2002; Hart et al., 2006; von Mering et al., 2003).

To address the issue of reliability in experimentally acquired interactions several studies have proposed computational methods for calculating the accuracy of interaction datasets (Suthram et al., 2006). Such confidence measures indicate the accuracy of an interaction set by providing us with a true positive rate. To make such a calculation the methods require a set of interactions that are assumed to be reliable, so-called gold standard positive (GSP) interaction sets. Alongside experimentally obtained interactions, computational methods have also provided the community with large datasets of information (von Mering et al., 2005). A common feature of all computational methods is that they rely on a GSP set of interactions as a training set. A more reliable set of interactions should improve results obtained from such methods.

Currently GSP sets are produced using a number of methodologies each with varying degrees of accuracy and bias. The simplest method is to gather interactions that have been observed in more than one experimental study; this method only generates a useful set for yeast. Other methods include selecting only those interactions elucidated solely from small-scale experiments and interactions gathered from curated databases or interactions witnessed in protein complexes (Mewes et al., 2004; Saeed and Deane, 2006).

Reliable interaction sets can be generated computationally by using information such as a protein's function and cellular localization (Wu et al., 2006). As such methods require annotation knowledge their applicability is limited to those species for which this information exists. It is also possible to create GSP sets using high-scoring interactions returned from interaction confidence assignment schemes, however there is a degree of circularity as these schemes depend upon other GSPs to make confidence assignments. Alternatively a range of methods exist that only require topological information to ascertain a GSP (Chen et al., 2006; Saito et al., 2002, 2003). The most successful of which is IRAP* or Interaction Reliability by Alternative Path by Chen et al. (2006). IRAP* measured the reliability of an interaction based on the observation that a similar biological function is performed by proteins in a highly interconnected network of interactions.

Protein interactions can also be verified and indeed predicted using homology information. Walhout et al. (2000) introduced the notion of an ‘interolog’, orthologous pairs of interacting proteins in different organisms. The idea can be extended to include paralogous interactions, an idea that was exploited by Deane et al. (2002) in order to assess the reliability of a protein interaction. Unlike topological methods which owe their shortcomings to the incomplete and error-prone nature of the initial network or function-based methods, which have an inherent bias towards well-investigated proteins, homology-based methods do not have such biases and can also use all available interaction information.

Orthologous interactions, protein pairs with interacting orthologues, have since been used to construct entire interactomes for organisms where experimental interactions are sparse. (Huang et al., 2007; Jonsson and Bates, 2006). Both studies proposed that the orthologous approaches adopted returned reliable interactions by analysing their confidence scores at varying thresholds. Patil and Nakamura (2005b) on the other hand created sets of homologous interactions for a number of species. Their work was followed by the creation of a GSP set which was achieved by using a combination of structural, Gene Ontology and interaction homology information (Patil and Nakamura, 2005a). Verification of this set was limited to checking for the degree of overlap with another GSP set.

Gold standard negative interactions sets are also required for the prediction and confidence assessment of interaction sets. Unlike positive interactions, negative interactions are unfortunately not reported by experimentalists. They are instead inferred computationally by uniformly selecting two proteins and pairing them off as non-interactors (Qi et al., 2005; Zhang et al., 2004) or by selecting protein pairs that do not share similar cellular compartments (Jansen et al., 2003). Recent work suggests that using only cellular compartment information to create negative interactions can lead to biased estimates of prediction accuracy and that this bias may affect a number of predictive methods (Ben-Hur and Noble, 2006). Wu et al. (2006) went on to predict negative interactions using a semantic similarity measure based upon functional, compartmental and process Gene Ontologies annotations. In their analysis they did not identify which of the three ontologies contributed more to the identification of negative interactions.

In this article, we investigate the usefulness of homologous interactions, both paralogous and orthologous and examine how well they identify true positive interactions in seven species. Unlike other methods used for generating reliable sets this homologous protein interaction verification (HPIV) method does not require a previously established GSP set nor does it require knowledge of protein characteristics such as expression, function or localization, allowing as large and as varied a set as possible to be analysed.

We demonstrate the improved quality of the verified interaction set by conducting extensive error rate analysis on the yeast subset. We illustrate how the increased quality observed in the HPIV yeast set is also reflected in other species by conducting a similar error rate analysis on HPIV D.melanogaster.

To demonstrate how this increase in quality can be useful, we conducted functional prediction on the set using the majority vote system and compared that to a standard datasets performance (Schwikowski et al., 2000). Similarly we use the yeast HPIV set as a training set to predict interactions using a domain–domain interaction prediction method (Deng et al., 2003). We find that not only does HPIV have improved accuracy over standard interaction sets but it also returns better results for functional prediction and improves the quality of predicted interactions when used as a training set.

We present a novel integrated method for generating GSN sets that utilizes functional, localization, expression and homology-based data. An interaction between two proteins is declared a negative interaction only if the two interacting proteins have no overlap in any of these areas. Currently no methods that measure the accuracy of a negative interaction set exist in the literature. To estimate the accuracy of our sets we calculate the overlap between multiple subsets and a true interaction set and find that function is by far the best individual feature that selects for negative interaction. Whilst function is the strongest individual factor we demonstrate that the integrated method returns the best GSN interaction set.

Finally, we analysed the effect of paralogous and orthologous interactions on the interaction verification process. Our results confirm that interactions are more conserved within species than across species (Mika and Rost, 2006). Interaction sets in different species come in varying sizes and in our test we ensure that this size effect is not responsible for the difference we see in paralogous and orthologous interactions. Our results suggest that homology-based interaction prediction methods may yield better results if paralogous interactions were considered as well as orthologous ones.

2 METHODS

2.1 Interaction verification

In order to verify protein interactions, our method has two simple prerequisites. (i) Protein interaction information and (ii) protein sequence information.

Protein interactions were downloaded from the Database of Interacting Proteins (DIP release date: 7 January 2007) for the following species: D.melanogaster, S.cerevisiae, Escherichia coli, C.elegans, Helicobacter pylori, Homo sapiens, Mus musculus (Salwinski et al., 2004). Sequences for interacting proteins from all species were also obtained from DIP. Proteins were then grouped into families by running a single iteration of BLAST (Altschul et al., 1997) with an e-value threshold of 10−04.

Every protein interaction was assessed. An interaction is verified if another interaction exists between any members of the homologous family of one partner in the interaction and a member of the homologous family of the other partner. For example, an interaction between protein p1 belonging to a family of homologues f1 and p2 belonging to a family of homologues f2 will be verified if we observe at least one other interaction between any of the proteins in f1 or f2. This binary scheme was selected based on a ROC curve analysis, see Supplementary Material.

2.2 Error rates

We used several measures of error in order to analyse the level of accuracy in a dataset. Error rate measures were conducted on the HPIV dataset as well as other standard sets. The standard DIP dataset: DIP_ORIGINAL, physical interactions taken from the MIPS database: MIPS_PHYSICAL [(Mewes et al., 2004) release date: 18 May 2006] and IRAP, a GSP interaction set generated by the IRAP* method (Chen et al., 2006).

DIP_MULTI is our reference set made up of only those interactions that were observed in more than one experimental study.

Localization annotation was taken from MIPS (Mewes et al., 2004) and functional information was obtained from the Gene Ontology Database (Ashburner et al., 2000) (19 April 2007). The GO molecular function ontology is made up of 18 groups, each of which is broken down into many more categories. In order to calculate an overlap, each protein's GO molecular function assignment was traced back to a parent group, and the overlap between groups was measured.

In the case of yeast, two further measures based on expression profiles were calculated. The first was the expression profile reliability (EPR) index (Deane et al., 2002). An expression-based distance score is calculated for all interacting protein pairs in a set. The resulting distribution of distance scores is compared with the distance score distributions of standard interacting and non-interacting sets. The comparison yields the approximate percentage of true interactions in the set. The second method, DENG, is a maximum-likelihood estimation method that also utilizes the principle of co-expression (Deng et al., 2003). A recent survey of confidence assessment schemes labelled this method to be the most reliable measure of accuracy (Suthram et al., 2006).

2.3 Functional prediction

In order to exhibit the value and increased accuracy of our gold standard interaction datasets, we used the yeast HPIV set to make functional predictions. The accuracy rate of the predictions was compared to standard subsets. For each set, 18 MIPS functional categories (Mewes et al., 2004) were assigned to all proteins, and in a leave one out approach the functions of each protein were predicted in turn using the simple majority vote method (Schwikowski et al., 2000). Predictions were made for only those proteins with more than one interacting partner and a prediction was deemed accurate if the correct function was predicted within any of the top three annotations.

2.4 Predicting interactions

To predict protein–protein interactions, we used a method proposed by Deng et al. (2002) where domain–domain interaction probabilities are estimated using a training set. The estimated domain interactions are then used to infer protein interactions.

The script and accompanying data can be downloaded from the following URL:

. The 22 November 2006 release of Pfam was used (Finn et al., 2006). ROC analysis was conducted on the datasets generated as a result of using different training sets (HPIV, DIP and MIPS). HPIV_MULTI was used as the true positive interaction set and true negative interactions were sourced by randomly pairing proteins.

2.5 Negative interactions

Negative interactions were predicted for S.cerevisiae using protein features such as function, expression, cellular localization and homology data. Functional annotations for the yeast proteome were obtained from the Gene Ontology Database (Ashburner et al., 2000). Localization annotation was taken from MIPS (Mewes et al., 2004) and expression information was taken from the Young Lab (Holstege et al., 1998). In total there were 5058 yeast proteins, [downloaded from Güldener et al. (2005)] for which we posses all such annotation.

Negative interactions were predicted by selecting a random pair of proteins that did not share any common protein characteristics. When using expression, we selected only those pairs that had an expression correlation coefficient between 0.3 and −0.3 (see Supplementary Material). In the case of homology, we selected only those protein pairs that had no homologous interactions.

Our integrated method incorporated all the individual factors as well as homologous interaction data. Here two proteins were randomly selected and then classified as negatively interacting if they fulfilled the following: (i) They have no common functional processes, (ii) they have no overlap in cellular localization, (iii) they have a co-expression correlation between 0.3 and −0.3 and (iv) they have no homologous interactions in an experimental dataset.

To analyse the influence of each protein feature on negative interactions we generated 100 subsets of interactions for every individual protein characteristic and our integrated method. The accuracy of each method was garnered by measuring the average overlap between DIP_ORIGINAL and 100 subsets of each GSN set. Each negative interaction subset contained 17 474 interactions, similar in size to our reference set (DIP_ORIGINAL).

3 RESULTS

3.1 True positive interaction verification

Gold standard positive interaction sets are created by taking protein interactions from multiple species and returning a subset of interactions that have been verified based on the existence of a homologous interaction. Table 1 details the number of proteins and interactions for each species before and after verification. The H.sapiens and M.musculus datasets contain interactions primarily from small scale experiments. These are widely considered to be more accurate than high-throughput methods and perhaps as a result we observe that less interactions are lost during the verification process. All the remaining datasets with the exception of H.pylori maintain around 20–30% of their original interactions.

Table 1.

Attrition rates

Dataset Number of proteins Number of interactions 
 Original HPIV (%) Original HPIV (%) 
D.melanogastor 7451 2437 (0.33) 22 819 4628 (0.20) 
S.cerevisiae 4959 2092 (0.42) 17 511 4894 (0.28) 
E.coli 1840 728 (0.40) 6966 1729 (0.25) 
C.elegans 2638 745 (0.28) 4030 911 (0.23) 
H.pylori 710 151 (0.21) 1420 169 (0.12) 
H.sapiens 1085 761 (0.70) 1397 977 (0.70) 
M.musculus 335 278 (0.83) 290 232 (0.80) 
Dataset Number of proteins Number of interactions 
 Original HPIV (%) Original HPIV (%) 
D.melanogastor 7451 2437 (0.33) 22 819 4628 (0.20) 
S.cerevisiae 4959 2092 (0.42) 17 511 4894 (0.28) 
E.coli 1840 728 (0.40) 6966 1729 (0.25) 
C.elegans 2638 745 (0.28) 4030 911 (0.23) 
H.pylori 710 151 (0.21) 1420 169 (0.12) 
H.sapiens 1085 761 (0.70) 1397 977 (0.70) 
M.musculus 335 278 (0.83) 290 232 (0.80) 

The number of proteins and interactions that were present in the original species-specific datasets and the final verified datasets. The values in parenthesis are the fraction of original information retained

As the HPIV method depends entirely on the interactions in an original set, we test how well it can cope with a set of entirely spurious interactions. We generated several sets of 1000 entirely random interactions. Using our method to verify the interactions in these sets we found that over 96% of the interactions were removed. Such a high level of attrition serves to illustrate that the HPIV method is highly effective at removing false interactions in a set containing no true interactions at all. Thus though the input data is the experimental interactions set that may contain a high level of error, in this test 100%, it does not necessarily lead to a high level of error in the final verified datasets.

In order to assess whether the verified datasets do in fact contain a larger fraction of true interactions, we conducted five error rate checks on the HPIV dataset for yeast (S.cerevisiae) and the fruit-fly (D.melanogaster). Table 2 shows the results of this analysis on the HPIV yeast set compared to results for the entire DIP dataset, the MIPS dataset of physical interactions and the DIP_MULTI dataset, a dataset containing interactions from DIP that were observed in multiple experiments and IRAP, a dataset obtained by using the topology of the interaction network. A consensus measure of the error rate was achieved by ranking each dataset in order for a specific error measure and calculating the mean rank. It can be seen that the HPIV dataset outperforms all the interaction sets with the exception of DIP_MULTI. This is unsurprising as DIP_MULTI contains interactions that have been observed in multiple independent experiments. However the size of DIP_MULTI is relatively small, containing 2209 interactions compared with the 17 474 interactions in the full DIP set and 4590 in the HPIV set.

Table 2.

Drosophila error rates

Drosophila Number of interactions Co-localization (%) Co-functional (%) 
DIP 22 822 85.16 66.27 
HPIV 4551 87.79 78.20 
Drosophila Number of interactions Co-localization (%) Co-functional (%) 
DIP 22 822 85.16 66.27 
HPIV 4551 87.79 78.20 

Basic tests conducted on the standard DIP and HPIV sets to show that an increased accuracy is observed in verified datasets other than yeast. Using GO annotation the co-localization and co-functional, tests show the percentage of interactions that had a similar cellular localization and function, respectively.

In order to demonstrate the increased quality of the HPIV datasets, we used the yeast subset to conduct functional prediction. Using majority vote the recall rate for the HPIV set was 83.44% compared a recall rate of 63.68% for DIP. We also used the interaction sets as a learning set for a domain–domain interaction-based protein interaction prediction scheme (Deng et al., 2002). The performance of each learning set was test using ROC analysis. Where the area under the ROC curve signifies the accuracy of the test, we found that HPIV performs best with an area of 0.91 compared to the areas of DIP, 0.89 and MIPS, 0.77.

3.2 True negative interactions

GSN interaction sets are just as important as GSP sets when making predictions and using confidence assignment schemes. We investigated the different ways of generating false interaction sets in S.cerevisiae by using a number of protein characteristics. Furthermore, we put forward a novel method to generate gold standard negative interaction sets by combining all available protein information such as function, expression and localization. Here we select only those proteins that had no functional similarity, no localization similarity, low co-expression and no homologous interactions. When randomly generating interactions for 5058 yeast proteins, random interaction space is given by 12 789 153 interactions. Our method, using all four determinants reduces negative interaction space to 4 193 169 interactions.

No one has previously investigated the quality of negative interaction sets and no standard methods exist to do this. We assessed the accuracy of the sets by measuring the level of overlap observed in a negative set with a true positive set, in this case the DIP yeast interaction set. The full DIP dataset is more useful when checking for overlap rather than gold standard sets such as DIP_MULTI as it is a relatively small set with only 2209 interactions. Coupled with the fact that our negative yeast interaction space is so vast (4 193 169) that no meaningful overlap would be detected.

Figure 1 shows the average number of overlapping interactions between DIP and negative interaction sets generated using different methods. A total of 100 negative interactions sets of identical size to the DIP_FULL dataset (17 474 interactions) were generated using six different conditional methods as well as our combined method. These methods included: (1) Selecting interactions randomly, (2) selecting random interactors that did not have any paralogous pairs, (3) random interactors that had no homologous (i.e. orthologous and paralogous) pairs, (4) random pairs with no compartmental overlap, (5) no functional overlap and (6) co-expression correlation between −0.3 and 0.3.

Fig. 1.

Average overlap of DIP_FULL against 100 negative interactions sets containing 17 474 interactions generated using different methods. Random: proteins are paired as interactors randomly. No paralogues: random pairs which do not have any interacting paralogous pairs. No homologues: random pairs which do not have any homologous interactions. No co-localization: random pairs that do not have common localization. No co-expression: random pairs that have a co-expression correlation between −0.3 and 0.3. No co-function: random pairs that do not have similar functions. Set 7 uses a combination of all the six other factors and gives substantially lower overlaps. The error rate bars represent the standard deviation.

Fig. 1.

Average overlap of DIP_FULL against 100 negative interactions sets containing 17 474 interactions generated using different methods. Random: proteins are paired as interactors randomly. No paralogues: random pairs which do not have any interacting paralogous pairs. No homologues: random pairs which do not have any homologous interactions. No co-localization: random pairs that do not have common localization. No co-expression: random pairs that have a co-expression correlation between −0.3 and 0.3. No co-function: random pairs that do not have similar functions. Set 7 uses a combination of all the six other factors and gives substantially lower overlaps. The error rate bars represent the standard deviation.

Our integrated method for generating negative interactions sets performs the best with an average overlap of 7.6 interactions. Currently used methods for generating false interactions show relatively higher levels of overlap. In fact, of the currently used methods only non-co-function shows a significant improvement over random choice. False sets obtained by randomly selecting proteins have an average overlap of 20.8, and sets that select random pairs with no co-function have an overlap of 11.65 (Fig. 1). In terms of co-expression selection of different correlation thresholds does not have a significant impact on the overlap measure (see Supplementary Material).

4 DISCUSSION

4.1 Quality assessment

Using homologous interactions it is possible to generate verified interaction sets for any species. Our assessment is primarily limited to the yeast interaction set, due to its well-studied interactome. We conducted several tests on the HPIV interaction set and compared the results to tests on other commonly used interaction sets (Table 2). Our HPIV set outperforms all datasets with the exception of DIP_MULTI, the smaller experimentally confirmed set which contains interactions that have been observed in more than one experimental study. The DENG reliability measure and the Reference Index are calibrated using the DIP_MULTI set, hence the measures would return good results. In the case of the remaining test, the overall performance of DIP_MULTI gives weight to the widely held view that interactions from more than one study are true but this is an expensive technique as different experimental studies have very little overlap (von Mering et al., 2003) and the number of studies is very limited in species other than yeast.

As all the verified datasets were obtained using the same method, we believe that the accuracy observed within yeast will be reflected in the HPIV datasets of other species. We illustrate this by conducting basic accuracy checks on fruit-fly interactions (Table 3). Once again the HPIV dataset performs better than the standard set confirming that HPIV sets can fill the gap in species where GSP sets do not exist. For species which have existing GSP sets, interactions attained through this method could be added to them. In the case of yeast, the HPIV set has 4894 interactions, of which only 1087 exist in DIP_MULTI. Hence an extra 3807 interactions could be added to the DIP_MULTI set doubling its size without compromising on quality. It is worth noting that a possible shortcoming of a homology-based interaction verification method is that some genuine interactions be lost, however interactions that remain are likely be correct. We also found that evolutionarily closer species verify greater numbers of interactions than more distant species, however it was difficult to conduct any meaningful analysis as the number of interactions verified by orthologous interactions was very low (see Supplementary Material).

Table 3.

Yeast error rates

Dataset Number of interactions EPR index (%) Reference Index (%) Co-localization (%) Co-functional (%) DENG Mean rank 
DIP_MULTI 2209 92.2 100 58.13 58.87 1.00 
HPIV 4590 48.5 22.79 46 49.42 0.7123 2.2 
DIP_ORIGINAL 17474 41.8 13.42 38.67 45.47 0.4187 3.4 
MIPS_PHYSICAL 7458 55.3 15.59 31.86 25.03 0.4019 3.4 
IRAP 1383 18.5 1.73 30.18 00.78 0.0078 
Dataset Number of interactions EPR index (%) Reference Index (%) Co-localization (%) Co-functional (%) DENG Mean rank 
DIP_MULTI 2209 92.2 100 58.13 58.87 1.00 
HPIV 4590 48.5 22.79 46 49.42 0.7123 2.2 
DIP_ORIGINAL 17474 41.8 13.42 38.67 45.47 0.4187 3.4 
MIPS_PHYSICAL 7458 55.3 15.59 31.86 25.03 0.4019 3.4 
IRAP 1383 18.5 1.73 30.18 00.78 0.0078 

The number of interactions and error rates found within five yeast datasets. The EPR index is a measure of true positives in the set. The reference index is the percentage of proteins from the set of interest, found to be in the reference set, in this case the DIP_MULTI set. The co-localization and co-functional columns show the percentage of interactions where both interacting partners share the same cellular localization or function, respectively. The DENG reliability measure is an expression-based reliability measure. The mean rank reflects the accuracy of the dataset by averaging the rank for each individual error rate.

We also conducted majority vote functional prediction on the HPIV yeast dataset in order to demonstrate the uses of better GSP sets. We found that the verified dataset performed much better than its standard counterpart. This does not conclusively prove that the HPIV set is of greater quality as it is possible that the HPIV dataset positively selects for proteins that have predictable functions.

In order to investigate how well the HPIV set performed when using it to predict protein–protein interactions, we also used a domain-based interaction prediction approach. Results from the resulting ROC analysis once again highlighted the applicability of verified sets. The HPIV set performed better than the standard DIP set and the MIPS physical set in predicting interactions.

Collectively this analysis demonstrates that HPIV sets do contain better quality interactions and their use in existing methodologies will lead to some increase in value. As including homologous interactions within datasets lead to an increase in quality, we considered how the exclusion of homologous interactions would affect negative interaction sets. Considering only yeast, we generated negative interactions based upon certain factors and applied quality assessment to the sets. The assessment of the negative interaction sets was conducted by comparing the overlap with a reference true positive set (Fig. 1). None of the other tests used to measure the accuracy of positive interactions could be used. This was because the negative interactions were generated using protein characteristics, and judging the merit of such a set by relying on similar information would lead to a large degree of circularity.

We found that the poorest negative set was created by pairing random proteins. When excluding only homologous interactions from such a random process we found that little or no value was added. The most likely reason for such behaviour could be because negative interaction space is so large that the effect of homologous interactions is negligible. Contrary to popular usage we also found that restricting by cellular compartments did not yield very good negative interaction sets. Of all the individual selection criteria, we observed that restriction by function gave the best GSN set. The set could be considerably improved however by the combination of all the characteristics.

Our suggestion that non-co-localization is not the optimal way to select negative interactions and in fact is no better than random with the added problem of introducing bias (Ben-Hur and Noble, 2006), may have a profound impact on several studies where compartmentalization was used to generate GSN sets (Gomez et al., 2003; Jonsson and Bates, 2006; Zhang et al., 2004).

4.2 Paralogues versus orthologues

Homologous interactions can be further broken down into two subcategories: paralogous and orthologous interactions. For simplicity here we define paralogous interactions as those interactions that exist between homologous proteins within the same organism and orthologous interactions are those interactions that exist between homologous proteins from different organisms. Having used homologous interactions to create ‘gold standard’ interaction sets we investigated the effect of paralogues and orthologues.

When verifying interactions using homologous interacting protein pairs, a total of 10 151 interactions were authenticated across all species. By breaking this number down into interactions verified by paralogous or orthologous proteins we found that 78% (7946) of these were verified using just paralogous interactions, 7% (677) were obtained as a result of just orthologous interactions and 15% were generated using both. Evidently paralogous interactions appear to have a much larger effect on the dataset.

By comparing orthologous interactions to paralogous interactions it is important to keep in mind that not all species have been investigated to the same extent. As Table 1 indicates, D.melanogaster and S.cerevisiae have been studied extensively however humans and mice have not. Considering there are so few interactions in human yet so many in yeast, it is possible that most of the interactions verified in humans are due to orthologous interactions. With such a discrepancy in numbers it maybe possible that the percentage of interactions verified by paralogous interactions (78%) could be just those verified from yeast or other well-studied organisms. To investigate this possible bias, we examined the size of every individual species original interaction set and compared that to the number of interactions verified by virtue of orthologous or paralogous proteins. Figure 2 shows the fraction of interactions verified by paralogous and orthologous interactions alongside how well a species has been studied, represented by the percentage each organism contributes to the DIP database. Consider E. coli, it has an original unverified interaction set that constitutes only 12.8% of total interactions indicating that it has a small known interactome. If a bias were present, we would expect to see that the number of orthologous verifications outweigh the number of paralogous verifications. We consistently do not find this to be the case. On the contrary we observe that regardless of the size of the original species dataset, paralogous verification remains prominent in all species, with the exception of mouse, where paralogous and orthologous verification is identical. A possible explanation for this could be the extremely small number of interactions currently available in this organisms. However we would predict that as more interactions are elucidated, the ratio between paralogous interactions and orthologous interactions will increase.

Fig. 2.

The graph depicts the size of the original species-specific interaction dataset as a percentage of DIP and the fraction of verified interactions that were authenticated due to paralogous and orthologous verification. Regardless of the size, paralogous verifications have a greater influence on the number of authenticated interactions apart from the first dataset where they are equivalent.

Fig. 2.

The graph depicts the size of the original species-specific interaction dataset as a percentage of DIP and the fraction of verified interactions that were authenticated due to paralogous and orthologous verification. Regardless of the size, paralogous verifications have a greater influence on the number of authenticated interactions apart from the first dataset where they are equivalent.

The influence of paralogous interactions is also apparent if we analyse the effect of paralogous and orthologous interactions in the creation of the negative interaction sets. We observe that orthologous interactions do not add much more information than paralogous interactions alone. When generating false interactions using just paralogous protein interactions (i.e. randomly selecting proteins that do not have any paralogous protein interactions) we find that the overlap is 20.19 (Fig. 1). Negative interaction sets created using both orthologous and paralogous interactions have an average overlap of 20.3.

It is clear that HPIV datasets are influenced more by paralogous interactions than orthologous interactions. The finding that more paralogous interactions exist also confirms previous work suggesting that interactions seem to be more conserved within species than between species (Mika and Rost, 2006). This throws into question whether it is currently viable to predict new interactions using interactions solely from other species. At the very least it is true to say that predicting putative protein interactions utilizing paralogous interactions as well would yield more confident results than relying solely on orthologous interactions.

5 CONCLUSION

Current protein interaction networks are not only incomplete but are strewn with false positives. In yeast this weakness is sometimes circumvented by the use of ‘gold standard’ sets. In other organisms no such sets exist. We employ a homology-based method, Homologous Protein Interaction Verification (HPIV) to verify interactions and create such ‘gold standard’ sets for several species. A strength of this method is that it requires only protein interaction and sequence information and is able to produce a ‘gold standard’ set for any species. Because of its simple prerequisites the method is able to utilise all available interaction information and is not as expensive as other techniques used to create reliable interaction sets.

To assess the value and reliability of HPIV sets, we performed extensive error rate analysis on the HPIV yeast and Drosophila subsets. We found that it outperforms standard sets in all measures of error. To demonstrate the utility of the new ‘gold standard’ sets we employed the HPIV yeast subset in protein function prediction and protein interaction prediction. More precise functional predictions were made and protein interactions were predicted with greater accuracy. It is our belief that this higher level of accuracy will be reflected in the HPIV datasets of other species.

We also created a set of ‘gold standard’ negative interactions in yeast. Currently most studies generate negative interactions by randomly pairing proteins or by selecting only those proteins that share no common cellular localization. Using functional, localization, expression and homologous interaction information we created a set of negative interactions.

To assess the accuracy of the negative interaction datasets, we generated interactions for each method and calculated an average overlap with a positive interaction set. We found that our integrated method performed the best and more surprisingly we observed that the best single criteria for generating false interactions is function. This contradicts several studies where negative interactions are generated by randomly selecting two proteins or by selecting proteins that do not have shared cellular compartments. We also for the first time demonstrate which of these factors is important for the generation of negative interactions, which should allow work such as that by Wu et al. (2006) to be further improved.

The homology-based method that we utilize makes use of both interactions conserved within a species and those conserved between species. We show that when breaking down homology into paralogues and orthologues, paralogous interactions are more prevalent and influence our methods far more than orthologues. This indicates that methods which predict putative protein interactions based on conservation in other species would yield more data if their analysis included paralogous interactions.

ACKNOWLEDGEMENT

R.S. is grateful for financial support from the EPSRC.

Conflict of Interest: none declared.

REFERENCES

Altschul
SF
, et al.  . 
Gapped blast and psi-blast: a new generation of protein database search programs
Nucleic Acids Res
 , 
1997
Ashburner
M
, et al.  . 
Gene ontology: tool for the unification of biology. The gene ontology consortium
Nat. Genet
 , 
2000
, vol. 
25
 (pg. 
25
-
29
)
Ben-Hur
A
Noble
WS
Choosing negative examples for the prediction of protein–protein interactions
BMC Bioinformatics
 , 
2006
, vol. 
7
 
Suppl 1
Bork
P
, et al.  . 
Protein interaction networks from yeast to human
Curr. Opin. Struct. Biol
 , 
2004
Chen
J
, et al.  . 
Increasing confidence of protein interactomes using network topological metrics
Bioinformatics
 , 
2006
, vol. 
22
 (pg. 
1998
-
2004
)
Deane
C
, et al.  . 
Protein interactions: two methods for assessment of the reliability of high throuhput observations
Mol. Cell. Proteomics
 , 
2002
, vol. 
1
 (pg. 
349
-
356
)
Deng
M
, et al.  . 
Inferring domain–domain interactions from protein–protein interactions
Genome Res
 , 
2002
, vol. 
12
 (pg. 
1540
-
1548
)
Deng
M
, et al.  . 
Assessment of the reliability of protein–protein interactions and protein function prediction
Pac. Symp. Biocomput
 , 
2003
(pg. 
140
-
151
)
Finn
RD
, et al.  . 
Pfam: clans, web tools and services
Nucleic Acids Res
 , 
2006
, vol. 
34
 
Database issue
(pg. 
247
-
251
)
Formstecher
E
, et al.  . 
Protein interaction mapping: a Drosophila case study
Genome Res
 , 
2005
, vol. 
15
 (pg. 
376
-
384
)
Gavin
A
, et al.  . 
Functional organisation of the yeast proteome by systemaitc analysis of protein complexes
Nature
 , 
2002
, vol. 
415
 (pg. 
141
-
147
)
Giot
L
, et al.  . 
A protein interaction map of Drosophila melanogaster
Science
 , 
2003
, vol. 
302
 (pg. 
1727
-
1736
)
Gomez
SM
, et al.  . 
Learning to predict protein–protein interactions from protein sequences
Bioinformatics
 , 
2003
, vol. 
19
 (pg. 
1875
-
1881
)
Güldener
U
, et al.  . 
CYGD: the Comprehensive Yeast Genome Database
Nucleic Acids Res
 , 
2005
, vol. 
33
 
Database issue
(pg. 
364
-
368
)
Hart
GT
, et al.  . 
How complete are current yeast and human protein-interaction networks
Genome Biol
 , 
2006
, vol. 
7
 (pg. 
120
-
120
)
Holstege
F
, et al.  . 
Dissecting the regulatory circuitry of a eukaryotic genome
Cell
 , 
1998
, vol. 
95
 (pg. 
717
-
728
)
Ho
Y
, et al.  . 
Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectometry
Nature
 , 
2002
, vol. 
415
 (pg. 
180
-
183
)
Huang
TW
, et al.  . 
Reconstruction of human protein interolog network using evolutionary conserved network
BMC Bioinformatics
 , 
2007
, vol. 
8
 (pg. 
152
-
152
)
Ito
T
, et al.  . 
A comprehensive two-hybrid analysis to explore the yeast protein interactome
Proc. Natl Acad. Sci. USA
 , 
2001
, vol. 
98
 (pg. 
4569
-
4574
)
Jansen
R
, et al.  . 
A bayesian networks approach for predicting protein–protein interactions from genomic data
Science
 , 
2003
, vol. 
302
 (pg. 
449
-
453
)
Jonsson
PF
Bates
PA
Global topological features of cancer proteins in the human interactome
Bioinformatics
 , 
2006
, vol. 
22
 (pg. 
2291
-
2297
)
Li
S
, et al.  . 
A map of the interactome network of the metazoan C. elegans
Science
 , 
2004
, vol. 
303
 (pg. 
540
-
543
)
Mewes
H
, et al.  . 
MIPS: analysis and annotation of proteins from whole genomes
Nucleic Acids Res
 , 
2004
, vol. 
32
 
Database issue
(pg. 
41
-
44
)
Mika
S
Rost
B
Protein-protein interactions more conserved within species than across species
PLoS Comput. Biol
 , 
2006
, vol. 
2
 
Patil
A
Nakamura
H
Filtering high-throughput protein–protein interaction data using a combination of genomic features
BMC Bioinformatics
 , 
2005
, vol. 
6
 pg. 
100
 
Patil
A
Nakamura
H
HINT – a database of annotated protein–protein interactions and their homologs
Biophysics
 , 
2005
, vol. 
1
 (pg. 
21
-
24
)
Pellegrini
M
, et al.  . 
Protein interaction networks
Expert Rev. Proteomics
 , 
2004
, vol. 
1
 (pg. 
239
-
249
)
Piehler
J
New methodologies for measuring protein interactions in vivo and in vitro
Curr. Opin. Struct. Biol
 , 
2005
, vol. 
15
 (pg. 
4
-
14
)
Qi
Y
, et al.  . 
Random forest similarity for protein–protein interaction prediction from multiple sources
Proc. Pac. Symp. Biocomput
 , 
2005
, vol. 
19
 (pg. 
531
-
542
)
Saeed
R
Deane
CM
Protein protein interactions, evolutionary rate, abundance and age
BMC Bioinformatics
 , 
2006
, vol. 
7
 (pg. 
128
-
128
)
Saito
R
, et al.  . 
Interaction generality, a measurement to assess the reliability of a protein–protein interaction
Nucleic Acids Res
 , 
2002
, vol. 
30
 (pg. 
1163
-
1168
)
Saito
R
, et al.  . 
Construction of reliable protein–protein interaction networks with a new interaction generality measure
Bioinformatics
 , 
2003
, vol. 
19
 (pg. 
756
-
763
)
Salwinski
L
, et al.  . 
The database of interacting proteins: 2004 update
Nucleic Acids Res
 , 
2004
, vol. 
32
 (pg. 
D449
-
D451
)
Schwikowski
B
, et al.  . 
A network of protein–protein interactions in yeast
Nat Biotechnol
 , 
2000
, vol. 
18
 (pg. 
1257
-
1261
)
Suthram
S
, et al.  . 
A direct comparison of protein interaction confidence assignment schemes
BMC Bioinformatics
 , 
2006
, vol. 
7
 (pg. 
360
-
360
)
Uetz
P
, et al.  . 
A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae
Nature
 , 
2000
, vol. 
403
 (pg. 
623
-
627
)
von Mering
C
, et al.  . 
Comparative assessment of large-scale data sets of protein–protein interactions
Nature
 , 
2003
von Mering
C
, et al.  . 
STRING: known and predicted protein–protein associations, integrated and transferred across organisms
Nucleic Acids Res
 , 
2005
, vol. 
33
 
Database issue
(pg. 
433
-
437
)
Walhout
AJ
, et al.  . 
Protein interaction mapping in C. elegans using proteins involved in vulval development
Science
 , 
2000
, vol. 
287
 (pg. 
116
-
122
)
Wu
X
, et al.  . 
Prediction of yeast protein–protein interaction network: insights from the gene ontology and annotations
Nucleic Acids Res
 , 
2006
, vol. 
34
 (pg. 
2137
-
2150
)
Yu
J
Fotouhi
F
Computational approaches for predicting protein–protein interactions: a survey
J. Med. Syst
 , 
2006
, vol. 
30
 (pg. 
39
-
44
)
Zhang
LV
, et al.  . 
Predicting co-complexed protein pairs using genomic and proteomic data integration
BMC Bioinformatics
 , 
2004
, vol. 
5
 (pg. 
38
-
38
)

Author notes

Associate Editor: Limsoon Wong

Comments

0 Comments