## Abstract

Motivation: Identifying protein–protein interactions is critical for understanding cellular processes. Because protein domains represent binding modules and are responsible for the interactions between proteins, computational approaches have been proposed to predict protein interactions at the domain level. The fact that protein domains are likely evolutionarily conserved allows us to pool information from data across multiple organisms for the inference of domain–domain and protein–protein interaction probabilities.

Results: We use a likelihood approach to estimating domain–domain interaction probabilities by integrating large-scale protein interaction data from three organisms, Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster. The estimated domain–domain interaction probabilities are then used to predict protein–protein interactions in S.cerevisiae. Based on a thorough comparison of sensitivity and specificity, Gene Ontology term enrichment and gene expression profiles, we have demonstrated that it may be far more informative to predict protein–protein interactions from diverse organisms than from a single organism.

Availability: The program for computing the protein–protein interaction probabilities and supplementary material are available at http://bioinformatics.med.yale.edu/interaction

Contact:hongyu.zhao@yale.edu

## INTRODUCTION

Protein–protein interactions play critical roles in the control of most cellular processes. Many proteins involved in signal transduction, gene regulation, cell–cell contact and cell cycle control require interaction with other proteins or cofactors to activate those processes (Papin et al., 2004; Tucker et al., 2001; Wang, 2002). Recently, systematic identifications of protein interactions in Saccharomyces cerevisiae have been conducted using high-throughput techniques such as yeast two-hybrid screening methods (Ito et al., 2001; Uetz et al., 2000) or affinity purification coupled with mass spectroscopy (Gavin et al., 2002; Ho et al., 2002). Although these experimental approaches have generated enormous amounts of data and valuable resources for studying protein interactions, these methods suffer from high false positive and false negative rates owing to their limitations (Mrowka et al., 2001; von Mering et al., 2002). For example, the false negative rate of the yeast two-hybrid assay used to construct S.cerevisiae interaction maps has been estimated to be >70% (Deng et al., 2002). Therefore, there is a great need to develop complementary computational methods capable of accurately predicting interactions between proteins through integrated analysis of data from multiple sources.

A number of computational approaches have been proposed to predict protein–protein interactions, including those based on genomic information (Enright et al., 1999; Tsoka et al., 2000), three-dimensional structural information (Lu et al., 2003; Aloy et al., 2004), integration of multiple genomic datasets (Jansen et al., 2003; Lin et al., 2004; Iossifov et al., 2004) and literature mining (Marcotte et al., 2001). Protein–protein interactions can also be predicted on the basis of evolutionary relationship. It has been shown that interacting proteins often exhibit coordinated evolution, so that proteins with similar phylogenetic trees are more likely to interact with each other (Pazos et al., 2001; Goh et al., 2002; Ramani et al., 2003). In addition, the concept of ‘interologs’ has been proposed based on the idea that a pair of interacting proteins are coevolving so that their respective orthologs in other organisms tend to interact as well (Walhout et al., 2000).

Several methods have been proposed to predict protein interactions in S.cerevisisae on the basis of another important principle, namely, domain–domain interactions. The protein domain as a unit of structure, function and evolution also serves as a unit for protein–protein interactions. Therefore, it is important to take into account domain–domain interactions when we infer plausible interacting protein pairs. In these methods, proteins are characterized by one or more domains and each domain is responsible for a specific interaction with another domain. Sprinzak and Margalit (2001) identified the domain pairs that are highly correlated with interacting protein pairs using protein–protein interaction data from S.cerevisiae as training data. The information was further used to predict interacting protein pairs that contain an interacting domain pair. Similarly, Gomez01,Gomez03 and Deng et al. (2002) estimated the probabilities of domain–domain interactions using protein–protein interaction data from S.cerevisiae as training data; the estimated domain–domain interaction probabilities can be used to infer protein–protein interaction probabilities. These methods depend highly on the accuracy of the training data and have been mostly applied to protein–protein interaction data from a single organism only, which may be inferior to methods that can incorporate more information in estimating domain–domain interaction probabilities.

Because domains are likely evolutionarily conserved, information from multiple organisms may be integrated together to improve the estimation of domain–domain interaction probabilities. In our study, we incorporate information from three organisms, S.cerevisiae, Caenorhabditis elegans and Drosophila melanogaster, to effectively utilize the domain information as the evolutionary connection among these model organisms. The protein–domain relationship can be extracted from relevant databases such as PFAM and SMART (Bateman et al., 2004; Letunic et al., 2004). By integrating large-scale protein–protein interaction data from these three organisms, we have extended a likelihood approach proposed by Deng et al. (2002) to estimate the probabilities of domain–domain interactions based on information from all three organisms. Considering each protein as a collection of domains, we can then estimate the probabilities of protein–protein interactions in S.cerevisiae based on the inferred domain–domain interaction probabilities. The protein pairs with interaction probabilities above a certain threshold can then be predicted to interact with each other. In order to assess the performance of our method, we first apply it to the interaction data from S.cerevisiae only and compare its performance with that of three other methods that predict protein interactions based on the domain composition of proteins in the cross-validation measurement, and we demonstrate that our method provides comparable performance to the others. Then, we compare our prediction results based on all three organisms with those based on S.cerevisiae alone. We find that the integrated analysis provides more reliable inference of protein–protein interactions than the analysis from a single organism based on the analysis of sensitivity and specificity, Gene Ontology term enrichment and gene expression profiles.

## METHODS

### Data sources

In our study, the high-throughput yeast two-hybrid data from three organisms, S.cerevisiae, C.elegans and D.melanogaster, are used to infer domain–domain interaction probabilities. For S.cerevisiae, we use a combined dataset from two independent studies (Ito et al., 2000; Uetz et al., 2000), which includes a total of 5295 interactions. For C.elegans, 4714 interactions were reported from yeast two-hybrid experiments (Li et al., 2004). For D.melanogaster, results from two-hybrid experiments yielded a total of 20 349 interaction pairs (Giot et al., 2003). The protein–domain relationships for each protein in S.cerevisiae, C.elegans and D.melanogaster are extracted from PFAM (Bateman et al., 2004) and SMART (Letunic et al., 2004).

### Maximum likelihood estimation of domain–domainand protein–protein interaction probabilities

We estimate the probabilities of domain–domain interactions through the extension of a likelihood approach proposed by Deng et al. (2002) so that it can incorporate information from all three organisms. In this model, we make the following assumptions: (1) domain–domain interactions are independent, so whether two domains interact or not does not depend on the interactions among other domains; (2) the probability that two domains m and n interact is the same among all the three organisms; (3) Two proteins i and j interact if and only if at least one pair of domains from the two proteins interact.

With these assumptions, we have

$$Pr({P}_{ijk}=1)=1-{\prod }_{({D}_{mn}\in {P}_{ijk})}(1-{\lambda }_{mn})$$
, where Pijk represents the protein pair i and j in species k; Pijk = 1 if protein i and protein j in species k interact with each other, and Pijk = 0 otherwise. Here, k = 1, 2, 3 represents species S.cerevisiae, C.elegans and D.melanogaster, respectively, λmn represents the probability that domain m interacts with domain n and the notation (DmnPijk) denotes all pairs of domains from protein pair i and j in species k. The probability that proteins i and j in species k are observed to be interacting in the experiments is Pr(Oijk = 1) = Pr(Pijk = 1)(1 − fn) + [1 − Pr(Pijk = 1)]fp, where Oijk = 1 if interaction between protein i and j is observed in species k, and Oijk = 0 otherwise. Here, fn and fp represent the false negative rate and false positive rate of the protein interaction data. It has been estimated that thetotal number of interactions between all yeast proteins is ∼20 000–30 000(Bader et al., 2004). Therefore, for S.cerevisiae, we have
$\begin{array}{c}fn=Pr({O}_{ijk}=0|{P}_{ijk}=1)=1.0-\frac{Pr({O}_{ijk}=1,{P}_{ijk}=1)}{Pr({P}_{ijk}=1)}\\ \ge 1.0-\frac{Pr({O}_{ijk}=1)}{Pr({P}_{ijk}=1)}=1.0-\frac{\hbox{ Number\; of\; observed\; interacting\; pairs }}{\hbox{ Number\; of\; real\; interacting\; pairs }}\\ \ge 1.0-\frac{5295}{20000}\ge 0.74.\end{array}$
We obtained a total of 5717 proteins from SWISS-PROT and TrEMBL; therefore,
$\begin{array}{l}fp=Pr({O}_{ijk}=1|{P}_{ijk}=0)=\frac{Pr({O}_{ijk}=1,{P}_{ijk}=0)}{Pr({P}_{ijk}=0)}\\ \le \frac{Pr({O}_{ijk}=1)}{Pr({P}_{ijk}=1)}=\frac{\hbox{ Number\; of\; observed\; interacting\; pairs }}{\hbox{ Total\; protein\; pairs\; \mbox{--}\; Number\; of\; real\; interacting\; pairs }}\\ \le \frac{5295}{5717*(5717+1)/2-30000}\le 3.3\times {10}^{-4}.\end{array}$
Similarly, for C.elegans, fn is ∼0.90 by mapping the observed interactions to a benchmark data set (Li et al., 2004) and we estimate fp to be <3 × 10−5. For D.melanogaster, fn is ∼0.80 (Giot et al., 2003) and we estimate fp to be <3.6 × 10−4.

The likelihood function that characterizes the probability of the observed protein interaction data across all three organisms is: L = ∏ Pr(Oijk = 1)Oijk[1 − Pr(Oijk = 1)]1 − Oijk. We can see that the likelihood function L is a function of parameter λ mn if we specify fixed values for fn and fp. To obtain the maximum likelihood estimates (MLEs) of the parameters, we propose to use the EM algorithm (Dempster et al., 1977), which consists of the expectation (E) step and the maximization (M) step. In the E-step, we need to calculate the expectations of the complete data given the observed data. Here, the complete data include all the domain–domain interactions for each protein–protein pair i and j of each of the three organisms, denoted by

$${D}_{mn}^{\left(ij\right)}$$
. We have
$E\left({D}_{mn}^{\left(ij\right)}\right|{O}_{ijk}={o}_{ijk},{\lambda }_{mn})=\frac{{\lambda }_{mn}^{(t-1)}{(1-fn)}^{{o}_{ijk}}f{n}^{1-{o}_{ijk}}}{Pr({O}_{ijk}={o}_{ijk}|{\lambda }_{mn}^{(t-1)})}.$
With the expectations of the complete data, in the M-step, we updatethe λ mn by
${\lambda }_{mn}^{\left(t\right)}=\frac{{\lambda }_{mn}^{(t-1)}}{{N}_{mn}}\sum \frac{{(1-fn)}^{{o}_{ijk}}f{n}^{1-{o}_{ijk}}}{Pr({O}_{ijk}={o}_{ijk}|{\lambda }_{mn}^{(t-1)})},$
where Nmn is the total number of protein pairs containing domain (m, n) across the three organisms, and the summation is over all these protein pairs.

We update the parameter estimates of the λ mn by iterating between the E-step and the M-step until convergence to obtain the MLEs of the λ mn for all the domain pairs. The estimated values of the λ mn allow us to compute the protein interaction probabilities so that two proteins with an interaction probability greater than a certain threshold can be predicted to be interacting partners.

### Cross-validated comparison and receiving operator characteristic analysis

To compare our likelihood approach with other similar methods that predict protein interactions based on protein domain information, we measure the performance of each prediction using a 5-fold cross-validation. As all the other methods predicting protein interaction pairs are applied to the interaction data from S.cerevisiae only, we define the training interaction data for the cross-validation as follows: we considered the 3543 yeast physical interaction pairs in MIPS as positive examples (Mewes et al., 2004) and the other possible protein pairs, totally 6 895 215 pairs, as negative examples. At each iteration of the cross-validation experiments we reserve one-fifth of both positives and negatives for testing and use the remaining data for training. The training–test procedure is repeated five times.

The prediction accuracy is measured using the receiving operator characteristic (ROC) curve, which demonstrates the trade-offs between sensitivity and specificity. It is a plot of the true positive rate (sensitivity) against the false positive rate (1 − specificity) for different thresholds. Here, the true positive rate, denoted as TPF, is calculated as the number of predicted protein pairs that are included in the positive examples divided by 3543, the total number of positives; the false positive rate, denoted as FPF, is calculated as the number of predicted protein pairs that are included in the negative examples divided by 6 895 215, the total number of negatives. The ROC score, calculated as the area under the ROC curve is a measurement of prediction accuracy. The closer the ROC score is to 1.0, the better the prediction. In our study, we repeat the entire cross-validation procedure three times in order to estimate the variance of the ROC score.

### Gene Ontology analysis

We determine whether the two genes encoding the predicted interacting protein pair have any GO annotation enriched in the biological process ontology by using the Saccharomyces Genome Database (SGD) GO TermFinder (http://search.cpan.org/dist/GO-TermFinder/). The probability that two genes share the same biological process by chance is calculated through the hypergeometric distribution. The P-value is calculated using the following equation:

$P\hbox{ -value }={\displaystyle \sum _{x}^{n}}\frac{\left(\begin{array}{c}M\\ x\end{array}\right)\left(\begin{array}{c}N-M\\ n-x\end{array}\right)}{\left(\begin{array}{c}N\\ n\end{array}\right)},$
where N and M represent the total number of genes in the population and the number of genes that have a particular biological process category annotation, respectively, and n and x represent the number of genes in the set and the number of genes in the set annotated with the particular biological process, respectively. Because each gene set we investigate is a pair of genes, both n and x are equal to 2. The P-value is corrected for multiple testing using Bonferroni correction and a protein pair is considered as GO term enriched if the corrected P-value is <0.05.

To assess the overall statistical significance of the observed GO term enrichment, we generate randomized protein–domain associations by randomly permuting the domain labels of all proteins while leaving the number of domains associated with each protein untouched. We then run the same prediction procedure on the permuted domain information. This process is repeated 100 times and the number of predicted protein pairs having GO term enrichment is recorded for each permutation. The empirical P-value for the observed GO term enrichment is calculated as the fraction of the permutations having a larger number of GO term enriched protein pairs than that based on the observed data.

## RESULTS

The protein–domain relationships are extracted from PFAM and SMART, and there are a total of 3317 domains associated with the proteins of the three organisms (S.cerevisiae, C.elegans and D.melanogaster). The distribution of these domains across the three organisms is shown in a Venn diagram in Figure 1.

### Sensitivity and specificity

In this study, we have extended a likelihood approach by Deng et al. (2002) to integrate information from diverse organisms to infer protein–protein interaction probabilities. We compare the performance of the likelihood approach with three other methods that have also been used for protein interaction prediction: the sequence-signature method proposed by Sprinzak and Margalit (2001) the attraction-only model (Gomez et al., 2001) and the attraction–repulsion model (Gomez et al., 2003). All four methods explore the experimental protein interaction data to assign the probability or score for each protein pair, and make predictions of interacting protein pairs based on a selected decision threshold. To compare the performance of each prediction method, we apply these methods to the same training interaction data obtained from a single organism—S.cerevisiae only—and measure the performance of each method using 5-fold cross-validation. For different thresholds, the sensitivity and specificity of each prediction method are calculated and the ROC scores that measure the accuracy ofprediction for each method are obtained (see Methods). The results in Figure 2 clearly demonstrate that, with only the information from a single organism, the prediction performance of the likelihood approach, with a ROC score of 0.628 ± 0.005, is comparable to that of the attraction–repulsion model, and is significantly better than those of the attraction-only model and the sequence-signature method.

The advantage of our extended likelihood approach is that it allows us to incorporate the large-scale protein–protein interaction data from diverse organisms. In order to assess the benefit of simultaneous analysis of multiple organisms, we investigate the information gain from the joint analysis of all three organisms compared with the analysis based solely on S.cerevisiae. Because information from C.elegans and D.melanogaster can affect (and hopefully improve) the estimated domain–domain interaction probabilities in S.cerevisiae, the predicted protein–protein interactions differ between the two methods. Taking the 3543 protein–protein physical interactions recorded in MIPS as true positives, we estimate the sensitivity and specificity for each threshold of the two methods either based on information from all three organisms or based on information from S.cerevisiae alone. The results are summarized in the ROC curves in Figure 3. The improvement based on the joint analysis of three organisms can be easily seen from this figure.

### Evaluation of GO term enrichment

In order to evaluate the quality of our predicted protein interactions, we investigate whether two genes encoding a predicted interacting protein pair are functionally related. Because genes more likely share the same biological process if they are functionally related (Vazquez et al., 2003), we determine whether these two genes have any GO annotation enrichment in the biological process ontology compared with what would be expected by chance from a random pair of genes. We observe that, out of the top 1000 predicted interacting protein pairs based on the information from all three organisms, 203 pairs have at least one GO term enriched, whereas only 91 pairs out of the top 1000 predicted pairs based on the information from yeast alone have a GO term enriched. To assess the statistical significance of these results, we compare these predictions with those based on randomized protein–domain associations (see Methods). We find that the 203 observed GO term enriched pairs based on the information from all three species are statistically significant (empirical P-value is 0), whereas the observed 91 GO term enriched pairs based on S.cerevisiae alone are not statistically significant (empirical P-value is 0.06).

### Gene expression profiles

Interacting proteins are more likely to be coexpressed than a random pair of genes and this fact has been used for experimental validation of the predicted protein–protein interactions (Ge et al., 2001; Kemmeren et al., 2002). In our study, we test whether there is statistical evidence suggesting that gene expression profiles are more similar between the predicted protein pairs, where the similarity is defined by the Pearson correlation coefficient between the gene expression profiles of these two genes. For gene expression profiles, we use publicly available gene expression data, including a time-course study during the yeast cell cycle (Spellman et al., 1998) and the Rosetta ‘compendium’ set, which is composed of 300 diverse mutations and chemical treatments (Hughes et al., 2000).

To test whether the correlation coefficients of gene expressions for the predicted interacting protein pairs are significantly higher than those for random gene pairs, we compare the distribution of the correlation coefficients between the predicted interacting protein pairs with a probability threshold of 0.1, the physical interaction protein pairs from MIPS, the predicted interacting pairs excluding those pairs from MIPS, and random pairs. We find that the distribution of the correlation coefficients of the predicted protein pairs is similar to that of the annotated interacting protein pairs in MIPS, which are verified interacting proteins. Compared with random protein pairs, the predicted protein pairs have a higher mean correlation coefficient (Supplementary Data). In addition, we compare the mean expression correlation coefficient for the predicted interacting protein pairs based on information from all three organisms and that based on information from S.cerevisiae alone. For this comparison, we first identify the top N predicted interacting pairs based on either method, where N takes values of 100, 500, 1000, 2000, 5000 and 10 000. We then calculate the average correlation coefficient for the predicted interacting pairs in the set for each method. As shown in Table 1, as N increases, the mean correlation coefficient decreases owing to the inclusion of a larger proportion of false positives in the data set. More importantly, for any given N, the mean correlation coefficient for the predicted interacting protein pairs based on the information from all three organisms is significantly higher than that for protein pairs predicted using the information from S.cerevisiae alone. In addition, the distributions of the correlation coefficients for the top 1000 predicted protein pairs based on two different sources are shown in Figure 4. As can be seen from this figure, there is a general shift of the distribution to higher correlation coefficient values for protein pairs predicted based on the information from all three organisms compared with those predicted based on S.cerevisiae alone, indicating that the prediction based on the information from all three organisms more probably yields more reliable predicted interacting protein pairs.

### Biological significance of the predictions

In this section, we discuss the biological relevance of the predicted interacting protein pairs. Although many of the predicted pairs are in the MIPS database, some of the top ones are not. Table 2 summarizes the top 10 predictions that are not in the MIPS database, and all these predictions have estimated interaction probabilities equal to 1. Table 2 also provides the functional annotation of these genes. Some of our predicted protein pairs include subunits of the same protein complex; for example, MCD1 and IRR1 are subunits of the yeast cohesin complex. Some other predictions involve interactions between proteins belonging to the same family, such as OCA1 and SNZ1, or between members of two different families, such as the VAC and ECM families. The interactions between VAC8, a phosphorylated vacuole membrane protein that is required for protein targeting from cytoplasm to vacuole (Scott et al., 2000), and the members of the ECM family, such as ECM15, may indicate that the ECM proteins are required for vacuole formation in three-dimensional extracellular matrices.

Some of our predictions may be biologically important. For example, it has been shown that the lack of Srp1 export might impair cNLS-dependent nuclear protein import in yeast (Stade et al., 2002). Because the ubiquitin-like modification of some proteins, such as RanGAP1, is required for protein nucleocytoplasmic trafficking (Matunis et al., 1998), the ubiquitin ligase may be involved in the nuclear protein import. Therefore, it may be reasonable to consider that Srp1 and BUL2, a component of the ubiquitin ligase complex, interact with each other and play a role in the nuclear protein import process together. The interaction between CUP2 and THI4 may indicate that genes activated by the transcription factor CUP2 are involved in the process of thiamine biosynthesis, in which THI4 plays an important role. Another example is the protein pair DCS1–NTH2. NTH2 is a neutral trehalase, and it has been proposed that the phosphorylation of DCS1 by CaM kinase II would lead to its dissociation from the neutral trehalase, and thus that the activity of the neutral trehalase would be upregulated (Souza et al., 2002). Therefore, the lack of CaM kinase II would downregulate the neutral trehalase activity as a result of the interaction between DCS1 and NIH2. In addition, we may predict the functions of some unknown proteins based on their interacting partners. For example, YMR009W is predicted to interact with FUN34, a transmembrane protein that is involved in ammonia production; therefore, we can predict that YMR009W may also be involved in this process.

## DISCUSSION AND CONCLUSIONS

In this article, we propose estimating the probabilities of interactions between domain pairs by pooling information from three organisms—S.cerevisiae, C.elegans and D.melanogaster—based on large-scale protein interaction data. Using the estimated domain–domain interaction probabilities, we can then estimate the probabilities of interactions between each protein pair in a given organism. We focus our attention on predicting the protein interactions in S.cerevisiae, and we have found that, even based on the information from S.cerevisiae only, the likelihood approach is among the best-performing methods considered in our comparisons. Because of the experimental errors of large-scale two-hybrid assays, the domain interactions inferred from one organism may not be reliable, and the incorporation of data from other organisms can indeed improve the estimated domain–domain and protein–protein interactions. The extension of the likelihood approach allows the incorporation of the information from all three organisms, and the prediction results were found to be better than those obtained based on the information from S.cerevisiae alone through the examinations of ROC curves, GO term enrichments and expression profiles. Therefore, we conclude that the approach proposed in this study outperforms those used for comparison, providing more informative inference of protein interactions.

The results from our approach can be further improved when the domain information is further and more reliably annotated in the future. Currently, only about two-thirds of the S.cerevisiae proteins have a defined domain composition, and we have considered possible interactions only between those proteins with annotated domain information. As a result, the predictions based on domain–domain interactions will be able to capture only a portion of all interactions, the number of which is estimated to be ∼20 000–30 000 in S.cerevisiae. Our predicted interacting pairs depend on the threshold value used for the estimated interaction probabilities, and the number of predicted pairs increases as we reduce the threshold. Owing to the unknown number of truly interacting protein pairs as well as the incompleteness of the annotated domain information, it is difficult to set a threshold value to match the expected number of interacting pairs. When we set the threshold at 0.1, 20 088 protein pairs are predicted to interact with each other. At this level, using MIPS physical interaction data as the gold standard, we estimate the sensitivity and specificity to be 38.6 and 99.7%, respectively. (The list of all the predicted interactions is provided as supplementary information.) As the interacting protein pairs included in MIPS are far from complete, these values calculated based on the MIPS data could be different from the actual values.

It is well known that two-hybrid assays contain many errors, and the exact error rates are hard to assess because the actual protein–protein interactions are not yet known. Based on the number of interactions in our training data, we have estimated the ranges of the false positive and false negative rates (see Methods). The estimated value of fn agrees with the literature in which the dataset is published, and the estimated value of fp differs from those established in the literature by an order of magnitude because a different definition of false positive is used (the number of incorrect interactions observed in experiments divided by the total number of observed interactions). We fix the fn and fp rates in our analysis as this approach has been shown to be robust with respect to a range of experimental error rates (Supplementary Data). In our study, we set the error rates to be fp = 3 × 10−4 and fn = 0.85 for the interaction data for all three organisms to ease the computation; the yielded predictions are used for the GO term enrichments and gene expression analysis. In addition, we have applied our approach to a core interaction dataset including 1374 interactions from S.cerevisiae (Ito et al., 2000; Uetz et al., 2000), 2135 interactions from C.elegans (Li et al., 2004) and 4625 interactions from D.melanogaster (Giot et al., 2003). We set the error rates to be fp= 0 and fn = 0.95 because the dataset contains only high-confidence interactions. However, the analysis yields a smaller number of predicted interactions, and measured by sensitivity and specificity, the overall performance of the core dataset is not comparable to that of the dataset including all the interactions (Supplementary Data). Given that the core dataset contains only ∼8000 interactions for all three organisms, which is much smaller than the number of expected interactions, the information included in the core dataset may be further from being complete than the complete dataset, eventhough it has a smaller false positive rate, thus limiting the prediction power of our approach.

We predict protein–protein interactions through the annotated protein domains, which are responsible for protein interactions through direct physical interactions. Therefore, our goal, precisely defined, is to predict whether two proteins have direct physical interactions, not whether proteins are in the same complex. In this study, we have focused on the integration of two-hybrid data from different organisms. The prediction reveals potential protein physical interactions, but some of these may not be biologically relevant in a physiological condition. In principle, other types of data can be integrated into the approach; for example, the integration of data from high-throughput mass spectrometry protein complex purification along with the correlated mRNA expression profiles are expected to extend our prediction, yielding functionally related protein pairs.

The basic principle of our approach is the fact that domain–domain interactions are likely conserved across different organisms, therefore allowing us to borrow information from diverse organisms to improve the predictions of protein–protein interactions in a given organism. Although our current approach has indeed led to improved predictions, it can be further refined to generate more accurate predictions. For example, we may first improve the predictions of protein–protein interactions within the same organism through integrating diverse data sources from that organism (e.g. Jansen et al., 2003; Lin et al., 2004) and then perform joint analysis across different organisms based on the results from these integrated analyses. The current approach estimates the domain–domain interaction probabilities for each domain–domain pair separately, and these estimated probabilities may be more accurately estimated by pooling information from domains with similar structures or functions. Finally, a Bayesian approach may be adopted here both to incorporate prior information on domain–domain interactions and to better infer domain–domain interaction probabilities.

Fig. 1

The distribution of the domains in S.cerevisiae, C.elegans and D.melanogaster.

Fig. 1

The distribution of the domains in S.cerevisiae, C.elegans and D.melanogaster.

Fig. 2

ROC score summary. Error bars indicate the standard deviation over three cross-validation experiments.

Fig. 2

ROC score summary. Error bars indicate the standard deviation over three cross-validation experiments.

Fig. 3

ROC curves of the prediction results based on different information sources.

Fig. 3

ROC curves of the prediction results based on different information sources.

Table 1

Comparison of the mean correlation coefficient for the selected predicted protein pairs based on two different information sources

Pairs Cell cycle Rosetta
Mean (sdc) Mean (S.cerevisiaeEmpirical P-value Mean (sdc) Mean (S.cerevisiaeEmpirical P-value
100 0.28 0.10 0.20 0.05
500 0.21 0.09 0.17 0.04
1000 0.15 0.09 0.13 0.03 0.0001
2000 0.12 0.08 0.0055 0.08 0.03 0.0129
5000 0.10 0.08 0.0255 0.06 0.02 0.0250
10 000 0.09 0.07 0.0264 0.05 0.01 0.0258
Pairs Cell cycle Rosetta
Mean (sdc) Mean (S.cerevisiaeEmpirical P-value Mean (sdc) Mean (S.cerevisiaeEmpirical P-value
100 0.28 0.10 0.20 0.05
500 0.21 0.09 0.17 0.04
1000 0.15 0.09 0.13 0.03 0.0001
2000 0.12 0.08 0.0055 0.08 0.03 0.0129
5000 0.10 0.08 0.0255 0.06 0.02 0.0250
10 000 0.09 0.07 0.0264 0.05 0.01 0.0258

sdc, prediction based on the information from three organisms, S.cerevisiae, D.melanogaster and C.elegans.

Fig. 4

Comparisons of the distributions of the Pearson correlation coefficients for the top 1000 predicted interacting protein pairs based on different information sources. sdc, prediction based on the information from three organisms S.cerevisiae, D.melanogaster and C.elegans.

Fig. 4

Comparisons of the distributions of the Pearson correlation coefficients for the top 1000 predicted interacting protein pairs based on different information sources. sdc, prediction based on the information from three organisms S.cerevisiae, D.melanogaster and C.elegans.

Table 2

The top 10 predicted interacting protein pairs that are not included in the MIPS physical interaction dataset

Protein I Function Protein II Function
MCD1 Mitotic chromosome determinant IRR1 Nuclear cohesin protein
ECM31 Involved in cell wall biogenesis and architecture VPS9 Required for Golgi to vacuole trafficking
CUP2 Copper-dependent transcription factor THI4 Involved in thiamine biosynthesis and DNA repair
BUL2 Ubiquitin-mediated protein degradation SRP1 Karyopherin-alpha or importin
DCS1 Scavenger mRNA decapping enzyme NTH2 Neutral trehalase
SNZ1 Member of the stationary phase-induced gene family, involved in response to cell stress SNZ1 Member of the stationary phase-induced gene family, involved in response to cell stress
YMR009W Unknown function localized to cytoplasm and nucleus FUN34 Integral membrane protein, involved in ammonia production
OCA1 Putative protein tyrosine phosphatase OCA1 Putative protein tyrosine phosphatase
ECM15 Involved in cell wall biogenesis and architecture VAC8 Required for vacuole inheritance and protein targeting from the cytoplasm to vacuole
SPC2 Signal peptidase 18 KD subunit URA3 Orotidine-5′-phosphate decarboxylase
Protein I Function Protein II Function
MCD1 Mitotic chromosome determinant IRR1 Nuclear cohesin protein
ECM31 Involved in cell wall biogenesis and architecture VPS9 Required for Golgi to vacuole trafficking
CUP2 Copper-dependent transcription factor THI4 Involved in thiamine biosynthesis and DNA repair
BUL2 Ubiquitin-mediated protein degradation SRP1 Karyopherin-alpha or importin
DCS1 Scavenger mRNA decapping enzyme NTH2 Neutral trehalase
SNZ1 Member of the stationary phase-induced gene family, involved in response to cell stress SNZ1 Member of the stationary phase-induced gene family, involved in response to cell stress
YMR009W Unknown function localized to cytoplasm and nucleus FUN34 Integral membrane protein, involved in ammonia production
OCA1 Putative protein tyrosine phosphatase OCA1 Putative protein tyrosine phosphatase
ECM15 Involved in cell wall biogenesis and architecture VAC8 Required for vacuole inheritance and protein targeting from the cytoplasm to vacuole
SPC2 Signal peptidase 18 KD subunit URA3 Orotidine-5′-phosphate decarboxylase

All these pairs have estimated interaction probability equal to 1. Each row represents an interacting protein pair with their corresponding annotated functions. The protein function annotations are obtained from CYGD (the Comprehensive Yeast Genome Database).

This research was supported in part by National Science Foundation grant DMS-0241160 and Y.L. was supported by the NIH Institutional Training Grants for Informatics Research.

Conflict of Interest: none declared.

## REFERENCES

Aloy, P., et al.
2004
Structure-based assembly of protein complexes in yeast.
Science

303
2026
–2029
Bader, J.S., et al.
2004
Gaining confidence in high-throughput protein interaction networks.
Nat. Biotechnol.

22
78
–85
Bateman, A., et al.
2004
The Pfam protein families database.
Nucleic Acids Res.

32
D138
–D141
Dempster, A.P., et al.
1977
Maximum likelihood from incomplete data via the EM algorithm.
J.R. Statist. Soc. B

39
1C38
Deng, M., et al.
2002
Inferring domain-domain interactions from protein-protein interactions.
Genome Res.

12
1540
–1548
Enright, A.J., et al.
1999
Protein interaction maps for complete genomes based on gene fusion events.
Nature

402
86
–90
Gavin, A.C., et al.
2002
Functional organization of the yeast proteome by systematic analysis of protein complexes.
Nature

415
141
–147
Ge, H., et al.
2001
Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae.
Nat. Genet.

29
482
–486
Giot, L., et al.
2003
A protein interaction map of Drosophila melanogaster.
Science

302
1727
–1736
Goh, C.S. and Cohen, F.E.
2002
Co-evolutionary analysis reveals insights into protein-protein interactions.
J. Mol. Biol.

324
177
–192
Gomez, S.M., et al.
2001
Probabilistic prediction of unknown metabolic and signal-transduction networks.
Genetics

159
1291
–1298
Gomez, S.M., et al.
2003
Learning to predict protein-protein interactions from protein sequences.
Bioinformatics

19
1875
–1881
Ho, Y., et al.
2002
Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry.
Nature

415
180
–183
Hughes, T.R., et al.
2000
Functional discovery via a compendium of expression profiles.
Cell

102
109
–126
Iossifov, I., et al.
2004
Probabilistic inference of molecular networks from noisy data sources.
Bioinformatics

20
1205
–1213
Ito, T., et al.
2001
A comprehensive two-hybrid analysis to explore the yeast protein interactome.
Proc. Natl Acad. Sci. USA

98
4569
–4574
Jansen, R., et al.
2003
A Bayesian networks approach for predicting protein-protein interactions from genomic data.
Science

302
449
–453
Kemmeren, R., et al.
2002
Protein interaction verification and functional annotation by integrated analysis of genome-scale data.
Mol. Cell.

9
1133
–1143
Letunic, I., et al.
2004
SMART 4.0: towards genomic data integration.
Nucleic Acids Res.

32
D142
–D144
Li, S., et al.
2004
A map of the interactome network of the metazoan C.elegans.
Science

303
540
–543
Lin, N., et al.
2004
Information assessment on predicting protein-protein interactions.
BMC Bioinformatics

5
154
Lu, L., et al.
2003
Multimeric threading-based prediction of protein-protein interactions on a genomic scale: application to the Saccharomyces cerevisiae proteome.
Genome Res.

13
1146
–1154
Marcotte, E.M., et al.
2001
Mining literature for protein-protein interactions.
Bioinformatics

17
359
–363
Matunis, M.J., et al.
1998
SUMO-1 modification and its role in targeting the Ran GTPase-activating protein, RanGAP1, to the nuclear pore complex.
Cell Biol.

140
499
–509
Mewes, H.W., et al.
2004
MIPS: analysis and annotation of proteins from whole genomes.
Nucleic Acids Res.

32
D41
–D44
Mrowka, R., et al.
2001
Is there a bias in proteome research?
Genome Res.

11
1971
–1973
Papin, J. and Subramaniam, S.
2004
Bioinformatics and cellular signaling.
Curr. Opin Biotechnol

15
78
–81
Pazos, F. and Valencia, A.
2001
Similarity of phylogenetic trees as indicator of protein-protein interaction.
Protein Eng.

14
609
–614
Ramani, A.K. and Marcotte, E.M.
2003
Exploiting the co-evolution of interacting proteins to discover interaction specificity.
J. Mol. Biol.

327
273
–284
Scott, S.V., et al.
2000
Apg13p and Vac8p are part of a complex of phosphoproteins that are required for cytoplasm to vacuole targeting.
J. Biol. Chem.

275
25840
–25849
Souza, A.C., et al.
2002
Evidence for a modulation of neutral trehalase activity by Ca2+ and cAMP signaling pathways in Saccharomyces cerevisiae.
Braz. J. Med. Biol. Res.

35
11
–16
Spellman, P.T., et al.
1998
Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.
Mol. Biol. Cell.

9
3273
–3297
Sprinzak, E. and Margalit, H.
2001
Correlated sequence-signatures as markers of protein-protein interaction.
J. Mol. Biol.

311
681
–692
Stade, K., et al.
2002
A lack of SUMO conjugation affects cNLS-dependent nuclear protein import in yeast.
J. Biol. Chem.

277
49554
–49561
Tsoka, S. and Ouzounis, C.A.
2000
Prediction of protein interactions: metabolic enzymes are frequently involved in gene fusion.
Nat. Genet.

26
141
–142
Tucker, C.L., et al.
2001
Towards an understanding of complex protein networks.
Trends Cell Biol.

11
102
–106
Uetz, P., et al.
2000
A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.
Nature

403
623
–627
Vazquez, A., et al.
2003
Global protein function prediction from protein-protein interaction networks.
Nat. Biotechnol.

21
697
–700
von Mering, C., et al.
2002
Comparative assessment of large-scale data sets of protein-protein interactions.
Nature

417
399
–403
Walhout, A.J., et al.
2000
Protein interaction mapping in C.elegans using proteins involved in vulval development.
Science

287
116
–122
Wang, J.
2002
Protein recognition by cell surface receptors: physiological receptors versus virus interactions.
Trends Biochem. Sci.

27
122
–126