Abstract

Motivation: Due to the limitations in experimental methods for determining binary interactions and structure determination of protein complexes, the need exists for computational models to fill the increasing gap between genome sequence information and protein annotation. Here we describe a novel method that uses structural models to reduce a large number of in silico predictions to a high confidence subset that is amenable to experimental validation.

Results: A two-stage evaluation procedure was developed, first, a sequence-based method assessed the conservation of protein interface patches used in the original in silico prediction method, both in terms of position within the primary sequence, and in terms of sequence conservation. When applying the most stringent conditions it was found that 20.5% of the data set being assessed passed this test. Secondly, a high-throughput structure-based docking evaluation procedure assessed the soundness of three dimensional models produced for the putative interactions. Of the data set being assessed, 8264 interactions or over 70% could be modelled in this way, and 27% of these can be considered ‘valid’ by the applied criteria. In all, 6.9% of the interactions passed both the tests and can be considered to be a high confidence set of predicted interactions, several of which are described.

Availability:http://bioinformatics.leeds.ac.uk/~bmb4sjc

Contact:r.m.jackson@leeds.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Comprehensive lists of macromolecules present in a number of organisms are now available thanks to the completion or near-completion of various genome sequencing projects (Adams et al., 2000; Lander et al., 2002). However, the gap between genome sequence information and protein annotation is increasing, and high-throughput functional genomics techniques, encompassing proteomics, bioinformatics, transcriptomics and other molecular biology approaches are required to bridge this gap between raw sequence information and relevant, functional protein annotation. Protein–protein interactions are of particular interest in this regard. They are ubiquitous in nature and it is widely accepted that they are involved in governing the majority of biological processes (von Mering et al., 2002). Transient associations between proteins form the basis of processes including signal transduction, hormone–receptor binding, antigen recognition by antibodies and enzyme inhibition. More permanent associations are required for those proteins where stability or function is determined by a multimeric state, for instance, oligomeric enzymes and structural protein assemblies.

A variety of techniques are now available to experimental biologist for discovering protein–protein interactions. Those used in a high-throughput manner for proteome scale studies are, the yeast two-hybrid system, affinity purification and protein microarrays. All of these, high-throughput systems were established by studies on the eukaryotic organism budding yeast Saccharomyces cerevisiae, (Gavin et al., 2002; Ho et al., 2002; Ito et al., 2001; Uetz, 2000; Zhu et al., 2001). It is suggested that the yeast interactome should involve some 30 000 protein–protein interactions (Kumar and Snyder, 2002; von Mering et al., 2002). High-throughput studies of interactions in yeast have identified, at most, ∼11 000 individual binary interactions, though this number is likely to be lower than this due to the potential for overlaps between the data sets (Kumar and Snyder, 2002; von Mering et al., 2002). Consequently, it is estimated that around two-thirds of the binary protein–protein interactions possible in a yeast cell have yet to be identified.

The structure of an interaction is required for an accurate mechanistic description of the biological process in which it is involved. A number of experimental methodologies are available for solving the three-dimensional (3D) structure of protein interactions; X-ray crystallography; nuclear magnetic resonance (NMR) spectroscopy; electron microscopy and electron tomography. However, only a fraction of known protein–protein interactions are also currently of known 3D structure. In the same way as all proteins are suggested to adopt one of around 1000-fold types in nature (Chothia, 1992), the structures of protein–protein interactions are suggested to conform to one of around 10 000 types (Aloy and Russell, 2004). The 3did database stores protein interactions of known structure (Stein et al., 2005). The ∼58 000 interactions (both intra- and inter-molecular interactions) currently represented in this database define around 2000 of the 10 000 predicted interaction types (Aloy and Russell, 2004).

Due to the limitations of experimental methods of binary interaction and complex determination, the need exists for computational models to fill the increasing gap between genome sequence information and protein annotation. The first in silico methods for predicting interactions between proteins such as gene fusion (Enright et al., 1999), gene neighbourhood (Overbeek et al., 1999) and phylogenetic profiles (Pellegrini et al., 1999) predict functional associations between proteins. The gene fusion approach, for instance, suggests that if one gene product in one organism appears to be expressed as two separate gene products in a second organism, a functional association between the two gene products is implied in that second organism. Although this can suggest a physical interaction between the proteins concerned, links between molecules may be less direct.

Recent advances in the prediction of new physical associations between proteins have adopted a threading-based approach to predict protein–protein interactions (MULTIPROSPECTOR, Lu et al., 2002; Jones et al., 2005). Threading is a well-established procedure for the prediction of the tertiary structure of proteins. A traditional threading approach [PROSPECTOR (Skolnick and Kihara, 2001)], was adapted by threading two proteins onto a protein complex template of known quaternary structure and incorporating a scoring function based on protein–protein interfacial energy. When the method was used on a set of known biological monomers and homodimers it could successfully distinguish between the two. A later study applied the method on a genomic scale, considering every possible pairwise interactions in the yeast genome (Lu et al., 2003). The method predicted 7321 interactions, based on 304 templates. A study of yeast protein–protein interactions stored in the Munich Information Centre for Protein Sequences (MIPS) showed that certain sequence signatures were over represented among the data (Sprinzak and Margalit, 2001). It was suggested that these sequence motifs could be used to predict other interactions. Other recent prediction attempts have been made using various machine learning techniques including decision trees (Zhang et al., 2004), Bayesian network models (Deng et al., 2002; Gomez et al., 2003) and kernels (Ben-Hur and Noble, 2005).

The conservation of an interaction is limited by the fact that the interface between the interacting partners must be conserved. Given this, and the finding that protein interaction prediction studies which have incorporated structural data, homology and interface locations have demonstrated improved results over previous studies, Espadaler et al. (2005) exploited the conservation of interfaces to produce a novel method for predicting interactions, called sequence search of interface patterns.

1.1 Sequence search of interface patterns (SSIP)

Using a non-redundant database of protein complexes of known 3D structure Espadaler et al. (2005) isolated a set of pairs of non-identical interacting proteins (the ‘seeding set’). For a given protein complex, a set of characteristic interface ‘patches’ was defined. Any residue closer to the opposing protein than a cut-off is considered to be within an interface patch. The cut-off was chosen such that it produced at least two separate patches of residues in each interacting protein. For each patch, an artificial 100 sequence multiple alignment was then built by sequence substitution based on a set of rules relating to the physiochemical properties of the side chains of the residues. These alignments were then used to generate a profile Hidden Markov Model (HMM) for each interface patch [using the HMMER program (Eddy, 1998)]. The set of profile HMMs produced for a protein of a complex were used for searching sequences from SwissProt to identify new proteins that match at least two of the profile HMMs from the seeding protein under a P-value threshold of 0.05 (therefore the P-value of finding a protein with at least two patches was <0.025). The pairs formed by A′ and B′, where A′ was found using the profile HMMs of A, and B′ was found using the profile HMMs of B, form a set of putatively interacting protein pairs derived from the seeding interaction between A and B. Only pairs formed by proteins from the same species were further considered. For each putative interaction identified, the above process was repeated, with new alignments and profile HMMs being generated from the sequence that matches the original HMM and new searches of SwissProt being performed until no new interactions could be added to the set of putative interactions generated by the original seeding pair. This method generated a set of 132 627 interactions involving 12 225 proteins from the original seeding set of 421 interactions.

1.2 Structure relationship (SR)

If a pair of proteins is orthologous to a pair of interacting proteins from a different organism, it is assumed that this interaction is also conserved and the orthologous interacting protein pairs are termed interologs (Matthews et al., 2001). Espadaler et al. (2005) extended this assumption to consider all possible relatives of the two interacting proteins (where relatives are defined as those proteins shown to share similar fold and function). The Structural Classification of Proteins (SCOP) database (Andreeva et al., 2004) was used to assign a fold to as many sequences in the DIP database as possible. In total 4324 proteins in the Database of Interacting Proteins (DIP) could be assigned to a SCOP fold family, this covers around one-sixth of the proteins in DIP. Subsequently, each pair of proteins in DIP was expanded based on the assigned fold of at least one of the proteins of the pair. If neither protein had an assigned fold then the interaction was not expanded. If both proteins were of known fold then the resulting set of interactions is built from two sets of proteins [i.e. the set assigned to the fold family of A (m proteins) and the set assigned to the fold family of B (n proteins)]. It is assumed that the relatives of A and B, A′ and B′ can interact with one another, but also that A′ can interact with B, and that B′ can interact with A. By extending this observation to all relatives of A and B, m × n interactions can be predicted. If only one of the proteins of a DIP interaction is of known fold then the predicted set for that interaction is the size of the set assigned to that fold family (i.e. the interaction of each of that set with the other partner involved in the interaction).

Due to the large size of the sets of predicted interactions produced by both the methods of Espadaler et al. (2005) outlined earlier, and the potential for false positives within them, the only predictions taken forward for further analysis were the intersection of the two sets of potential interactions defined by SSIP and SR.

1.3 Working hypothesis

The work carried out in this study was undertaken with the aim of evaluating the putative interactions produced in the work of Espadaler et al. (2005) To this end, two evaluation procedures were carried out, (1) sequence-based evaluation and (2) structure-based evaluation.

In the sequence-based evaluation, proteins of similar interface sequence are predicted to have a similar 3D structure in the interface. It therefore follows that in homologous or analogous proteins, sequences found at the same location in the primary sequence should be found at a similar point in the tertiary structure of the two proteins (Supplementary data: Fig. Ia). A sequence-based evaluation of the interactions involving putative interactors that are analogous to the seeding interactors was carried out. This involves performing a sequence alignment of the seeding interactor with the predicted interactor, informed by the location of the interface patch sequences, and assessing the resulting alignment with regard to the relative positions of those patches in the seeding sequence and the putative sequence (Supplementary data: Fig. Ib).

Structure-based evaluation procedures have already shown some success (Aloy and Russell, 2003; Lu et al., 2002). In all predicted cases from the study by Espadaler et al. (2005) the seeding interface sequence is preserved. The SSIP method suggests that if these predicted interactions actually occur in nature, then the interface of the predicted interaction should be the same as that of the seeding interaction. Therefore, its orientation in 3D space should also be maintained. By superimposing the putative interface residues on the seeding interface residues, the 3D structure of the putative interaction can be predicted (Supplementary data: Fig. II), in cases where suitable 3D models of the tertiary structure of the putative interactors are available. We have called this ‘Comparative Docking’ since only the interaction interfaces are assumed to be conserved and not necessarily the interacting partners and the resulting model produces a docked 3D protein–protein complex. These models can be further assessed using structure-based criteria.

2 METHODS

2.1 Databases

A number of databases were used for the analysis presented in this study. These were as follows: (1) the SwissProt database of protein sequences [now part of the Universal Protein Resource (UniProt)]; (2) the Protein Data Bank [PDB (Berman et al., 2000)]; (3) ModBase, a database of homology models of protein tertiary structure produced by the Modeller package for fold assignment (Fiser and Sali, 2003; Pieper et al., 2004); (4) the SwissModel Repository (Kopp and Schwede, 2004; Schwede et al., 2003), a database of homology models produced by SwissModel (Kopp and Schwede, 2004; Schwede et al., 2003).

2.2 Protein–protein interaction data sets

Two distinct sets of predicted protein–protein interactions were used for the analyses, both defined by the work of Espadaler et al. (2005). The I1 data set consisted of interactions whose association is confirmed by the presence of an entry in the experimentally determined DIP. The I2 data set consisted of interacting pairs where both partners share at least one domain of the same family with one of the proteins from a pair of interacting proteins in DIP. I1 contained 42 interacting protein pairs; I2 was considerably larger, consisting of 11 712 interacting pairs (involving 3049 proteins).

2.3 Assessment of putative interactions by sequence-based modelling

Interface patches on the seeding pairs were defined as sets of five or more residues present up to a set distance away from the interacting partner (2–5 Å, set to produce at least two separate patches in each interaction partner). Systematic single residue substitutions in these patch sequences produced artificial multiple alignments, which were in turn used to build profile HMMs, using HMMER (Eddy, 1998). In this study, these patch positions were mapped to the primary sequence of the seed proteins, and the profile HMMs used to map the equivalent patches to the sequence of the putative interactors (using HMMalign, part of the HMMER package). This process resulted in two profiles for each partner of a putative interaction: (1) the seed sequence and its patch sequences and (2) the putative interactor and its patch sequences. The MUSCLE protein sequence alignment program (Edgar, 2004) was then used in profile mode to align these two profiles to one another. The resulting multiple alignment could then be scored for overlap between the patch residues in the seeding interactor and the predicted interactor. Two distinct scoring functions were used.

2.3.1 The patch scoring function

Seed patch positions in the profile alignment were scored for whether a patch residue was also present at that position in the alignment in the putative sequence. Binary scores were used, 1 for a patch residue being present and 0 for no equivalent patch residue being present in the putative when compared to the seed patch. By dividing the resultant tally by the total number of patch residues in the seed, a score out of 1 is produced (Supplementary data: Fig. IIIa).

2.3.2 The patch sequence conservation scoring function

The patch sequence conservation score was produced in a similar way to the patch overlap score, seed patch positions were considered for the presence of a putative patch residue, however, the simple presence of a patch residue at the same position is not sufficient to score. The residue present must conform to the rules of substitution used when producing the artificial alignments for profile HMM production. If the rules are obeyed then the residue scores one, otherwise it does not score. The score is calculated in the same way as with the patch overlap score, producing a score out of 1, which will always be equal to or lower than the patch overlap score (Supplementary data: Fig. IIIb).

2.4 Assessment of putative interactions by structure-based modelling

2.4.1 Homology modelling

In order to construct 3D models of the putative interactions predicted by Espadaler et al., (2005) 3D models of the individual proteins are required. The database of homology models, ModBase (Pieper et al., 2004), was interrogated for the presence of homology models for the proteins found in the I1 and I2 data sets. All models found, regardless of patch coverage, were retrieved from ModBase directly. If no model was found, SwissModel Repository (Kopp and Schwede, 2004) was also then checked. A total of 18 207 homology models were downloaded for the 3393 proteins able to be modelled.

2.4.2 Interface structure superposition

It is assumed that the putative interactors associate using the same interface patches as the seed interactors. Therefore, the geometry of this conserved interface should also be conserved. Therefore using the residues involved in the seed interface, and the corresponding residues in the putative interactor, the latter can be transformed so that its ‘interface’ is aligned in 3D with that of the seed, using the least-squares fitting method of McLachlan (1982). By performing this process with both putative interactors the 3D models of the putative interactions were produced. Only patches found to be overlapping in the sequence alignments were considered for least squares fitting. An assessment was then made as to which of the stored homology models of the putative interactor was most appropriate given the patch positions. A model providing complete coverage of the interface patches was selected, if available, otherwise, a model covering as many patches as possible was used. Patches, which contained gaps in the putative sequence were corrected so the corresponding patches in the seed sequence contained an equal number of atoms. This process was repeated for the second chain of the interaction and the results concatenated to give the 3D model of the putative protein–protein interaction. The validity of the putative interaction models was assessed by consideration of the proximity of the Carbon-alpha (Cα) atoms on opposing chains of the interaction to one another. If two Cα atoms were within 3.5 Å of one another, they were considered as a ‘steric clash’. A count of the total number of steric clashes in a given model was recorded.

3 RESULTS

Putative interactors may either be homologues of their seeding predictor (member of the same Pfam family) or non-homologues. This means that the predicted interactions can fall into one of three classes. HH interactions are those putative interactions that consist of two homologues of the seeding interaction (i.e. one homologue of each chain). HX interactions are those putative interactions that involve one homologous interactor and one non-homologous interactor. Finally, in XX interactions, neither putative interactor is a homologue of the respective seeding chain.

The distribution of the I1 and I2 data sets within these three categories show that the majority of the 42 I1 data set are HH interactions (88%) with few HX (7%) and XX (5%) interactions. Whereas the majority of the 11 759 I2 data set are HX interactions (53.8%) with proportionately less HH (34%) and more XX (12%) interactions. It is expected that the most reliable predictions will be the HH set, so it follows that the high confidence I1 data set should be made up of largely HH interactions.

3.1 Sequence-based evaluation of predicted interactions

The sequences of the seeding interactions and the putative interactors that they predict were aligned (see Methods). The resulting alignments were then scored by the mean patch overlap and the patch sequence conservation score for the two interactors. The distribution of the score for the I1 and I2 data sets are shown in Figure 1. Of the I1 dataset 78.6% have Patch Overlap Scores >0.9, and 90.9% of those (71.4% of the data set) maintain a Patch Sequence Conservation score of >0.9 (Fig. 1A). In line with the lower confidence of the I2 data set, only 20.5% have a Patch Overlap Score over 0.9 and only 5.9% of those interactions (1.2% of the data set) have a Patch Sequence Conservation score of >0.9 (Fig. 1B).

Fig 1.

Distribution of Patch Overlap (blue) and Patch Sequence Conservation (red) Scores. (A) Distribution of scores in the I1 data set. (B) Distribution of scores in the I2 data set. (A colour version of this figure is available as supplementary material.)

Fig 1.

Distribution of Patch Overlap (blue) and Patch Sequence Conservation (red) Scores. (A) Distribution of scores in the I1 data set. (B) Distribution of scores in the I2 data set. (A colour version of this figure is available as supplementary material.)

3.1.1 Distribution of the three interaction classes within the sequence-based results

The results of the I2 data set can be further analysed in terms of how the Patch Overlap and Patch Sequence Conservation Scores are distributed within the three interaction classes (Supplementary data: Fig. IV). What is clear is that there is a bias towards HH interactions in the groups with a high Patch Overlap Score (χ2-test, P-value = 9.18 × 10−64, Supplementary data: Fig. IVa). This bias becomes even more pronounced in the interactions that also have a high conservation score (Supplementary data: Fig. IVb), to the point where 86.7% (1187/1369) of the I2 interactions having a Patch Sequence Conservation Score greater than 0.7 are HH interactions. Only 12 interactions (0.88%) of this group are XX interactions (0.86% of the XX interactions of the I2 data set).

Are the scores for HH interactions under this analysis high because the proteins involved are homologues of one another, or are they high because the interfaces are genuinely preserved? The answer to this question lies in the fact that even members of the same Pfam family can be quite divergent in terms of sequence. For instance, chain C of the seeding complex 1hja_CI (α-chymotrypsin (C) and ovomucoid (I) inhibitor) predicts the human Complement C2 protein (P06681) to be a member of the 6 HH interactions of the I2 data set. P06681 contains a trypsin domain, meaning that this and the α-chymotrypsin are members of the same Pfam family and are therefore classed as homologues. The alignment of these two proteins gives a Patch Overlap Score of 1, and a Patch Sequence Conservation Score of 0.69. However, the sequence identity between 1hjaC and P06681 is only 23%. Given this background of relative divergence, the overlap of the patches and indeed the level of conservation between them are extremely unlikely to have occurred by chance.

In another example, one of the predicted ‘inhibitors’ of the human Complement C2 protein, the Serine Protease Inhibitor Kazal-type-4 (O60575), has only 31.37% identity with ovomucoid (1hjaI); however, this is a member of the same Pfam family and the interaction between P06681 and O60575 seems perfectly valid under the tests performed during this study (Patch Overlap Score = 1, Clash Score = 16). This shows that it is not simply high levels sequence identity that lead to a high Patch Overlap Score, but the genuine preservation of patches in more distantly related sequences that can lead to a high Patch Overlap Score.

We performed analysis of the Patch Overlap Scores versus the percentage sequence identity between the seed interactor and the putative interactor it predicts for individual proteins involved in HH interactions of the I2 data set (Supplementary data: Fig. Va). There are many cases where a low sequence identity does not necessarily correlate with a low Patch Overlap Score. A similar analysis was performed with Patch Sequence Conservation versus sequence identity (Supplementary data: Fig. Vb). Again, and using the point P06681-1hjaC as an example, although Patch Sequence Conservation scores and sequence identity do seem to be more highly correlated than the Patch Overlap Scores, there are many cases where a high Patch Sequence Conservation Score does not necessarily correspond with a high sequence identity.

3.2 Structure-based evaluation of predicted interactions

3.2.1 Homology modelling of putative interactions

It is suggested that up to 60% of the sequences in SwissProt are now either of known structure or are of sufficient homology with a protein of known structure to permit a model of varying accuracy to be constructed (Pieper et al., 2004). Any sequence identity >30% between a protein to be modelled and a protein of known structure is sufficient for a reasonable prediction of tertiary structure (Tramontano and Morea, 2003). The Structure Relationship (SR) method of predicting protein–protein interactions developed by Espadaler et al. (2005) means that the predictions of the I1 and I2 data sets are related to proteins in the DIP database that can be assigned to a SCOP fold family, and therefore are biased towards proteins that are able to be modelled. In line with this expectation, at least one model is stored in the ModBase homology model database (Pieper et al., 2004) for 3019 of the 3058 proteins involved in the interactions of the I1 and I2 data sets (98.7%). The SwissModel Repository (Kopp and Schwede, 2004) contains models for a further five proteins, meaning that only 34 proteins have no homology model.

The 34 proteins that cannot be modelled are involved in 176 interactions of the I2 data set. Homology models that do not cover the interface patches and alignments with no overlapping patches mean that interaction models could not be produced for two interactions of the I1 data set, as well as a further 3319 interactions of the I2 data set. Nevertheless, 8264 interactions (70.3%) of the I2 data set were modelled.

3.2.2 Steric compatability

The interaction models that could be produced were scored for steric clashes (see Methods). The distribution of the scores between 0 and 200 clashes is shown in Figure 2. Thirty-two of the 40 modelled I1 interactions (80%) score less than 10 steric clashes, again reflecting the high confidence placed in this data set. The highest clash score of any interaction in the I1 dataset is 110 (Fig. 2A). Although there are some significantly higher scores among the modelled interactions of the I2 data set, the bias in the scores is clearly towards lower scores, with the largest group (2480 interactions, 30.0%) being the 0–10 clashes group (Fig. 2B). 244 interactions of the I2 data set were found to have models with more than 200 steric clashes (data not shown).

Fig 2.

Distribution of Clash Scores. (A) Distribution of scores in the I1 dataset. (B) Distribution of scores in the I2 data set. Also showing the distribution of the clash scores within the three interaction classes. (A colour version of this figure is available as supplementary material.)

Fig 2.

Distribution of Clash Scores. (A) Distribution of scores in the I1 dataset. (B) Distribution of scores in the I2 data set. Also showing the distribution of the clash scores within the three interaction classes. (A colour version of this figure is available as supplementary material.)

3.2.3 Distribution of the three interaction classes within the structure-based results

The interaction class distribution of the I2 interactions that could be modelled (8264 of 11 759) is different to the distribution of the I2 data set as a whole (c2 test, P-value = 6.27 × 10−64), more HH interactions are represented in the modelled set (40.8%, compared to 34.3% of the whole set). Of the modelled interactions 50.75% are HX and 8.46% are XX.

The putative interactions with the lowest clash scores also show a strong bias toward HH interactions (χ2-test, P-value = 2.83 × 10−123, Fig. 2B). Of the interactions that score less than 50 clashes in the I2 data set (5348 interactions), more than half (56.15%) are HH interactions (compared to 40.8% of all the modelled interactions). The bias in the best scoring group (0–10 clashes) is even more pronounced, with 73.29% of this group being HH interactions. The biased distribution reflects the higher confidence that can be placed in predicted interactions that involve relatives of the seeding interaction.

3.3 Intersection of sequence-based and structure-based evaluation procedures

Comparison of the distribution of the clash scores for the interactions of the I2 data set with Patch Overlap scores of more than 0.9 with that for I2 as a whole (Supplementary data: Fig. VI) shows the >0.9 Patch Overlap Score group of interactions has a greater proportion of members scoring 0–10 clashes than the data set as a whole (51.1% versus 30.0%). It also shows a higher proportion of interactions scoring 0–50 clashes (80.3% versus 64.7%).

The distribution of the Patch Overlap and Patch Sequence Conservation Scores for the interactions of the I2 data set with low clash scores also confirm the likely greater validity of models with low clash score (data not shown). The Patch Overlap scores in the <50 clashes set are generally higher, with a greater proportion of 0.9–1.0 scores than the data set as a whole (33.2% versus 20.6%). The distribution of Patch Sequence Conservation Scores are also generally higher in the <50 clash set than in the data set as a whole.

The tendency of interactions that perform well in terms of Patch Overlap score or Clash Score to also score better than average in the clash test or patch overlap tests, respectively provides a measure of confidence that both sequence and structural methods are genuinely identifying a set of interactions that are more likely to be of biological significance. It is also the case that the highest confidence interactions of the I2 data set should be those that perform best in both sequence-based and structure-based evaluation procedures.

3.4 Cross-validation, a high confidence subset of the I2 data set

By taking those interactions with a Patch Overlap Score of >0.9 and a clash score of <20, a set of high confidence interactions can be defined from the I2 data set. This set consists of 1297 interactions, of which 1100 (84.8%) are HH interactions, 165 (12.7%) are HX interactions and only 32 (2.5%) are XX interactions. This distribution of the interactions among the three interaction classes is closer to that of the I1 data set than it is to that of the I2 (Fig. 3). It is also clear from the distributions shown in Figure 3 that the clash score provides better discrimination between ‘good’ and ‘bad’ interactions than the Patch Overlap Score.

Fig. 3.

Distribution of interaction class within sets of varying confidence. The more stringent the conditions applied to the I2 data set, in terms of the two tests, the closer the distribution of interaction types resembles the distribution of the experimentally confirmed I1 data set. (A colour version of this figure is available as supplementary material.)

Fig. 3.

Distribution of interaction class within sets of varying confidence. The more stringent the conditions applied to the I2 data set, in terms of the two tests, the closer the distribution of interaction types resembles the distribution of the experimentally confirmed I1 data set. (A colour version of this figure is available as supplementary material.)

It is the 32 high confidence XX interactions that are potentially the most interesting of this set, since they are the most likely to be novel. This set of interactions was further examined for interesting cases.

3.4.1 Dihydrolipoyol dehydrogenase homodimer

The seeding interaction 1aer_AB (Pseudomonas aeruginosa exotoxin A [chains A and B)] predicts a homodimer of P18925 (Dihydrolipoyl dehydrogenase of Azotobacter vinelandii). P18925 shares 35.9% identity with 1aer chain A and 34.51% identity with 1aerB, though it is not a member of the same Pfam family as either chain of the seeding interaction. Lessard et al. (1998) purified dihydroipoyl dehydrogenase from Bacillus stearothermophilus as a homodimer, this study provides experimental evidence that a homologue of the protein concerned exists as a homodimer in nature, lending credence to the prediction. When examined with the assessment methods described in this study, the P18925 homodimer had a Patch Overlap Score of 1.00 and the 3D model of the interaction has one steric clash (Fig. 4A).

Fig. 4.

Structures of high confidence predicted interactions. (A) Homodimer of dihydrolipoyl dehydrogenase from Azotobacter vinelandii (P18925), the model has a clash score of 1. (B) Interaction between the insulin receptor (P06213, red chain) and insulin receptor substrate 1 (IRS1, P35568, blue chain). The model has a clash score of 10. (C) Interaction between the insulin receptor (P06213, red chain) and epidermal growth factor receptor substrate 15 (EPS15, blue chain). (A colour version of this figure is available as supplementary material.)

Fig. 4.

Structures of high confidence predicted interactions. (A) Homodimer of dihydrolipoyl dehydrogenase from Azotobacter vinelandii (P18925), the model has a clash score of 1. (B) Interaction between the insulin receptor (P06213, red chain) and insulin receptor substrate 1 (IRS1, P35568, blue chain). The model has a clash score of 10. (C) Interaction between the insulin receptor (P06213, red chain) and epidermal growth factor receptor substrate 15 (EPS15, blue chain). (A colour version of this figure is available as supplementary material.)

3.4.2 Insulin receptor—insulin receptor substrate 1

As its name suggests, the insulin receptor substrate 1 (IRS1) protein is phosphorylated by the insulin receptor and is involved in mediating the control of various cellular processes by insulin (Tanti et al., 1994). The interaction between the two human proteins IRS1 (P35568) and insulin receptor (P06213) is predicted by the seeding complex 1hja_CI, which is an interaction between a-chymotrypsin and its inhibitor ovomucoid. The structural model of the interaction has a clash score of 10 (Fig. 4B), and the Patch Overlap Score for the interaction is 1.00.

Interestingly, a large number of interactions within this high confidence set are predicted by 1hja_CI (147) and involve the insulin receptor, or the insulin-like growth factor receptor (12). Other interactions predicted by 1hja_CI involve transmembrane receptor tyrosine kinases, possibly hinting at a general mechanism of interaction between such molecules and their substrates. There are members of this particular group of putative interactions that may prove to be particularly interesting. Although it does not quite fall into the highest confidence set, the interaction between the human insulin receptor (P06213) and epidermal growth factor receptor substrate 15 (EPS15, P42566) is just such an interaction. EPS15 is involved in the internalization of the epidermal growth factor receptor (Chen et al., 1998). The prediction of this particular interaction (Patch Overlap Score—0.92, clash score—23, Fig. 4C) suggests that EPS15 may be involved more generally in the internalization of ligand activated receptor tyrosine kinases, or that it may not be involved itself directly, but a similar, as yet unidentified, protein may exist that performs a similar function for the insulin receptor.

3.5 Co-localization index

If both protein partners of a given interaction are found in the same cellular compartment, that interaction is more likely to occur. The co-localization index (Lu et al., 2003) is the ratio of the number of protein pairs in which both partners have the same gene ontology (GO) subcellular localization annotation (Nsame) over the number of protein pairs where both partners have any subcellular annotation (Nany), i.e:  

formula
The Ci of the high confidence set of 1297 interactions is 0.60 (455/764). This is better than the value achieved by the multimeric threading prediction method assessed in this way by (Lu et al., 2003) (0.56) and is also higher than a number of other prediction methods assessed in the same way in the same study. The I1 data set has a Ci of 0.83.

3.6 Assessment by InterPreTS

All the interactions of the I2 data set were assessed using the InterPreTS (Interaction Prediction through Tertiary Structure) methodology (Aloy and Russell, 2002, 2003). The Z-scores produced were considered for their significance. A Z-score ≥2.3 indicates a prediction that the interaction occurs in a similar way to the known complex structure (i.e. the seeding interaction), with a confidence of 99%. A Z-score ≥1.3 indicates a significance of 90% and a Z-score <1.3 indicates that the predicted interaction does not occur in the same manner as the known complex (Aloy and Russell, 2003). A total of 4283 interactions of the I2 data set have a Z-score ≥1.3 (36.4%), 51.8% of these interactions (2218) have a Z-score ≥2.3. The 90% confidence group of the high confidence subset of interactions is proportionally larger, with 57.3% of the 1297 interactions having a Z-score >1.3 (747 interactions). Of this group 68.4% fall within the 99% confidence limit (511 interactions). There is more than a 2-fold enrichment in the number of complexes with a confidence of 99% for the high confidence subset with respect to the I2 data set.

3.7 Online availability

The I1 and high confidence set I2 data sets, along with the results of the evaluation methods presented in this study, have been made available online at http://bioinformatics.leeds.ac.uk/~bmb4sjc. The data can be browsed by data set and seeding interaction, or searched by SwissProt accession number and name for interactions involving a protein of interest. For each interaction the results of both evaluation stages can be viewed, the alignments for the sequence based evaluation, and the 3D quaternary structure model of the complex can be viewed or downloaded (if available).

ACKNOWLEDGEMENTS

We would like to thank Partick Aloy both for his suggestions and assistance with InterPreTS. This work was supported by the BBSRC in the form of a studentship for S.J.C. B.O. acknowledges support from the Spanish Ministerio de Educación y Ciencia (MEC, BIO02005-00533) and EU grant (IST-507585).

Conflict of Interest: none declared.

REFERENCES

Adams
MD
, et al.  . 
The genome sequence of Drosophila melanogaster
Science
 , 
2000
, vol. 
287
 (pg. 
2185
-
2195
)
Aloy
P
Russell
RB
Interrogating protein interaction networks through structural biology
Proc. Natl. Acad. Sci. USA
 , 
2002
, vol. 
99
 (pg. 
5896
-
5901
)
Aloy
P
Russell
RB
InterPreTS: protein interaction prediction through tertiary structure
Bioinformatics
 , 
2003
, vol. 
19
 (pg. 
161
-
162
)
Aloy
P
Russell
RB
Ten thousand interactions for the molecular biologisture
Nat. Biotechnol.
 , 
2004
, vol. 
22
 (pg. 
1317
-
1321
)
Andreeva
A
, et al.  . 
SCOP database in 2004: refinements integrate structure and sequence family data
Nucleic Acids Res.
 , 
2004
, vol. 
32
 (pg. 
D226
-
D229
)
Ben-Hur
A
Noble
WS
Kernel methods for predicting protein-protein interactions
Bioinformatics
 , 
2005
, vol. 
21
 
Suppl. 1
(pg. 
i38
-
i46
)
Berman
HM
, et al.  . 
The Protein Data Bank
Nucleic Acids Res.
 , 
2000
, vol. 
28
 (pg. 
235
-
242
)
Chen
H
, et al.  . 
Epsin is an EH-domain-binding protein implicated in clathrin-mediated endocytosis
Nature
 , 
1998
, vol. 
394
 (pg. 
793
-
797
)
Chothia
C
Proteins. One thousand families for the molecular biologist
Nature
 , 
1992
, vol. 
357
 (pg. 
543
-
544
)
Deng
M
, et al.  . 
Inferring domain-domain interactions from protein-protein interactions
Genome Res.
 , 
2002
, vol. 
12
 (pg. 
1540
-
1548
)
Eddy
SR
Profile hidden Markov models
Bioinformatics
 , 
1998
, vol. 
14
 (pg. 
755
-
763
)
Edgar
RC
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res.
 , 
2004
, vol. 
32
 (pg. 
1792
-
1797
)
Enright
AJ
, et al.  . 
Protein interaction maps for complete genomes based on gene fusion events
Nature
 , 
1999
, vol. 
402
 (pg. 
86
-
90
)
Espadaler
J
, et al.  . 
Prediction of protein-protein interactions using distant conservation of sequence patterns and structure relationships
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
3360
-
3368
)
Fiser
A
Sali
A
Modeller: generation and refinement of homology-based protein structure models
Methods Enzymol.
 , 
2003
, vol. 
374
 (pg. 
461
-
491
)
Gavin
AC
, et al.  . 
Functional organization of the yeast proteome by systematic analysis of protein complexes
Nature
 , 
2002
, vol. 
415
 (pg. 
141
-
147
)
Gomez
SM
, et al.  . 
Learning to predict protein-protein interactions from protein sequences
Bioinformatics
 , 
2003
, vol. 
19
 (pg. 
1875
-
1881
)
Ho
Y
, et al.  . 
Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry
Nature
 , 
2002
, vol. 
415
 (pg. 
180
-
183
)
Ito
T
, et al.  . 
A comprehensive two-hybrid analysis to explore the yeast protein interactome
Proc. Natl. Acad. Sci. USA
 , 
2001
, vol. 
98
 (pg. 
4569
-
4574
)
Jones
DT
, et al.  . 
Prediction of novel and analogous folds using fragment assembly and fold recognition
Proteins
 , 
2005
, vol. 
61
 
Suppl. 7
(pg. 
143
-
151
)
Kopp
J
Schwede
T
The SWISS-MODEL Repository of annotated three-dimensional protein structure homology models
Nucleic Acids Res.
 , 
2004
, vol. 
32
 (pg. 
D230
-
D234
)
Kumar
A
Snyder
M
Protein complexes take the bait
Nature
 , 
2002
, vol. 
415
 (pg. 
123
-
124
)
Lander
ES
, et al.  . 
Initial sequencing and analysis of the human genome
Nature
 , 
2001
, vol. 
409
 (pg. 
860
-
921
)
Lessard
IA
, et al.  . 
Expression of genes encoding the E2 and E3 components of the Bacillus stearothermophilus pyruvate dehydrogenase complex and the stoichiometry of subunit interaction in assembly in vitro
Eur. J. Biochem.
 , 
1998
, vol. 
258
 (pg. 
491
-
501
)
Lu
L
, et al.  . 
MULTIPROSPECTOR: an algorithm for the prediction of protein-protein interactions by multimeric threading
Proteins
 , 
2002
, vol. 
49
 (pg. 
350
-
364
)
Lu
L
, et al.  . 
Multimeric threading-based prediction of protein-protein interactions on a genomic scale: application to the Saccharomyces cerevisiae proteome
Genome Res.
 , 
2003
, vol. 
13
 (pg. 
1146
-
1154
)
Matthews
LR
, et al.  . 
Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or “interologs”
Genome Res.
 , 
2001
, vol. 
11
 (pg. 
2120
-
2126
)
McLachlan
AD
Rapid comparsion of protein structures
Acta Cryst.
 , 
1982
, vol. 
A38
 (pg. 
871
-
873
)
Overbeek
R
, et al.  . 
The use of gene clusters to infer functional coupling
Proc. Natl. Acad. Sci. USA
 , 
1999
, vol. 
96
 (pg. 
2896
-
2901
)
Pellegrini
M
, et al.  . 
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles
Proc. Natl. Acad. Sci. USA
 , 
1999
, vol. 
96
 (pg. 
4285
-
4288
)
Pieper
U
, et al.  . 
MODBASE, a database of annotated comparative protein structure models, and associated resources
Nucleic Acids Res.
 , 
2004
, vol. 
32
 (pg. 
D217
-
D222
)
Schwede
T
, et al.  . 
SWISS-MODEL: An automated protein homology-modeling server
Nucleic Acids Res.
 , 
2003
, vol. 
31
 (pg. 
3381
-
3385
)
Skolnick
J
Kihara
D
Defrosting the frozen approximation: PROSPECTOR–a new approach to threading
Proteins
 , 
2001
, vol. 
42
 (pg. 
319
-
331
)
Sprinzak
E
Margalit
H
Correlated sequence-signatures as markers of protein-protein interaction
J. Mol. Biol.
 , 
2001
, vol. 
311
 (pg. 
681
-
692
)
Stein
A
, et al.  . 
3did: interacting protein domains of known three-dimensional structure
Nucleic Acids Res.
 , 
2005
, vol. 
33
 (pg. 
D413
-
D417
)
Tanti
JF
, et al.  . 
Serine/threonine phosphorylation of insulin receptor substrate 1 modulates insulin receptor signaling
J. Biol. Chem.
 , 
1994
, vol. 
269
 (pg. 
6051
-
6057
)
Tramontano
A
Morea
V
Assessment of homology-based predictions in CASP5
Proteins
 , 
2003
, vol. 
53
 
Suppl. 6
(pg. 
352
-
368
)
Uetz
P
, et al.  . 
A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae
Nature
 , 
2000
, vol. 
403
 (pg. 
623
-
627
)
von Mering
C
, et al.  . 
Comparative assessment of large-scale data sets of protein-protein interactions
Nature
 , 
2002
, vol. 
417
 (pg. 
399
-
403
)
Zhang
LV
, et al.  . 
Predicting co-complexed protein pairs using genomic and proteomic data integration
BMC Bioinformatics
 , 
2004
, vol. 
5
 pg. 
38
 
Zhu
H
, et al.  . 
Global analysis of protein activities using proteome chips
Science
 , 
2001
, vol. 
293
 (pg. 
2101
-
2105
)
Associate Editor: Dmitrij Frishman

Comments

0 Comments