Abstract

Motivation

Protein–protein interactions (PPI) are essential for the function of the cellular machinery. The rapid growth of protein–protein complexes with known 3D structures offers a unique opportunity to study PPI to gain crucial insights into protein function and the causes of many diseases. In particular, it would be extremely useful to compare interaction surfaces of monomers, as this would enable the pinpointing of potential interaction surfaces based solely on the monomer structure, without the need to predict the complete complex structure. While there are many structural alignment algorithms for individual proteins, very few have been developed for protein interfaces, and none that can align only the interface residues to other interfaces or surfaces of interacting monomer subunits in a topology independent (non-sequential) manner.

Results

We present InterComp, a method for topology and sequence-order independent structural comparisons. The method is general and can be applied to various structural comparison applications. By representing residues as independent points in space rather than as a sequence of residues, InterComp can be applied to a wide range of problems including interface–surface comparisons and interface–interface comparisons. We demonstrate a use-case by applying InterComp to find similar protein interfaces on the surface of proteins. We show that InterComp pinpoints the correct interface for almost half of the targets (283 of 586) when considering the top 10 hits, and for 24% of the top 1, even when no templates can be found with regular sequence-order dependent structural alignment methods.

Availability and implementation

The source code and the datasets are available at: http://wallnerlab.org/InterComp.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Proteins are involved in almost all processes in cells and have evolved to interact with a range of other molecules, such as proteins, DNA, RNA or small molecules. The study of how proteins interact with these molecules offers important insights into the function of proteins, the way they operate, and possible causes of disease (Alberts, 1998; Jeong et al., 2001; Li et al., 2004).

Proteins interact with other molecules by making direct physical contact through specific residues on the protein surface. These residues constitute the interface of a protein. A variety of interfaces have been experimentally identified and have been found to vary both in shape and residue composition (Davis and Sali, 2005). Interfaces can be stable, as for the multiple chains of the ribosome or transient, as for many proteins involved in signalling pathways. The same interface can interact with multiple molecules and an interaction can also require multiple interfaces (Bomsztyk et al., 2004; Cohen, 2002; Han et al., 2004).

To predict how proteins interact with other molecules it is of fundamental importance to know where the interfacial residues are located on their surface. For example, in the case of protein docking, it has been shown that it is relatively easy to dock proteins using template-based docking techniques if a similar interaction has been experimentally determined (Kundrotas et al., 2012; Mirabello and Wallner, 2017; Tuncbag et al., 2012; Zhang et al., 2013).

A number of template-based docking methods have been developed in the last few years. Some methods use sequence- or profile-based alignments to match two target protein sequences to the sequences of two protein chains that are part of an experimentally solved quaternary structure (Chen and Skolnick, 2008; Mukherjee and Zhang, 2011). When a match is found, the structure and mutual position of the two protein chains can be used as templates to model the interaction of the targets. Unfortunately, this approach inherits the same drawbacks as template-based modelling of protein monomer structures, and when the pairwise sequence identity drops below 30% it is difficult to obtain a reliable prediction (Aloy et al., 2003).

Other methods are based on structural templates. The first step in these methods is usually to build the monomer structure of each molecule in the complex separately. Then, the structure of each monomer is aligned to libraries of known complexes, and cases where the two target monomers are structurally similar to monomers of a known complex can be used as a template for the interaction. This improves the coverage of modellable multimers. Since the structure of proteins is more conserved than the sequence, more distantly related homologs can be found using structure (Aloy and Russell, 2002; Mirabello and Wallner, 2017; Zhang et al., 2013). However, a drawback of the methods based on structure is that both the target monomers need to be overall structurally similar to their templates. At the same time, it has been shown that an interaction is only specific to the shape and chemical composition of the patches of residues directly involved in the interaction, i.e. the interfacial residues and not the overall structural scaffold. In fact, some proteins interact through the same type of interface while differing substantially in their overall structural similarity (Keskin and Nussinov, 2007).

In addition, the number of different spatial arrangements of residues in protein interfaces seems to be lower than the number of different protein folds, and studies have shown that the space of interface structures is already covered to a large degree in the known structures in the Protein Data Bank (Gao and Skolnick, 2010b; Kundrotas et al., 2012). Thus, in principle it should be possible to find template interfaces for most unknown interactions from the structure of single interacting monomers, particularly if the alignment is limited to the interfacial regions of proteins. But aligning only interface residues is not trivial and methods adopt a mixed approach to this problem, where full structural alignments are used to find templates, and the quality of the alignment at the interfacial region of the template is used to improve the quality of the prediction (Guerler et al., 2013; Hosur et al., 2011; Mirabello and Wallner, 2017; Zhang et al., 2013). Other methods focus on structurally aligning a subset of residues of the target structures corresponding to the interfacial regions of templates as a more flexible way of finding interaction templates whenever the evolutionary relationship between targets and templates is unclear or non-existent (Gao and Skolnick, 2010a; Günther et al., 2007; Konc and Janežič, 2010; Tuncbag et al., 2011).

However, thus far the full potential and the characteristics of interfaces have not been explored, and it is still unclear whether aligning interfaces, rather than full monomers, represents a real improvement in the search for interaction templates (Sinha et al., 2010). In the latest CASP12/CAPRI experiment (2016) the most successful groups were still using templates gathered from alignments of sequences or full structures (Lensink et al., 2017). A possible reason for why the full potential of interfaces has not yet been exploited can be found by analyzing how current interface alignment methods are implemented. For example, PRISM (Tuncbag et al., 2011), one of the leading methods based on interface alignments, does not restrict its search for templates to only interfacial residues, but also includes neighboring residues that are closer than 6 Å to other interfacial residues, even if these are buried in the protein core. Although such an approach helps the alignment procedure by reducing the fragmentation of the interface, it also restricts the alignment to cases where basically the secondary structure elements at the interface level match, and thus it might be less effective at finding evolutionary unrelated, yet compatible, templates.

Another approach to find templates is to rely on using structural alignment programs such as TM-align (Zhang and Skolnick, 2005) and restricting the alignment to only interfacial residues (Guerler et al., 2013; Kundrotas and Vakser, 2013). The ability to insert multiple gaps in structural alignments can help to address the issue of fragmentation at the interface. However, depending on the level of fragmentation, TM-align might still fail in retrieving the correct alignment, simply because it is not designed for dealing with heavily fragmented coordinate sets. A further limitation of TM-align is that the sequential order of the residues must be maintained for the algorithm to work correctly, and as such it would not be possible to align interfaces with different chain topologies.

Similarly, other structural alignment methods that were not developed explicitly for the alignment of fragmented molecules or molecules that are radically different in topology, such as MICAN (Minami et al., 2013) and DaliLite (Holm and Park, 2000), suffer from these same limitations.

To address the topological issue, iAlign was developed (Gao and Skolnick, 2010a). It is a protein–protein interface comparison method based on an extension of the Kabsch algorithm (Kabsch, 1976), also used in TM-align, that will optionally allow for sequence-order independent comparisons for interfaces. However, it utilizes a definition of protein interface, where the interfacial residues are collected across both the protein chains involved in the interaction. This means that the mutual position of the patches of interfacial residues must be known before any comparison can be performed against a template. Thus, iAlign can only be used if the complete interface is known, e.g. for comparing known interfaces, and not for searching for interfacial residues on a monomer structure.

In this study, we present InterComp, designed to perform sequence-order independent structural comparisons and alignments. Since the algorithm works on a disjointed set of points in space rather than on a sequence of residues, the method is general and can be used for various structural comparisons applications, including interface–surface and interface–interface comparisons. The main difference in these applications would be a few size-dependent parameters and the statistics, i.e. P-value calculation. Here, we demonstrate a case when we apply InterComp to find protein interfaces on monomer structures using interface–surface comparisons. We show that InterComp can pinpoint the interface location on the surface of proteins, even when structural homologs found by regular structural alignment methods (e.g. TM-align) have been removed.

2 Materials and methods

2.1 Algorithm

The aim of this study is to build a software that is capable of comparing molecules by treating them as a set of independent points on a surface in a 3D space. This means that the points have no inherent ordering, which differs from regular structural alignments, where the atoms follow a fixed order according to the protein sequence.

We use a modified version of a stochastic method for molecular structure matching (Barakat and Dean, 1991; Kirkpatrick et al., 1983) and simulated annealing to solve an optimization problem that maximizes the structural superposition score of two molecules independently of the chain topology of their covalently bonded network. The objective function is calculated by comparing the Cα distance maps of the two molecules. This simplifies the problem, since distance maps are invariant to rotations and translation there is no need to apply spatial transformations to superimpose molecules. Instead, the optimal matching between two molecules is found by permutating the rows and columns in the distance matrix while maximizing the similarity (see below for details).

2.2 Objective function

In this study, the similarity measure for a given trial alignment between two molecules a and b with length La = N and Lb = M (NM), represented as internal distance matrices Da and Db, is a variant of the Levitt-Gerstein score (Levitt and Gerstein, 1998) adapted to internal distances:
strdist(Da,Db)=1N2x=1Ny=1N11+(δxy/d0)2
(1)
where N is number the of residues in a, and d0 is a parameter that monitors the slope of the function (optimized d0 = 0.5 Å, see Supplementary Material), δxy is the absolute element (x, y) in the matrix difference between Da and the first N columns and rows of Db after it has been permutated to form a trial alignment, see diff(D1,D2) in Figure 1a. The M − N residues (for x > N and y > N) from b are not included in the alignment and are thus excluded from the similarity score.
To also consider the chemical compatibility of the molecules (i.e. the similarity of the aligned residues), a second scoring component based on amino acid similarity is used, analogous to the method used for the structural similarity above:
seqdist(Sa,Sb)=1Nz=1NBLOSUM(sza,szb)
(2)
where N is the number of residues in Sa, Sa and Sb are the aligned amino acid residues from targets a and b, respectively, and the BLOSUM62 substitution matrix (Henikoff and Henikoff, 1992) is used to score the similarity for the matching position.
Fig. 1.

(a) The similarity score in InterComp is calculated by adding up the elements in the matrix of deltas diff(Da,Db). The deltas are the absolute difference, calculated element by element, between the first N rows/columns in the distance matrix Da and the matrix Db. (b) A trial alignment is obtained by permutating Db by swapping any two random rows/columns forming the Dbp

To represent both the structural and sequence similarity, the two scores are combined using a weighted sum to form the complete objective function in the simulated annealing procedure:
opt(p)=arg maxpP(M,N)Wstr strdist(Da,Dbp)   +(1Wstr) seqdist(Sa,Sbp)
(3)
where Wstr[0,1] is the weight for the structural similarity score, Dbp is a permutation of the rows/columns in the distance matrix Db and Sbp the corresponding amino acids. The default weight of the structural component of the scoring function, Wstr, has been optimized to 0.5 by trying different values for Wstr in the 0.25–1.75 range (see Supplementary Material).

When the optimal mapping between two molecules has been found, a structural superposition can be performed by minimizing the RMSD for the mapping and outputting the two structurally aligned molecules in PDB format.

2.3 Optimization protocol

The optimization procedure keeps the distance map Da of the smallest molecule fixed, while trial configurations for the largest map Db are generated by swapping a random pair of rows/columns. To account for the difference in size, a number of rows/columns in Db, equal to the size difference, are always ignored when calculating the final score. These are initialized randomly, and then sampled naturally by the swapping of rows/columns in Db.Figure 1b shows how a trial configuration is generated to obtain Dbp from Db. In addition, to model mismatches in Da a proportion of the rows/columns from the smallest molecule, Da can optionally be ignored [null correspondences as defined in Barakat and Dean (1991)]. However, in our tests this option did not yield better results, most likely because the structural similarity score [Equation (1)] already down-weight residues with large deviations (data not shown).

The simulated annealing procedure is used to find the optimal sorting for Db. At each iteration, two columns in Db are randomly swapped to form a trial configuration, the score [Equation (3)] is calculated and the configuration accepted if it improves on the current best score. Otherwise, the configuration is accepted with probability: P=exp(Δscore/T), where the Δscore is the difference between the trial score and the last accepted score, and T is the annealing temperature, which is gradually lowered with the number of iterations.

With regard to the annealing procedure discussed in Barakat and Dean (1991), the main differences in the current method are the larger number and length of Markov chains used during the search. This is necessary since it is difficult to compare molecules containing many atoms and a large imbalance in the number of atoms between targets. The longer Markov chains allow the state of the system to settle as the temperature is decreased in the annealing procedure. The procedure is said to have converged whenever the acceptance rate, i.e. the frequency at which a new trial configuration is accepted, drops below 0.3%. The acceptance rate is reset at the beginning of each Markov chain, and the minimum size of the Markov chain is set so that the acceptance rate can always drop below the stopping criteria.

2.4 Datasets

To test and optimize the method, a data set of hetero- or homo-dimeric protein–protein complexes was constructed. The set is composed of protein chains extracted from a 20% redundancy-reduced version of PDB compiled by PISCES (Wang and Dunbrack, 2003). The reduced PDB contained 2952 protein chains (July 13, 2016) with resolution 1.6 Å and R-factor 0.25 or better. From this set, any protein chain involved in one or more dimer interactions was selected, resulting in 668 protein chains involved in dimer interactions. To avoid including targets with very small interfaces, any monomer with an interface composed of <20 residues was removed. In addition, targets with an interface covering >50% of their surface were also removed, since these targets were deemed too easy and even a random predictor would score well in our tests. The final set called T568 consisted of 568 protein monomers. From this, a monomer shells set (S568) was constructed, consisting of the Cα atoms from residues on the surface of the protein chains defined by Residue Solvent Accessibility (RSA) for the side-chain >15% calculated using Naccess (Hubbard and Thornton, 1993).

In addition, a library of template protein interfaces was also extracted from PDB (May 19, 2016) using the defined biological units to prevent non-native interfaces from crystal packing (Carugo and Argos, 1997). If a biological unit is also a multimer, an interface is defined as residues within 5.0 Å (all-atoms) between the two monomers. Each resulting interface is characterized by the monomer from which its residues were extracted and the monomer containing the counterparts in the interaction. For example, if the PDB 1xyz contains two chains, A and B, two separate interfaces will be extracted: 1xyz_AB, containing the Cα atoms of residues from A that were interacting with any residue in B, and 1xyz_BA, containing the Cα atoms of residues from B that were interacting with any residue in A. The resulting template library, called I400k, contained approximately 400 000 interfaces.

Figure 2a shows an example of a protein monomer (PDB id: 3vsv, chain A, colored in purple) where the shell atoms are highlighted as green spheres and the interface atoms (interface id: 3vsv_AD) are highlighted in red. It is important to note that all the interface atoms are also shell atoms. A stylized version of the same concept is shown in Figure 2b. In this work, we use InterComp to align the interface to shells (Fig. 2c). We also remove obvious structural homologs by applying TM-align to align interfaces to monomers (Fig. 2d) and monomers to monomers (Fig. 2e).

Fig. 2.

(a) Protein monomer 3vsv, chain A (purple). The shell atoms are Cα atoms from residues whose RSA is over 15% (green). The interface atoms are a subset of the shell atoms whose residues are also closer than 5 Å to any residue in another protein chain (red). (b) Stylized version of a protein monomer where the shell and interface residues are represented with a dashed contour line. (c) Example of an interface–shell alignment, where an interface and a shell extracted from two monomers are aligned. This is the case with all InterComp alignments in this work. (d) Example of an interface–monomer alignment, where the interface atoms are aligned against all Cα atoms in a monomer. (e) Example of a monomer–monomer alignment, where all Cα atoms from two protein monomers are aligned. The alignments shown in d and e are performed with TM-align in this work

In order to estimate P-values for the structural score [Equation (1)], a set of approximately 2 million random alignments between target shells and biological interfaces was constructed. This set was constructed by aligning 1790 monomer shells previously described in Gao and Skolnick (2010a) against the library of biological interfaces (I400k) using InterComp to calculate the structural score. To avoid including non-random hits, pairs that showed a significant similarity by TM-align (TM-score <0.35) were filtered out. It is important to note that this filtering will only remove any high scoring interface–shell alignment that can be found by TM-align (sequence-order dependent). It is still possible that the set of random alignments may include non-random high scoring matches, interfaces that are only found when compared in a sequence-order independent manner using InterComp. Still, these examples will be few compared to the whole random set and should not influence the P-value calculation too much.

Since the structural score depends heavily on the size of the interface and shell, the random structural scores were fitted to an extreme value distribution for different interface and shell size bins. In Figure 3, the random distribution for different size bins along with the fitted distributions are shown. We used the fit to calculate the P-values for each structural score. By analyzing the first row of plots in Figure 3, it is clear that whenever interfaces contain <10 residues, independently of the size of the shell, random scores tend to be high. Thus, it will be hard to find significant hits for small interfaces, which makes sense, since smaller interfaces can easily align with most targets (e.g. single helix to helix alignment).

Fig. 3.

Probability density distributions of InterComp scores from random alignments for different interface (rows) and shell (column) sizes. The empirical probability density is shown in blue and the fitted extreme value distribution in red

Since our tests consist of multiple comparisons against many templates (400 k), using the P-value alone to decide whether the null hypothesis can be rejected or not would be impossible without performing the necessary multiple testing correction. In our tests, we thus adopt the False Discovery Rate (FDR) controlling procedure (Benjamini and Hochberg, 1995) and derive an FDR adjusted P-value (q-value) for each P-value (Yekutieli and Benjamini, 1999). The q-values are then used to decide if a given InterComp score is significant.

To test the hypothesis that InterComp is more sensitive than a sequence-order dependent algorithm, two additional interface sets of varying difficulty were constructed. The first ‘medium’ set was built by removing all template hits that would also be matched by TM-align (TM-score >0.5) for a given target from the T568 set, using Monomer–Interface alignments, see Figure 2d. The second ‘hard’ set was built by removing all template hits that would also be matched by TM-align (TM-score >0.5) for a given target from the T568 set, using Monomer–Monomer alignments, see Figure 2e.

In Figure 4a, schematic representation of all three subsets of template interfaces used to test InterComp is shown. It is important to note that while the I400k is the same for every target in the T568 set, the ‘medium’ and ‘hard’ sets will vary from target to target, depending on the structural similarity found using interface–monomer and monomer–monomer alignment with TM-align.

Fig. 4.

Schematic overview of the different template interface sets that were used to test InterComp. The full I400k interface set, also called the ‘easy’ set, is shown in red. The ‘medium’ subset of I400k for a given target is obtained by removing any template interface that will align with the TM-align, TM-score of the Interface–Monomer alignment 0.5. The ‘hard’ subset of I400k is obtained by removing from the ‘medium’ subset any template interface whose parent monomer is in the same fold as the target monomer, TM-score of the Monomer–Monomer alignment 0.5

2.5 Performance measures

To assess the performance of InterComp, the Positive Predicted Value (PPV) of interfacial residues is used:
PPV=TPTP+FP
(4)
where TP is the number of True Positives, i.e. the number of residues correctly predicted as part of an interface, and FP is the number of residues incorrectly predicted as part of an interface. The sum of TP and FP is also the total number of residues predicted as part of an interface.

3 Results and discussion

3.1 Benchmark: aligning interfaces to surfaces

InterComp was applied to the problem of predicting the interface residues on the surface of protein chains involved in a dimeric interaction. By restricting the search to the surface of proteins (shells) rather than to full monomers, the complexity of the search was reduced. This was necessary since the time required for the algorithm to converge is related to the length of the Markov chains which increases quadratically with the size of the largest target. The accuracy should not suffer from this choice, since interfaces are expected to be on the surface of proteins.

We hypothesize that top-ranking interface templates by q-value (see c) should correspond to residues involved in interactions with partner proteins. The location of the interface on each monomer was found by searching for areas on the surface that are compatible in shape and chemical composition to interfaces in the library, as illustrated in Figure 2c.

In detail, InterComp was used to align each monomer target shell from the S568 set against the I400k interface library (‘easy’ test). The top 1 and top 10 best matches by q-value are selected and the performance is assessed by the fraction of correctly predicted interface residues from the target shell, PPV [Equation (4)], see Figures 5 and 6 right panel. To account for the fact that the ‘easy’ test database contains several simple cases that any structural alignment method would find. The test was made more difficult by removing any template that would be found using TM-align, a commonly used sequence-order dependent structural alignment method. Two difficulty levels were tested: for the first (‘medium’ test) any template with an interface–monomer TM-score >0.5 was removed (Figs 5 and 6, middle panel); for the second (‘hard’ test) any interface whose parent monomers had a TM-score >0.5 with the target was removed, effectively removing any templates that overall were structurally similar to the target (Figs 5 and 6, right panel). In this way, it is possible to assess if InterComp is capable of finding correct template interfaces even when sequence-order dependent structural alignment methods like TM-align fail at various levels.

Fig. 5.

Distribution of fraction of correctly predicted interface residues (PPV) in relation to the negative logarithm of the q-value calculated from the top 10 InterComp hits using the ‘easy’ (green), ‘medium’ (orange) and ‘hard’ (purple) interface sets. The ‘random’ box corresponds to the fraction of interfacial residues on each target shell and estimates the performance of a predictor that would pick random shell residues to be part of the interface. The infinity () sign on the x-axis corresponds to q-value = 0

Fig. 6.

Scatter plots of the PPV in relation to the negative logarithm of the q-value calculated from the InterComp score for the Top 1 and Top 10 hits for each target shell using the ‘easy’ (green), ‘medium’ (orange) and ‘hard’ (purple) interface sets. The infinity () sign on the x-axis corresponds to q-value = 0

There is a clear correlation between the structural score from InterComp and the PPV for all three tests (Figs 5 and 6). Although the correlation between InterComp score and PPV for the ‘hard’ test, is less pronounced (Spearman’s rank correlation 0.20–0.24), InterComp could still find at least the location of an interface with PPV0.5 for more than half (287/568) of targets for the top 10 hits and for 24% (139/568) of the targets for the top 1. Moreover, the chances of finding the right interface increases as the q-value derived from the InterComp improves (Fig. 5, purple boxes). For the ‘medium’ test, InterComp could correctly identify the correct interface for 70% (401/568) of targets for the top 10, and 60% (341/568) for the top 1. On the full I400k template set (excluding self-hits, ‘easy’ test) InterComp founds at least one interface for 80% of targets (450/568) and 66% (373/568) for the top 10 and top 1 hits, respectively. In all cases, including the ‘hard’ set the number of correctly identified interfaces is significantly better than would be expected by a random predictor.

3.2 Alternative interfaces

In some cases, even at very significant q-values the PPV can be low or close to zero (Fig. 6). The reason for this could of course be that InterComp predicts completely wrong interfaces in all of these cases. However, a more likely explanation is some of these very significant hits are alternative, yet unknown interfaces or interfaces not included among the test interfaces for the target. An example of a potential correctly predicted interface that is classified as incorrect in the benchmark is shown in Figure 7a. The top 1 interface predicted by InterComp (q-value = 0) for target 4b1y (medium difficulty) is incorrect, since it does not match the interface in its biological unit. However, the monomer from which the template interface was extracted superimposes almost perfectly with the target, as shown in Figure 7b. This highlights a problem with multiple correct interfaces, and although some can be identified, not all of them are included in the relevant biological unit.

Fig. 7.

An example of a significant hit that is not part of the target set. (a) The protein Phactr1 RPEL-3 (4b1y) from organism Oryctolagus cuniculus bound to G-actin (cyan). The target interface is the binding site for G-actin (short peptide shown in cyan), but InterComp identified another high-confidence interface with template 2p9l1_BF (white spheres). However, the template 2p9l (Arp2/3 complex from organism Bos taurus) superimpose almost perfectly to the target (green/grey chains aligned on right side, RMSD: 0.65), with the interface 2p9l1_BF highlighted in red and the relevant partner for the template interaction shown in white. In this case, it is likely that the inferface pinpointed by InterComp is actually correct, even though it was not part of the target set

3.3 Successful examples

A few examples of successful cases from the ‘hard’ tests are shown in Figure 8. For all of these examples, InterComp is able to find with high accuracy (PPV >0.5) the location of the target interface. This would not have been possible using regular sequence-order dependent structural alignment methods such as TM-align. In this case, in fact, the TM-score is <0.5 for aligning the template interface to the target (0.23–0.35) and for aligning the template interface parent monomer to the target (0.26–0.42). It is easy to spot the location of some false positives, where the predicted interface residues (the red spheres) are quite far away from the interaction surface, e.g. on the rear side of 1t92A in Figure 8a and 3lagA in Figure 8b. In these cases, a simple clustering technique could potentially be used to filter out spurious positives that are unlikely to be part of the interface. But that is not the focus of this study.

Fig. 8.

Successful predictions from the ‘hard’ category with the corresponding InterComp score, q-value, M-M TM-score (Monomer–Monomer TM-score), M-I TM-score (Monomer–Interface TM-score) and PPV. The target chains are shown in yellow with the predicted interfacial residues highlighted by red spheres, template chains are shown in cyan with the aligned interface residues highlighted by white spheres. The interacting target partners are showed in transparent orange and template partners are showed in transparent magenta. (a) Target 1t92A, Pili subunit structure of N-terminal truncated pseudopilin PulG from Klebsiella oxytoca. Template 4uxz1_BC, crystal Structure of a Membrane Diacylglycerol Kinase from organism Escherichia coli. (b) Target 3lagA, Double-stranded beta-helix, crystal structure of a functionally unknown protein RPA4178 from Rhodopseudomonas palustris. Template 3kvp1_AD, Beta-propeller-like, crystal structure of uncharacterized protein ymzC precursor from Bacillus subtilis. (c) Target 4dt5A, Single-stranded left-handed beta-helix, crystal structure of antifreeze protein from Rhagium inquisitor. Template 3cgl1_DC, GFP-like protein dsFP483, cyan fluorescent protein from organism Discosoma striata. (d) Target 5b08A, Alpha-beta plaits, polyketide cyclase OAC from organism Cannabis sativa. Template 3oay2_LM Immunoglobulin-like beta-sandwich, HIV glycan shield from Homo sapiens

3.4 Computational cost

To give an idea of the computational cost of running InterComp, we timed the running times using a few typical sizes for target shells against the full I400k set of template interfaces using a 28 core 2.6 GHz computer with 128GB RAM Linux node. The median number of residues on the surface of a monomer (shell size) for set S568 is 129. This corresponds to a monomer of about 200 residues (e.g. target 2zcmA, 3cxnA) with a runtime of approximately 120 min on 28 cores (56 core hours in total). A run on a target shell twice as large (250 residues and 400 residues in total in the monomer) will need approximately 13 h to be completed (364 core hours in total). Overall, a full test on all 568 targets in the S568 set against I400k takes approximately 100 000 core hours, i.e. 176 core hours per target on average.

4 Conclusions

We have presented InterComp, a topology and sequence-order independent structural alignment method. We have shown that InterComp is capable of performing protein surface to interface alignment and can be used to pinpoint potential interaction points on the surface of proteins, even when regular structural alignment methods that are dependent on the sequence order fail. The fact that InterComp can align monomer structures to one side of a complete interface is extremely useful, and should leverage the use of structural information in protein–protein docking by providing novel templates with similar interfaces but no overall structural similarities. However, the interface–surface alignment case demonstrated here is only one of many potential use-cases for InterComp. For instance, we are currently recalculating the statistics to apply the method to interface–interface, and small molecule comparisons, which will enable clustering of interfaces and improvements to virtual screening of small molecules.

Acknowledgements

The computations were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) at the National Supercomputer Centre (NSC) in Linköping. We also thank Isak Johansson-Åkhe for helpful discussions.

Funding

This work was supported by Swedish Research Council grants 2012-5270, 2016-05369, The Swedish e-Science Research Center, and the Foundation Blanceflor Boncompagni Ludovisi, nee Bildt.

Conflict of Interest: none declared.

References

Alberts
 
B.
(
1998
)
The cell as a collection of protein machines: preparing the next generation of molecular biologists
.
Cell
,
92
,
291
294
.

Aloy
 
P.
,
Russell
R.B.
(
2002
)
Interrogating protein interaction networks through structural biology
.
Proc. Natl. Acad. Sci. USA
,
99
,
5896
5901
.

Aloy
 
P.
 et al.  . (
2003
)
The relationship between sequence and interaction divergence in proteins
.
J. Mol. Biol.
,
332
,
989
998
.

Barakat
 
M.T.
,
Dean
P.M.
(
1991
)
Molecular structure matching by simulated annealing. III. The incorporation of null correspondences into the matching problem
.
J. Comput.-Aided Mol. Des.
,
5
,
107
117
.

Benjamini
 
Y.
,
Hochberg
Y.
(
1995
)
Controlling the false discovery rate: a practical and powerful approach to multiple testing
.
J. R. Stat. Soc. B
,
57
,
289
300
.

Bomsztyk
 
K.
 et al.  . (
2004
)
hnRNP K: one protein multiple processes
.
Bioessays
,
26
,
629
638
.

Carugo
 
O.
,
Argos
P.
(
1997
)
Protein–protein crystal-packing contacts
.
Protein Sci.
,
6
,
2261
2263
.

Chen
 
H.
,
Skolnick
J.
(
2008
)
M-TASSER: an algorithm for protein quaternary structure prediction
.
Biophys. J.
,
94
,
918
928
.

Cohen
 
P.T.
(
2002
)
Protein phosphatase 1-targeted in many directions
.
J. Cell Sci.
,
115
,
241
256
.

Davis
 
F.P.
,
Sali
A.
(
2005
)
PIBASE: a comprehensive database of structurally defined protein interfaces
.
Bioinformatics
,
21
,
1901
1907
.

Gao
 
M.
,
Skolnick
J.
(
2010
)
iAlign: a method for the structural comparison of protein–protein interfaces
.
Bioinformatics
,
26
,
2259
2265
.

Gao
 
M.
,
Skolnick
J.
(
2010
)
Structural space of protein–protein interfaces is degenerate, close to complete, and highly connected
.
Proc. Natl. Acad. Sci. USA
,
107
,
22517
22522
.

Guerler
 
A.
 et al.  . (
2013
)
Mapping monomeric threading to protein–protein structure prediction
.
J. Chem. Inf. Model.
,
53
,
717
725
.

Günther
 
S.
 et al.  . (
2007
)
Docking without docking: iSEARCH-prediction of interactions using known interfaces
.
Proteins
,
69
,
839
844
.

Han
 
J.-D.J.
 et al.  . (
2004
)
Evidence for dynamically organized modularity in the yeast protein–protein interaction network
.
Nature
,
430
,
88
.

Henikoff
 
S.
,
Henikoff
J.G.
(
1992
)
Amino acid substitution matrices from protein blocks
.
Proc. Natl. Acad. Sci. USA
,
89
,
10915
10919
.

Holm
 
L.
,
Park
J.
(
2000
)
DaliLite workbench for protein structure comparison
.
Bioinformatics
,
16
,
566
567
.

Hosur
 
R.
 et al.  . (
2011
)
iWRAP: an interface threading approach with application to prediction of cancer-related protein–protein interactions
.
J. Mol. Biol.
,
405
,
1295
1310
.

Hubbard
 
S.J.
,
Thornton
J.M.
(
1993
)
‘Naccess’
.
Computer Program, Department of Biochemistry and Molecular Biology, University College
,
London
.

Jeong
 
H.
 et al.  . (
2001
)
Lethality and centrality in protein networks
.
Nature
,
411
,
41
42
.

Kabsch
 
W.
(
1976
)
A solution for the best rotation to relate two sets of vectors
.
Acta Crystallogr. A
,
32
,
922
923
.

Keskin
 
O.
,
Nussinov
R.
(
2007
)
Similar binding sites and different partners: implications to shared proteins in cellular pathways
.
Structure
,
15
,
341
354
.

Kirkpatrick
 
S.
 et al.  . (
1983
)
Optimization by simulated annealing
.
Science
,
220
,
671
680
.

Konc
 
J.
,
Janežič
D.
(
2010
)
ProBiS algorithm for detection of structurally similar protein binding sites by local structural alignment
.
Bioinformatics
,
26
,
1160
1168
.

Kundrotas
 
P.J.
,
Vakser
I.A.
(
2013
)
Global and local structural similarity in protein–protein complexes: implications for template-based docking
.
Proteins
,
81
,
2137
2142
.

Kundrotas
 
P.J.
 et al.  . (
2012
)
Templates are available to model nearly all complexes of structurally characterized proteins
.
Proc. Natl. Acad. Sci. USA
,
109
,
9438
9441
.

Lensink
 
M.F.
 et al.  . (
2017
)
The challenge of modeling protein assemblies: the CASP12-CAPRI experiment
.
Proteins
,
86
,
257
273
.

Levitt
 
M.
,
Gerstein
M.
(
1998
)
A unified statistical framework for sequence comparison and structure comparison
.
Proc. Natl. Acad. Sci. USA
,
95
,
5913
5920
.

Li
 
S.
 et al.  . (
2004
)
A map of the interactome network of the metazoan C. elegans
.
Science
,
303
,
540
543
.

Minami
 
S.
 et al.  . (
2013
)
MICAN: a protein structure alignment algorithm that can handle multiple-chains, inverse alignments, C(α) only models, alternative alignments, and non-sequential alignments
.
BMC Bioinformatics
,
14
,
24
.

Mirabello
 
C.
,
Wallner
B.
(
2017
)
Interpred: a pipeline to identify and model protein–protein interactions
.
Proteins
,
85
,
1159
1170
.

Mukherjee
 
S.
,
Zhang
Y.
(
2011
)
Protein–protein complex structure predictions by multimeric threading and template recombination
.
Structure
,
19
,
955
966
.

Sinha
 
R.
 et al.  . (
2010
)
Docking by structural similarity at protein–protein interfaces
.
Proteins
,
78
,
3235
3241
.

Tuncbag
 
N.
 et al.  . (
2011
)
Predicting protein–protein interactions on a proteome scale by matching evolutionary and structural similarities at interfaces using PRISM
.
Nat. Protoc.
,
6
,
1341
1354
.

Tuncbag
 
N.
 et al.  . (
2012
)
Fast and accurate modeling of protein–protein interactions by combining template-interface-based docking with flexible refinement
.
Proteins
,
80
,
1239
1249
.

Wang
 
G.
,
Dunbrack
R.L.
(
2003
)
PISCES: a protein sequence culling server
.
Bioinformatics
,
19
,
1589
1591
.

Yekutieli
 
D.
,
Benjamini
Y.
(
1999
)
Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics
.
J. Stat. Plan. Infer.
,
82
,
171
196
.

Zhang
 
Q.C.
 et al.  . (
2013
)
PrePPI: a structure-informed database of protein–protein interactions
.
Nucleic Acids Res.
,
41
,
D828
D833
.

Zhang
 
Y.
,
Skolnick
J.
(
2005
)
TM-align: a protein structure alignment algorithm based on the TM-score
.
Nucleic Acids Res.
,
33
,
2302
2309
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Supplementary data