Accurate prediction of inter-protein residue–residue contacts for homo-oligomeric protein complexes

Abstract Protein–protein interactions play a fundamental role in all cellular processes. Therefore, determining the structure of protein–protein complexes is crucial to understand their molecular mechanisms and develop drugs targeting the protein–protein interactions. Recently, deep learning has led to a breakthrough in intra-protein contact prediction, achieving an unusual high accuracy in recent Critical Assessment of protein Structure Prediction (CASP) structure prediction challenges. However, due to the limited number of known homologous protein–protein interactions and the challenge to generate joint multiple sequence alignments of two interacting proteins, the advances in inter-protein contact prediction remain limited. Here, we have proposed a deep learning model to predict inter-protein residue–residue contacts across homo-oligomeric protein interfaces, named as DeepHomo. Unlike previous deep learning approaches, we integrated intra-protein distance map and inter-protein docking pattern, in addition to evolutionary coupling, sequence conservation, and physico-chemical information of monomers. DeepHomo was extensively tested on both experimentally determined structures and realistic CASP-Critical Assessment of Predicted Interaction (CAPRI) targets. It was shown that DeepHomo achieved a high precision of >60% for the top predicted contact and outperformed state-of-the-art direct-coupling analysis and machine learning-based approaches. Integrating predicted inter-chain contacts into protein–protein docking significantly improved the docking accuracy on the benchmark dataset of realistic homo-dimeric targets from CASP-CAPRI experiments. DeepHomo is available at http://huanglab.phys.hust.edu.cn/DeepHomo/


Overall Performance of DeepHomo with contact threshold of 6Å
To investigate the effect of the distance threshold for inter-protein contacts on the performance, another threshold of 6Å used by BIPSPI [1] was also adopted to define the residue-residue contact.
The DeepHomo model was then re-trained with the same hyper-parameters and the network using the new labels. The new model was also evaluated on the PDB set with 300 experimental homo-dimeric complexes and the CAPRI-CASP set with 28 realistic targets. The two DCA-based methods and two machine learning-based methods were also re-evaluated for comparison. The results for the PDB set are listed in Tables S3 and S4 and the results for the CAPRI-CASP set are listed in Tables S5 and S6. Figure S1 shows the comparison between our new DeepHomo model and the other four methods in terms of the precision and accuracy rate. It can be seen the tables and figure that our new DeepHomo model still outperformed the other methods on the both test sets under almost all the metrics. For example, DeepHomo achieved the top L/2 precision of 25.9% on the PDB set while the second best method BIPSPI struc obtained the precision of only 10.3%. It should be noted that all five methods have shown a decline in the performance when a smaller threshold was used. This can be understood because a smaller threshold reduces the number of defined true contacts and thus make the prediction more challenging.

Comparison between contact-based model and distance-based model
Recently, distance-based folding has shown its great power in the monomer structure prediction [2][3][4].
Thus, we have also tried to use the similar strategy to train a distance-based deep learning model for the prediction of distance distribution of inter-protein residue pairs for homo-oligomers. The minimal distance of heavy atoms between two residues was discretized into 15 bins: <4.0Å, 4.0 to 6.0Å, 6.0 to 8.0Å, · · ·, 28.0 to 30Å, >30Å. Then, the prediction of the distance distribution was translated into multi-class classification. The architecture of the neural network was almost the same as that for the contact-based model except that the activation function of the output layer was changed from the sigmiod to softmax. The distance-based model was also tested on the PDB set and the CASP-CAPRI set. Figure S2 shows the precision as a function of the number of predicted contacts when using different contact thresholds (6Å and 8Å). For comparison, the results of the contact-based models trained with respective contact thresholds are also shown. When using the threshold of 6Å, the contact probability of a residue pair predicted by the distance-based model is the sum of the predicted probability of the first two bins, and for threshold of 8Å, the probability is the sum of the first three bins. As shown in Figure S2, the distance-based model does not outperform the contact-based models whether the distance threshold is 6Å or 8Å. As the residue pairs in contact are only a small part of all the possible residue pairs. When the distance is discretized into multi-bins, the residue pairs in each bin become even fewer, which will make the prediction more difficult. In addition, the conformational changes in the monomer structures would also lead to some fluctuations in the distance of interprotein residues. More sophisticated distance-based model and network architectures may be needed to handle such challenging situations in the future.
3 Application to homo-oligomers with more than two chains Except C2 symmetry, homo-oligomers may form other symmetry types like C3, D2 with more than two chains. We have also evaluated the performance of our DeepHomo model on the homo-oligomers consisting of more than two chains on our symmetric protein docking benchmark (SDB) [5], which is a comprehensive and non-redundant benchmark for symmetric protein docking and consists of targets with different types of symmetry. The targets with C2 symmetry were removed from the data set as we focused on the homo-oligomers with more than two chains. The targets with tetrahedral and helical symmetry were also removed because HSYMDOCK [6, 7] only supports Cn and Dn symmetry so far. The 69 retained targets consisting of C3-C7 symmetry and D2-D6 symmetry, which were used to evaluate the performance of our DeepHomo model and the other four methods. The input features of these targets were produced by the same process as that for C2 symmetry, except the docking map feature which was produced with restriction of the respective symmetry type of the target. The ground truth label for each residue pair was defined according to the minimal distance of the pair in the multiply interfaces. From the results shown in Tables S7-S8 and Figure S3, we can see that our DeepHomo model shows the best performance on the metrics of precision, accuracy rate and accuracy order when compared with the other four methods. Although our DeepHomo model was trained on the data set with C2 symmetry, it also achieved good performance on the targets with other symmetry types. However, the performance decreases compared with that on the test set consisting of targets with only C2 symmetry. This can be understood as follows. When training the DeepHomo model, the docking map was produced with the constraint of C2 symmetry, so the information of C2 symmetry was already included in the DeepHomo model. As the oligomerization state can be predicted from the sequence [8,9], the information of C2 symmetry may also be implicit in the sequential features and in the trained model. Therefore, DeepHomo is expected to perform better on C2 targets than others. As there are not enough data for homo-oligomers with other symmetry types, transfer learning [10,11] might be used in the future to improve the performance of our model on homo-oligomers with more than two chains.
(2) Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D. Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci U S A. 2020;117 (3) Figure S1: Performance of DeepHomo and the other four approaches with the threshold of 6Å for contacts. The precision, number of true positive (TP) predictions divided by number of predictions, as a function of the number of predicted contacts on the PDB test set (a) and the CASP-CAPRI set (c). The accuracy rate, number of the targets with at least one successfully predicted contact divided by the total number of targets in a test set, as a function of the number of predicted contacts on the PDB test set (b) and the CASP-CAPRI set (d).  Figure S2: Performance of the contact-based model and the distance-based model on the PDB test set of 300 experimental homo-dimeric structures (a,b) and on the CASP-CAPRI test set of 28 realistic targets (c, d) with contact threshold of 6Å (a,c) and 8Å (b,d).  Figure S3: Performance of DeepHomo and the other four approaches on the SDB test set of 69 targets with more than two chains. (a) The precision, number of TP predictions divided by number of predictions, as a function of the number of predicted contacts. (b) The accuracy rate, number of the targets with at least one successfully predicted contact divided by the total number of targets in a test set, as a function of the number of predicted contacts.