DeepFold: enhancing protein structure prediction through optimized loss functions, improved template features, and re-optimized energy function

Abstract Motivation Predicting protein structures with high accuracy is a critical challenge for the broad community of life sciences and industry. Despite progress made by deep neural networks like AlphaFold2, there is a need for further improvements in the quality of detailed structures, such as side-chains, along with protein backbone structures. Results Building upon the successes of AlphaFold2, the modifications we made include changing the losses of side-chain torsion angles and frame aligned point error, adding loss functions for side chain confidence and secondary structure prediction, and replacing template feature generation with a new alignment method based on conditional random fields. We also performed re-optimization by conformational space annealing using a molecular mechanics energy function which integrates the potential energies obtained from distogram and side-chain prediction. In the CASP15 blind test for single protein and domain modeling (109 domains), DeepFold ranked fourth among 132 groups with improvements in the details of the structure in terms of backbone, side-chain, and Molprobity. In terms of protein backbone accuracy, DeepFold achieved a median GDT-TS score of 88.64 compared with 85.88 of AlphaFold2. For TBM-easy/hard targets, DeepFold ranked at the top based on Z-scores for GDT-TS. This shows its practical value to the structural biology community, which demands highly accurate structures. In addition, a thorough analysis of 55 domains from 39 targets with publicly available structures indicates that DeepFold shows superior side-chain accuracy and Molprobity scores among the top-performing groups. Availability and implementation DeepFold tools are open-source software available at https://github.com/newtonjoo/deepfold.


Dataset, training, and validation
We used the latest PDB database (Feb.2022) for training.We clustered the sequences of PDB using CD-HIT with 40% sequence identity, which resulted in 31,911 protein chains.We further filtered 23,366 chains of highresolution (with < 2.5 Å) for a fine-tuning dataset.The obtained sequences were cropped to 256 and 384 residue sizes as in the AF2 for training.Five DeepFold models were selected from training with various training schedules.For all the trained models, we employed the Uni-fold (a trainable version of AF2) training system, where the trainings were started from the AF2 parameters and then further optimized in the style of transfer learning.Table S1 shows the training detail of a representative model for which a validation result is shown below.Details of the five DeepFold models are described in Supplementary Section 6.For validation, we tested the trained model on 102 targets of CASP13/14.In order to measure the performance on the side chain torsion angles precisely, we defined new accuracy measures   1 for  1 and   2 for  2 .  1 is defined as the proportion of predicted  1 angles that have a difference of 10 degrees or less when compared to the ground truth  1 angles.For  2 accuracy,   2 is defined as the proportion of residues with angles for both  1 and  2 have a difference of 10 degrees or less with compared to the ground truth  1 and  2 respectively.Figure S1 shows a comparison of DeepFold predictions with AF2 in backbone, side-chain and secondary structure accuracies.Blue color indicates that DeepFold outperformed AF2, whereas yellow denotes the opposite cases.From Figure S1 (a), we can see that the DeepFold predicts quite different protein structures from those of AF2 for about 13 targets.On the other hand, the average TMscore of DeepFold predictions is 0.8648 while that of AF2 is 0.8592 indicating modest improvement of 0.56% in favor of DeepFold.and   2 of DeepFold and AF2, respectively.As can be seen from the plots, the average side chain accuracies of both  1 and  2 are higher in DeepFold, showing that mean scores are improved about 3% point over AF2 in both accuracy measures.More importantly, the majority of targets are improved consistently in side-chain angles.This implies that the modified loss functions were effective in improving the side chain angle predictions.Figure S1 (d) compares the accuracies of secondary structure prediction on CASP 13/14 targets by DeepFold vs. AF2, which shows that the average secondary structure accuracy of DeepFold (with mean accuracy of 0.8606) was slightly higher than that of AF2 (mean accuracy of 0.8567).Considering that the typical classification accuracy of eight-state secondary structure in the literature ranges from 0.70 to 0.80 (Spencer, et al., 2015;Wang, et al., 2016;Zhang, et al., 2018), the accuracy of 0.8606 is significantly higher than the typical average accuracy.In addition, we found that (data not shown here) both DeepFold and AF2 showed better or comparable accuracies in all eight states, with both models performing well in predicting 'H' (alpha helix) and 'E' (strand), while they demonstrated low accuracies in predicting 'S' (Bend).Figure S2 (a) compares the predicted average side-chain confidence score for each target ̂ and its ground truth value .As can be seen from the linear fit, a reasonable correlation between the prediction ̂ and the true confidence score s was achieved with the correlation coefficient of r = 0.7.Considering that the side chain confidence s reflects the side chain angle differences by definition as shown in Eq. ( 9), the high correlation implies that the predicted ̂ can be used as a good measure for confidence level for side chain accuracy.

Re-optimization by conformational space annealing
Once 3D structures are inferred from the DeepFold networks, we perform global optimization using conformational space annealing (CSA) with the full atom force field, distance restraints, and side chain torsion restraints generated by the networks.The energy function used for CSA is implemented in the PyCSA (Joung, et al., 2018) global optimization library and OpenMM (Eastman, et al., 2017), a molecular dynamics simulation toolkit.The energy function is defined as follows: where   is molecular mechanics force field composed of AMBER14SB (Maier, et al., 2015) energy and Generalized Born implicit solvation energy (Onufriev, et al., 2004).The distogram potential energy Edisto is defined by   = − <>  (  /  0 ), where   indicates the distogram probability between the residues  and  for   < 17.94 Å with constant extrapolation for farther distances (Senior, et al., 2020), where   0 is the distogram probability at the cut-off distance.Each term of   is an interpolated fit to a cubic spline.Since distogram predictions for larger distances are not expected to be accurate enough, we limited the summation on <  > over all pairs for which the distance of maximum probability is less than 16.06 Å.   for the set of all the side-chain torsion angles {} is the flat-bottomed Lorentzian-type potential energy (Joo, et al., 2018) defined as follows: where  ̂ is k-th predicted torsion angle obtained from DeepFold,   is a tolerance angle (5° in this work), and , which results in a flat-bottom of width 10° with Lorentzian-width of  = 15°.For the weights   , we choose {3.0,2.5,2.0,1.5} for the four types ( 1 ,••• ,  4 ) of sidechain angles respectively.For multiple predicted models, we can generalize the above formula by clustering the predicted angles with a threshold of 30°, and take the new  ̂ as the average of all the predicted angles of each cluster.Also a new  is taken as  =  (15, _ + 10), where _ is the maximum angle difference of the cluster.Estap is a statistical potential for pairs of torsion angles including backbone and side-chains (Yang, et al., 2012).The CSA method aims to find the lowest energy structure by exploring its conformational space.In its early stages, CSA searches the entire conformational space, gradually narrowing the search to smaller regions with lower energy.As a result, the CSA method provides the optimized structures that satisfies the distogram and side-chain restraints obtained from DeepFold, while also balancing the molecular mechanics force field.

Ablation study for DeepFold
We retrained the DeepFold model to investigate the effect of each modified loss component.The training database was built using the PDB structures deposited before 28 August 2019, the same as that of AlphaFold2 (AF2) (Jumper et al., 2021).It was also filtered with 40% sequence similarity in the same manner as AF2.We also applied the same sequence similarity filter to remove PDB chains similar to CASP13/14 targets.The dataset for the ablation study contains 25,777 chains.In the case of the template database provided by AF2, we filtered template structures based on the start dates of CASP13/14, respectively.

Full model
Figure S3: Comparison of ablation models with the AlphaFold2 model.We validate the trained models with 102 CASP 13/14 targets.The first row shows the results for W-FAPE loss, while the second and third row represent the results for SC-torsion loss and those of the full model respectively.
Figure S3 presents the outcomes of the ablation study for the three aforementioned models.Even with just the W-FAPE or SC-torsion loss, there are improvements in side-chain accuracy.When these two losses are combined in the full model, the side-chain accuracy is further improved.Detailed numerical results are provided in Table S2.In conclusion, the DeepFold architecture elevates the side-chain accuracy, while the backbone accuracy remains stable.Note that the Molprobity score gets worse from ~1.06 to ~2.3 (small is better).However, it was improved through later CSA re-optimization.(See the Molprobity comparison between DFolding-server and DFolding in the Figure 3 of the main manuscript.)  Figure S4 (a) illustrates a TM-score comparison between the case of the original AF2 templates and that of new templates (and their alignments), which were obtained by using the CRFalign method (Lee, et al., 2022).Structures were generated using the AF2 pipeline.For the 102 CASP13/14 targets, there is an average TM-score improvement of about 0.01.Meanwhile, certain targets display substantial improvement in head-to-head comparison.Specifically, for T1064, the TM-score improvement is 0.39.In Figure S4 (b, c), it shows that the side-chain prediction remains largely unchanged despite the template change.As depicted in Figure S5, the backbone improvement for the predicted structure (by DeepFold through CRFalign) on target T1064 is shown with a TM-score difference of about 0.39 over that by AF2 (from 0.4049 to 0.7937).For this target, the Neff score of the MSA was relatively low, around 1.85, and the average TMscore difference among the top 4 templates was 0.11 (0.31 vs. 0.42) which is a considerable difference but not a huge one.Still, there was a considerable TM-score difference of approximately 0.39 in the final result.When comparing protein tertiary structures, notable differences can be observed in the beta-sheet arrangement of the intermediate regions, ranging from cyan to yellow colors, when compared with the native structure.The impact of templates on protein structure prediction accuracy was studied in a recent work (Wu, et al., 2023) where they concluded that templates are especially beneficial for targets with similar templates.Using the domains for which PDBs are publicly available, we performed comparisons between the results of CSA reoptimization and those of AF2s with several different iterations.The result is shown in the following table which indicates that the CSA method produces better average scores on side-chain torsion angles and molprobity.This demonstrates that CSA reoptimization is an appropriate tool for advancing the details of protein structures, as we suggested in the main paper.Apparently, increasing the number of recycling iterations does not lead to statistically significant changes in AF2 results beyond four times as AF2 paper reports (Jumper et al., 2021).However, another research group as ColabFold (Mirdita, et al., 2022) suggests that increasing the number of recycles can be beneficial for larger proteins or complex structures.

Training details
We trained each DeepFold model with 8 A100 GPUs, where the batch size was 64.Because the batch size is larger than the number of GPUs, we used gradient accumulation for training.The training time for each model is about 3-4 days.We employed a fine-tune strategy for model 0~3, freezing the evoformer module with the gradients affecting only the structure module.For our model training, the differences in hyperparameters between DeepFold models are outlined in TableS3.We trained the initial model with our recent PDB dataset.This initial training did not include our proposed modifications to loss functions.Then we further trained our model with bigger crop sizes together with our proposed loss functions.We used the Adam optimizer with exponential decay for which decay_rate=0.95 and decay_steps=500.effectiveness of our methodology.For example, domain T1123-D1, shown in red, showed significant improvement despite having a Neff value of 1.9, indicating relatively low MSA quality.In such cases, our data suggest that DeepFold's updated template information can help improve prediction results.However, Figure 7(a) in the main manuscript shows that there is no improvement for the other three domains marked in red.These three domains all had Neff values greater than 6.0, making them high-quality targets for MSA.It is worth mentioning that when the quality of MSA information is high, the impact of templates on the prediction structure may not be significant, as highlighted in the AF2 paper (Jumper et al., 2021).

Figure S1 :
Figure S1: Comparison of DeepFold with AF2 for 102 targets of CASP13/14 in terms of (a) TM-scores, (b) & (c) side chain accuracies of  1 and  2 , and (d) secondary structure accuracies in 8-state (calculated by DSSP).All numbers within the figures are average values over 102 targets.

Figure
Figure S1 (b) and (c) show   1 and   2 of DeepFold and AF2, respectively.As can be seen from the plots, Figure S2 (b) & (c) are showing a comparison of the pLDDT and LDDT score of all the targets with DeepFold and AF2 respectively.We can see that DeepFold shows a slightly better correlation coefficient of 0.80 over 0.78 of AF2.

Figure S2 :
Figure S2: Validation of the trained model on 102 targets of CASP13/14.(a) compares the true average side chain confidences and its prediction ̂ for each target, and (b) & (c) compares (normalized) pLDDT vs. LDDT for all the targets by DeepFold and AF2 respectively.

Figure S5 :
Figure S5: Structure change by new templates and alignment for T1064.

Table S1 :
Training settings and the weights of the loss terms for the training of a DeepFold model (Jumper, et al., 2021)

Table S2 :
Validation of CASP13/14 targets with various metrics.We observe that modified side chain loss is effective in increasing chi1, chi2 accuracies.Also, we find that combining the two losses gives better results in side-chain accuracies.

Table S3 :
Note that AF_# refers to the AF2 with # times iteration.