ReFOLD: a server for the refinement of 3D protein models guided by accurate quality estimates

Abstract ReFOLD is a novel hybrid refinement server with integrated high performance global and local Accuracy Self Estimates (ASEs). The server attempts to identify and to fix likely errors in user supplied 3D models of proteins via successive rounds of refinement. The server is unique in providing output for multiple alternative refined models in a way that allows users to quickly visualize the key residue locations, which are likely to have been improved. This is important, as global refinement of a full chain model may not always be possible, whereas local regions, or individual domains, can often be much improved. Thus, users may easily compare the specific regions of the alternative refined models in which they are most interested e.g. key interaction sites or domains. ReFOLD was used to generate hundreds of alternative refined models for the CASP12 experiment, boosting our group's performance in the main tertiary structure prediction category. Our successful refinement of initial server models combined with our built-in ASEs were instrumental to our second place ranking on Template Based Modeling (TBM) and Free Modeling (FM)/TBM targets. The ReFOLD server is freely available at: http://www.reading.ac.uk/bioinf/ReFOLD/.


INTRODUCTION
The refinement of 3D models of proteins can be thought of as the 'end game' of protein structure prediction (1). Subtle improvements in predictive models are often required to move them the 'last mile', beyond the template or fragments upon which they are based, so they more accurately reflect the time averaged observed structures. Comparative protein modelling is now routinely used across the life sciences. However, many biological applications of 3D models are critically dependent on high model accuracy, particularly in key regions, such as binding sites for drug discovery (2). Improving the accuracy of comparative models, beyond the information derived from the template, therefore continues to be one of the pressing problems in structural bioinformatics. However, it has proved difficult to develop reliable and practically useful refinement methods, highlighted by the relatively slow progress seen in CASP (Critical Assessment of Techniques for Protein Structure Prediction). Indeed, the CASP assessors have stated that 'the endgame problem is going to be at least several orders of magnitude harder than the TBM (Template Based Modeling) problem' (1).
Refinement was introduced as a new category in CASP7 to encourage development. While quality assessment (QA) methods, such as ModFOLD (3), can accurately identify the magnitude of errors in models and locate where those errors are, refinement methods aim to fix the errors in models. Once an atomic model has been obtained, it may be tweaked to idealize bond geometry and to remove unfavorable contacts that may have been introduced by the initial modelling process. Ironically, in the early years of refinement, often most methods made models worse, rather than improving upon them. In recent years, the situation has been improved, however, it is hard to quantify if improvements between CASPs are being made in refinement, as maintaining the level of 'target difficulty' is problematic (1).
Refinement methods can be loosely categorized into the automated server-based programs and non-server-based highly CPU intensive programs, referred to as the 'human' refinement groups in the CASP experiments as they allow a large degree of manual intervention in their pipelines. The freely available automated server GalaxyRefine, focuses on rebuilding and repacking of the side chains, before performing overall structure relaxation via molecular dynamic simulation (MDS) (4). Additionally, the KoBaMIN server, relies on minimization of a knowledge-based potential of mean force (5). Another successful fully automated method is the 3Drefine method (6,7), which works in a two-step process involving optimization of the hydrogen bonding network and composite physics and knowledge-based force fields to give atomic-level energy minimization using the MESHI molecular modelling framework (8). The advantage of such automated approaches is that they are easily distributed, accessible and user-friendly. However, successes in improving protein models via automated approaches remains relatively low, with the most successful servers making relatively conservative changes. In contrast, the less well automated approaches, which often make more bold changes to models, have arguably achieved better results.
The human refinement methods commonly use physics based approaches that select structures from Molecular Dynamics (MD)-based ensembles followed by structural averaging, which has led to a high-level of refinement success (9,10).The most successful human refinement group in both CASP10 and CASP11 was the Feig group (1,11). Despite the relatively high performance of such strategies, using highly intensive MDS approaches for refinement has some drawbacks. The refinement procedure used by the Feig group consumed 75,000 core hours (12 days on 265 cores) on the multicore Intel Xenon CPU machines of the day, just to refine a single 3D model for a single protein target (12). Given that each protein structure prediction method will often produce hundreds or thousands of alternative models for each target, using similarly intensive MDS refinement procedures would quickly become problematic to manage in terms of CPU/GPU loads, if it were to be performed on multiple targets and models. This means that MDS refinement protocols, such as those developed by the Feig group, would be less practical to be used routinely for large scale fully automated structure prediction pipelines. Thus, faster, more practical and, ideally, as accurate and consistent, fully automated refinement servers are required. While many automated servers exists, few servers make use of the power of MDS and few adequately evaluate the models and present users with accurate reports of the likely improvement in errors. User feedback on when and, more specifically, where a 3D model has been improved is important to get right and it is often neglected.
We have a good track record in building 3D models (13)(14)(15) and estimating the likely errors they may have, recently termed Accuracy Self Estimates or (ASE) (16)(17)(18). We have had some success at using quality assessment guided multiple template based modelling to improve upon errors in single template models (19), but this approach requires sequence-structure alignments and template data. The Re-FOLD server is our first successful attempt at developing a method for directly fixing likely errors in any user supplied 3D models, using global quality assessment guided refinement. The ReFOLD server also equips users with the ability to identify specific domains or regions in a protein that are likely to be correctly refined, via its accurate per-residue error estimates.

MATERIALS AND METHODS
ReFOLD is the server implementation of the refinement method we initially developed for the CASP12 experiment. The ReFOLD method works using a unique hybrid approach consisting of rapid iterative refinement with i3Drefine (7) and molecular dynamics simulations with NAMD (20), combined with the latest version of our leading model quality estimation method, ModFOLD (3) (http://www.reading.ac.uk/bioinf/ModFOLD/, manuscript for version 6 in preparation). Input models were refined and evaluated over a number of successive stages. This iterative filtering process led to the generation of hundreds of alternative refined models, which were then ranked by quality.
The ReFOLD pipeline consisted of three protocols outlined in the flowchart shown in Supplementary Figure S1. The first protocol simply used a rapid iterative strategy for refinement of starting models, with 20 refinement cycles (iterations) of i3Drefine (7). Although the authors recommend not to run i3Drefine for more than 10 iterations, an optimal number of iterations is not specified, and often models will improve further beyond 10 iterations. Simply, all i3Drefine requires is a starting model, in PDB format, and a given number iterations as input parameters.
The second protocol employed a more complex and CPU/GPU intensive molecular dynamic simulation strategy, using NAMD (20) to refine each starting model. The NAMD protocol that we implemented was inspired by that of Feig and Mirjalili (10), utilizing all atom MD sampling in explicit solvent. Simulations were conducted at 298k under neutralized pH conditions with 1 bar of atmospheric pressure to resemble normal cellular conditions. Weak harmonic restraints with a spring constant of 0.05 kcal/mol/A 2 on all C-alpha atoms were added to conserve aspects of the starting model. The CHARMM22/27 (21) force field (for protein systems, the two are equivalent) was used as the parameter file with default TIP3P water model (22). As some proteins are sensitive to ionic conditions in the solvent and PME (Particle Mesh Ewald) (23) was being used, the system was neutralized by adding either Na+ or Cl-ions. Only non-bonded interactions were calculated with bonded interactions excluded; the exclusion parameter was set to exclude up to four pairs of bonded atoms. Electrostatic and van der Waal interactions for these atom pairs were instead calculated using the parameter file (CHARMM27). A 12 A cut-off for calculating non-bonded interactions (mostly van der Waal's) was applied, as this is the official standard for CHARMM force fields, with the smoothing function switching distance of 10Å to avoid discontinuity in energy and forces. The pairlistdist function was set to a distance of 14Å between atom pairs for inclusion in pair lists. All hydrogen bonding was rigidified, using the rigidBonds functions, allowing the time step to be increased to 2 fs. PME was used to calculate full electrostatics and the temperature in the system was controlled through Langevin dynamics (24), which balances random noise with friction to push atoms to the target temperature (298 K). The langevinDamping function was set to 1/ps to give some temperature control without dampening the system to the extent that effects were not seen. Periodic boundary conditions were used to achieve maintenance of conditions such as pressure, density and water box and also enable the use of PME, making the simulation more biologically realistic. The first stage of the MD simulation was 1000 steps of minimization, to lower the potential energy of the system reducing bad initial contacts, high force and temperature regions. This process zeros velocity, so that the reinitvels function returns the system to the desired temperature (298 K). For CASP12, the second protocol was run using multiple short trajectories in place of a single long trajectory; four parallel simulations were The 164 refined models generated from the second protocol were assessed and ranked using ModFOLD6 rank method. The third protocol was a combination of the first two approaches, where the top ranked model from the second protocol was further refined using 20 iterations of i3Drefine. Subsequently, all of the refined models generated by each of these protocols and the starting model were pooled and re-ranked again using ModFOLD6 rank and the final top models were identified. Finally, each refined model was evaluated and compared with the original starting model in terms of local and global model quality scores. Static and dynamic graphical outputs were generated using the raw QA scores in order to display the top refined models and estimated improvements in a user friendly manner.

Server inputs and outputs
The only required inputs to the ReFOLD server are the amino acid sequence for the target protein and a single 3D model (in PDB format) for refinement. Users may optionally provide a name for their protein sequence and their email address. The ReFOLD server results page provides users with an accurate estimate of the likely percentage improvement in their global quality score based on the top refined model ( Figure 1A). In addition, the server is unique in providing output for multiple alternative refined mod-els in a way that allows users to quickly visualize the key residue locations, which are likely to have been improved upon compared to their original model. The results page provides users with a series of per-residue error plots, which demonstrate the reduction in local errors in the refined models compared with the uploaded original ( Figure 1B). This is important, as global refinement of a full chain model may not always occur, whereas local regions, or individual domains, may often be much improved. Presenting results to users in this way also gives them the choice to easily compare alternative refined models, allowing them to focus their attention to key interacting residues or specific domains. Users can also click through the images in the table in order to compare the refined and original 3D models interactively in using the JSmol/HTML5 framework ( Figure 1C). No plugins are required and, conveniently, interactive results may also be viewed on mobile devices.

Independent benchmarking
Our ReFOLD refinement approach was independently tested by the assessor team in the recent CASP12 experiment, and it was a key factor contributing to our success. During the CASP12 prediction season (May-August 2016), we used ReFOLD to build hundreds of alternative refined models, for both the main tertiary structure prediction and refinement categories. ReFOLD gave us a significant performance boost in the main tertiary structure prediction category, where it enabled us to further improve the quality of some of the very best initial server models. As a result of our high performance, we were invited to speak at the meeting in Gaeta about our template based modelling (TBM) strategy. Our group ranked in 2nd position overall on both the TBM and TBM/FM (Free Modelling) targets according to the assessors' formula (Supplementary Tables S1 and S2), and we ranked 11th overall on FM targets. Our group also improved the global and local quality scores for many of the starting models provided in the refinement category itself, where we ranked 14th overall (http://predictioncenter.org/casp12/).
The results in Table 1 summarize our CASP12 results for regular targets, where the top server models, selected by ModFOLD rank, were then further refined with Re-FOLD. Arguably, this benchmark represents a more realistic user test case, where each of the starting models have been selected in a fully automated manner and have been generated for full length protein chains. The results in Table 1 show that the majority of models were either improved upon or unchanged according to the GDT TS (25) and MolProbity (26) scores. There is an average overall improvement in scores ( GDT TS = 8.15, MolProbity = -10.16), which are greater than the standard errors, and there is a strong correlation between the ModFOLD6 rank global scores and GDT TS scores (Pearson's r = 0.8094). The GDT TS score measures the global positioning of the backbone C-alpha atoms based on multiple superpositions of the predicted and experimental structure. It is promising to see that ReFOLD improves backbone quality, but these changes are relatively small, so significant improvements in GDT TS are hard to detect on the CASP data set. While it is encouraging that the improvement in cumula-tive GDT TS scores is greater than the error, the pairwise t-test on the data in Table 1 does not provide evidence that the improvement in the backbone is statistically significant (P = 0.1196). However, the improvement in the MolProbity full atom scores is statistically significant at the 95% confidence level, according to the pairwise t-test on the data in Table 1 (P = 0.0301). Whereas the GDT TS score focuses only on the C-alpha atoms, the MolProbity score is an all-atom composite score, which means that it also measures the finer detail of differences in the local errors of the side chains. The MolProbity score denotes the expected resolution with respect to experimental structures, therefore models with lower MolProbity scores are more physically realistic. Figure 2 shows examples where ReFOLD has successfully refined starting models of varying target difficulty (FM, TBM, TBM/FM targets and a refinement target), with improved GDT TS scores for each.
It is clear that the success of refinement is related to the quality of the starting model, when targets are subdivided into domains (Supplementary Tables S3-S5). Dividing targets into domains allows us to pinpoint where the method performance is strongest. The results indicate that the method works better overall on harder FM targets, which have on average lower initial GDT TS scores. The method is less successful on the assessor selected and edited refinement targets and the TBM domain models (where starting models have, on average, much higher GDT TS scores) than either the full chain models or FM domain models. Clearly, it is harder to refine models where there is less room for improvement, and likewise it is harder to detect a smaller GDT TS. In addition, for many of the starting server models, the developers also attempted refinement, which clearly produces a problem of diminishing returns for further refinement. Nevertheless, considering full chain models across regular targets, on average the automatically selected initial models are successfully improved upon by the ReFOLD pipeline.
The time taken to refine a model is dependent on the sequence length. The average time per CASP12 target was ∼10 h for each NAMD run using a quad core desktop (Intel Xeon E5-2407v2) with a standard graphics card (Nvidia GeForce GTX 970). Smaller models were quicker to refine (e.g. 310 min for a 60 residue model) and larger models took longer (e.g. 909 min for a 257 residue model). The other components of the method, including the quality assessment, are run in parallel and will usually take no more than a few extra hours. The ReFOLD pipeline is presently running on a dedicated 56 thread server (Intel Xeon E5-2695 v3) with an Nvidia Tesla K40M GPU card, so most users should expect their results well within 24 h, once they are running.

CONCLUSIONS
The ReFOLD server allows users to attempt to fix errors in their 3D models of proteins and identify improvements with built-in Accuracy Self Estimates. The user friendly, dynamic results pages let users visualise potential improvements for over 200 alternative refined models, at both a global and local level. Providing users with visual comparisons of estimated local improvement allows them to quickly identify W426 Nucleic Acids Research, 2017, Vol. 45, Web Server issue  those models, which are likely to have been improved upon in a specific region of interest. In addition, the server provides users with a compressed archive all of the generated refined models, which they may rank using their own alternative quality assessment protocols. In the recent CASP12 experiment, the ReFOLD pipeline gave our group a performance boost increasing our cumulative GDT TS scores and thereby contributing to our high overall rankings.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.