PSICA: a fast and accurate web service for protein model quality analysis

Abstract This paper presents a new fast and accurate web service for protein model quality analysis, called PSICA (Protein Structural Information Conformity Analysis). It is designed to evaluate how much a tertiary model of a given protein primary sequence conforms to the known protein structures of similar protein sequences, and to evaluate the quality of predicted protein models. PSICA implements the MUfoldQA_S method, an efficient state-of-the-art protein model quality assessment (QA) method. In CASP12, MUfoldQA_S ranked No. 1 in the protein model QA select-20 category in terms of the difference between the predicted and true GDT-TS value of each model. For a given predicted 3D model, PSICA generates (i) predicted global GDT-TS value; (ii) interactive comparison between the model and other known protein structures; (iii) visualization of the predicted local quality of the model; and (iv) JSmol rendering of the model. Additionally, PSICA implements MUfoldQA_C, a new consensus method based on MUfoldQA_S. In CASP12, MUfoldQA_C ranked No. 1 in top 1 model GDT-TS loss on the select-20 QA category and No. 2 in the average difference between the predicted and true GDT-TS value of each model for both select-20 and best-150 QA categories. The PSICA server is freely available at http://qas.wangwb.com/∼wwr34/mufoldqa/index.html.


INTRODUCTION
The three-dimensional (3D) structure of a protein is essential in studying its functions (1). Computational 3D protein structure prediction is important since experimental methods including X-ray crystallography, electron microscopes and nuclear magnetic resonance (NMR) are all costly and time consuming (2). Predicting 3D structures using computational methods can be much faster and cheaper. However, the accuracy of predicted models can vary greatly for differ-ent targets and different prediction methods. Therefore, it is vital to find a reliable method to evaluate the quality of predicted models (3).
Over the past 20 years, many protein model quality assessment (QA) methods have been proposed (4,5). There are two basic approaches: single-model QA methods (6)(7)(8)(9)(10) that are able to evaluate a single model's quality, and multimodel QA methods (11,12) that require a pool of models to evaluate the quality of one or some models in the pool. Multi-model QA methods have outperformed single-model QA methods in recent CASPs (4,5,13,14). However, the size and quality of the model pool used by multi-model methods have great impact on their QA results (15). Single-model QA methods use potential functions and/or machine learning. Potential functions include physics-based potential functions and knowledge-based potentials (16). Machine learning has been used to aggregate various potential functions to achieve improved results.
Quasi-single-model QA methods (17) try to combine the advantages of both types of methods. They employ the 'consensus' idea from multi-model QA methods, but do not require a pool of models as input. Instead, they generate their own reference models. Quasi-single-model QA methods have achieved good results in recent CASPs, comparable to multi-model QA methods. However, generating a good pool of reference models and making the best use of multiple fragments of known protein structures remain a challenge.
After years of development, state-of-the-art QA methods are becoming complex and difficult for users to implement. The PSICA (Protein Structural Information Conformity Analysis) web service is intended to make some of the best existing QA methods available to the public. It has an intuitive user interface that runs on all mainstream web browsers. Users do not need to run any local software, plugin, Java Applets or ActiveX.
PSICA is designed to evaluate how much a tertiary model of a given protein primary sequence conforms to the known protein structures of similar protein sequences, and to assess the quality of predicted protein models. PSICA implements the MUfoldQA S method (18), an efficient state-ofthe-art quasi-single-model QA method. In CASP12, MU-  foldQA S ranked No. 1 in the protein model QA select-20 category in terms of the difference between the predicted and true GDT-TS value of each model. MUfoldQA S calculates quality scores based on templates. which are protein fragments with known 3D structure and sequences similar to the sequence of predicted model. For a given predicted 3D model, PSICA generates (i) predicted global GDT-TS value; (ii) interactive comparison between the model and other related known protein structures; (iii) visualization of the predicted local quality of the model; and (iv) JSmol rendering of the model. The GDT-TS value is a popular indicator of similarity between the two protein 3D structures (in our case, predicted models and observed structures). It is calculated by computing the percentage of corresponding C-alpha atom pairs whose distance falls within the cut-off values of 1, 2, 4 or 8Å after superimposing the two protein structures, and compute the average of those four percentage values.
Additionally, PSICA implements MUfoldQA C (18), a multi-model QA method based on MUfoldQA S. MU-foldQA C uses MUfoldQA S results as weights in a consensus approach and let better models contribute more to the final QA result. In CASP12, MUfoldQA C ranked No. 1 in top 1 model GDT-TS loss in the select-20 QA category and No. 2 in the average difference between the predicted and true GDT-TS value of each model for both select-20 and best-150 QA categories.
In the rest of this paper, after a brief overview of the algorithms, the implementations of the web service will be presented. Then, a detailed description of inputs and results will be given. At last, experimental results on benchmark datasets will be presented to show its advantages over existing servers.

Algorithm overview
MUfoldQA S is a quasi-single model QA method that predicts the GDT-TS value between a protein model and its native structure. This method calculates model QA score based on the fragments of other known protein structures with similar primary sequences without building full protein models.
The input to MUfoldQA S is the sequence of amino acids of the target protein (TargetSeq 0) and a predicted model (Decoy 0). Its main steps are as follows: (i) Use Blast (19) to query our in-house protein database (20) to find a set of proteins with similar sequences. Blast first uses NR sequence database to generate the checkpoint file and then use this file as input to search PDB sequence database for similar sequences. Let us call the set of similar sequences found by Blast Seq Blast. (ii) Sort Seq Blast according to heuristic score T = (3log 10 E)• I• C, where I is the sequence identity, i.e. the extent to which two amino acid sequences have the same residues at the same positions in an alignment, expressed as a percentage; C is the ratio of the template length to the target sequence length; and E is the E-value. The formula combines the three factors together in spite of the  different scales. The term (3-log 10 E) is designed to return a positive value for all E < 1000, which is a good trade-off between covering the majority of the cases and maintaining a good performance. (iii) Select the top 10 similar sequences (referred to as Seq Blast T10) with the highest T-scores. (iv) Repeat Step 1-3 using HHsearch (21) to replace Blast to get another set of similar sequences, referred to as Seq HH T10. HHsearch uses a profile HMMs database derived from the PDB sequences to get results. (v) Merge Seq Blast T10 and Seq HH T10 to get Seq T20, without removing any of these sequences even if some sequences are from the same protein. (vi) For each C-alpha position j on template k from the sequence pool Seq T20, compare its sequence with position j on TargetSeq 0 from the input and calculate BLOSUMbased (22) weight W kj .
(vii) Retrieve the corresponding 3D coordinates of all sequences in Seq T20 from the PDB database, referred to as Structure T20. (viii) Compare the 3D structure of the predicted model (Decoy 0) with each structure in Structure T20 to calculate GDT-TS value S k . (ix) MUfoldQA S local score H j at position j is the average of S k weighted by W kj . (x) MUfoldQA S global score is the simple average of all H j .
MUfoldQA C uses the information that MUfoldQA S generates to improve consensus result. MUfoldQA C takes in a group of predicted models and runs MUfoldQA S on each one to generate its local and global scores. Then, a certain number of top models are selected as the reference models according to their global scores. The final QA score for each predicted model is the weighted average of pairwise GDT-TS value between the model and each reference model, weighted by the local score of MUfoldQA S. For more details about the database and parameters, please refer to our paper (18).

Implementation
On PSICA web page, after the user submits a task either through filling out a form on the web page or through an API, the task data will be stored in a queue managed by the server. When a task is executed, the server runs Blast and HHsearch, respectively, to find similar sequences using the target sequence. When a compressed protein model file is submitted, the server unpacks the compressed file to get a set of predicted models. Then, for each model, MU-foldQA S is run to calculate a global QA score and local QA scores. These results and the original task information are stored in functionality expansion APIs for other programs to use. If MUfoldQA C add-on is enabled by the user, MU-foldQA C scores will be computed using MUfoldQA S data from the APIs. At last, all results are presented and visualized in a user-friendly manner.
The server backend is written in PHP. All frontend user interface and interactive visualization are implemented in HTML5/JavaScript. Neither browser plug-in, Java Applets, nor ActiveX is required. The task scheduler of the server is written in Go without using any SQL statement to eliminate SQL injection. MUfoldQA S is mainly written in Octave/MATLAB. In addition to third party software Blast and HHsearch, the server also uses TMscore (23,24) to calculate GDT-TS values.

Web interface
PSICA requires the input of a protein sequence and one predicted model file in PDB format. Optionally, multiple predicted models can be submitted together in a *.tar.gz file as used in CASP. For users who want to generate their own tarball files, command line instructions for Linux and MacOS and executable files for Windows are provided to simplify the process. An optional MUfoldQA C add-on is available, which is useful for QA on a set of predicted models for the same target sequence. The other optional inputs include the Target Name field, which helps the user to remember different submitted tasks, and the Email field, which allows the user to receive a notification email after the task is finished. The notification email includes a text version of the result and a link to the full report. If the user chooses not to provide the email address, the user can still retrieve the results via either the task status page or the task receipt file generated ( Figure 1) during the submission process. For people using public computer or simply wish not to receive receipt file, they can use 'Disable receipt file for this task' when submitting the task. For users who need a quick demonstration of the server, a sample input with instruction is provided. The sample comes from CASP12 target T0865 stage 1 (select-20) QA task. T0865 is C-terminal coiled-coil domain of CIN85 (PDB code: 2N64). At the length of 75 amino acids, it is one of the shortest targets in CASP 12, and thus minimize the download, upload and computation time needed for the demonstration.
When a user clicks the 'Submit This Job' button, the input will first be verified on the frontend and then validated on the backend. If the submission is successful, a task receipt file will be generated to make a local copy of the basic information of the task and the URL to retrieve the result. Then the webpage will automatically redirect to the result page. If the task is still running, this page will show the basic task information (Figure 2). In addition to the information submitted by the user, this page also includes fields indicating if the job is running or waiting in queue ('CurrentStatus'), the time user submitted this task ('SubmitTime'), and the time in which this task leaves the waiting queue and starts running ('StartTime'). When the task is running, the 'TaskProgress' section will show a list of subtasks and their statuses.
When the task is finished, the result page will display a table showing a summary of the results. From left to right, the columns are visualization of the local quality, name of the predicted model, global score (ranging from 0 to 1, the higher the better), and the link to the full report of this predicted model. The table is sorted by global score by default but can be changed to be sorted by decoy name. If the user has enabled MUfoldQA C add-on, an additional column of MUfoldQA C results (ranging from 0 to 1, the higher the better) will be added (Figure 3), and the result will be sorted by MUfoldQA C scores. A user can export the scores to CSV file by click on '[Export to CSV]' or simply print the page with '[Print This Page]'. Below the summary table is the task information table, which recalls some information like when the task starts and finishes, what the sequence is, the max, mean and min sequence identities of the templates.
In the full report, the title will show the model name and its global quality, followed by the interactive JSmol view ( Figure 4A) of the model. The JSmol view uses mouse to control rotation and includes widgets to change the way the model is colored and rendered, and to export the model file or take a snapshot of the model.
Next to the JSmol view is the interactive visualization to compare models with templates ( Figure 4B). The visualization consists of a pair of 3D C-alpha backbone models superimposed over each other. The blue one represents the predicted model and the red one represents the template. The drop-down menu allows users to select from a pool of W448 Nucleic Acids Research, 2019, Vol. 47, Web Server issue   Figure 4C). It is an interactive 3D model of C-alpha backbone in different colors that represent different local qualities. The local quality is calculated for each C-alpha position. This graph shows which section might be ill-folded and helps user to gain insights on how to improve the predicted structure. At last, the visualization of distance matrix ( Figure 4D) is provided. Contact map is of great importance in protein structure prediction. It is a 2D binary matrix where each element represents if the distance of two amino acids falls into a cut-off threshold. A contact map could be derived from the distance matrix based on a cut-off threshold. Instead of plotting multiple contact maps that a user might need, the 3D map allows a user to interactively explore how the contour separating contact and non-contact regions changes with different cut-off values by simply moving the mouse.
For developers who want to integrate PSICA into their own software, a script is provided to demonstrate how to interact with PSICA sever, including submitting a task, checking for task status and retrieving the final results in Python, Octave/MATLAB. Since the API for PSICA is simple to use, the script can also be easily adapted to most programming languages that support system calls. In fact, we have already used PSICA in some of our QA methods in CASP 13.

Benchmark results
To gain an unbiased evaluation of the performance, we have participated in CASP12 under the group name MU-foldQA S and MUfoldQA C. CASP (Critical Assessment of Techniques for Protein Structure Prediction) experiment is a world-wide competition held every two years since 1994. It is designed to provide objective evaluation of the stateof-the-art methods for protein structure prediction. In recent CASPs, the QA task consists of two stages. In stage 1, each QA group was given up to 20 selected predicted models (Select-20) ranging from good to bad. In the stage 2, each QA group was given up to 150 top models (Best-150) selected by the naïve consensus algorithm.
Accuracy. When blindly tested in CASP12, MUfoldQA S (scores shown in 'Global Score' column in the result summary page in Figure 3) and MUfoldQA C (scores shown in 'Add-on: MUfoldQA C' column in the result summary page in Figure 3) achieved good results. Table 1 shows the performance comparison between them and other methods in terms of GDT-TS differences between predicted and true values, average over up to 70 targets. All data are from the CASP official website.
For the select-20 QA category, which is similar to practical protein structure prediction situation, where a small number of predictions are generated and their qualities vary greatly, MUfoldQA S and MUfoldQA C performed significantly better than other methods in terms of average GDT-TS differences, ranked No. 1 and No. 2, respectively, outperforming the third place by 35.8% and 32.0%. For the Best-150 QA category, MUfoldQA C ranked No. 2, outperforming the next best, ModFOLD6 cor, by 18.3%. MU-foldQA S ranked the third place among all single-model and quasi-single-model QA methods. Furthermore, PSICA is tested using the recently released CASP13 dataset. Only 20 targets (out of total 79 valid targets) can be used for QA performance evaluation because both the true GDT-TS of each decoy model of these targets and the performance of each group are available, whereas the other targets are not. PSICA is compared with other publicly available servers/tools and the result is shown in Table 2. The scores of other methods are from the CASP official website. The result shows that MUfoldQA C outperforms the latest version of ModFOLD7 by 31% in the select-20 QA category and 28% in the Best-150 QA category. MUfoldQA S also outperforms MULTICOM series and ProQ series in both categories.
Speed. PSICA is much faster than other QA methods, as shown in the comparison of the execution times of PSICA and other methods on the CASP12 best-150 dataset of 70 targets. The execution times of other methods are obtained from the CASP official website, which were calculated as the duration between the timestamp of server received the task and the timestamp of server submitted the result. Figure 5 shows a comparison of execution time between MUfoldQA S (PSICA with MUfoldQA C add-on disabled), MUfoldQA C (PSICA with MUfoldQA C addon enabled) and other QA groups. MUfoldQA S, on average, uses 2,765 s to evaluate a target with 150 decoys. MUfoldQA C and ModFOLD6 (15) are slower, using 8,694 and 9,961 s, respectively. QASproCL and iFold 2 are much slower, using 23,518 s (7.5 times slower than MUfoldQA S) and 148,476 s (52.7 times slower than MUfoldQA S), respectively.

CONCLUSION
In this paper, we have presented PSICA, a new web service to evaluate predicted protein models by analyzing its conformity to known protein structures. The service is developed based on a top quasi-single-model and a top multimodel QA method in CASP12. It runs faster than other existing servers.
For developers of protein tertiary structure prediction methods, PSICA could be easily integrated into their prediction pipeline. Furthermore, PSICA provides interactive GUIs to visualize varies aspects of the predicted protein model, including interactive comparison between the predicted model and other known protein structures, visualization of the local quality of the predicted model, visualization of its distance matrix and JSmol rendering of the model.