GraphQA: protein model quality assessment using graph convolutional networks

Abstract Motivation Proteins are ubiquitous molecules whose function in biological processes is determined by their 3D structure. Experimental identification of a protein’s structure can be time-consuming, prohibitively expensive and not always possible. Alternatively, protein folding can be modeled using computational methods, which however are not guaranteed to always produce optimal results. GraphQA is a graph-based method to estimate the quality of protein models, that possesses favorable properties such as representation learning, explicit modeling of both sequential and 3D structure, geometric invariance and computational efficiency. Results GraphQA performs similarly to state-of-the-art methods despite using a relatively low number of input features. In addition, the graph network structure provides an improvement over the architecture used in ProQ4 operating on the same input features. Finally, the individual contributions of GraphQA components are carefully evaluated. Availability and implementation PyTorch implementation, datasets, experiments and link to an evaluation server are available through this GitHub repository: github.com/baldassarreFe/graphqa. Supplementary information Supplementary data are available at Bioinformatics online.

Local Distance Difference Test (LDDT) Local Distance Difference Test (LDDT), is a residue-level score that does not require alignment of the structures and compares instead the local neighborhood of every residue, in the decoy and in the native structure. If we define the neighborhood of a residue as the set of its contacts, i.e. the set of other residues that lie within a certain distance from it, we can express the quality of that residue as the percentage of contacts that it shares with the corresponding residue in the native structure.   Fig. S2: Example of LDDT scoring for residue 7: the residues within a radius R 1 are { 6, 8, 10 } the native structure (left) and { 6, 8 } for the decoy (right); at a radius R 2 we have { 3, 6, 8, 9, 10, 11 } the native structure (left) and { 3, 6, 8, 9, 10 } for the decoy (right).

S2.1 Datasets and software
We consider all decoys of all target included in CASP 9-13, excluding targets that have been canceled by the organizers (table S2). For each target, we consider the native models available on the CASP website as the ground-truth tertiary structure. The following auxiliary tools and programs are used to preprocess the data downloaded from CASP: • We compare each decoy to its corresponding native structure and obtain the ground-truth scores that we use to train and evaluate our method: • TM-score (https://zhanglab.ccmb.med.umich.edu/TM-score) to evaluate TM-score, GDT_TS, GDT_HA, • lDDT (https://swissmodel.expasy.org/lddt) to evaluate global and local LDDT scores, • Voronota (https://bitbucket.org/kliment/voronota) to evaluate global and local CAD scores.
• For each target sequence, we run the Jackhmmer (http://hmmer.org/documentation.html) tool for multiple-sequence alignment against the Uniref50 dataset.

S2.2 Protein graphs statistics
The cutoff value dmax determines which edges are included in the graph and, consequentially, its connectivity. A low cutoff implies a sparsely connected graph, with few edges and long paths between nodes. A higher cutoff yields a denser graph with more edges and shorter paths.

S3 GraphQA architecture
In this section, we illustrate in more detail the structure of the GraphQA architecture, as well as the hyperparameter space that was explored to optimize performances on the validation set.

S3.1 Message-passing layers
Within GraphQA, a protein structure is represented as a graph whose nodes correspond to residues and whose edges connect interacting pairs of amino acids. At the input, the features of the i-th residue are encoded in a node feature vector v i . Similarly, the features of the pairwise interaction between residues i and j are encoded in an edge feature vector e i,j . A global bias term u is also added to represent information that is not localized to any specific node/edge of the graph. With this graph representation, one layer of message passing performs the following updates.
1. For every edge i → j, the edge feature vector is updated using a function φ e of adjacent nodes v i and v j , of the edge itself e i,j and of the global attribute u: 2. For every node i, features from incident edges { e � j,i } are aggregated using a pooling function ρ e→v : For every node i, the node feature vector is updated using a function φ v of aggregated incident edgesē � i , of the node itself v i and of the global attribute u: 4. All edges are aggregated using a pooling function ρ e→u :ē The global feature vector is updated using a function φ u of the aggregated edgesē � , of the aggregated nodesv � and of the global attribute u: In GraphQA, all intermediate updates are implemented as Linear-Dropout-ReLU functions, and all aggregation functions use average pooling. The encoder and readout layers do not make use of message passing, effectively processing every node/edge in isolation. Message passing is instead enabled for the core layers of the network and enables GraphQA to process information within progressively expanding neighborhoods.
The number of neurons in the core message-passing layers decreases from the input to the output. Specifically it follows a linear interpolation between the input and output numbers reported below, rounded to the closest power of two. In preliminary experiments, we noticed that a progressive increase of the number of layers results in convergence issues, which is in contrast to the practice of increasing the number of channels in Convolutional Neural Networks.

S3.2 Hyperparameter optimization
GraphQA has a low memory footprint (we can process mini-batches of 200 graphs on a 12GB GPU card) and is computationally efficient (training takes 2 hours using a single GPU), which allows for extensive hyperparameter search and ablation studies. Rather than performing a full-range grid search over the whole space, we perform a "guided" parameter search. First we perform a few exploratory runs, then we decide to focus on the most promising combinations of parameters and skip others. For example, in some initial experiments we tried max pooling as an aggregation function ρ for message-passing layers, but it performed poorly on the validation set and we decided to exclude it. Repeating this process of trial and elimination allows us to prune unpromising regions of the search space.
The final model is chosen to be the one with the highest Rtarget on the validation set. The following considerations were made: • The range of values for dmax is chosen according to the average distance between alpha carbons of two consecutive residues, which is approximately ∼ 5Å, and a distance ∼ 10Å after which residue-residue interactions are negligible. • The values for σ are chosen so that the RBF encoding of the edge length is approximately linear around ∼ 5Å.
• The values for L are chosen to approximately match the average length of the shortest paths in the protein graphs at different cutoffs.
• In addition to what described in section 2.3, we also tested an architecture with BatchNorm layers between the Dropout and ReLU operations, but apart from a significant slowdown we did not notice any improvement. 10 −2 , 10 −3 10 −3 Weight decay 10 −4 , 10 −5 10 −5 Local weight λ � 1, 5, 10 1 Global weight λ g 1, 5, 10 1 For some of the final hyperparameters, the best value happens to be at the boundary of the search space, e.g. for the sizes of the message-passing layers. In those cases, we decided to stop the search and not extend the range further for practical reasons: i) those parameters control the number of weights in the network, the larger the network, the heavier the risk of overfit, ii) increasing some of those values to 768 or 1024 would have slowed down/increased the memory footprint too much, iii) by observing the trend from smaller values to larger values we noticed diminishing returns and assumed that no significant improvement could be achieved by searching further in that direction.

S4 Additional Studies
To complement the analysis reported in the main text, we perform additional studies on the effect of feature representation and on the generalization ability of the trained model.

S4.1 Distance and separation encoding
The feature vectors associated to the edge of the graph represent two types of distances between residues, namely spatial distance and separation in the sequence. In this study we evaluate the effect of different representations on validation performances. Spatial distance is the physical distance between ammino acids, measured as the euclidean distance between their β carbon atoms. We consider three possible encodings for this distance: • Absent: spatial distance is not provided as input; • Scalar: spatial distance is provided as a raw scalar value (in Ångstrom); • RBF: spatial distance is encoded using 32 RBF kernels, with unit variance, equally spaced between 0 and 20. Figure S4 reports the aggregated performances on CASP 11 of ten runs for each of the above. The rich representation of the RBF kernels seem to improve both LDDT and GDT_TS scoring performances, even though the effect is rather limited.
: Spatial distance: absent, encoded as a scalar, encoded using RBF kernels.
Separation is the number of residues between to amino acids in the sequence, we consider three possible encodings: • Absent: sequential separation is not provided as input; • Scalar: sequential separation is provided as a raw scalar value (positive integer); • Categorical: sequential separation is encoded as a one-hot categorical variable, according to the classes { 0, 1, 2, 3, 4, 5 : 10, > 10 }, which are based on typical interaction patterns within a peptidic chain. Figure S5 reports the aggregated performances on CASP 11 of ten runs for each of the above. For local scoring, the choice of encoding plays little difference as long as separation is present in the input. On the other hand, the choice of categorical encoding over scalar encoding results in higher global scoring performance.
: Sequential separation: absent, encoded as a scalar, encoded as a categorical variable.

S4.2 Transmembrane vs. soluble proteins
In this study, we evaluate how the natural environment of a protein affects the predictive performances of our method. Targets from CASP 11 and 12 are classified as transmembrane and soluble according to Peters et al. (2015) and scored separately using GraphQA. Transmembrane proteins behave differently from soluble proteins as a consequence of the environment they are placed in. The former expose non-polar residues to the cellular membrane that surrounds their structure. On the contrary, the latter tend to present polar amino acids to the surrounding water-based solvent. Since this information is not explicitly provided to the model, we can compare the predictive performances between the two sets and check that it has actually learned a flexible protein representation. The outcome of this evaluation is shown in table S4. While it is evident that GraphQA performs better on soluble proteins, which are more numerous in the training set, it also scores transmembrane proteins to an acceptable degree.

S5 Additional Metrics
The Quality Assessment literature is rich of metrics to measure the performances of scoring methods. In the main text we tried to keep the exposition uncluttered by only reporting figures for the most important metrics. Here we present a more extensive set of metrics, that further describe our method and can serve as future benchmark.
In the following, we use: For global scores, it it the square root of: Correlation coefficients (global) We compute the Pearson (R), Spearman (ρ) and Kendall (τ ) correlation coefficients between the true and predicted scores across all targets and decoys. Since this evaluation does not distinguish between different decoys or between different targets, a high value for these metrics can be misleading. Thus, their per-model and per-target versions should be also checked. For each local score: For each global score: Correlation coefficients per-model For every decoy of every target, we compute the Pearson (R model ), Spearman (ρ model ) and Kendall (τ model ) correlation coefficients between true and predicted residue-level scores (LDDT, CAD). We then report the average correlation coefficients across all decoys of all targets. The per-model correlation coefficients estimate the performance of the network to rank individual residues by their quality and distinguish correctly vs. incorrectly folded segments. Per-model correlation coefficients are computed only for local scores: "main" -2020/8/3 -23:30 -page S8 -#15
Correlation coefficients per-target For every target, we compute the Pearson (Rtarget), Spearman (ρtarget) and Kendall (τtarget) correlation coefficients between true and predicted decoy-level scores (GDT_TS, GDT_HA, TM-score, LDDT, CAD). We then report the average correlation coefficients across all targets. With reference to the funnel plots, this would be the correlation between the markers in every plot, averaged across all plots. The per-target correlation coefficients estimate the performance of the network to rank the decoys of a target by their quality and select the ones with highest global quality. Per-target correlation coefficients are computed only for global scores: First Rank Loss (FRL) and First Rank Loss top-5 (FRL 5 ) For every target, we compute the difference in ground-truth scores between the true best decoy and the best decoy according to the predicted scores. We then report the average FRL across all targets. This represents the loss in (true) quality we would suffer if we were to choose a decoy according to our ranking. In the funnel plots (figures S6, S7 and S8), FRL can be visualized as the gap between the two vertical lines indicating the true best (green) and predicted best (red). FRL is only computed for global scores only, for every target t: FRL measures the ability to select a single best decoy for a given target. In our experiments, however, we noticed that FRL is extremely subject to noise, as it only considers top-1 decoys. Thus, we also compute FRL 5 , which represents the minimum loss across the 5 top-scoring decoys of a target.
z-score (z) For every target, we consider the ground-truth distribution of global scores and compute the z-score of each decoy, relative to the mean and variance of such distribution. For example, for GDT_TS: � Then, we use GraphQA's predictions to identify the best-scoring decoy of each target and average their z-scores: Notably, we do not remove outliers and do not set negative z-scores to zero, as commonly done in CASP. The reason is twofold: • Outlier removal was introduced in CASP4 when the quality of submitted decoys was relatively low. Therefore, a cleanup step was required to ignore the low-quality models that would otherwise skew this metric.
• Negative z-scores are only relevant if some QA method selects a worse-than-average decoy as the best model for a given target, otherwise they are never included in the mean. Since we only compare with top-performing QA methods this step is not necessary.

S6 CASP13 Additional Results
CASP13 is the most recent edition of CASP whose decoy structures are available at the time of writing. Specifically, the inputs of our model, target sequences and predicted tertiary structures can be downloaded from the website. Since native structures have also been released, it is possible to compare our predictions with the ground-truth quality scores, obtained with respect to the target conformations. Furthermore, QA predictions submitted by other participants are available, enabling us to compare results. In this section we present additional results for methods and metrics that are excluded from the main text for sake of brevity.

S6.1 Evaluation of GraphQA
In table S5 we report all metrics computed using our global quality predictions. As indicated, each metric is either: i) computed across all decoys of all targets or ii) first computed on the decoys of each target and then averaged over targets. Similarly, in table S6 we report all metrics computed our local quality predictions. These local metrics are either: i) computed across all residues, or ii) computed on the residues of each decoy and then averaged over all decoys of all targets. In table S8 and table S9 we report the z-score and first rank loss of the predicted-best decoy w.r.t. the true best for each target in CASP13, as ranked by the five types of global scores we consider.

S6.2 Comparing GraphQA to other quality assessment methods
Where possible, we compute evaluation metrics for all groups who participated in the quality assessment track in CASP13. The metrics for global quality predictions are shown in table S10, and metrics for local quality predictions are shown in table S11. All evaluation metrics are computed inhouse using the official CASP13 submissions from the participating groups. All tables are relative to stage 2 predictions, according to the official categorization. The evaluation code is publicly available.

Decoy-level global quality predictions
For decoy-level predictions, we download the publicly available GDT_TS submissions made by every group for all decoys from CASP13. These predictions are directly comparable with the ground-truth values for GDT_TS computed using open-source tools (appendix S2.1). Furthermore, they correspond to one of the scores that GraphQA is trained to predict at the global level. Therefore, in table S10 we report the evaluation of all metrics described above for GDT_TS predictions. Finally, we produce funnel plots for all targets, where the true and predicted GDT_TS scores are drawn (figures S6, S7, and S8)

Residue-level local quality predictions
For residue-level predictions, comparing GraphQA with other methods is not as straightforward, as GraphQA is trained to predict LDDT and CAD scores, but submissions to CASP require to estimate the distance, in Ångstrom, between the residue in the decoy and its correct position in the native structure using a superposition from GDT_TS. Therefore, we convert our score predictions to "distances" using the following heuristic: Once the local quality predictions of all groups are expressed as distances, we compare with ground-truth distances downloaded from the CASP portal. In table S11, the entries GraphQA-LDDT and graphqa-CAD represent respectively the LDDT and CAD predictions of GraphQA converted into distance-like values. Notably, some groups are characterized by negative correlation coefficients, which we guess is due to a submission mistake. In fact, those methods might predict a score in the range [0, 1] where 1 represents high quality and might have skipped the conversion to a distance-like value. However, we choose to treat their predictions as submitted and did not apply an extra conversion. Table S10. Global quality prediction (GDT_TS): performance comparison between all quality assessment groups in CASP13, both single-model and consensusbased. All metrics are computed with respect to ground-truth GDT_TS scores, using the two strategies described in the text: global across targets and average of per-target metrics. For GraphQA, we use the decoy-level output corresponding to GDT_TS. For other QA groups, we use the publicly available submissions from the CASP portal and also report the group ID used in CASP. We consider predictions submitted for stage 2 decoys. Single-model methods are highlighted in bold, to separate them from consensus methods. Rows are sorted by R across all targets.

Across all targets
Per "main" -2020/8/3 -23:30 -page S14 -#21 S14 F. Baldassarre et al. Table S11. Local quality prediction (residue-residue distance): performance comparison of all quality assessment groups in CASP13, both single-model and consensus-based. All metrics are computed with respect to ground-truth local distances as computed in CASP. Metrics are averages using the two strategies described in the text: globally across all decoys and average of per-decoy metrics. Since GraphQA predicts LDDT and CAD scores we transform these scores into distances using the heuristic formula described in the text. The QA methods at the bottom of the table that achieve negative correlation coefficients have most likely uploaded score-like values rather than distance-like values in their official submission to CASP. We consider predictions submitted for stage 2 decoys. For each group, we report the group ID used in CASP. Rows are sorted by ρ across all decoys. If ρ is negative we consider its absolute value, since it better reflects the predictive quality of the corresponding method.