Unbiased curriculum learning enhanced global-local graph neural network for protein thermodynamic stability prediction

Abstract Motivation Proteins play crucial roles in biological processes, with their functions being closely tied to thermodynamic stability. However, measuring stability changes upon point mutations of amino acid residues using physical methods can be time-consuming. In recent years, several computational methods for protein thermodynamic stability prediction (PTSP) based on deep learning have emerged. Nevertheless, these approaches either overlook the natural topology of protein structures or neglect the inherent noisy samples resulting from theoretical calculation or experimental errors. Results We propose a novel Global-Local Graph Neural Network powered by Unbiased Curriculum Learning for the PTSP task. Our method first builds a Siamese graph neural network to extract protein features before and after mutation. Since the graph’s topological changes stem from local node mutations, we design a local feature transformation module to make the model focus on the mutated site. To address model bias caused by noisy samples, which represent unavoidable errors from physical experiments, we introduce an unbiased curriculum learning method. This approach effectively identifies and re-weights noisy samples during the training process. Extensive experiments demonstrate that our proposed method outperforms advanced protein stability prediction methods, and surpasses state-of-the-art learning methods for regression prediction tasks. Availability and implementation All code and data is available at https://github.com/haifangong/UCL-GLGNN.


Introduction
Proteins play an essential role in most biological processes, and their functions are realized through the dynamic structures (Frauenfelder et al. 1988, Li et al. 2022).Gaining insights into protein functions through the dynamic changes of their attributes (e.g.three-dimensional structure, thermodynamic stability) can help us better understand the fundamentals of life (Park et al. 2004).For example, certain diseases result from a single amino acid residue alteration, leading to a significant difference in protein thermodynamic stability that is closely related to the disease's molecular mechanism (Hartl 2017).Recent years have witnessed the great success of the protein 3D structure prediction based on deep learning (Jumper et al. 2021), which accelerates the traditional folding structural prediction task from months to hours.Similarly, estimating the change of protein thermodynamic stability upon amino acid mutations using conventional physical approaches (Marabotti et al. 2021) is time-consuming and laborious.Therefore, accurate computational approaches for protein thermodynamic stability prediction (PTSP) are needed, which will contribute to research on mutation-induced diseases and precision medicine.
PTSP aims to quantitatively predict the change in protein thermodynamic stability, denoted as DDG (Stefl et al. 2013, Pancotti et al. 2022), representing the difference between Gibbs free energies (DG).The Gibbs free energy DG is used to estimate the stability change of a protein from unfolding state to folding state.When a mutation occurs in an amino acid, it will disrupt the interaction network of amino acid residues, leading to changes in thermodynamic stability.For the folding protein without mutation, we refer it as a wide-type with Gibbs free energy change DG w .Conversely, the folding protein with amino acid mutation is called mutant structure with Gibbs free energy change DG m .Thus, the difference between Gibbs free energy is obtained with the formulation DDG ¼ DG m À DG w .
Researchers have made great efforts (Pancotti et al. 2022) in the field of thermodynamic stability and developed several deep learning-based approaches for DDG prediction (Pandurangan et al. 2017, Pucci et al. 2018, Montanucci et al. 2019, Li et al. 2020, Benevenuta et al. 2021).Unlike the computationally demanding methods based on biophysical modelings such as molecular dynamics simulation, deeplearning-based methods that extract features from protein sequences and structure have entered the mainstream.However, there remain two crucial unsolved problems in the way to provide more reliable predictions of thermodynamic stability upon point mutations: (i) The above-mentioned works either ignore the natural topology of proteins nor neglect the importance of the mutated site, which is the essential cause of topological changes of mutant proteins.(ii) The noisy samples are unavoidable in the PTSP task as the DDG obtained by the experimental values could be affected by the environment and human operation.However, the previous works have neglected the noisy samples for the PTSP task, which influences the model generalization ability and robustness.
To address the above-mentioned challenges, we propose a Global-Local Graph Neural Network enhanced with Unbiased Curriculum learning (GLGNN-UCL) to predict changes in protein thermodynamic stability.GLGNN-UCL represents proteins as graphs, with amino acids as nodes and residue interactions as edges.We first construct a Siamese graph attention network (GAT) (Veli ckovi c et al. 2018) to prediction DDG based on the global feature, the satisfactory accuracy showing that the geometric information is quite important.Still, the single point mutation site's information, responsible for alterations in residue interactions (i.e.graph topology) in proteins, is not well considered.To address this issue, we devise a local feature transformation flow to enhance the model's ability to represent the local mutated site's features.
More importantly, we propose a novel unbiased curriculum learning method to handle the inherent noisy samples in the PTSP task.We develop a simple yet effective hard sample selection function that automatically identifies noisy samples and adjusts their weights, preventing the model from being influenced by noise samples.Our approach demonstrates state-of-the-art performance on common benchmarks compared to other methods.The contributions of this work are: • We propose a framework named GLGNN-UCL to predict the change of protein thermodynamic stability upon point mutation.GLGNN-UCL exploits a Siamese graph neural network to represent the structure of the protein before the mutation and after the mutation.Followed by the logic of the nature of amino acid mutations, we use the local node feature to enhance the global feature representation to boost the performance.• We elaborate an unbiased curriculum learning approach to handle the intrinsic noisy samples in the thermodynamic stability prediction task, which could effectively distinguish and reweights the noisy samples thus avoiding the model from being affected by the noise.• Extensive experiments on our benchmark demonstrate that our GLGNN-UCL not only significantly exceeds the previous state-of-the-art methods for thermodynamic stability prediction but also outperforms methods that aim to handle the noisy samples for regression tasks.
2 Related work

Curriculum learning and regression
Curriculum learning (CL) (Bengio et al. 2009) stems from the idea that learning from easy to hard could improve the generalization ability of the model.Various works have demonstrated the merit of the CL in computer vision (Gong et al. 2022) or natural language processing (Platanios et al. 2019).
Recently, Wang et al. (2021) propose the CurGraph that aims at solving the graph classification task by estimating the complexity of the graph's topology.However, there are only several works that focus on training the model for a regression task with the curriculum, as the regression task is different from the classification task according to Yang et al. (2021).Castells et al. (2020) embeds the CL into regression task by proposing the SuperLoss that automatically decreases the contribution of samples with a large loss.Regression based on imbalanced data is a common issue in the real world, especially in the domain of bioinfomatics.However, most efforts are mainly based on SMOTE (Torgo et al. 2013).Yang et al. (2021) proposed a deep imbalance regression (DIR) framework to handle this issue by taking both label and feature distribution calibration into account.Nevertheless, the DIR is mainly designed for the task in the domain of computer vision and natural language processing, and does not take the distance between targets into account.Gong et al. 2019) provides a theoretical foundation for the expressive power of GNNs and the design of a powerful GNN.

Graph neural networks
Numerous studies have applied graph neural networks (GNNs) to biological problems, which includes protein design (Ingraham et al. 2019), feature representation learning (Jing et al. 2021), expression referring (Yang et al. 2020a,b), relationship prediction (Satorras et al. 2021), survival gene path analysis (Liang et al. 2022), disease diagnosis (Xing et al. 2022), medical image analysis (Huang et al. 2022), and human action analysis (Yan et al. 2023).However, none of these works focus on point mutations.Our paper introduces a global local GNN based on GAT for superior representation and transformation of local mutation site features.

Methodology
Missense genetic mutations (i.e. a mistake in the DNA which results in the wrong amino acids) alter the corresponding amino acid residues in the protein sequences.The variation in physicochemical properties like charge and hydrophobicity of the residue is very likely to affect the residue-interaction network.All residues in the neighborhood of the mutation site are forced to leave the original coordinates to accommodate the modified side-chain and form another stable conformation.We used the Rosetta (Alford et al. 2017) FastRelax protocol to obtain the initial protein structure before and after mutation.The aim of the protein thermodynamic stability prediction task is to quantify the values of DDG by learning efficient biophysical features from both the wild-type and mutant structures.

Global-local graph neural network
Considering that proteins exhibit a natural graph-like structure and input protein structures are inherently paired, we develop a Siamese graph neural network to extract richly structured protein features.As protein mutations arise from point mutations, the graph neural network should be capable of concentrating on the mutated site and its surrounding regions.Thus, we take the graph attention networks (GAT) (Veli ckovi c et al. 2018) as the backbone network of the Siamese graph neural network to extract the initial graph representations of the proteins.To learn the common knowledge of the nonmutated protein points, the upper part and the lower part of the Siamese graph network share the same weights.Formally, given a set of N protein node features h with the number of each node feature F, we first apply a shared attention mechanism to calculate the similarity ratios between a node and its neighboring nodes.For two nodes with index i and j, the importance a ij of the node j's feature to that of node i is formulated as: where W 2 R F 0 ÂF is a shared transformation matrix, a ! 2 R 2F 0 is a scoring weight vector, N i is the one-hot neighborhood of node i, jj denotes the concatenation, and the softmax operation is used for normalization.Then we aggregate each node feature with its one-hot neighbors based on their similarity ratios, and adopt a multi-head concatenation operation to stabilize the training process.The aggregated feature h where K is the number of attention heads, a k ij denotes the attention coefficients in the kth attention head, and W k denotes the transformation matrix in the kth head.Finally, we use the average pooling after the GAT layer to generate global protein structure representations.Although the above-mentioned Siamese graph attention network can represent the structural mutation process of proteins more effectively than the previous methods, it still lacks attention to local mutated nodes, which is the root cause of changes in protein topology and thermal stability.Thus, we further propose a novel and lightweighted module named Local Feature Transformation Flow (LFTF), to enhance the model's ability to capture the local mutated node.Let x a be the local feature vector ahead of the GAT layer shaped 1 Â a, x b be the local feature vector behind the GAT layer shaped 1 Â b, the refined local feature vector Algorithm 1. Unbiased Curriculum Learning Algorithm.
Require: P ¼ fp 1 ; . . .; p B g, Y¼fy 1 ; . . .; y B g 1: P denotes the prediction of the samples in mini-batch 2: Y denotes the label of the samples in mini-batch 3: B denotes the number of samples in mini-batch D denotes the queue to store the samples' difficulties.4: for i ¼ 1 to B do 5: Calculate sample's difficulty h i with Eq. ( 5) and store the value to queue D. 6: end for 7: Calculate the threshold T to filter the hard samples with Eq. ( 6).8: Define the loss L of current mini-batch and the schedule S of current mini-batch with Eq. ( 9).9: for i ¼ 1 to B do 10: Calculate the loss l i of the sample.11:

UCL-GLGNN
after Local Feature Transform (LFT) module f ðÁÞ is represented by y with shape 1 Â b.This process is represented as: where f ðÁÞ denotes a fully connected layer with a input channels and b output channels.After that, we update the current node feature with y and send this feature vector into the further GAT layer and LFT layer until the last layer of the graph neural network.By taking the advantage of the LFTF module, the error during the training process could better propagate to the local node, thus our model could achieve better performance.

Unbiased curriculum learning
To resolve the unavoidable error of the thermodynamic stability change, which is a common phenomenon in both experiments and the chemical calculation, we propose a novel unbiased curriculum learning (UCL) method to train the model end to end, which is shown in Algorithm 1.In the following three sections, we elaborate on the key concepts in UCL, which include the difficulty metric function and the curriculum scheduler (Fig. 1).

Difficulty metric function
The difficulty metric function is crucial in curriculum learning.In previous works, researchers typically use the loss of samples during training as the difficulty metric function.However, in graph regression tasks, a larger loss for a sample does not necessarily imply that the sample is harder than others, as the larger loss may result from model initialization or sample scarcity.For example, a sample with a larger DDG might have a larger loss due to model initialization.To fairly select difficult samples from the current mini-batch, which contains samples with both large and small DDG changes, we propose the following unbiased hardness function that eliminates the influence of ground truth values.Furthermore, to address the issue of the hardness value significantly increasing when GT (i.e. the denominator of the formula Equation 5) is close to 0, we propose the piece-wise function shown below: where the sample is denoted by x with ground-truth label x gt and the predicted value is represented by x pred , the hardness of sample x is represented as H(x).K represents the piece coefficient.Given the prior knowledge that the ground truth value DDG for the mutation of amino acids is typically normally distributed (see Fig. 2), we set K to 1. Furthermore, as  illustrated in Fig. 2, the model tends to overfit on samples with ground-truth values around 0, as they constitute a large portion of the entire dataset.In this situation, samples with larger ground truth values are likely to be assigned a larger loss.However, the difficulty of samples may be influenced by factors such as intrinsic topological structure and node features, in addition to the ground truth value.Thus, our unbiased design addresses this issue from another perspective.Based on the above unbiased measurement function, we design an adaptive threshold to assess each sample's difficulty according to the average and deviation of sample difficulties in a mini-batch, which is formulated as follows (Fig. 3): where a is a hyper-parameter for hard sample mining which is set to 1 by default.The higher a is, the fewer samples are regarded as the hard sample.The h avg and h std denote the averaged difficulty and the standard deviations of difficulty in the current batch, which are defined below: where the number of samples contained in current mini-batch is represented by B.

Scheduler design
After identifying difficult samples using the aforementioned methods, it is necessary to design a scheduler function for noisy samples to adaptively adjust their importance.The question arises: should we learn from easy to hard samples (i.e.weight the hard samples from 0 to 1) or from hard to easy (i.e.weight the hard samples from 1 to 0)? Training the model from easy to hard may lead to overfitting on noisy samples, even though the model is not affected by difficult samples initially.On the other hand, learning from hard to easy, an anti-curriculum paradigm, offers several advantages: (i) The model can extract more topological information from hard samples on the amino acid graph in the early stages, thereby avoiding overfitting on noisy outliers in the final stage.(ii) During the initial training rounds, the model may not accurately identify difficult samples.This uncertainty could cause the model to misclassify easy samples as hard samples.As a result, learning

UCL-GLGNN
from easy to hard may lead to the inadvertent discarding of misjudged "hard samples" (which are actually easy samples) at the beginning, adversely affecting the model's generalization ability.Therefore, we choose to learn from hard to easy.Let E be the number of training epochs, e denotes the current epoch, and the anti-curriculum schedule is defined as:

Loss function
By taking the above-mentioned curriculum paradigm into account, let h i be difficulty of ith sample in a mini-batch that is obtained by Equation 5., x i and y i denotes the predicted value and ground truth value of the DDG for sample i, the reweighted loss l 0 i is defined as: The final loss with respect to a mini-batch is defined as: 4 Experiments

Dataset
The training dataset in this study is derived from FireprotDB (Stourac et al. 2021), which contains 2518 samples upon single-point mutation after removing replicated mutations and the homologous proteins against the test data (BLAST P-value !0.001) (Li et al. 2020).We perform the 5-fold cross-validation on the training set, which means using 2014 samples for training and selecting the best performing model on the remaining 504 samples in the validation set.The Ssym dataset (Pucci et al. 2018) that contains 684 mutated samples is used for testing.PDBrenum Faezov and Dunbrack Jr (2021) was used to convert the mutation positions in the database to those in the PDB structures.The procedure to represent the proteins in the form of the graph are summarized as follow.If the distance between the alpha C of the amino acids <5 A ˚, we add a connecting edge between two amino acid nodes.For the feature of the nodes (i.e.amino acid residues) in the graph, they are obtained from the following three categories: (i) Amino acid encoding, including 5D representation from skip-gram model (Lv et al. 2021), 7D one-hot vector according to the amino acid classification, 8D vector summarizing several basic biophysical properties of a single residue.

Comparison with state-of-the-art methods
We evaluate our GLGNN model in three ways: (i) We compare it with previous top-performing models to demonstrate its effectiveness.(ii) We test our Siamese graph network model with three popular graph neural network backbones and our GLGNN backbone.(iii) We conduct experiments on GLGNN-UCL and other advanced methods addressing noisy samples in regression tasks.The test set is divided into "Direct" (mutations in the natural protein) and "Reverse" (mutations prior to the natural protein) to highlight the proposed methods' impact (Table 1).
The results comparing various network structures are shown in the upper part of Tables 2 and 3.By utilizing Siamese graph representation, most graph-based methods surpass previous neural network-based methods.GAT outperforms other methods because it can focus on the mutated site.More importantly, our GLGNN, which uses a tailored local feature transformation flow, can better learn local features.As a result, GLGNN not only outperforms previous neural network-based methods but also significantly surpasses other graph representation methods.
Regarding learning methods aimed at handling noisy samples in regression tasks, we compare the proposed GLGNN-UCL with two advanced learning methods in the LIR (Learning with Imbalanced Regression) part of Tables 2 and  3. We carefully adjust their hyperparameters to ensure optimal performance.On the Ssym benchmark, "SL" and "DIR" slightly improve the model's performance.However, on the P53 benchmark, "DIR" and "SL" fail to enhance the model's performance.In contrast, the proposed "UCL" boosts performance on both Ssym and P53 benchmarks.The reason behind this might be that "DIR" mainly focuses on the imbalance issue in regression tasks and does not sufficiently consider noisy samples.The rebalancing strategy in "DIR" further increases noise in the training data.As for "SL", it overlooks the intrinsic label value bias.Consequently, the proposed "UCL"

Ablation study
Ablation on the model structure.Ablation on the Unbiased Anti-curriculum method.In Table 5, we assess three curriculum approaches.M1 is a biased method using loss as a difficulty measure.M2 uses a proposed unbiased difficulty metric and an easy-to-hard scheduler (i.e. S ¼ e=E).M3 is an unbiased anti-curriculum method with a hard-to-easy schedule.From Table 5, both proposed M1 and M3 outperform the baseline, indicating that the "Unbiased" operation effectively distinguishes hard samples.The results between M1 and M3 support the assumption in the "scheduler design" section (Section 3.2.2).All the results surpass the baseline, demonstrating that downweighting samples with noise is effective.
Sensitivity analysis on the hard sample metric function.For noisy sample mining, we provide a sensitivity analysis of the

UCL-GLGNN
hyperparameter a in Equation 6, as shown in Table 6.The results demonstrate that all the a values significantly outperform the baseline method, indicating the effectiveness of our noisy sample detection algorithm.

Discussion and conclusion
The analysis of feature importance is shown in Fig. 2. The GLGNN-UCL model, with all features included, surpasses other models, providing superior Directed and Reversed PCC and lower RMSE values, indicating a strong correlation with low error rates.Models lacking amino acid or energy encoding show slightly reduced PCC and slightly increased RMSE, indicating minor losses in accuracy.The worst-performing model lacks evolutionary encoding, having the lowest PCC and highest RMSE values, emphasizing the vital role of evolutionary encoding.We think the reason might be that evolutionary encoding aids in understanding the intricate links between protein sequences, structures, and functions, which are key to stability predictions.Moreover, if a mutation occurs in an evolutionarily conserved region, it's likely to have a significant impact on protein stability, which might also be the reason why the evolutional feature could boost the performance.Overall, these results highlight that each feature encoding uniquely contributes to GLGNN's performance, with evolutionary encoding being particularly crucial.This aligns with the idea that protein behavior, a complex phenomenon, is influenced by a mix of factors, necessitating a diverse feature set in machine learning models predicting protein behaviors.
Our study focuses on single-point mutations due to their significant impact, using the GLGNN model.Although this model could hypothetically predict the effects of multiple mutations by treating each as an individual single-point mutation, we discourage this due to potential complex, nonlinear interactions between mutation sites.We're developing a new model to accurately predict both single and multiple-point mutations for a more comprehensive mutation impact prediction tool.For further wet-lab experiment, we outline a plan which is available in the appendix.
In this study, we present GLGNN-UCL, a graph regression method incorporating curriculum learning to address the problem of protein thermodynamic stability prediction.We first introduce a custom-designed global-local graph network to predict the thermodynamic change in proteins upon amino acid mutation.Subsequently, we propose an unbiased curriculum learning paradigm to handle noisy samples during training by controlling the weight of these samples.Comprehensive experimental results on a widely used benchmark confirm the superior performance of our approach.It not only outperforms advanced protein stability prediction methods based on neural networks or graph neural networks but also demonstrates superiority among state-of-the-art learning methods for regression prediction tasks.Notably, our local feature transformation module requires only 0.05 MB parameters but boosts performance by approximately 4%.More interestingly, the custom-designed UCL module enhances performance by 3% without any increase in parameters.
In addition, our work not only addresses the gap in protein thermodynamic stability prediction but also pioneers a way to handle noisy samples in the field of graph regression.Furthermore, we contribute a benchmark for evaluating graph neural networks on the PTSP task.Future work will involve delving deeper into the curriculum paradigm by exploring tailor-designed schedulers and validating the performance of our algorithm through wet laboratory experiments. 2

Figure 1 .
Figure 1.The pipeline of the proposed GLGNN-UCL: an unbiased curriculum learning-powered global-local graph neural network to predict the thermodynamic stability upon point amino acid mutation.Given an input structure and an amino acid mutation, we use Rosetta to obtain the wide-type and the mutant protein structure, which is shown in the purple and yellow part in the figure.The mutated amino acid is shown in a lighter color on the graph.The protein structure graphs are processed by a Siamese graph neural network with the Local Feature Transformation Flow (LFTF) module, to obtain its transformed feature representation.After that, we use the tailor-designed unbiased curriculum loss to train the model end to end.

Figure 2 .
Figure 2. The distribution the DDG values in the training set.
(Hamilton et al. 2017)(GNN) are powerful tools to model the non-Euclidean data.Inspired by the convolution operation in the imaging data, Graph Convolutional Network (Kipf and Welling 2017) (GCN) was proposed to handle graph data.GraphSAGE(Hamilton et al. 2017)extends the GCN based on the idea of inductive learning.Graph Attention Network (Veli ckovi c et al. 2018) (GAT) learns a graph feature transformation with the masked self-attention mechanism.Graph Isomorphism Network (GIN)(Xu et al.
l i 15: end if 16: end for 17: Update model parameters with the loss L.
(Li et al. 2020, Benevenuta et al. 2021, Pancotti et al. 2022) memory.The framework is implemented in PyTorch 1.10.1 and PyTorch Geometric 2.0.2 (Fey and Lenssen 2019) and CUDA 10.6.AdamW is applied to optimize the model.We train the models at a learning rate of 0.002, batch size at 256, training epoch at 50, and weight decay at 0.001.It is worth noting that as PyTorch Geometric does not guarantee reproduction, the results of the SiamGNN methods and the LIR-based methods are obtained by averaging the result of five independent experiments with 5-fold cross-validation.We follow the previous works(Li et al. 2020, Benevenuta et al. 2021, Pancotti et al. 2022)to use the Root Mean Square Error (RMSE), Pearson Correlation Coefficient (PCC), and Anti-symmetric score as the metric for model evaluation.
, including both physicsbased (Van der Waals interactions, solvation, hydrogen bonds) and knowledge-based energy terms (protein backbone, side-chains, torsions).(iii) Evolutionary encoding, 20D representation derived from multiple sequence alignment against the Uniclust_30 database (Mirdita et al. 2017) by hhblits (Remmert et al. 2011).To sum up, we obtain a 60D feature to encode each node in the graph.We obtain the edge

Table 1 .
Details of dataset used in our experiments, including dataset names, types, and number of mutated proteins.

Table 2 .
Comparison with state-of-the-art methods on Ssym benchmark.aInSiamGNN, we compare different backbones based on the SiamGNN framework.In LIR (learning with imbalance regression), we compare the proposed UCL with SL (NeurIPS'20) and DIR (ICML'21) based on ourGLGNN.bSeparatelyaveraging the results of five folds.Unique best results and our methods are marked in bold. a

Table 3 .
Comparison with state-of-the-art methods on the P53 benchmark.a In SiamGNN, we compare different backbones based on the SiamGNN framework.In LIR (learning with imbalance regression), we compare the proposed UCL with SL (NeurIPS'20) and DIR (ICML'21) based on our GLGNN.
a b Separately averaging the results of five folds.Unique best results and our methods are marked in bold.

Table 4 .
Ablation study of the model structure based on the Ssym test set.a

Table 5 .
Ablation study of the proposed unbiased curriculum learning method.

Table 6 .
Sensitivity analysis on the hard sample select function.