Struct2GO: protein function prediction based on graph pooling algorithm and AlphaFold2 structure information

Abstract Motivation In recent years, there has been a breakthrough in protein structure prediction, and the AlphaFold2 model of the DeepMind team has improved the accuracy of protein structure prediction to the atomic level. Currently, deep learning-based protein function prediction models usually extract features from protein sequences and combine them with protein–protein interaction networks to achieve good results. However, for newly sequenced proteins that are not in the protein–protein interaction network, such models cannot make effective predictions. To address this, this article proposes the Struct2GO model, which combines protein structure and sequence data to enhance the precision of protein function prediction and the generality of the model. Results We obtain amino acid residue embeddings in protein structure through graph representation learning, utilize the graph pooling algorithm based on a self-attention mechanism to obtain the whole graph structure features, and fuse them with sequence features obtained from the protein language model. The results demonstrate that compared with the traditional protein sequence-based function prediction model, the Struct2GO model achieves better results. Availability and implementation The data underlying this article are available at https://github.com/lyjps/Struct2GO.


Introduction
As the expression products of genes and macromolecules in organisms, proteins are the main material basis of life activities, widely existing in various cells, providing many functions such as catalysis, cell signal, and structural support, playing a key role in life activities and functional execution.At the same time, the study of proteins can better grasp life activities on a molecular level, which has important practical significance for the management of diseases, the creation of new medications and the improvement of crops.Because of the progressing high-throughput sequencing technology, protein sequence data are increasing exponentially.At present, more than 100 000 proteins have been obtained by biological experiments in the Universal Protein (UniProt) (UniProt Consortium 2018) database with standard functional annotations.This accounts for only 0.1% of the proteins in the UniProt database.However, the method of verifying protein functions based on biological experiments is time-consuming and laborintensive and has strict requirements on equipment and funds, which cannot meet the increasing annotation demand, so it is necessary to design an efficient protein function prediction method.
The protein function prediction problem can be viewed as a multi-label binary classification problem, that is, by extracting the features of the given protein and mapping it to the space of protein function labels.A variety of data sources can be tapped to obtain protein function prediction features, such as protein sequence, protein structure, protein family, and protein-protein interaction network, etc.The most commonly used information source is protein sequence and interaction network.The protein function labels can be standardized through The Gene Ontology Consortium (2017), which is a database established by the Gene Ontology Consortium to define and describe genes and their products.According to different functional scopes, Gene Ontology includes three independent branches: Cellular Component, Molecular Function and Biological Process.
Generally, the study of protein function prediction can be separated into three stages.The initial step is the classic sequence-based method, such as BLAST (Altschul et al. 1990), which calculates the similarity between protein sequences and transfers annotations between proteins with similarity scores exceeding a certain threshold.This method has great limitations in the prediction of protein functions without sequence similarity.The second stage is the machine learning method based on a decision tree and support vector machine, of which the representative is the multi-source k nearest neighbors (Lan et al. 2013) algorithm, which integrates multiple similarity measurement methods to find the k nearest neighbors of the current protein, and the annotation of the current protein is determined by calculating the weighted average of the function of its neighboring proteins.In 2018, the DeepGO (Kulmanov et al. 2018) model proposed by Kulmanov et al. was the initial application of deep learning to protein function prediction, learning features from the protein sequence matrix through convolutional neural networks, and combining the embedding vectors of protein nodes in the PPI network for function prediction, and then enter the third phase of deep learning models.The following year, the team proposed the DeepGOPlus (Kulmanov and Hoehndorf 2020) model, which does not rely on the embedding vectors of protein nodes in the protein-protein interaction network, but captures sequence similarity information through the diamond (Buchfink et al. 2015) sequence alignment tool and combines CNN to extract sequence features to improve prediction performance.DeepGraphGO (You et al. 2021) leverages the family and domain information of the sequence to provide the nodes with initial characteristics and then utilizes graph convolutional networks to acquire the structural information of the PPI network.Building on this, PSPGO (Wu et al. 2022) proposed a multi-species label and feature propagation model based on a protein sequence similarity network and PPI network.
All of the above methods use protein sequence as the information source to predict GO terms, however, simply utilizing protein sequence information cannot reveal the correlation between protein functions.And models that obtain homologous sequence features based on the PSSM method exhibit lower sensitivity to single amino acid substitutions (Arya et al. 2022).Structure determines function is a universal rule in nature (Dawson et al. 2017, Mitchell et al. 2019).Hence, despite disparate sequences, two proteins with analogous structures may possess analogous functions (Brenner et al. 1996, Holm and Sander 1996, Krissinel 2007, Sebastian and Contreras-Moreira 2013).It is imperative to create techniques that utilize protein structural data to anticipate functions to compensate for the disparity between protein sequence and function.DeepFRI (Gligorijevi c et al. 2021) has demonstrated encouraging outcomes in the annotation of protein functions through the utilization of experimentally determined protein structural databases.Although only a limited number of proteins have experimental structures, AlphaFold2 (Jumper et al. 2021) has achieved a remarkable advancement in protein structure prediction, attaining an unprecedented level of accuracy in the prediction of protein structures at the atomic level, and in most cases has shown accuracy comparable to experiments, and has public released 214 million protein structure information, including humans, which will further promote the development of methods for predicting protein functions using structure.
In this article, Struct2GO, a protein function prediction model that leverages multi-source data fusion, is proposed as shown as Fig. 1.Specifically, the model takes protein sequence information and protein structure information as inputs extracts sequence features through the SeqVec pre-training model and extracts structural features through the hierarchical graph pooling model based on the self-attention mechanism.To maximize the utilization of the protein structure information provided by AlphaFold2, the residue-level embedding is pre-trained in the protein structure network via Node2vec, which is then employed as the initial node feature of the pooling model.Numerous experiments have demonstrated that the protein function prediction model which combines structure and sequence can significantly enhance prediction accuracy.Simultaneously, the model eliminates the restrictions of the PPI network on feature extraction, thereby significantly improving the model's generalizability.

Datasets
In this experiment, we obtained human protein structure data predicted by AlphaFold2 from the EMBL-EBI database, including 23 391 protein structures.In this article, more than 560 000 data were screened from the gene ontology annotation labels corresponding to human proteins, and the annotations obtained by experiments, that is, evidence codes (Evidence Code) of "IDA," "IPI," "EXP," "IGI," "IMP," "IEP," "IC," or "TA" were extracted, among which the human dataset included 20 395 data.Concurrently, we downloaded and parsed the most recent gene ontology data released by the official gene ontology website, construct the directed acyclic graph of gene ontology according to the terms of BPO, CCO, and MFO branches parsed, and complete the labels according to the above true path rules.It should be noted that most of the functional terms do not appear in the dataset or only annotate a few proteins, so this article filters out gene ontology nouns with a frequency lower than a certain threshold for each branch to reduce the sparsity of labels.After completion, the number of BPO, MFO, and CCO labels is 809, 273, and 298 respectively (see Supplementary Table S3).

The construction of protein contact map
Protein structure and function are closely related.To better infer protein-related functions from protein structure information, we transform the three-dimensional protein structure into a two-dimensional protein contact map, construct a protein structure network to aggregate the information of adjacent residues, and finally obtain the protein structure features.
In terms of a specific implementation, we can obtain the three-dimensional atomic coordinates of the protein structure through AlphaFold2, and then calculate the relative distance between amino acid residues.If the Ca atom between them is less than 10 A ˚, it is considered that there is an edge directly connected between the two residues.We employed two distinct methods for generating contact maps: ANY-ANY and NBR, see Supplementary Fig. S1 for the relevant experimental results.

Obtaining amino acid residue level features
In the protein structure network, each node is an amino acid residue.To obtain the features of the node, the most intuitive method is to use the one-hot encoding of 20 different amino acids, but this method cannot capture the position information of the same amino acid in different protein networks.Therefore, we utilize graph representation learning to acquire the structural information of the node in the protein network.Among the current algorithm, DeepWalk (Perozzi et al. 2014) is one of the most representative algorithms, which extends the word2vec (Mikolov et al. 2013) idea, and suppose that neighboring nodes have analogous embedding vectors.Node2vec (Grover and Leskovec 2016) is optimized through a biased random walk to acquire the successive vertices, i.e.
given the current vertex v, the likelihood of visiting the subsequent vertex x is where p vx is the unnormalized transition probability between vertex v and vertex x, and Z is the normalization constant.Node2vec introduces two hyperparameters p and q to regulate the random walk strategy.Assuming that the current random walk passes through (t, v) to reach vertex v, let p vx ¼ a pq t; x ð ÞÁ w vx ,w vx is the edge weight between vertex v and x.
where d tx is the shortest path distance between vertex t and vertex x.
In terms of a specific implementation, this experiment is based on the open-source distributed machine learning platform Spark-On-Angel of Tencent and uses the efficient data storage, update, and sharing services provided by Spark to implement the node2vec algorithm for graph computing.In the proteins we input, the number of amino acid residues is below 1500, so we choose the length of the walk-in node2vec to be 30, p to be 0.8, q to be 1.2, combined with the one-hot encoding of each residue, and finally, we generate 1 Â 50 dimensional feature vectors for each residue in the protein.

Extraction of protein sequence features
In the natural language domain, there has been a rapid advancement of pre-trained models such as Bert (Devlin et al. 2018) and XLNet (Yang et al. 2019) in recent years, and many researchers have extended the models in the NLP field to the bio-sequence field, proposing a variety of pre-trained models for obtaining distributed representations of protein sequences, and the SeqVec model (Heinzinger et al. 2019) is widely employed among them.The SeqVec pre-trained model can extract semantic information related to function from the sequence and has achieved good results in tasks such as protein subcellular localization, secondary structure prediction, and functional prediction.Specifically, the SeqVec model uses the CharCNN (Zhang et al. 2015) algorithm to acquire local characteristics of amino acids, and then uses the BiLSTM algorithm to construct the language model.The single amino acid feature is obtained by averaging the field features and the language model.That is, for the kth amino acid, its representation is (3) 512-dimensional vectors output in a forward and backward of LSTM layers, respectively.These two output vectors are concatenated to form a 1024-dimensional feature h LM k;j as the resultant of the j-layer BiLSTM model.Finally, the SeqVec model concatenates residue-level features into a 1024 Â N matrix and reduces the dimensionality of the matrix through principal component analysis or average aggregation to generate a 1 Â 1024 matrix.
In terms of the specific implementation, this experiment uses the SeqVec model, which first pre-trains about 33M sequences in the UniRef50 database.Then the human protein sequences are taken as input.For each protein sequence, we can get a feature vector as the protein sequence feature, which is combined with its structural features in the subsequent model for downstream protein function prediction.

Model and implementation
Since the same protein may have multiple functions, the model is essentially a multi-label classification task.In this article, an attention-based graph pooling mechanism is adopted, which takes the above-obtained protein contact graph and amino acid residue features as input extract protein structural features through graph convolution and hierarchical pooling and integrates the above sequence features as the input of the downstream protein function prediction multi-label classifier.At the same time, the network layer and post-processing layer in the classifier ensure the hierarchical relationship between GO labels.

Convolution layer
In this stage, we take the protein contact graph as the adjacency matrix and the amino acid residue features as the node feature in the graph and propagate its features between residues with similar structures and structures through graph convolution.We explored several widely used graph convolution functions, including Kipf and Welling graph convolutional layer (GraphConv) (Kipf and Welling 2016), Chebyshev spectral graph convolutions (ChebConv) (Defferrard et al. 2016), SAmpLe and aggregate convolutions (SAGEConv) (Hamilton et al. 2017), and Graph Attention (GAT) (Veli ckovi c et al. 2017).We compared the effects of different graph convolution methods on the results, and the experimental findings revealed that the two-layer GraphConv model attained the highest level of success.In each layer, a new hidden representation is obtained through neighbor message propagation and aggregation: is the Ã degree matrix, and Ã 2 R NÂN 0 is the adjacency matrix with self-connections.

Self-attention graph hierarchical pooling layer
In recent years, the self-attention mechanism has been extensively employed in deep learning models, resulting in noteworthy outcomes, and allowing the model to focus more on significant features.Lee et al. (2019) introduced the selfattention method to the graph pooling model, and obtained importance scores of each node by stacking convolutional layers and transforming the output features into one-dimensional.
Then we adopted the node selection algorithm proposed by Cangea et al. (2018), which retains some nodes and edges of the input graph and generates a new subgraph as the input of the next layer.The pooling ratio k determines the number of nodes that will be retained, and we select kN d e nodes by the importance scores of each node obtained from the selfattention convolutional layer.In the application of the model, we use the two-head attention mechanism to obtain two importance scores for each node respectively and calculate the mean as the ultimate score.In the experiment, this method effectively improves the performance of the model.
where X 0 is the original feature of the retained node, X out is the generated feature of the retained node, Z mask is the importance score of the retained node, and A out is the adjacency matrix of the subgraph generated by the retained node.Xu et al. (2018) proved in the paper that in the field of graph classification, compared with mean-pooling and maxpooling, sum-pooling shows better results.In sum-pooling, all node features in the graph are summed up, which can learn all labels and extract more information.In our hierarchical pooling model, we extract the graph features of this layer by splicing sum-pooling and max-pooling, and finally, sum the graph features of multiple layers as the structural features of the protein.The formula for each layer graph pooling is as follows:

Readout layer
where N is the number of nodes in this layer, X i represents the feature of the ith node, and jj represents the feature splicing.
3 Experiment and results

Experiment
To validate the effectiveness of Struct2Go, we divided the human protein dataset into training set, validation set, and test set in a ratio of 8:1:1 respectively to conduct experiments with three different prediction models.We compared the predicted results of the test set with those of the current mainstream models, including Naı ¨ve, BLAST, DeepGO, DeepGOA, DeepFRI, and GAT-GO.Naı ¨ve algorithm annotates GO terms according to the frequency, and BLAST is a protein sequence comparison technique that utilizes sequence similarity and dynamic programming to predict gene labels.
DeepGO leverages both protein sequence information and PPI network data to infer gene ontology tags.DeepGOA innovatively introduces GCN to obtain knowledge guidance prediction in GO, DeepFRI transforms protein threedimensional structure into a contact map and uses GCN to extract structural features for protein function prediction, and GAT-GO changes the aggregation function GCN to GAT based on DeepFRI and verifies it through experiments.
In this article, AUC, AUPR, and Fmax are selected as metrics to evaluate the accuracy of protein function prediction from different perspectives.
(The definition of specific parameters and formulas can be found in the Supplementary Data.)From Table 1, it is observable that our model has achieved a considerable enhancement in multiple metrics, which can be attributed to our processing and model design for the protein dataset when compared to other prevalent models.We fully mined the protein structure information provided by AlphaFold2 and combined it with the sequence feature method to achieve good results in protein function prediction.At the same time, we also see that in all branches, the MFO branch has good prediction results, while the BPO branch has lower accuracy, which may be related to the number of labels in different branches.For a fair evaluation of the model, metrics for all GO labels are provided and a histogram is plotted as shown in Supplementary Fig. S2.The metrics of the training set and test sets can be seen in Supplementary Table S4.

Ablation study
Then, we perform ablation experiments to assess the impact of each component in the Struct2GO model on the enhancement of performance, as shown in Table 2.
The experiments involved extracting protein semantic features from individual sequences using the SeqVec pre-trained model, obtaining contact maps based on AlphaFold2's atomic-level protein three-dimensional coordinates, and extracting protein structural features through hierarchical graph pooling.From the experimental data, it can be seen that the removal of any component will lead to the loss of model performance, which fully proves that all components of our model are effective.The ablation experiments reveal that, compared to protein semantic features obtained solely from single sequences, protein structural features have a significant impact on downstream function prediction tasks.Analogous to the findings of Arya et al. (2022), structuralbased features are more effective in capturing amino acid mutations.Furthermore, our ablation experiment results also support the perspective of the conclusions drawn by Arya et al. (2022) indirectly.

Model analysis
We compare the different variants of each component in the model and verify through experiments that our model achieves the best results in each variant, as shown in Table 3.When extracting structural features from the protein contact graph, we use four different aggregation functions, GraphConv, ChebConv, GATConv, and SAGEConv, respectively.The experimental data reveal that the various aggregation functions have a minimal influence on the model performance, but GraphConv often achieves better results in all data.Next, we compare the effects of SumPool, AvgPool, and MaxPool on model performance when reading graphs.As Xu et al. (2018) stated, the SumPool method can accumulate more features and often achieve better results in tasks that distinguish graph structures.The Struct2GO model demonstrates that hierarchical graph pooling is more effective than global graph pooling, likely due to its capacity to efficiently extract pertinent information from protein contact graphs with a large number of nodes.Finally, we contrasted the outcomes of single-layer and double-layer self-attention layers when utilizing hierarchical pooling.The experimental results show that multi-head attention layers can learn more effective information and often perform better in experiments.

Parameter sensitivity analysis
Then, we examined the effects of parameters such as dropout, learning rate, pooling ratio, and conv number on the model.We employed the control variable method, varying a single parameter at a time for multiple comparison experiments, and evaluated the actual impact of the parameter on the model performance by observing the performance of the model after training, to identify the optimal parameter value.The scope of hyperparameter comparison experiments is presented in Table 4.

Struct2GO
The utilization of dropout in deep neural networks can mitigate overfitting and enhance the generalization capacity.From the experiments in Fig. 2, it is evident that varying dropouts have a negligible effect on the model performance.Among them, when the dropout is 0.3, the model achieves slightly better performance.Simultaneously, to expedite the convergence rate of the model, we opted for a more prudent dropout value of 0.3.
From the experimental data depicted in Supplementary Fig. S4, we observed that the model's classification performance was weakest when the learning rate was 0.01, indicating that an excessively high learning rate could lead to the loss function fluctuating.When the learning rate decreased, the model's convergence gradually improved, but this also necessitated a greater number of training cycles to reach the optimal value.Ultimately, taking into account both the number of training cycles and the model performance, we set the learning rate to 0.0001.
From the experimental data depicted in Supplementary Fig. S5, when the convolution number is 1, it means that we can only learn the features of the direct neighbors.As the convolution number increases, the nodes in the graph can learn more features of the indirect neighbors, but at the same time, it will also lead to the problem of overfitting.The experimental data reveal that the performance of various convolution numbers is only slightly dissimilar, and the model achieves the optimal performance when the convolution number is 2.
The pooled ratio represents the ratio of the number of nodes in the subgraph generated in the next layer in the hierarchical process to the original graph, that is, the pooling ratio.In Supplementary Fig. S6, if the pooling ratio is 1, it degenerates to global pooling.From the comparison of the experimental results in the graph, we find that the model performance is better when the pooling ratio is 0.75.When the pooling ratio is reduced to 0.25, the model performance has a significant decrease, which may be because the reduction of the number of nodes in the subgraph will affect the generalization ability of the model, so we set the pool ratio value to 0.75.

Conclusion
In this article, we propose a powerful end-to-end graph deep learning model Struct2Go, which can effectively and quickly annotate protein functions based on protein structure and sequence.Specifically, we adopt a graph pooling model to acquire structural features from the three-dimensional protein structure predicted by AlphaFold2 and integrate the sequence features extracted by Seqvec to train the protein function classifier.AlphaFold2 predicted three-dimensional protein structure data provides strong support for our functional prediction, which can enable us to abandon the constraints of PPI networks in previous works and effectively improve the generality of the model.At the same time, compared with the previous methods for predicting protein function based on experimentally determined protein structure, AlphaFold2 provides sufficient high-resolution structure information, which enables our model to perceive more homologous information and effectively improve the accuracy of prediction.The comparative experiments demonstrate that Struct2Go has attained the most advanced performance, thereby conclusively   demonstrating the effective support of structural information for protein function prediction.
In our future work, we will continue to investigate novel methods and enhance the generality and precision of the Struct2Go model.In addition, the AlphaFold2 website provides us with 217 million protein structure datasets of multiple species, which can be used in future research to try largescale cross-species protein function model training, which can effectively improve the generality of the model.
At the same time, in order to focus more on the influence of subtle structural changes on protein function prediction in future work, we can explore new approaches in protein structure feature extraction.For instance, we can investigate embedding the amino acid features extracted from sequence models into protein structural networks and explore novel random walk models to more comprehensively unearth valuable information within protein structures.In addition, we can also build a protein network based on structural similarity, with a single protein as the node, and use the effective information of homologous proteins in network propagation to improve the accuracy of the model prediction.

Figure 1 .
Figure1.The Struct2Go model graph.The model's input includes protein structure and protein sequence.In the preprocessing stage, the protein threedimensional structure is transformed into a protein contact graph, and amino acid-level embedding is generated through Node2vec.At the same time, based on SeqVec, the protein sequence features are extracted and dimensionality is reduced to 1 Â 1024.Then, through two layers of the self-attention graph pooling model, protein structure features are extracted, in which GCN aggregates neighbor information and generates node weights, Top-rank algorithm selects nodes according to weight values, and updates node features to generate subgraphs, and accumulates the feature values of the two readout layers as the output of protein structural features.Finally, the sequence and structure features of the protein are fused as the input of the classifier.

Figure 2 .
Figure 2. PR curve of Struct2GO with different dropout.The curve of different colors represents the influence of different dropout values on the performance of the model.By observing the PR graph, it can be found that the model shows the best performance and stability when the dropout value is 0.3.

Table 1 .
Experimental results on human protein data.

Table 2 .
Ablation experiment results on human protein data.

Table 4 .
Range of hyperparameter comparison experiments.

Table 3 .
Model comparison experiment results on human protein dataset.