Graph2GO: a multi-modal attributed network embedding method for inferring protein functions

Abstract Background Identifying protein functions is important for many biological applications. Since experimental functional characterization of proteins is time-consuming and costly, accurate and efficient computational methods for predicting protein functions are in great demand for generating the testable hypotheses guiding large-scale experiments.“ Results Here, we propose Graph2GO, a multi-modal graph-based representation learning model that can integrate heterogeneous information, including multiple types of interaction networks (sequence similarity network and protein-protein interaction network) and protein features (amino acid sequence, subcellular location, and protein domains) to predict protein functions on gene ontology. Comparing Graph2GO to BLAST, as a baseline model, and to two popular protein function prediction methods (Mashup and deepNF), we demonstrated that our model can achieve state-of-the-art performance. We show the robustness of our model by testing on multiple species. We also provide a web server supporting function query and downstream analysis on-the-fly. Conclusions Graph2GO is the first model that has utilized attributed network representation learning methods to model both interaction networks and protein features for predicting protein functions, and achieved promising performance. Our model can be easily extended to include more protein features to further improve the performance. Besides, Graph2GO is also applicable to other application scenarios involving biological networks, and the learned latent representations can be used as feature inputs for machine learning tasks in various downstream analyses.

or micro-area under receiver operating characteristic (ROC) curves.However, given the fact that the dataset is kind of skewed (more negative samples than positive samples), the precision-recall curve will give a more informative picture of an algorithm's performance than ROC curves, as pointed by Davis and Goadrich (reference 42 in the manuscript).Therefore, we adopt micro-AUPR (m-AUPR) and macro-AUPR (M-AUPR) as two evaluation metrics.m-AUPR is computed by first vectorizing all labels (consider as one label) and then computing the AUPR, while M-AUPR is computed by first calculating AUPR for each label individually and then performing the unweighted mean of all AUPRs.Both of the metrics can properly represent the overall performance of a multi-label classifier.
>In Table 2, the authors show that the overall performance of Graph2GO is good, when compared with a baseline of BLAST.It does seem however, that the BLAST performance is rather high.CAFA typically reports Fmax (not F1) values between 0.4 and 0.5.I am wondering how such high BLAST F1 score was achieved.One possible issue may be the random split of training and testing sets as reported in P 6 "Comparison with BLAST".A random split does not ensure that there are no homologs for the training set in the test set, which may mean that the authors are testing on part of their training set and test set.The authors should ensure that any sequence in the test set is not too similar to the sequences in the training set.E.g.https://urldefense.proofpoint.com/v2/url?u=https-3A__www.biorxiv.org_content_10.1101_626507v4.full&d=DwIGaQ&c=k9MF1d71ITtkuJx-PdWme51dKbmfPEvxwt8SFEkBfs4&r=ekJXXuaVVww1UTqb5FGVAqKm1D8j8S7Zred d9hAkY7k&m=0Feecqdx0RHOBGTTz1ex_w6u3ZYkDPiddrOHjdLNzoE&s=NMlQXdy4 qrXzxCszFk_3ASxxc-3QmA6j1BxmSisYeLM&e= Authors' answer: Thanks for pointing out this question.We have added a new comparison situation where all training samples sharing more than 50% identity with one of the test samples are removed.Therefore, we ended up with two datasets: (1) full dataset that uses all randomly split training samples; (2) partial dataset that removes potential homologs.We can see that the results on the partial dataset are now similar to the results from CAFA.From the updated Table 2, we can see that Graph2GO is more robust than BLAST given the fact that Graph2GO shows smaller drop than BLAST from evaluation under full dataset condition to partial dataset condition.As a result, Graph2GO outperforms BLAST even more under the partial dataset condition across all ontologies.The detailed results and analysis are in section "Comparison with BLAST" and Table 2.
>Minor comments: > >Figure 3 is too small to be legible.Authors' answer: Thanks for pointing out the problem.We have updated Figure 3 for the updated evaluation metrics and made the figure clearer this time.>Table 3: seemingly low F1 performances, per organism.
Authors' answer: The results of each organism is consistent with the results on human, even for the low F1 score.As we answered to the first point, the low F1 score is due to the definition of our previously used F1 (based only on top three predictions).We have changed this type of F1 score to F-max, as defined in the section "Evaluation metrics".The updated results still seem consistent with the results on human.>Reviewer #2's comments: > >This paper presents an approach to predicting protein function that leverages heterogenous sources of knowledge about proteins, including relational information derived from protein interactions and sequence similarity relationships.The authors have clearly presented their computational architecture, which is interesting and as far as I know represents a novel contribution.However, I have some concerns.My main concerns with this manuscript are (a) a number of (apparent) variations in experimental settings that impeded comparability of results and might suggest some cherry-picking, and (b) the lack of comparison with the tasks and data sets used in the community evaluation CAFA (Critical Assessment of Function Annotation).Together, it means it is difficult to fully judge the contributions of the work.> >Major point 1: > >-Few details of the DNN are provided; this should be elaborated.
Authors' answer: Thanks for pointing out this problem.We have added more details about DNN in section "Deep neural network classifier" to make it clearer.
>-Is the CNN used in the 'Feature transfer' experiments the same as the DNN of the original experiments?If not, what are the differences, and why is the same algorithm not used for comparability?Is retraining needed in any case, can't the comparison of the sparsity vs non-sparsity group be done as an analysis over the results data?I don't feel that this is explained well.
Authors' answer: Thanks for raising this question.The CNN model used in "Feature transfer" experiment is not the same the DNN we used in the second part of our Graph2GO model.We have added a detailed description of this model in the supplementary material named "Convolutional Neural Networks (CNNs)".
The hypothesis we want to claim here is to show Graph2GO, a model making use of feature transfer within the network, is better at dealing with proteins without sufficient features due to the annotation bias.In other words, we want to show that the performance on the sparsity group is not a lot worse than that of on the non-sparsity group, when using Graph2GO.That's why we use the ratio between the performance on sparsity group and non-sparsity group as a measure.We want to compare with a method without using the concept of feature transfer.Actually, we have many choices, for example, SVM, logistic regression, DNN or CNN.The reason why we chose CNN is that CNN performs better than DNN or SVM in this case since it is good at extracting features.But we can also choose other methods to compare with, no matter the method itself is good or not, since what we care about is the ratio of performance between the sparsity group and non-sparsity group.
The re-training is not needed for Graph2GO in this analysis, since we just analyzed the results data by dividing the test results into two groups and calculated the corresponding metrics.
>-In the comparison with deepNF and Mashup, the authors do not explain how the embeddings produced by these methods are used.The authors have explained that these methods produce embeddings that are analogous to the embeddings derived by the authors, but then do not use the same approach in the subsequent stages of the classification process; they use a multi-layer NN for Graph2Go and an SVM for the other two methods.For direct comparability, the overall architecture for learning would be kept the same, and only the input embeddings would be changed.How else can we know whether the key improvement comes from the embedding approach, or the classification model?The current comparison does not appear to be like-for-like.
Authors' answer: Thanks for pointing out this issue.We previously touched a little bit on these two methods in the fourth paragraph in the "Introduction".We have added a few more descriptions of these two methods.
We previously used SVM for these methods as the classification model because they used SVM in their original implementation.In order to make it more comparable, we have changed their classification model to DNN as well, with the same architecture of Graph2GO.We have updated Figure 3 accordingly with new comparison results, and we have shown that our method outperforms other two methods In CC and MF ontologies, and achieve comparable results in BP ontology.>-In the comparison across species, the authors do not explain their experimental framework sufficiently.Do they control for overlap (very high sequence similarity) with the human training data they have used in developing their model?Do they 'hide'/hold back annotations [all, or just some?] for the proteins in these data sets?Furthermore, they claim "decent" performance (compared to what?) and yet the performance on these data sets is far lower than on the other test data they present.This does not at all 'prove' the robustness; it needs to be explained.Authors' answer: Thanks for pointing out this problem.In the comparison across species, we are actually performing experiments in each species individually.That's to say, for each species, we download their data, and then performing cross-validation within this species, just like what we did for human dataset.We didn't work in a transfer learning manner, training on human data and test on other species.
By conducting experiments in multiple species, we want to see whether the performance is consistent among all species to show the robustness of our method.We have also included the results on human dataset in the Table 3, and we can see that the performance of these six species are consistently decent, except that the results on s. cerevisiae and bacillus subtilis in BP ontology are a lot better than those of other species.>Major point 2: > >There have been a number of studies over several years that explore methods for automatic assignment of protein function (CAFA4 has just started).The authors cite some of this work (Reference 1) but then go on to claim that homology-based methods are still the norm.That was true 10 years ago, but I am not convinced that it is still true today.Importantly, data sets have been prepared for CAFA, as well as many methods, and the authors clearly need to discuss how their work relates to that work.Indeed, I would suggest that without a direct comparison to the data sets used in CAFA, it is not possible for the authors to claim state-of-the-art performance.BLAST is still a meaningful baseline method to compare with, but it is certainly not representative of the state-of-the-art.This comparison may not be entirely feasible given that CAFA test sets are large and only a fraction of them are used for the final evaluation, but IMO it is important to explore this and at least attempt to come close to a similar experimental framework to test the authors' approach.At a minimum, the authors should be using the same metrics for evaluation as CAFA --perhaps in addition to their own metrics --to enable meaningful comparison, and of course discussing results on that task in relation to theirs.A comparison that only considers embedding methods is incomplete, in particular given that there are several methods used in CAFA that explicitly support integration of multiple sources of information, including relational information (see [1] for instance).
Authors' answer: Thanks for the great suggestions.For the evaluation metrics, we use similar metrics with CAFA: F-max, macro-AUPR and micro-AUPR.We have added the section "Evaluation metrics" under "Results" to describe the metrics in detail.AUROC is used for term-centric evaluation in CAFA.Here we adopt AUPR instead, since as suggested by Davis and Goadrich (reference 42 in the manuscript), when the dataset is highly skewed, which is the case in this protein function prediction task, AUPR is a more informative metric than AUROC.
Mashup and deepNF are both state-of-the-art network-based embedding methods for predicting protein functions by integrating multiple network information.By comparing with these two methods, we have shown that Graph2GO is among the state-of-the-art network-based embedding methods for protein function prediction.We agree that it is ideal to compare with a larger pool of methods, including integrative methods not based on embedding in CAFA.However, we feel that given the scope of the study, we would like to focus on several methods that are recently published and well received in the community.We would like to continue the development and comparison in the lines of CAFA in the future, which may need our own implementations of CAFA methods that require devotion of more time.We think participating in a large-scale competition like CAFA in the future might be beneficial to our method comparison and development.
In summary, our method not only outperforms Mashup and deepNF in performance, but also exhibits some improvements in the method design.For example, Graph2GO Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation can consider both heterogenous network structural information and protein attributes (including sequences, locations and protein domains) while other two methods only consider network features.Besides, the concept of feature transfer adopted by Graph2GO is advantageous for solving the annotation bias, which is a novel contribution.
>On metrics, the use of the groupings of GO terms based on sparsity of annotations is not well justified; I do not understand what this represents, in particular given that the number of proteins annotated to a given GO term is not reflective of any biological reality --it simply reflects areas of study and our current level of biological understanding.GO is organised hierarchically along semantic/ontological lines, and the authors do not appear to recognise this characteristic.Furthermore, there exists a grouping of GO terms called GO Slims which already seeks to reduce the number of classes that should be considered [2] (now called subsets:https://urldefense.proofpoint.com/v2/url?u=http-3A__geneontology.org_docs_go-2Dsubset-2Dguide_&d=DwIGaQ&c=k9MF1d71ITtkuJx-PdWme51dKbmfPEvxwt8SFEkBfs4&r=ekJXXuaVVww1UTqb5FGVAqKm1D8j8S7Zred d9hAkY7k&m=0Feecqdx0RHOBGTTz1ex_w6u3ZYkDPiddrOHjdLNzoE&s=J9pWUSC W_9XLhgcFjbPLOw2acB-nA9wZrNl5fqLnx44&e= ).
Authors' answer: Thanks for raising this question.We group GO terms based on the sparsity of annotations to assess the impact of annotation bias on function prediction, and to show the performance of our method in more detail.Typically, the more data we have (meaning the annotations are less sparse), the better the function prediction model can perform.By dividing the GO terms in groups with different number of annotated proteins and analyzing the results separately, we can show how our model works when the data is sparse or when the data is abundant.
As for GO Slims, we found that they are grouped mainly based on different species.Since we already analyze the results for each species individually, we feel it is better to stay with the original GO groupings.Thanks for the suggestions.>In addition, why are only the top three predictions considered for evaluation purposes, in calculating F1-score?Having an absolute number, rather than based on confidence/probability etc. requires justification.
Authors' answer: Thanks for pointing out this problem.We have changed this metric to F-max as suggested.The definition is provided in section "Evaluation metrics".Previously we used F1-score based on top three predictions since it was used by Mashup and deepNF, and we adopted F1-score as well for the comparison.Now we use F-max to compare all these methods, which is reasonable and more widely used.
>In the methods, the authors do not adequately explain their output representation --is it framed as a multi-label classification problem, or as a multi-class problem (i.e. can a single input have multiple labels in the output)?Is it just a flat set of GO terms, so that the semantic interdependencies between the terms are not considered?(I later found this detail in the Results; it should come earlier.)It is actually a hierarchical classification problem [3].This hierarchical nature also interacts with metrics and evaluation; please see [4] and [5].The authors should consider/discuss their evaluation protocols in this context.Authors' answer: Thanks for raising this issue.Our task is a multi-label problem, we mentioned that in section "Deep neural network classifier" under "Methods".Our output representation is a flat set of GO terms.In our current implementation, we didn't consider the semantic interdependencies between the terms, which might cause inconsistency between the prediction of leaf GO terms and the corresponding parent GO terms.However, when we generate the dataset for model training and testing, we do assign the parent GO term to the protein when the child GO term is present.We think this will partially solve the issue of incomplete GO term annotations in the dataset.Besides, since we report a probability in range [0, 1] for each GO term, the users can decide the reliability of the GO term (no matter leaf or parent GO term) by the probability.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
We have also added some discussion about how we can treat this as a hierarchical classification problem in the future under the "Discussion" section.In the future, one way to treat it as a hierarchical classification is to add the hierarchical constraints in the output layer to force the predictions of leaf nodes are consistent with the predictions of the corresponding parent nodes.Another potential method is to explicitly add the GO hierarchy into the model, along with our other network components to inform the training process, since we can regard the GO hierarchy as a network.We will leave this integration to the future work.>In the section of "Graph2Go pipeline" the authors mention two alternative architectures.Ideally, they would test both architectures (possibly moving the detail to Supplementary) and indicate which worked better.
Authors' answer: Thanks for pointing out this problem.We have added the comparison in the supplementary material.From the comparison, we see that the model that trains independent VGAE for each network and combines their embeddings is better than the model that first combines the networks and train one VGAE to obtain the overall embeddings.This result supports our Graph2GO architecture.>On language; the writing is in general very good.There are a few colloquialisms that could be replaced: "plenty of" -> "substantial"; "really good" -> "very good" or "excellent"."noises" should just be "noise"."well suitable" should be "well suited".I found at least one typo "nueral".
Authors' answer: Thanks for pointing these out.We have corrected the language problems.

Background
Knowledge of protein functions is of great importance to understand life at a molecular level, study disease mechanisms and help explore novel therapeutic targets.However, experimental identification of protein functions are time-consuming and expensive, which are not suitable for large-scale applications.Therefore, high-throughput computational methods are required for discovering protein functions with reasonable quality and accuracy [1], and provide testable hypotheses for targeted experimental validation.
Most existing computational algorithms exploit homology inference to infer protein functions [2,3,4], which are based on the assumption that proteins with similar sequences frequently carry out similar functions.A standard approach is simply to transfer annotations from the best-annotated BLAST hit [5].Some approaches involve the use of protein-protein interaction (PPI) networks based on the fact that proteins closer Compiled on: April 15, 2020.Draft manuscript prepared by the author.

1
Click here to access/download;Manuscript;Gigascience_Graph2GO_v2_hig in the network have greater chance to share similar functions [6,7].Some other methods utilize domain content in the sequence to assign functions [8,9].Under the assumption that domains are protein structural architecture modules, it makes sense that protein function should be closely related to or determined by the composite domains.There are also other methods making use of other sources of information for protein function prediction, such as protein subcellular localization [10,11], post-translational modifications [11] and literature [12].
Given the limited prediction capacity of a single source of information, many methods have been proposed to combine several kinds of information and take advantage of the power of machine learning techniques.For instance, INGA [13] performs sequence similarity and domain architecture searches, and combines them with enrichment analysis on interaction networks to derive the consensus prediction.Similarly, COFAC-TOR [14] consists of three individual pipelines for sequence-, structure-and PPI-based predictions by querying UniProt-GOA [15], BioLip and STRING [16] databases respectively and generates the consensus based on three confidence scores obtained from the three pipelines.DeepGO [17] uses representation learning methods to learn features from sequence and interaction networks respectively and then combine them to predict the function using a deep neuro-symbolic model.Although these methods could achieve reasonable prediction accuracy, one limitation is that they only treat multiple kinds of features separately and do not consider the relationships between proteins for each feature, which might result in the loss of informaiton contained within the interactions.One possible solution to take into account the relationships is to use proteinprotein interaction networks to transfer features between proteins.
Graphs, such as those representing social networks, molecular graph structures [18] and biological protein-protein interaction networks, occur naturally in various real-world scenarios and have important applications in modern machine learning.Modeling the interactions between entities as graphs has enabled researchers to understand the network system in a systematic manner.For example, in a social network application one might wish to predict the role of a person or recommend new friends to a user [19]; in a clinical application, researchers want to predict new therapeutic applications of drug molecules [20]; and basic scientists are also interested in classifying the roles of a protein in a biological interaction graph [21].Mashup [22] is a framework for scalable and robust multiple network integration and generates low-dimensional vectors for nodes by characterizing topological context in heterogeneous PPI networks based on the random walk with restart method.The key of Mashup is the use of random walk with restart method to analyze the network structure.They also proposed novel methods for dimension reduction and integration of heterogeneous networks by solving optimization problems.DeepNF [23] is a network fusion method for extracting protein features from multiple heterogeneous interaciton networks based on multimodal deep auto-encoders.DeepNF also utilizes the random walk with restart to extract information about network structure for individual PPI networks.DeepNF proposed a different integration method based on auto-encoders.Both of these two methods adopt a two-stage model: first generating informative embeddings based on network structures in an unsupervised manner, then building a supervised classification model to predict GO terms with embeddings as the input.Although both Mashup and DeepNF can generate embeddings used for downstream protein function prediction, they only consider topological information contained in multiple PPI networks while ignoring informative protein attributes such as protein sequence or protein domains, thus might lack important information.
A crucial challenge in machine learning on graphs is finding a way to incorporate multiple types of information about the structure and attributes of the graph into the machine learning model.Traditional approaches usually use summary graph statistics [24], kernel functions [25] or handcrafted features to represent local neighborhood structures.Nowadays, people seek to learn representations that encode structural information in a data-driven way.The idea behind these representation learning methods is to learn a function that maps nodes in the graph to points in a low-dimensional vector space.By optimizing this mapping the geometric relationships in the learned space could reflect the original structure of the graph, and the learned representation can be used as feature inputs for downstream machine learning applications [26].These network representation learning methods can also make use of node attributes to generate more informative embeddings, such as user profiles in a social network and protein signatures in a protein interaction network [21].
In this paper, we propose Graph2GO, a multi-modal graphbased architecture that can make use of several kinds of data sources in a unified way to predict protein function.Unlike other consensus methods we mentioned earlier, we first use protein-protein interactions and sequence similarities to build two graphs separately and use protein sequence information, protein subcellular location, protein domains or any other possible information as node attributes in both graphs.Given the attributed graph, we use the attributed network representation learning algorithm to obtain informative embeddings for each node in each graph, i.e., each protein.In this way, we manage to learn representations by modeling both node attributes and network structures comprehensively and simultaneously.Then we combine these two embeddings and use them to predict the protein functions with a feedforward neural network model.
As far as we know, we are the first to use attributed network representation learning methods to model both interaction network and sequence similarity network with node attributes and predict protein functions, and we have successfully achieved state-of-the-art performance on the benchmark dataset.Besides, our model is extensible to incorporate other function-related information to further improve the performance.Graph2GO does not rely on any manually crafted features and is entirely data-driven.Graph2GO is also applicable to other similar scenarios, such as predicting new therapeutic applications of existing drugs, since the learned embeddings can be used as feature inputs for various downstream machine learning tasks.

Methods
Graph2GO consists of two parts: the first part is an unsupervised graph-based representation model that utilizes both network information (protein-protein interactions, sequence similarities) and node attributes (protein sequence, subcellular location and protein domains) to generate unique embedding vectors for each protein; the second part is a fully-connected deep neural network classifier which use embeddings as features and Gene Ontology (GO) [27] as function labels.Gene Ontology (GO) defines concepts that describe functions and classifies functions on three aspects: molecular function (MF), cellular component (CC) and biological process (BP) [27].We first describe how we obtain and encode different kinds of information, and then describe the detailed model specification.In Figure 1, we use PTEN gene as an example to detail how we encode these information.The model architecture is shown in Figure 2.

Dataset
We use reviewed and manually annotated proteins from Swis-sProt (release 2018_11) [15].The dataset contains 20412 human proteins.The dataset provides GO annotations along with the experimental evidence code.We also obtained protein sequence, subcellular location and protein domains (Pfam) from SwissProt.We downloaded protein-protein interaction networks from STRING database (v10.5)[16], filtered by the proteins we obtained from SwissProt.

Protein-protein interaction network
From the network data in STRING, we use the "combined score" provided by STRING as the confidence score.As shown in Figure S2, We perform the cross-validation to choose the threshold of the combined score for building the PPI network, and we only use interactions whose "combined score" are greater than 300 to construct the adjacency matrix as the representation of the network.For interactions with low confidence and protein pairs with no interactions mentioned in STRING, we assign the corresponding element in the adjacency matrix as zero.

Sequence similarity network
In order to build a sequence similarity network, we use the BLAST program to search similar sequences for each protein in our dataset [28].We only select similarity pairs whose "evalue" are smaller than 1e-4 as candidate edges for similarity network.Detailed cross-validation results on selecting the threshold are shown in Figure S1.

Protein sequence
We encode amino acid sequences following the conjoint triad (CT) method [29], which has been widely used to represent sequence in related fields [30,31,32,33].The twenty kinds of amino acids are first clustered into seven classes according to the dipoles and volumes of the side chains since amino acids within the same class likely involve synonymous mutations.And then all the amino acids in the same class are considered as identical.The mapping between classes and amino acids is shown in Table S1.Then we consider any three continuous amino acids as a unit and counts the triad frequencies by calculating the occurrence numbers within the protein sequence.Thus, the dimension of the CT encoding is 7 × 7 × 7 = 343.Using the CT method we manage to convert amino acid sequences into fixed-dimension representation.

Subcellular location
We obtained subcellular location information from SwissProt dataset.There are in total 359 different subcellular locations.We use bag-of-words encoding to represent this information, that's to say, the location is encoded as a binary vector of length 359 with each element indicating whether the protein is annotated with this location.Therefore, for a protein without any subcellular location annotation, it is represented as a vector of all zeros.

Protein domains
In the dataset we obtained from SwissProt, there are in total 5817 unique protein domains.In order to avoid the curse of dimensionality and to decrease the complexity, we only use protein domain terms that appear more than 5 times in our dataset and we are left with 655 terms.The protein domain knowledge is also encoded by way of bag-of-words encoding.

VGAE model
In the first part of Graph2GO, the core is a Variational Graph Auto-encoder (VGAE) [34], which can generate latent representations based on both network structure and node features, as shown in Figure 2a.We will discuss this model by problem formulation, followed by its inference part (encoder) and generative part (decoder).The purpose of VGAE is to learn interpretable embedding for each protein by training the encoder and decoder at the same time.

Problem formulation
We are given an undirected graph G = (V, E) with N = |V| nodes.
Here N is the number of proteins, and each vertex of G represents one protein while each edge is one interaction in the PPI network or a similarity pair in the sequence similarity network.The adjacency matrix A of G and its degree matrix D A are derived from known network information.We enforce self-loops in the graph by simply adding the identity matrix to A. The input features of each vertex are included in an N × R matrix X, which is the concatenation of protein sequence feature S, subcellular location feature L and protein domains feature D as shown in Figure 2a.Here R is the sum of feature dimensions of matrix S, L and D.
The model depicted in Figure 2a is basically an encoderdecoder model.First the encoder maps each node v i in the graph to a low-dimensional latent variable, z i , based on the node's position in the graph, its local neighborhood structure, and its attributes.Next, the decoder reconstructs the adjacency matrix term A ij corresponding to the pair of nodes v i and v j .By jointly optimizing the encoder and decoder, the model learns to compress graph structure information and original node attributes into the low-dimensional latent space.In principle, the model learns to propagate features across the proteins based on the network structure and the encoder-decoder architecture ensures that the learned representations in the latent space are meaningful and informative.

Encoder
The inference module is a graph convolutional network (GCN) encoder [35], which is a function with the goal of a mapping from the original features X to the latent variable Z with the network information A. To be more specific, we want to learn a probability model q(Z|X, A) and here we use graph convolutional network g to model this probability: where q is a function that encodes proteins into latent variables Z based on network information A and node attributes X, φ is the parameter of graph convolutional network g and I is an identity matrix.µ and σ are mean and variance of the Gaussian distribution corresponding to latent variable Z that are estimated using network g from data directly.Then Z can be sampled from q(Z|X, A).According to the reparameterization trick [36], z i is obtained by: where is element-wise multiplication and i ∼ N (0, I).
The intuition of GCN is as follows.The multiplication of feature matrix X and the adjacency matrix A means that, for every node, we sum up feature vectors of all its neighboring nodes and itself to update its feature representation.In this way, node attributes are propagated across the graph and the information associated with each protein is augmented, especially for proteins with little annotation.This graph convolution operation is similar to Laplacian smoothing which makes features of nodes in the same cluster similar [37].In order to avoid changing the scale of the feature vectors, A is first normalized so that all rows sum to one by symmetric normalization.Multiplying X with Ã means taking the average of neighboring node's features.In this way, GCN can effectively learn embeddings through integrating neighboring graph features.

Decoder
As our latent embedding already contains both node attributes and structure information, the generative module we define here is a simple inner product decoder that aims to reconstruct adjacency matrix A using learned latent variables z i : where σ(•) is the logistic function.We use the logistic function transformed inner product of z i and z j (right-hand side of Equ. ( 6)) as the probability of these two proteins having interaction.As indicated in Figure 2a, the output of the decoder Â is the approximation of adjacency matrix A and we optimize the model so as to make them as close as possible.

Cost function
Similar to VAE [36], the cost function is the reconstruction error with a regularizer: where KL[q(•) p(•)] is the Kullback-Leibler divergence between q(•) and p(•).The first term is to minimize the reconstruction error of the adjacency matrix A. The second term is to minimize the difference between q(Z|X, A) and p(Z).The cost function is the tradeoff between how accurate our model can reconstruct the input network and how close the latent variables can match p(Z).As specified in VAE, we assume p(Z) ∼ N (0, I).We train the VGAE using stochastic gradient descent to optimize the cost function with respect to the parameters of the encoder [36].

Deep neural network classifier
In the second part of Graph2GO, as shown in Figure 2b, we take out embeddings µ i contained in the matrix µ and train a fully-connected deep neural network as the final supervised classifier.Deep neural network (DNN) has become one of the most popular and powerful techniques for supervised machine learning [38].It consists of three types of layers: input layer, hidden layers and output layer.The inputs to the classifier are embedding vectors for each protein, while the output layer represents GO terms that we aim to predict.Hidden layers are stacked between the input layer and output layer, aiming to learn meaningful abstract features for the task.To go from one layer to the next, a set of units (neurons) compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function (activation function).By stacking multiple hidden layers, DNN can learn an extremely intricate non-linear function mapping from the input to the output, which means it works best when the task is inherently non-linear.Gradient descent algorithm is commonly used for training neural networks and binary cross-entropy is used as the cost function in our multi-label classification task.
We train three classifiers for molecular function (MF), biological processes (BP) and cellular components (CC), respectively.For each classifier, it is a multi-class multi-label model and the dimension of the output space is the number of GO terms within each ontology.Each protein may be predicted with multiple GO terms simultaneously.The classifier predicts the probabilities of the protein having each GO term annotation.The performance of the classifier can be excellent without much complex neural network structures since the embeddings already contain enough information and are highly representative in the learned low-dimensional vector space.

Graph2GO pipeline
As shown in Figure 2b, Graph2GO pipeline consists of two VGAE models for the protein-protein interaction (PPI) network and sequence similarity network (SSN), respectively, and the final deep neural network (DNN) classifier for predicting protein functions.Instead of combining two networks first and train one VGAE to obtain overall embeddings, by training independent VGAE model for each network and combine their embeddings, we try to avoid introducing noise and to keep as much information as possible.We compared the performance of these two ways for integrating two networks and the detailed results is shown in Table S2.

Experimental setup
Graph2GO was implemented using Tensorflow in Python and took advantage of the powerful computing capacity of GPU.All the simulations were carried out on Owens cluster provided by the Ohio Supercomputer Center (OSC) [39] with 27 processors and 127GB memory.The GPU we used was NVIDIA Tesla P100 with 16GB memory.Our source code is available at https://github.com/yanzhanglab/Graph2GO.
For our experiments, we use SwissProt's reviewed and manually annotated human proteins (release 2018_11).We only use proteins that also exist in the STRING database (v10.5)where corresponding interaction information is available.In order to use the CT method to encode amino acid sequences, we delete proteins whose sequence contains ambiguous amino acids including B, O, J, U, X and Z.After filtering out 5279 proteins, our final dataset contains 15133 proteins, along with 1,713,652 protein-protein interactions and 843,212 edges in the sequence similarity network.For GO annotations, we only consider the experimental evidence code among EXP, IDA, IPI, IMP, IGI, IEP, TAS and IC [27].If a protein is annotated with a GO term, we additionally annotated it with all the ancestor terms.
In the first part of our architecture, the encoder is a twolayer neural network structure of 400 and 800 neurons each.Using two layers of graph convolutional operations is recommended in [35,37], and we confirm the choice of two layers by performing the cross-validation, as shown in Figure S1.We initialize weights as described in [40] and train the model for 200 iterations using Adam algorithm with a learning rate of 0.001 [41].The final prediction classifier is a three-layer fullyconnected neural network with 1024, 512 and 256 neurons, respectively.Dropout layer and batch normalization layer are used between every dense layers to avoid over-fitting and the dropout rate is set as 0.3.Adam is used to train the model for 100 iterations with a learning rate of 0.001.

Evaluation metric
We use metrics that are similar to those adopted by the CAFA (Critical Assessment of protein Function Annotation algorithms) challenge to evaluate our model and compare with others [1].The first metric we use is the maximum F-measure over all possible thresholds defined as follows: where pr means precision, rc means recall, I is the indicator function, f is a GO term, P i (t) is a set of predicted GO terms for protein i using the threshold t, and T i is a set of annotated GO terms for protein i.
We also use similar term-centric metrics to evaluate the model as suggested by CAFA.However, instead of using receiver operating characteristic (ROC) curves, we choose to consider precision-recall (PR) curve and calculate the area under the curve.As pointed by Davis and Goadrich [42], when dealing with highly imbalanced datasets, precision-recall curves give a more informative picture of an algorithm's performance than ROC curves.Since this is a multi-label task, we adopt two averaged measures of area under precision-recall curve (AUPR) for all terms: Where pr f and rc f are precision and recall for a single GO term f, AUPR f is the area under the precision-recall curve (AUPR) for f, N f is the number of GO terms used for evaluation.Macroaveraged AUPR (M-AUPR) is defined as the unweighted mean of the AUPR for all labels, while micro-averaged AUPR (m-AUPR) is calculated globally by considering each element of the label indicator matrix as a label.

Comparison between different types of features
In this section, we compare the results of using different network information and node attributes.The purpose is to show which type of feature is most informative and how we can further improve the performance by combining multiple types of features.In Table 1, we show the results when using different network types and node attributes.For network types, we test among only using PPI network, only sequence similarity network and the combined network.As for node attributes, we compare the performance among different attributes (sequence, protein domains and subcellular location), and especially, compare with ALL attributes (using all three kinds of attributes together).We do not consider sequence information as node attributes when the sequence similarity network is used as the network to train the model because of the redundancy.
Looking at the performance of using different network types, the combined network is always better than the other two single network types across all three ontologies, which indicates that the PPI network and the sequence similarity network can provide complementary information for function prediction.Comparing the effect of PPI network and sequence similarity network, PPI network shows better performance than sequence similarity network except on molecu-lar function (MF).We can conclude that PPI network provides more function-related information than sequence similarity network and that MF ontology has a great correlation with sequence information.
From the analysis of the effects of using different node attributes, we can conclude that the integration of all node attributes is better than using any single of them across all three ontologies, no matter what kind of network type is used.We can see that using sequence attribute can achieve reasonable performance for predicting functions in all three ontologies when combined with PPI network, proving the importance of sequence information.It is also shown that protein domain attribute is the most informative feature for predicting MF functions while subcellular location attribute is important for CC functions' prediction.Overall, each type of node attribute contributes to the function prediction, and the combination of all features improves the performance and provides more robust predictions.Based on the results shown in Table 1, our final model considers both sequence similarity network and PPI network as our network information.As for node attributes, we include subcellular location and protein domain features as node attributes in the sequence similarity network while include all three types of features as node attributes in the PPI network.

Comparison with BLAST
We first compare our method with a baseline model, BLAST, which is a widely used homology-based method for annotating protein functions [5].We randomly split the dataset into 80% training set and 20% test set.We compare our method with BLAST under two conditions: (1) full dataset: using all 80% training set; (2) partial dataset: removing training samples with more than 50% identity with one of the test samples (removing potential homologs).The training set is used for constructing the database.Then all GO term annotations of the best hit in the database are assigned to the query protein in the test set as the predicted functions.Since BLAST cannot assign probabilities for each prediction, we use Precision, Recall and F1-score as the evaluation metrics instead of the M-AUPR, m-AUPR and F-max we used in the previous section.For Graph2GO, we use the same training and test set and use 0.5 as the threshold to assign predicted labels.
As shown in Table 2, in terms of all metrics, Graph2GO outperforms BLAST by a large margin across all three ontologies under both comparison condition, especially for CC and BP, proving the superiority of our model.We can see that Graph2GO can achieve significantly higher Precision than  2. Performance comparison between BLAST and Graph2GO in terms of precision, recall and f1-score under two conditions (full dataset and partial dataset after removing homologs).We use 0.5 as the threshold to assign predicted labels for Graph2GO.It should be noted that the F1 score here is different from the F-max used previously since the threshold is pre-specified for Graph2GO and there is no threshold for BLAST.BLAST under the full dataset condition.BLAST results in many false positives and low Precision because of predicting functions simply based on amino acid sequence information while ignoring other informative features.Besides, it is shown that BLAST can obtain a reasonable performance in MF ontology, which demonstrates that amino acid sequence information is highly related to the functions in MF ontology again.After removing highly similar sequences of the test set in the training set, performances of both methods drop accordingly, while Graph2GO shows a smaller drop and thus outperforms BLAST even more under the partial dataset condition.For BLAST, the performance drops from full dataset condition to partial dataset condition in terms of F1-score is 27%, 32% and 37% for CC, MF and BP, respectively.As for Graph2GO, the F1-score drops 9%, 25% and 17% accordingly, which are significantly less than BLAST, exhibiting a more robust performance.

Comparison with other network-based methods
To evaluate the performance of our method, we compare our method with two popular protein function prediction methods: Mashup and deepNF, which are also based on the concept of embedding.In their initial implementations, Mashup and deepNF use SVM as the classification model.In order to make it more comparable, we also use DNN that is used by Graph2GO as their classification models.Each method is evaluated using 5-fold cross-validation, repeated 10 times.We first compare the performance using all GO terms as shown in Figure 3.We also group GO terms into three functional categories based on the sparsity levels for each ontology, each containing GO terms with 11-30, 31-100 and 101-300 proteins, and show the detailed comparison in Figures S2, S3 and S4.
For both CC and MF ontology, our model outperforms deepNF and Mashup in terms of m-AUPR and F-max, and achieves comparable results with deepNF and Mashup in terms of M-AUPR.Our model is comparable to Mashup and outperforms deepNF in BP ontology in terms of m-AUPR and F-max and slightly worse than Mashup in M-AUPR.Looking at the detailed comparison in three sparsity levels shown in Figure S2, our model can achieve the best performance consistently for MF ontology, across all sparsity levels.As we discussed above, the MF ontology has a great correlation with sequence information.Unlike Mashup and deepNF that only considers PPI networks, Graph2GO explicitly includes sequence information in the model (sequence similarity network and sequence in the node attribute), which improves the performance of our model.As for CC ontology, Graph2GO can achieve the best performance for the most general categories (i.e., annotating between 101 and 300 proteins).For GO terms annotated between 11 and 100 proteins, our method is not as good as Mashup and deepNF.
For BP ontology, as the number of annotations per GO term increases, the performance of Graph2GO gradually improves and become comparable with other two methods.
It can be observed that Graph2GO works best for GO terms with enough annotated proteins, and is not as good as Mashup for very specific GO terms with few annotations for CC and BP ontology.In summary, our model can achieve state-of-theart performance in both CC and MF ontology, especially as the number of annotated proteins increases, and obtain comparable results in BP ontology.Compared with Mashup and deepNF, we can conclude that both the inclusion of discriminating node attributes and the power of attribute propagation across two types of networks enable Graph2GO to obtain competing results.

Performance on other species
In order to demonstrate the power and robustness of Graph2GO, We downloaded data from SwissProt and STRING database for other five species (fruit fly, mouse, rat, s. cerevisiae and bacillus subtilis), and tested the performance of Graph2GO on these species.For each species, we followed the same procedure as we tested on human dataset, and the performance was evaluated in each species individually.Table 3 shows the results in terms of macro-AUPR, micro-AUPR and F-max.We observe that Graph2GO can achieve consistently decent performance on all five species in terms of these evaluation metrics, which proves the robustness of our method.It should be noted that the performance on S. cerevisiae and Bacillus subtilis is a lot better than on other species.

Graph2GO web server
We also developed a web server based on Shiny app in R that supports protein function query on-the-fly: https:// integrativeomics.shinyapps.io/graph2go/.Currently, the web server supports 15133 Human proteins that are trained by our model.Network information (sequence similarity network and PPI network) and node attributes (subcellular location and protein domains) are displayed in our web server.The function prediction results are ranked by the probability score for each ontology separately, and users can specify the threshold to control the confidence of the results.
One of the promising applications of our model is to utilize the latent representations to perform downstream analysis, since the generated representations already summarize various kinds of informative features.We support downstream clustering analysis for the query protein where similarities are measured based on latent representations.Most similar proteins can be detected for the query protein to constitute a network module.Gene ontology enrichment analysis can be performed on all proteins in the network module to analyze functions of the module.In the future, we are going to extend the server to include more proteins from other species.

Feature transfer for solving annotation bias
One of the challenges in protein function prediction is the lack of annotated features for proteins, which are crucial for machine learning models.This is due to the bias in the experimental annotation of proteins and has impeded the development of biomedical research to some extent [43,44].In our experiment on the human dataset, among 15133 proteins, 5716 proteins lack annotation of protein domains while 2364 lack annotation of subcellular location.Even with lots of missing values, when only using protein domains or subcellular location as node attributes, Graph2GO can still achieve reasonable results, as shown in Table 1.To demonstrate the power of Graph2GO to predict on these sparsely annotated proteins, we divide the test set into two groups: the sparsity group (without any subcellular location or protein domain annotation) and the non-sparsity group.We compare Graph2GO with a convolutional neural network (CNN) model.This CNN model uses the same node attributes as Graph2GO while do not utilize networks to propagate protein attributes on these two groups separately.The detailed description of this CNN model is in the supplementary material.The comparison results are shown in Table S3.It is anticipated t hat the performance on sparsity group should be worse than that of non-sparsity group due to the lack of features.We calculate the ratio between the performance of sparsity group and non-sparsity to measure how well Graph2GO and CNN handle the lack of features.The averaged ratio in terms of F-max for CNN is 74%, 44% and 73% for CC, MF and BP, respectively, while for Graph2GO, the averaged ratio is 101%, 84% and 111%.We can see that when lacking features, the performance of CNN would drop significantly compared to the non-sparsity group.However, the performance of Graph2GO is much more stable when facing the lack of features.In some cases, the performance on sparsity group is even better than non-sparsity group.This is due to the feature transfer of Graph2GO where node attributes are transferred between nodes in the graph to augment the information of each protein.Since proteins are connected in the interaction network, nodes without any attributes can still update their representations based on attributes of neighboring nodes, which solves the problem of missing values to some extent.
Encoding of original features into the model is of high importance to Graph2GO, not only for preserving information as much as possible, but also for feature transfer.We represent protein domains and subcellular location using bag-ofwords encoding method, which best preserves the original information and supports the addition of features that the feature transfer relies on.As for protein sequence, CT encoding method loses sequential information and is not well suited for the addition operation.To solve this, we also utilize the similarity network to represent the sequence information, and we discover that the inclusion of both types of sequence representations is better than any single one of them, which might indicate that they can provide some complementary information.

Extensibility and future work
Compared to other methods that also use multiple protein features [17,13,14], one of the advantages of our method is that it is convenient to incorporate other function-related information.Within the network architecture, all the features are regarded as node attributes and can be treated equally.In this paper, we found the integration of all these node features and the network information provides the best prediction performance.The model can be easily extended to take into account additional information such as post-translational modifications, protein structure or any other protein features.It's also worth incorporating more networks and find a better way to integrate them.For example, we may use the seven channels in STRING database separtely instead of using the combined PPI network, as adopted by Mashup.
The architecture we proposed in this paper can not only be used to predict protein functions, but also be applied to other tasks which involve biological networks, since the learned embeddings are informative and can be used as feature inputs for various downstream machine learning tasks.For example, one can predict the interactions between any two proteins, predict new therapeutic applications of existing drug or run clustering algorithm to cluster genes into modules based on learned embeddings.
In principle, predicting GO terms is a hierarchical classification problem, since GO terms are organized in a hierarchy and are interacting with each other [45].Currently, Graph2GO assumes the independence of all GO terms, which might cause inconsistence between the prediction of leaf GO terms and the corresponding parent GO terms.In the future, we will consider adding some constraints on the output layer of our classification model to force the prediction consistent between leaf nodes and parent nodes.We will also try to incorporate the GO hierarchy into the model as one of the network structures to inform the training process.

Conclusion
In this work, we present Graph2GO, a graph-based deep learning model that can make use of heterogeneous information in a unified way to predict protein functions.Our network representation learning method can take into account both network structure and protein attributes together to make predictions by organizing proteins into an attributed network architecture.Graph2GO achieves state-of-the-art performance on the benchmark dataset and is easy to be adapted to solve other tasks involving biological networks, such as link prediction, node classification and sub-network discovery.

Response letter
We thank the editors and reviewers for their insightful comments and suggestions and believe that after addressing each comment the manuscript is stronger.We have highlighted all our answers in red color.

Editor's comments:
Both reviewers point out several issues with the evaluation and comparison procedures, which may have introduced bias.Before making a decision on a revised manuscript we will seek further advice from our reviewers and I urge you to make an effort to fully address their comments.
Authors' answer: We appreciate the feedback.We have worked carefully to solve the problem concerning evaluation and comparison procedures that might introduce bias.We have added a new evaluation metric F-max that is widely used by the community to replace our previous F1-score that is only based on top three predictions to make the evaluation more reliable.We include the detailed definition of these evaluation metrics so that general readers can understand.We also elaborate more on our DNN classification model and the models we compare with as suggested by reviewers.For experiments, we have added more details to clarify any confusions.Besides, we make changes to get rid of any variations that might impede the comparability of the results.We also followed other suggestions carefully and explained how we addressed them in the following point-to-point response.

Reviewer #1's comments:
This work describes a new method that uses attributed network representation that uses both interaction networks and protein features to predict protein function, when the function are represented as GO terms.This method of integrating heterogeneous information is, to my knowledge, novel and interesting.The methodology and results are mostly well described (but see comments below).

Major comments:
My main concern is the rather low F1 scores when compared with the apparent better performance of the AUPR scores, as shown in Table 1.(Incidentally, of showing F1 scores, why not show average precision and recall so that the reader can establish the contribution of the precision and recall to the F1 score?)The F1 scores are quite low (0.29 for CC and MF, and 0.07 for BP).I believe the authors could explain that, especially in light of the overall good performance in Table 2. Again, Figure 3 shows a low performance of F1 (0.25-0.35) of F1.How does that work with the F1 of 0.717 reported in Table 2? Authors' answer: Thanks for pointing out this problem.We should apologize for the inconsistence of the use of F1-score.In Table 1, Table 3 and Figure 3, the F1-score we used is only based on top three predictions, as used by Mashup and deepNF that we want to compare with.In Table 2, when comparing with BLAST, we used the common F1-score, since F1-score based on top three predictions cannot be calculated for BLAST.The F1-score based on top three predictions tend to be small across all models, compared to other metrics.In order to solve the inconsistence, we decide to change the evaluation metric to F-max, the maximum F1-score over all possible thresholds.This evaluation metric is also adopted by CAFA (Critical Assessment of protein Function Annotation algorithms).The detailed definition of this metric along with others are introduced in section "Evaluation metric" under "Results".Looking at the results based on F-max, the performance is more consistent with other metrics, and our model is still better than other methods under this metric.I could not find definitions for m-AUPR and M-AUPR.It is my understanding that microaveraged area under the precision-recall curve (m-AUPR) is computed by first vectorizing and then computing the AUPR; Macro-AUPR (M-AUPR) is computed by first computing the AUPR for each function separately, and then averaging these values across all functions.I would like to know how this was implemented here, and how the differences in performance are explained.While these performance metrics may be well-understood in the machine learning community, I think that the authors should elaborate a bit more on the choice of these metrics and their meaning, as Gigascience aims at a broader audience than those well-versed in machine learning.
Authors' answer: Thanks for pointing out this issue.We have added a new section "Evaluation metric" under "Results" to elaborate on the metrics we used.Since the protein function prediction is a multi-label task, we need to evaluate the performance on all labels (GO terms).One of the most popular metrics used in this setting is macro-or micro-area under receiver operating characteristic (ROC) curves.However, given the fact that the dataset is kind of skewed (more negative samples than positive samples), the precision-recall curve will give a more informative picture of an algorithm's performance than ROC curves, as pointed by Davis and Goadrich (reference 42 in the manuscript).Therefore, we adopt micro-AUPR (m-AUPR) and macro-AUPR (M-AUPR) as two evaluation metrics.m-AUPR is computed by first vectorizing all labels (consider as one label) and then computing the AUPR, while M-AUPR is computed by first calculating AUPR for each label individually and then performing the unweighted mean of all AUPRs.Both of the metrics can properly represent the overall performance of a multi-label classifier.
In Table 2, the authors show that the overall performance of Graph2GO is good, when compared with a baseline of BLAST.It does seem however, that the BLAST performance is rather high.CAFA typically reports Fmax (not F1) values between 0.4 and 0.5.I am wondering how such high BLAST F1 score was achieved.One possible issue may be the random split of training and testing sets as reported in P 6 "Comparison with BLAST".A random split does not ensure that there are no homologs for the training set in the test set, which may mean that the authors are testing on part of their training set and test set.The authors should ensure that any sequence in the test set is not too similar to the sequences in the training set.E.g.https://urldefense.proofpoint.com/v2/url?u=https-3A__www.biorxiv.org_content_10.1101_626507v4.full&d=DwIGaQ&c=k9MF1d71ITtkuJx-PdWme51dKbmfPEvxwt8SFEkBfs4&r=ekJXXuaVVww1UTqb5FGVAqKm1D8j8S7Zred d9hAkY7k&m=0Feecqdx0RHOBGTTz1ex_w6u3ZYkDPiddrOHjdLNzoE&s=NMlQXdy4q rXzxCszFk_3ASxxc-3QmA6j1BxmSisYeLM&e= Authors' answer: Thanks for pointing out this question.We have added a new comparison situation where all training samples sharing more than 50% identity with one of the test samples are removed.Therefore, we ended up with two datasets: (1) full dataset that uses all randomly split training samples; (2) partial dataset that removes potential homologs.We can see that the results on the partial dataset are now similar to the results from CAFA.From the updated Table 2, we can see that Graph2GO is more robust than BLAST given the fact that Graph2GO shows smaller drop than BLAST from evaluation under full dataset condition to partial dataset condition.As a result, Graph2GO outperforms BLAST even more under the partial dataset condition across all ontologies.The detailed results and analysis are in section "Comparison with BLAST" and Table 2.

Minor comments:
Figure 3 is too small to be legible.
Authors' answer: Thanks for pointing out the problem.We have updated Figure 3 for the updated evaluation metrics and made the figure clearer this time.
Authors' answer: The results of each organism is consistent with the results on human, even for the low F1 score.As we answered to the first point, the low F1 score is due to the definition of our previously used F1 (based only on top three predictions).We have changed this type of F1 score to F-max, as defined in the section "Evaluation metrics".The updated results still seem consistent with the results on human.

Reviewer #2's comments:
This paper presents an approach to predicting protein function that leverages heterogenous sources of knowledge about proteins, including relational information derived from protein interactions and sequence similarity relationships.The authors have clearly presented their computational architecture, which is interesting and as far as I know represents a novel contribution.However, I have some concerns.My main concerns with this manuscript are (a) a number of (apparent) variations in experimental settings that impeded comparability of results and might suggest some cherry-picking, and (b) the lack of comparison with the tasks and data sets used in the community evaluation CAFA (Critical Assessment of Function Annotation).Together, it means it is difficult to fully judge the contributions of the work.

Major point 1:
-Few details of the DNN are provided; this should be elaborated.
Authors' answer: Thanks for pointing out this problem.We have added more details about DNN in section "Deep neural network classifier" to make it clearer.
-Is the CNN used in the 'Feature transfer' experiments the same as the DNN of the original experiments?If not, what are the differences, and why is the same algorithm not used for comparability?Is retraining needed in any case, can't the comparison of the sparsity vs non-sparsity group be done as an analysis over the results data?I don't feel that this is explained well.
Authors' answer: Thanks for raising this question.The CNN model used in "Feature transfer" experiment is not the same the DNN we used in the second part of our Graph2GO model.We have added a detailed description of this model in the supplementary material named "Convolutional Neural Networks (CNNs)".
The hypothesis we want to claim here is to show Graph2GO, a model making use of feature transfer within the network, is better at dealing with proteins without sufficient features due to the annotation bias.In other words, we want to show that the performance on the sparsity group is not a lot worse than that of on the non-sparsity group, when using Graph2GO.That's why we use the ratio between the performance on sparsity group and non-sparsity group as a measure.We want to compare with a method without using the concept of feature transfer.Actually, we have many choices, for example, SVM, logistic regression, DNN or CNN.The reason why we chose CNN is that CNN performs better than DNN or SVM in this case since it is good at extracting features.But we can also choose other methods to compare with, no matter the method itself is good or not, since what we care about is the ratio of performance between the sparsity group and non-sparsity group.
The re-training is not needed for Graph2GO in this analysis, since we just analyzed the results data by dividing the test results into two groups and calculated the corresponding metrics.
-In the comparison with deepNF and Mashup, the authors do not explain how the embeddings produced by these methods are used.The authors have explained that these methods produce embeddings that are analogous to the embeddings derived by the authors, but then do not use the same approach in the subsequent stages of the classification process; they use a multi-layer NN for Graph2Go and an SVM for the other two methods.For direct comparability, the overall architecture for learning would be kept the same, and only the input embeddings would be changed.How else can we know whether the key improvement comes from the embedding approach, or the classification model?The current comparison does not appear to be like-for-like.
Authors' answer: Thanks for pointing out this issue.We previously touched a little bit on these two methods in the fourth paragraph in the "Introduction".We have added a few more descriptions of these two methods.
We previously used SVM for these methods as the classification model because they used SVM in their original implementation.In order to make it more comparable, we have changed their classification model to DNN as well, with the same architecture of Graph2GO.We have updated Figure 3 accordingly with new comparison results, and we have shown that our method outperforms other two methods In CC and MF ontologies, and achieve comparable results in BP ontology.
-In the comparison across species, the authors do not explain their experimental framework sufficiently.Do they control for overlap (very high sequence similarity) with the human training data they have used in developing their model?Do they 'hide'/hold back annotations [all, or just some?] for the proteins in these data sets?Furthermore, they claim "decent" performance (compared to what?) and yet the performance on these data sets is far lower than on the other test data they present.This does not at all 'prove' the robustness; it needs to be explained.Authors' answer: Thanks for pointing out this problem.In the comparison across species, we are actually performing experiments in each species individually.That's to say, for each species, we download their data, and then performing cross-validation within this species, just like what we did for human dataset.We didn't work in a transfer learning manner, training on human data and test on other species.
By conducting experiments in multiple species, we want to see whether the performance is consistent among all species to show the robustness of our method.We have also included the results on human dataset in the Table 3, and we can see that the performance of these six species are consistently decent, except that the results on s. cerevisiae and bacillus subtilis in BP ontology are a lot better than those of other species.

Major point 2:
There have been a number of studies over several years that explore methods for automatic assignment of protein function (CAFA4 has just started).The authors cite some of this work (Reference 1) but then go on to claim that homology-based methods are still the norm.That was true 10 years ago, but I am not convinced that it is still true today.Importantly, data sets have been prepared for CAFA, as well as many methods, and the authors clearly need to discuss how their work relates to that work.Indeed, I would suggest that without a direct comparison to the data sets used in CAFA, it is not possible for the authors to claim state-of-the-art performance.BLAST is still a meaningful baseline method to compare with, but it is certainly not representative of the state-of-the-art.This comparison may not be entirely feasible given that CAFA test sets are large and only a fraction of them are used for the final evaluation, but IMO it is important to explore this and at least attempt to come close to a similar experimental framework to test the authors' approach.At a minimum, the authors should be using the same metrics for evaluation as CAFA --perhaps in addition to their own metrics --to enable meaningful comparison, and of course discussing results on that task in relation to theirs.A comparison that only considers embedding methods is incomplete, in particular given that there are several methods used in CAFA that explicitly support integration of multiple sources of information, including relational information (see [1] for instance).
Authors' answer: Thanks for the great suggestions.For the evaluation metrics, we use similar metrics with CAFA: F-max, macro-AUPR and micro-AUPR.We have added the section "Evaluation metrics" under "Results" to describe the metrics in detail.AUROC is used for term-centric evaluation in CAFA.Here we adopt AUPR instead, since as suggested by Davis and Goadrich (reference 42 in the manuscript), when the dataset is highly skewed, which is the case in this protein function prediction task, AUPR is a more informative metric than AUROC.
Mashup and deepNF are both state-of-the-art network-based embedding methods for predicting protein functions by integrating multiple network information.By comparing with these two methods, we have shown that Graph2GO is among the state-of-the-art network-based embedding methods for protein function prediction.We agree that it is ideal to compare with a larger pool of methods, including integrative methods not based on embedding in CAFA.However, we feel that given the scope of the study, we would like to focus on several methods that are recently published and well received in the community.We would like to continue the development and comparison in the lines of CAFA in the future, which may need our own implementations of CAFA methods that require devotion of more time.We think participating in a large-scale competition like CAFA in the future might be beneficial to our method comparison and development.
In summary, our method not only outperforms Mashup and deepNF in performance, but also exhibits some improvements in the method design.For example, Graph2GO can consider both heterogenous network structural information and protein attributes (including sequences, locations and protein domains) while other two methods only consider network features.Besides, the concept of feature transfer adopted by Graph2GO is advantageous for solving the annotation bias, which is a novel contribution.
On metrics, the use of the groupings of GO terms based on sparsity of annotations is not well justified; I do not understand what this represents, in particular given that the number of proteins annotated to a given GO term is not reflective of any biological reality --it simply reflects areas of study and our current level of biological understanding.GO is organised hierarchically along semantic/ontological lines, and the authors do not appear to recognise this characteristic.Furthermore, there exists a grouping of GO terms called GO Slims which already seeks to reduce the number of classes that should be considered [2] (now called subsets:https://urldefense.proofpoint.com/v2/url?u=http-3A__geneontology.org_docs_go-2Dsubset-2Dguide_&d=DwIGaQ&c=k9MF1d71ITtkuJx-PdWme51dKbmfPEvxwt8SFEkBfs4&r=ekJXXuaVVww1UTqb5FGVAqKm1D8j8S7Zred d9hAkY7k&m=0Feecqdx0RHOBGTTz1ex_w6u3ZYkDPiddrOHjdLNzoE&s=J9pWUSC W_9XLhgcFjbPLOw2acB-nA9wZrNl5fqLnx44&e= ).
Authors' answer: Thanks for raising this question.We group GO terms based on the sparsity of annotations to assess the impact of annotation bias on function prediction, and to show the performance of our method in more detail.Typically, the more data we have (meaning the annotations are less sparse), the better the function prediction model can perform.By dividing the GO terms in groups with different number of annotated proteins and analyzing the results separately, we can show how our model works when the data is sparse or when the data is abundant.
As for GO Slims, we found that they are grouped mainly based on different species.Since we already analyze the results for each species individually, we feel it is better to stay with the original GO groupings.Thanks for the suggestions.
In addition, why are only the top three predictions considered for evaluation purposes, in calculating F1-score?Having an absolute number, rather than based on confidence/probability etc. requires justification.
Authors' answer: Thanks for pointing out this problem.We have changed this metric to F-max as suggested.The definition is provided in section "Evaluation metrics".Previously we used F1-score based on top three predictions since it was used by Mashup and deepNF, and we adopted F1-score as well for the comparison.Now we use F-max to compare all these methods, which is reasonable and more widely used.
In the methods, the authors do not adequately explain their output representation --is it framed as a multi-label classification problem, or as a multi-class problem (i.e. can a single input have multiple labels in the output)?Is it just a flat set of GO terms, so that the semantic interdependencies between the terms are not considered?(I later found this detail in the Results; it should come earlier.)It is actually a hierarchical classification problem [3].This hierarchical nature also interacts with metrics and evaluation; please see [4] and [5].The authors should consider/discuss their evaluation protocols in this context.Authors' answer: Thanks for raising this issue.Our task is a multi-label problem, we mentioned that in section "Deep neural network classifier" under "Methods".Our output representation is a flat set of GO terms.In our current implementation, we didn't consider the semantic interdependencies between the terms, which might cause inconsistency between the prediction of leaf GO terms and the corresponding parent GO terms.However, when we generate the dataset for model training and testing, we do assign the parent GO term to the protein when the child GO term is present.We think this will partially solve the issue of incomplete GO term annotations in the dataset.Besides, since we report a probability in range [0, 1] for each GO term, the users can decide the reliability of the GO term (no matter leaf or parent GO term) by the probability.
We have also added some discussion about how we can treat this as a hierarchical classification problem in the future under the "Discussion" section.In the future, one way to treat it as a hierarchical classification is to add the hierarchical constraints in the output layer to force the predictions of leaf nodes are consistent with the predictions of the corresponding parent nodes.Another potential method is to explicitly add the GO hierarchy into the model, along with our other network components to inform the training process, since we can regard the GO hierarchy as a network.We will leave this integration to the future work.
In the section of "Graph2Go pipeline" the authors mention two alternative architectures.Ideally, they would test both architectures (possibly moving the detail to Supplementary) and indicate which worked better.
Authors' answer: Thanks for pointing out this problem.We have added the comparison in the supplementary material.From the comparison, we see that the model that trains independent VGAE for each network and combines their embeddings is better than the model that first combines the networks and train one VGAE to obtain the overall embeddings.This result supports our Graph2GO architecture.
On language; the writing is in general very good.There are a few colloquialisms that could be replaced: "plenty of" -> "substantial"; "really good" -> "very good" or "excellent"."noises" should just be "noise"."well suitable" should be "well suited".I found at least one typo "nueral".
Authors' answer: Thanks for pointing these out.We have corrected the language problems.

Figure 1 .
Figure 1.Here we use PTEN as an example to show how we encode sequence, subcellular location and protein domains features.For sequence encoding, we use the conjoint triad (CT) method.For subcellular location and protein domains, we use bag-of-words encoding.

Figure 2 .
Figure 2. (a) Architecture of the variational graph auto encoder (VGAE), the first part of Graph2GO.The inputs to VGAE include an adjacency matrix A, representing protein-related network, and a node attribute matrix.VGAE is an encoder-decoder approach.The encoder is a two-layer graph convolutional networks and the decoder is a dot product decoder.The mean vector µ i is the output embedding for classification.(b) Graph2GO pipeline consisting of two VGAE models for proteinprotein interaction (PPI) networks and sequence similarity networks (SSN), respectively, and the final deep neural network (DNN) classifier.Two embeddings generated from PPI network and SSN are concatenated as the input for DNN classifier, and the DNN classifier outputs the probabilities of the protein having each GO term annotation.

Figure 3 .
Figure 3. Performance comparison with other network-based methods.Graph2GO is compared with Mashup and deepNF in terms of three metrics: micro-AUPR, macro-AUPR and F-max.Figure (a) -(c) shows the comparison results on CC, MF and BP ontology, respectively.Each method is evaluated using 5-fold cross-validation, repeated 10 times to calculate the confidence interval.

Table 1 .
Performance comparison among using different network types (sequence similarity network, PPI network and both) and node attributes (sequence, subcellular location, protein domains and ALL).We use the M-AUPR, m-AUPR and F-max as the evaluation metric.ALL means using the combination of all features as node attributes, and we do not include sequence as node attributes when sequence similarity network is used to train the model.

Table 3 .
Performance on other five species in terms of macro-AUPR, micro-AUPR and F-max metrics.