Partial order relation–based gene ontology embedding improves protein function prediction

Abstract Protein annotation has long been a challenging task in computational biology. Gene Ontology (GO) has become one of the most popular frameworks to describe protein functions and their relationships. Prediction of a protein annotation with proper GO terms demands high-quality GO term representation learning, which aims to learn a low-dimensional dense vector representation with accompanying semantic meaning for each functional label, also known as embedding. However, existing GO term embedding methods, which mainly take into account ancestral co-occurrence information, have yet to capture the full topological information in the GO-directed acyclic graph (DAG). In this study, we propose a novel GO term representation learning method, PO2Vec, to utilize the partial order relationships to improve the GO term representations. Extensive evaluations show that PO2Vec achieves better outcomes than existing embedding methods in a variety of downstream biological tasks. Based on PO2Vec, we further developed a new protein function prediction method PO2GO, which demonstrates superior performance measured in multiple metrics and annotation specificity as well as few-shot prediction capability in the benchmarks. These results suggest that the high-quality representation of GO structure is critical for diverse biological tasks including computational protein annotation.


INTRODUCTION
Proteins are the main bearers of life activities, and understanding their functions is important for unlocking the biological code.Experimental protein function annotation is time-consuming and extremely expensive.Next-generation sequencing technologies have resulted in a large number of protein sequences to be annotated, thus leading to an urgent need for low-cost and highefficiency protein function annotation methods [1].Deep learning [2] has shown promising results in predicting protein structure and protein function [3][4][5][6][7] over traditional methods that are based on sequence similarity and homology search.However, existing deep learning-based protein function prediction methods still confront significant challenges when dealing with a large and complex hierarchical functional label space.The first one, on the protein side, is how to extract effective feature representations from amino acid (AA) sequences.Benefiting from the development of natural language processing (NLP) techniques [8], self-supervised pre-trained protein language models based on transformer [9] have been shown to effectively encode embedding representations of AA sequences [4,10,11].The second one, from the label side, is how to accurately project lowdimensional embedding of protein into a large-scale, hierarchical and extremely unbalanced functional label space for protein function prediction task [12].Early deep learning approaches [5] directly use a f lat classifier for protein function prediction and ignore the relationships between functional labels.Until recently, several studies [6,7] started to learn a low-dimensional geometric representation (e.g. a low-dimensional dense vector) for each functional label by exploring their inter-relationships and integrating this representation learning for downstream protein annotation.For example, Zhou et al. [6] use a two-layer graph convolutional network (GCN) [13] while Cao and Shen [7] use simple matrix multiplication to learn the vector representation of function labels.However, the complexity of functional labels still warrants more sophisticated modeling to capture their full relationships and semantic meanings to improve protein function prediction.
Gene Ontology (GO) [14] has become one of the most popular functional label systems to characterize proteins and their relationships, including their molecular functions (molecular function ontology, MFO), their subcellular locations (cellular component ontology, CCO) and biological process in which the proteins are involved (biological process ontology, BPO).The GO terms are organized in a hierarchical DAG, with shallow terms representing broad, abstract semantics and deep terms representing concrete, precise semantics.In the protein function annotation task, each protein is usually annotated with more than one GO term, constituting a multi-label classification problem in machine learning.
In this work, we propose a novel GO term representation learning method PO2Vec to learn embedding representation for GO terms.In contrast to existing methods [15][16][17][18], which typically rely on the ancestral co-occurrence, PO2Vec learns topological information by exploring the shortest reachable path-based partial order relationships.In addition, we apply the pre-trained GO embedding to the protein function classification task and propose a new protein function prediction method, PO2GO.Extensive evaluations demonstrate that PO2Vec outperforms existing methods, both IC-based [19] and deep learning-based [15,17,18,20], in learning GO embedding representations across a range of biological tasks.Additionally, PO2GO outperforms alternative methods [6,7] for protein annotation, despite utilizing the topological structure information of GO.The novelties of this approach are 2fold.First, PO2Vec is the first method that utilizes the important partial order relationships in GO for learning better GO term embeddings.Second, we propose a contrastive learning method to model the partial order relationships and the experimental results show that PO2Vec can better capture the topological and biological information of GO terms.Benefiting from PO2Vec's effective representation learning, PO2GO outperforms existing protein annotation methods in terms of information content and when dealing with insufficient training samples.These results encourage further extension of our method in exploiting other types of protein-related features or in learning the representations of other ontology graphs.

METHODS & MATERIALS
In this paper, we propose a novel protein function prediction method, Partial Order to Gene Ontology (PO2GO), as shown in Figure 2. The architecture of the new method consists of three main components: (i) protein feature extractor encodes a protein sequence into a vector; (ii) GO term encoder obtains embedding for each GO term; and (iii) joint modeling predictor performs protein function prediction by conducting GO terms embedding database searching.We will introduce these modules in the following subsections.

Protein feature extractor
In this paper, we used the ESM-1b [4] to obtain protein embedding from AA sequences since ESM-1b is superior in terms of moderate model size and strong feature representation ability [4,11].ESM-1b is a pre-trained protein language model (PLM), using RoBERTa [21] architecture, trained on over 250 million protein sequences from the UniRef database [22].Compared with RoBERTa, AA representations extracted by ESM-1b contain information about biological properties that can be directly applied to downstream protein tasks.Specifically, given a protein p with L AAs, we use ESM-1b to obtain the embedding of each AA and form a matrix B ∈ R d×L where d is the AA embedding dimension.Following the study [11], we use the same mean pooling strategy to aggregate AA features and obtain the protein embedding of p as f (p) = mean(B) ∈ R d×1 .

PO2Vec for GO term embedding
In the protein function prediction task, our aim is to map each protein to a set of GO terms.Given a set of m GO terms T = {t 1 , t 2 , . . ., t m }, we aim to learn an embedding function e (•) to obtain the embedding e (t i ) ∈ R d ×1 (where d is the term embedding dimension) for a term t i .There are several types of relationships between GO terms, the most common of which are is_a and part_of.Specifically, t 1 is_a t 2 means that t 1 is a subclass of t 2 ; t 3 part_of t 4 indicates that whenever t 3 is present, it is always a part of t 4 .The relationships between GO terms could be transitive.For example, as shown in Figure 1A, if membrane-bounded organelle (GO:0043227) is_a organelle (GO:0043226) and organelle (GO:0043226) is_a cellular anatomical entity (GO:0110165), then membrane-bounded organelle (GO:0043227) is_a cellular anatomical entity (GO:0110165).This transmission relationship forms a hierarchy between terms, in which shallower terms represent more abstract semantics and deeper terms represent more concrete semantics.The hierarchy is crucial for protein function annotation.For example, if a protein is annotated as membranebounded organelle (GO:0043227), then it should also be annotated as organelle (GO:0043226) and cellular anatomical entity (GO:0110165); this transmission rule is known as the true path rule [23] in protein function annotation.Although a number of relationships exist between GO terms, only is_a and part_of, which account for 88% of relationships (Supplementary Table S1 available online at http://bib.oxfordjournals.org/),could safely transfer annotations [14,24].Since is_a and part_of are the majority of the relationships, we only consider these two relationships in our modeling for the sake of simplicity.In the following, we first define the partial order constraints to guide the embedding learning and then propose a contrastive learning method to learn the GO term embeddings.

Partial order constraint
In GO representation learning, the majority of methods [15][16][17][18] that rely on the GO DAG structure assume that the similarity of two terms is primarily determined by the topological structure of the GO DAG, despite the fact that the similarity may also be inf luenced by other factors, such as the number of annotations of terms.Intuitively, sim e (t i ) , e t j , the similarity between two terms t i and t j , should be related to the length of the shortest path between t i and t j .For example, both organelle (GO:0043226) and cellular anatomical entity (GO:0110165) are ancestors of membrane-bounded organelle (GO:0043227) (Figure 1A).However, the semantic similarity between organelle (GO:0043226) and membrane-bounded organelle (GO:0043227) should be greater than the semantic similarity between cellular anatomical entity (GO:0110165) and membrane-bounded organelle (GO:0043227).Although most existing GO term embedding methods [15][16][17][18]25] capture the co-occurrence relationships among a term and its ancestors, few take the path between two terms into consideration for GO term embedding learning.To solve the above problem, we propose a new method for learning GO term embedding by taking the path between terms into consideration.Given two terms t i and t j , within a GO DAG, we define the shortest reachable path (SRP), srp t i , t j , based on the following three cases: 1. Direct reachability: t j is said to be directly reachable from t i if there exists a directed path that starts at t i and ends at t j .This typically applies to node pairs with ancestral relationships.srp t i , t j between t i and t j is defined as the path with minimum number of edges connected t i and t j .len srp t i , t j returns the number of edges, for example, len srp (GO : 0043227, GO : 0110165) = 2. 2. Indirect reachability: t j is said to be indirectly reachable from t i if t j is not directly reachable from t i but there exists a term t k such that t k is directly reachable from both t i and t j .This is often the case for sibling or sibling-like term pairs.The definition of srp t i , t j is that the path with minimum number of edges connected t i and t j among all indirect reachable paths.The length of SRP is calculated by adding 0.5 to the number of edges, for example, len srp (GO : 00043227, GO : 1990900) = 3.5.3. Unreachability: t j is said to be unreachable from t i if t j is neither directly nor indirectly reachable from t i .This category encompasses all other cases, typically where t i and t j come from different domains.The length of SRP between t i and t j is defined as infinity, for example, len srp (GO : 0043227) , e (GO : 0044238) = +∞.

PO2Vec
We adopt a contrastive learning strategy to learn distinctive embedding of GO terms.Contrastive learning is an unsupervised representation learning method that aims to push positive pairs of 'similar' inputs together and push negative pairs of 'dissimilar' inputs away in the representation space.The commonly used approach InfoNCE [26] defines the following loss function L InfoNCE where X contains training examples in a batch, x + is the positive sample of x and N (x) is the set of negative terms of x. s x, x j is the similarity between x and x j .
In contrastive learning, one of the most important tasks is to sample the positive sample t + i and the negative sample set N (t i ) for the term t i .In the implementation of PO2Vec, we exclusively choose positive samples from the terms that are directly reachable with SRP length of 1 (i.e.parent terms or children terms) or indirectly reachable with SRP length of 2.5 (i.e.sibling terms), as positive samples for two main reasons.First, the parent-child and sibling relationships are the closest biological relationships between GO terms.Choosing terms as positive samples based on these relationships aligns with biological intuition.Second, the total number of directly and indirectly reachable terms are numerous.Adding the aforementioned constraints during the selection of positive samples can help address sampling bias and enhance the robustness of the algorithm.
Usually, the number of indirectly reachable and unreachable terms is much larger than directly reachable terms.If we simply draw negative samples from all terms, we may get few directly reachable terms that are important in the GO term embedding learning.Therefore, we propose a new negative term sampling method to obtain better negative samples, which is illustrated in Figure 1B.The new method consists of three phases: indexing, sampling and contrastive learning.

Indexing
The indexing phase generates three lists for each term for sampling positive and negative terms.Given a term t i , we construct above three lists Q dr (t i ), Q ir (t i ) and Q ur (t i ).Q dr (t i ) consists of the directly reachable terms of t i , Q ir (t i ) consists of the indirectly reachable terms of t i and Q ur (t i ) consists of the unreachable terms in the different domain of t i .Note that the terms in Q dr (t i ) and Q ir (t i ) must be sorted in ascending order according to the length of SRP to t i .

Stratified sampling
In the sampling phase, we sample one positive term and k negative terms from the three lists, i.e.Q dr (t i ), Q ir (t i ) and Q ur (t i ).Specifically, given a term t i , we randomly sample a directly reachable term with SRP length of 1 from Q dr (t i ) or an indirectly reachable term with SRP length of 2.5 as t + i .We take a total of k negative samples as the set of negative samples Here, we need to set two hyper- parameters k and u, where k determines the total number of negative terms and u determines the number of negative samples in N dr (t i ).For detailed information on the experimental settings of hyperparameters k and u, please refer to the Supplementary Materials.

Contrastive learning
Finally, we define the following balanced InfoNCE loss for the GO term embedding learning: where τ is a temperature hyperparameter to adjust the result of

Joint modeling predictor
In our method, we initially derive the GO term embedding using PO2Vec.It's crucial to emphasize that, during the inference stage, these GO term embeddings remain consistent across different proteins.Essentially, the role of the joint modeling predictor can be interpreted as conducting embedding database searching.We also acquire the protein embedding from the pre-trained ESM-1b.Subsequently, these two kinds of embedding will be used to jointly model protein function prediction.To bridge the gap in the semantic space between GO terms and proteins, we introduce two multi-layer perceptrons (MLPs) to project GO term embeddings and protein embeddings into the same space (Figure 2).Specifically, given a protein p i , we obtain protein embedding f p i from ESM-1b; then, an MLP is applied to obtain f proj p i = MLP f p i .Likewise, given a GO term t j , we obtain its projected embedding as e proj t j .We compute the similarity between the protein p i and GO terms by s ∈ R m×1 , where m is the number of GO terms and the j-th element of s is calculated as s j = e proj t j T • f proj p i .Unlike previous works that directly use this similarity vector as the final prediction result, we introduce an MLP layer to obtain the protein function prediction result ŷ ∈ R m×1 , via a multi-label binary crossentropy loss as follows: where y ∈ R m×1 ∈ {0, 1} denotes whether the GO term t j is annotated to the protein or not.
In the training stage, we fix the protein feature extractor (thus the protein representations are fixed) and learn the parameters of three MLPs by minimizing the loss function in Equation ( 5) with back propagation.Note that the GO term representations will be tuned in the training stage.

Benchmark and evaluation
The datasets, metrics and experimental settings for benchmark are detailed in Supplementary Material.

Performance evaluation of GO representation learning
We carried out five tasks to thoroughly analyze the learned GO term embeddings in capturing the hierarchical structure in GO, i.e. the depth prediction, distinguishability between ancestor terms and non-ancestor terms, GO domain information encoded by embeddings, correlation with sim PFAM and correlation with sim PPI .

Depth prediction
The aim of this task is to measure the learned GO term embeddings in distinguishing the depth of GO terms [18].In this task, we took the GO term embeddings as input and trained a simple MLP to predict the longest path from a given GO term to its root.We randomly split the GO dataset into an 80% training set and a 20% test set.
The depth prediction results of PO2Vec and the baselines are shown in Figure 3A, which indicates that PO2Vec significantly outperforms the baselines in the depth prediction task.This result demonstrates that PO2Vec has superior performance in capturing depth information of a term than baselines.
We conduct an ablation experiment to analyze the benefit of Balanced InfoNCE compared to traditional InfoNCE on capturing depth information.The experimental results are shown in Figure 3B.If we use the default InfoNCE without stratified sampling, the model's performance in predicting hierarchical depth decreases significantly, resulting from the fact that the directly reachable terms are rarely sampled due to its small size compared to the indirectly reachable and unreachable terms, and thus, the model is undertrained with negative samples dominated by the indirectly reachable and unreachable terms.

Distinguishability between ancestor terms and non-ancestor terms
In this task, we calculate the cosine similarity of each GO term to its ancestor terms and non-ancestor terms (Figure 4).Intuitively, a GO term is semantically closer to its ancestors than its non-ancestor terms.The 1-Wasserstein distances [27] computed between the two distributions of similarity for PO2Vec and TransH are larger than Anc2Vec and Opa2Vec (as shown on the top of Figure 4), suggesting that PO2Vec and TransH distinguishes between ancestor terms and non-ancestor terms better than the other methods.The similarity distributions computed from OPA2Vec embeddings show that a given GO term tends to be more similar to its non-ancestor terms than its ancestors, indicating OPA2Vec may not model these two types of GO term relationships very well.The ancestor similarity distribution is more dispersed in PO2Vec than in TransH and Anc2Vec, which implies that PO2Vec is better at resolving the ancestors of varying distances than TransH and Anc2Vec.Taken together, PO2Vec is capable of modeling GO data with hierarchical relationships.

Domain information encoded by GO term embeddings
To evaluate whether the GO domain information is captured in the embeddings for each term, we project the embeddings onto a two-dimensional space with the dimension reduction method Uniform Manifold Approximation and Projection (UMAP) [28] (Figure 5).The result shows a clear separation of three clusters, indicating that PO2Vec is capable of encoding GO terms into proper sub-spaces.

Correlation with sim PFAM
The quality of the representation of GO terms can be assessed by comparing protein similarity to their semantic similarity computed from GO term embeddings.The idea is that semantic similarity between the annotated GO terms of two proteins should ref lect the similarity of the proteins.Consequently, the correlation between them implicitly indicates the quality of the GO representation.Thus, we use the correlation of semantic similarity with sim PFAM to evaluate the performance of PO2Vec and other methods.The semantic similarity between GO annotations of each pair of proteins was computed with the BMA method.Then, the Pearson correlation coefficient and Spearman's rank correlation coefficient were calculated to evaluate the correlation between the semantic similarity and sim PFAM .The results demonstrate that PO2Vec outperforms the other embedding methods in the PFAM-1 and PFAM-3 datasets as well as species divided datasets ( Table 1) in most situations.PO2Vec obtained the best performance in all the groups except EC and SC; TransH, which is a strong baseline method for knowledge graph embedding (KGE), yield better results in EC and SC.The performance of PO2Vec in this analysis suggests that its learned GO embeddings may represent protein better than other embedding methods.

Correlation with sim PPI
It has been suggested that interacting proteins are likely to be involved in similar subcellular locations or biological processes.As a result, the semantic similarity of two proteins is also a putative predictor of their interaction.Interacting protein pairs should have a higher semantic similarity score than non-interacting protein pairs.Therefore, we evaluate the quality of GO term embeddings by their capability to discriminate proteins as interacting or non-interacting.
Similar to sim PFAM , we computed the semantic similarity and its Pearson correlation and Spearman's rank correlation with sim PPI (Table 1).The results demonstrate that PO2Vec obtained the highest correlation coefficient in the combined PPI datasets.In the divided protein datasets of species, the performance of PO2Vec is the best (except for Spearman's correlation in EC) among the GO term embedding methods, closely followed by Anc2Vec.
For a better comparison, we plotted out the distributions of the semantic similarity for each protein pair, interacting or non-interacting, and computed their 1-Wasserstein distances for each embedding method (Figure 6).The 1-Wasserstein distances obtained from PO2Vec are larger than the others, suggesting that PO2Vec embeddings distinguish interacting proteins from noninteracting ones better.Taken together, these results indicate the superior performance of GO embeddings obtained by PO2Vec in distinguishing whether protein pairs are interacting or not.

Performance evaluation of protein function prediction
Based on PO2Vec, we developed PO2GO to annotate protein functions with GO and compared its performance to state-of-the-art methods, including TALE, DeepGOA and DeepGOPlus.Similar to PO2GO, both TALE and DeepGOA take advantage of hierarchical information in the GO framework.Note that we replaced the original protein feature extractor of TALE and DeepGOA with ESM-1b for a fair comparison.DeepGOPlus, a classical strong baseline method based on the one-dimensional convolutional neural network, was also included in the benchmark.The performance of the traditional sequence similarity-based method, Diamond, was listed for comparison too.As shown in Table 2, PO2GO outperforms other methods in all three GO domains.The superior performance of PO2GO over TALE and DeepGOA indicates the effectiveness of our PO2Vec embedding method.

Ablation study of PO2GO
To further analyze the contribution of PO2Vec to protein function prediction, we conducted an ablation experiment.When we removed PO2Vec or replaced PO2Vec with a naive multihot embedding that could capture the information about GO terms and their ancestors, the performance of PO2GO appeared to degrade (Table 4).These ablation results demonstrate that the GO information captured by PO2Vec helps to enhance the predictive performance of the model for protein annotation.We also found  that the prediction performance of PO2GO did not decrease significantly when we fixed the pre-trained PO2Vec embeddings during PO2GO training.This shows that the embedding pre-training procedure of PO2Vec does learn good GO term representations suitable for downstream tasks.

The specificity of protein annotation
To explore characteristics of protein annotation by PO2GO, we compare PO2GO with competing methods in terms of specificity as defined by information content (IC) [19,29].A term with high IC tends to be more specialized and rarer in occurrence, and a term with low IC tends to have a broad function and appears more commonly in protein annotations.We calculate the average IC of true-positive predictions of a protein for four different methods at their best threshold (calculated by F max ), respectively.For detailed description, please refer to the Supplementary Material.In BPO and MFO, PO2GO obtains the best average IC and significantly outperforms the other competing methods (Table 3).In CCO, TALE slightly outperforms PO2GO and achieves the best average IC.Different from MFO and BPO, the average depth of GO terms is shallower in CCO, which is 6.2 in CAFA3 (as compared to 6.8 and 8.3 in MFO and BPO, respectively).Therefore, our PO2Vec embedding approach probably models deeper GO hierarchical structure better, which consequently results in better performance in MFO and BPO than in CCO.In summary, PO2GO tends to generate higher or comparable specificity predictions for protein annotation.

Few-shot prediction
There are protein families with few numbers of known sequences; thus, it is important to assess how the protein annotation models perform when the training examples are insufficient.For this purpose, we group GO terms according to their numbers of annotated proteins and calculate the GO term-centric F1score within each group as an evaluation metric when the best threshold (calculated by F max ) is selected.As shown in Figure 7, the  7C) is not consistently better than the other methods as in BPO and MFO, probably due to the similar phenomenon of CCO's shallow hierarchical structure as observed in the annotation specificity analysis.Overall, PO2GO has better predictive performance in the majority of scenarios.

CONCLUSION AND DISCUSSION
A novel model named PO2Vec is presented for GO term representation learning.Under the SRP-based partial order constraint, the ancestral or non-ancestral properties of terms were captured by the contrastive learning, resembling the learning process in previous co-occurrence-based methods.We included terms across GO domains for similarity computation so that all the terms from the three domains can be mapped to the same embedding space, although this may have limited direct biological implications.
In this process, we assumed that the similarity between terms from different domains is smaller than that of any term pairs with ancestral relationship.PO2Vec further refines the learned ancestral information to distinguish proximities of terms with distant ancestors and close ancestors.Overall, PO2Vec captures the GO topological information more comprehensively under the SRP-based partial order constraint.The effectiveness of PO2Vec was demonstrated in experimental analyses from five aspects on both GO and protein levels.On the GO level experiments, we demonstrate that the learned embedding through PO2Vec can well reconstruct term depth, differentiate ancestor and non-ancestor GO terms and identify their own GO domains.In addition, on the protein level, we find that the embedding of PO2Vec ref lects the functional domain and interacting of proteins, indicating that our embedding is suitable for a wide range of generelated bioinformatics analyses.
A novel protein function annotation model, called PO2GO, is proposed to integrate PO2Vec and the protein language pre-trained model ESM-1b.The effectiveness of PO2GO is demonstrated with benchmarks by comparing to DeepGOA, TALE and DeepGOPlus as well as Diamond.In comparison to simpler methods such as graph neural network and graph-based regularization techniques, PO2Vec provides a deeper semantic representation by effectively distinguishing between varying degrees of term proximities through a contrastive learning framework.This suggests that the high-quality GO term embedding helps improve computational protein annotation.Consequently, GO term embedding not only improves the overall performance of protein function prediction as measured in the metrics of F max , S min and AUPR but also enhances the specificity of prediction and the few-shot ability of the model.The enhancement of prediction specificity can facilitate the annotation of a protein reliably with specific GO terms rather than generic, shallow-level ones.The improvement of the model's few-shot ability provides a feasible solution for biological tasks, including protein function prediction, which are hindered by limited availability of labeled data.The ESM-1b we adopted in the PO2GO architecture can be replaced by other well-trained protein encoders since our formulation is generic enough that any pre-trained protein language model may be used.Moreover, our embedding approach may also provide insight for the representation learning of other knowledge graphs such as the Human Phenotype Ontology [30], Disease Ontology [31] and SNOMED CT [32].
One limitation of this study is that the SRP-based partial order constraint takes into account only is_a and part_of relationships in the GO system.Although these two types of relationships make up over 88% of all the GO term relationships, the embedding could potentially be further improved if the rest relationships are included in the representation learning.Moreover, the textual description of each GO term is another source of information that could be processed by NLP models in addition to the relationship information in the GO DAGs, which may also enhance the representation learning of GO terms.

Key Points
• A novel model named PO2Vec is presented for Gene Ontology (GO) term representation learning.
• Compared with existing methods based on GO-directed acyclic graph structure, PO2Vec captures the topological information of GO more comprehensively under the shortest reachable path-based partial order constraint.
The effectiveness of PO2Vec was demonstrated in experimental analyses from five aspects.• A novel protein function annotation prediction model, named PO2GO, is proposed, which is jointly constructed by PO2Vec and the protein language pre-trained model ESM-1b.The superior performance of PO2GO is demonstrated with comparative benchmarks.

Figure 1 .
Figure 1.Illustration of partial order constraint used for PO2Vec embedding algorithm.(A) Snippets of GO hierarchy demonstrating the shortest reachable path-based partial order constraints.Given a GO term, it should be semantically more similar to its closely directly reachable term than its distantly reachable term, which include directly reachable, indirectly reachable and unreachable terms.(B) The f lowchart of PO2Vec algorithm incorporating partial order constraint.With contrastive learning, semantically close terms are driven together while distant terms are driven apart in the embedding space.

Figure 2 .
Figure 2. The network framework of PO2GO.The architecture consists of three modules: (i) Protein feature extractor encodes a protein into a d dimension embedding; (ii) GO term encoder encodes each GO term (total m GO terms) into a d dimension embedding; and (iii) joint modeling predictor identifies the mapped GO terms for each protein among a total m GO terms.

Figure 3 .
Figure 3. Depth prediction results: (A) Comparison with baseline methods and (B) PO2Vec with InfoNCE and Balanced InfoNCE.

Figure 4 .
Figure 4.The distributions of cosine similarity of each GO term to its ancestor terms and non-ancestor terms.The 1-Wasserstein distances between the two distributions are shown on the top for each GO embedding method.The negative sign for OPA2Vec indicates that the distance between the ancestor and non-ancestor distributions is in the reverse direction to those of PO2Vec, TransH and Anc2Vec.The dashed lines indicate median of each distribution.

Figure 5 .
Figure 5.The clustering of GO term embeddings in the three GO domains.Each point represents a GO term.

Figure 6 .
Figure 6.Semantic similarity distributions of the sets of GO terms on PPI datasets.The 1-Wasserstein distances are shown on the top.

Figure 7 .
Figure 7.The term-centric F1-score evaluation is grouped by annotation size on the CAFA3 dataset.(A) BPO evaluation; (B) MFO evaluation; and (C) CCO evaluation.

Table 2 :
The performance evaluation of different methods for protein function prediction on the CAFA3 dataset and swissprot dataset

Table 3 :
Average information content of the predictions on CAFA3 dataset

Table 4 :
AUPR of the predictions on CAFA3 dataset in the ablation analysis