Data and Text Mining Measuring Gene Functional Similarity Based on Group-wise Comparison of Go Terms

Motivation: Compared with sequence and structure similarity, functional similarity is more informative for understanding the biological roles and functions of genes. Many important applications in computational molecular biology require functional similarity, such as gene clustering, protein function prediction, protein interaction evaluation and disease gene prioritization. Gene Ontology (GO) is now widely used as the basis for measuring gene functional similarity. Some existing methods combined semantic similarity scores of single term pairs to estimate gene functional similarity, whereas others compared terms in groups to measure it. However, these methods may make error-prone judgments about gene functional similarity. It remains a challenge that measuring gene functional similarity reliably. Result: We propose a novel method called SORA to measure gene functional similarity in GO context. First of all, SORA computes the information content (IC) of a term making use of semantic specificity and coverage. Second, SORA measures the IC of a term set by means of combining inherited and extended IC of the terms based on the structure of GO. Finally, SORA estimates gene functional similarity using the IC overlap ratio of term sets. SORA is evaluated against five state-of-the-art methods in the file on the public platform for col-laborative evaluation of GO-based semantic similarity measure. The carefully comparisons show SORA is superior to other methods in general. Further analysis suggests that it primarily benefits from the structure of GO, which implies expressive information about gene function. SORA offers an effective and reliable way to compare gene function.


INTRODUCTION
In recent years, gene functional similarity has become a main hotspot in biology research.Because it is important for a variety of applications such as gene clustering (Brameier and Wiuf, 2007;Cho et al., 2009;Qu and Xu, 2004;Yang et al., 2008), protein interaction prediction and evaluation (Li et al., 2008;Jain and Bader, 2010;Schlicker et al., 2007;), gene function prediction (Chen and Xu, 2004;Jensen et al., 2003;Nariai et al., 2007) and disease gene prioritization (Chen et al., 2009;Mathur and Dinakarpandian, 2011;Ortutay and Vihinen, 2009;Schlicker et al., 2010;Yilmaz et al., 2009).Moreover, compared with sequence and structure similarity, functional similarity is more informative for understanding the biological roles and functions of genes.
Gene Ontology (GO) is a controlled vocabulary of terms for describing behavior of genes and their products ( GO-Consortium, 2004), which is valuable to measure gene functional similarity.Gene and its products, which are collectively called gene to simplify in this article, are usually annotated with multiple terms.Functional similarity between genes can be inferred from the semantic relationships of their terms.It is considered that two genes are similar in function if their terms are similar in semantics.Accordingly, many methods based on semantic similarity have been put forward to estimate gene functional similarity.These methods could be generally classified into two categories: pairwise and group-wise (Pesquita et al., 2009a).
Pairwise methods measure gene functional similarity through two steps.The first step is measuring semantic similarity scores of term pairs using term comparison techniques.The most typical term comparison techniques used by these methods are Resnik's (1999), Lin's (1998), Jiang and Conrath's (1998).The second step is computing gene functional similarity based on the semantic similarity scores calculated in the first step.Some rules such as average rule (AVG), maximum rule (MAX) and bestmatch average rule (BMA) are used in the last step.The methods based on AVG regard the average of semantic similarity scores of all term pairs as gene functional similarity.The methods based on MAX take the maximal semantic similarity score of all term pairs as gene functional similarity.The methods based on BMA find all the best matches between the term sets and take the average of semantic similarity scores of these best matches as gene functional similarity.As Lord et al. (2003) made use of GO and AVG to estimate gene functional similarity, great efforts have been made in this field.In 2005, Sevilla et al. (2005) and Azuaje et al. (2005) introduced methods like Lord's, but they used MAX and BMA rather than AVG.Meanwhile, many variants of aforementioned typical term comparison techniques like GraSM (Couto et al., 2005), Wang's (Wang et al., 2007) and Pozo's (Pozo et al., 2008) were proposed.Recently, Couto et al. (2011) exploited DiShIn to update GraSM, and Yang et al. (2012) improved the semantic similarity between two terms by considering their common ancestors and descendants.
Although pairwise methods are used widely for measuring gene functional similarity, they suffer from some limitations of combining rules.Methods based on AVG will underestimate gene functional similarity.For instance, if two genes both are annotated with two same terms, which are unrelated to each other, their functional similarity is 0.5 by these methods.In fact, they are exactly matched, and their functional similarity should be 1.Methods based on MAX will overestimate gene functional similarity.An example is that, the functional similarity between genes, which share common terms, is 1, regardless of the different terms of them.Unlike the methods aforementioned, methods based on BMA make a balance between them.Nevertheless, the pairwise methods are affected by how well the semantic similarity of single term pair is measured.The detailed discussion of these methods can be referred to several reviews (Pesquita et al., 2009a;Guzzi et al., 2011).
Group-wise methods estimate gene functional similarity by comparing the terms in groups.These methods are categorized as follwos: set-based, graph-based and vector-based.Set-based methods (Batet et al., 2011;Gentleman et al., 2005;Lee et al., 2004;Martin et al., 2004;Mistry and Pavlidis, 2008;Pesquita et al., 2008) put terms and their ancestors into term set to denote gene firstly.Then, they compute semantic similarity score between the term sets using Tversky's ratio model (Tversky, 1977).Finally, the semantic similarity score between the term sets is regarded as gene functional similarity.Graphbased methods make use of GO sub-graph to describe gene, in which nodes are terms and arcs represent relationships between terms.These methods estimate gene functional similarity by means of graph matching (Alvarez and Yan, 2011;Cho et al., 2007;Gentleman et al., 2005;Lin et al., 2004;Sheehan et al., 2008;Ye et al., 2005;Yu et al., 2007).Vector-based methods represent each gene as a vector where each dimension corresponds to a term and 1 means the specific term occurs while 0 otherwise.They measure the gene functional similarity through calculating the cosine similarity of vector (Huang et al., 2007) or the probability of co-occurrence of the terms (Chabalier et al., 2007).
To our knowledge, the group-wise methods also have some shortcomings.The set-based and vector-based methods ignore some valuable information implicit in the semantics and relationships of terms.The graph-based methods are limited by the complexity of graph matching.
In general, some error-prone judgments about gene functional similarity may be raised by existing methods.In our views, it primarily results from the inappropriate computing of the information content (IC) of terms and unreasonable conversion from semantic similarity into functional similarity.For the effective comparison of gene function, we design a novel method based on Semantic Overlap Ratio of Annotations, namely SORA.Section 2 illustrates the details of our method, and the experimental results are shown and discussed in Section 3. Finally, Section 4 presents some concluding remarks.

METHODS
The process of measuring gene functional similarity by SORA is displayed in Figure 1.At first, to quantify the semantics of the terms, SORA infers the IC of the terms from their location in the GO hierarchy.Meanwhile, the inherited and extended IC values of the terms are computed separately.Next, for the semantics of a term set, SORA calculates the IC of the term set by combining the inherited and extended IC values of its members.Finally, the functional similarity between two genes is computed on the basis of the IC values of their term sets by a simple reciprocal average method.
2.1 Measure the inherited and extended IC of terms 2.1.1Related works There are two approaches, corpus-based and structure-based, to compute the IC of a term.Under the corpus-based approach, the IC of the term t i is defined as In the Equation (1), p(t i ) is the occurrence probability of t i and its descendants in the specified GO annotation (GOA) corpus.Considering a GOA corpus includes 50 distinct annotated genes, in which 15 genes are annotated with term t i or t i 's descendants, the IC of the term t i is 15 50 % 0:5229: However, it becomes 0.3802 when annotation information about additional 10 genes annotated with the term t i is added to the GOA corpus.It can be found that IC for the same term depends on the number of genes annotated with it.As argued by Guzzi et al. (2011), the semantics of GO terms should be independent of the annotation distribution.This approach suffers corpus bias and may not reflect the semantics of the term objectively.
Alternatively, the IC of the term can also be computed from the number of its descendants in the GO structure (Seco et al., 2004).We refer this approach as a structural IC approach.Under this approach, the IC of the term t i is defined as where desc(t i ) means the number of descendants of term t i , and total_ terms is the number of terms in GO.This measure produces consistent IC of the term over different annotation corpus, which seems more reasonable than corpus-based approach.However, a new problem is that the IC Actually, the IC of these terms may be not entirely same.Hence, Equation ( 2) is also unreasonable for measuring the IC of the terms.Besides, some measures (Gentleman et al., 2005;Ye et al., 2005) considered that the IC of the term is proportional to its depth in the hierarchy, which premised that the semantic of term is finer and finer details as one descends the hierarchy.However, these approaches may not distinguish the differences between the terms, which are at the same level but differ in the number of descendants.Meanwhile, we noticed that some works, which focused on the semantic distances of terms, achieved their goals through exploiting the information contained in the GO hierarchy.For example, to measure distance between linked terms, Jiang and Conrath (1998) weighted the edges along shortest path linking the terms based on the link density, term depth and the difference of their IC.Inspired by these works, we consider that the semantics of the term may be tightly related to its location in the GO hierarchy, which could be characterized by term depth (specificity) and the number of descendants (coverage).Accordingly, a novel approach is proposed to overcome the limitations suffered by aforementioned measures.
2.1.2Inherited and extended IC of the term It assumed that the IC of the term is not only proportional to its depth but also inversely to the number of its descendants because more descendants the term has, less specific the semantics is.Therefore, the IC of the term is computed by Equation (3).
In Equation ( 3), the semantic specificity of term t i , Specificity(t i ) is computed by its depth in the GO hierarchy.The maximum depth of the term is taken as its depth.The semantic coverage of term t i , Coverage(t i ) is measured by the number of its descendants in GO, like Equation ( 2).Under this approach, the terms at lower levels are more specific with bigger IC, whereas the terms with more descendants are more generic with smaller IC.
According to the true path rule of GO, if a gene is annotated with a term, it is also annotated with the ancestors of the term.That is to say, the semantics of the ancestor term is generalized from that of its descendants, and the latter is extended from the former.In light of this, the semantics of the term is divided into two parts: one is inherited semantics, which is same as the semantics of its ancestors, and the other is extended semantics, which is special in itself.For measuring IC of a term set, the inherited IC and extended IC of each term, which represent the inherited and the extended semantics of the term respectively, are computed.Supposed that the term t j is one ancestor of the term t i , the inherited IC of the term t i from the term t j is actually equal to the IC of term t j , IC(t j ).The extended IC of the term t i from the term t j is defined as Likewise, given the ancestor set of the term t i , AS(t i ), the inherited IC of the term t i from AS(t i ) equals the IC of AS(t i ), IC(AS(t i )).The extended IC of the term t i from AS(t i ), IC extended (AS(t i )!t i ), is

MEASURE THE IC OF TERM SET BY COMBINING THE INHERITED AND THE EXTENDED IC OF ITS MEMBERS
Regarding the IC of the term set, a simple method is summing up the IC of the terms in the set.Take an example, the IC of term set ts, which just contains two terms t 1 and t 2, is the summation of the IC(t 1 ) and IC(t 2 ).However, as discussed by Couto et al. (2005), the terms may share IC because of the inheritance nature of GO.Take the term set ts again, considering the term t c is one common ancestor of t 1 and t 2 , they share the inherited IC from t c , IC(t c ) but differ in the extended IC from t c .Accordingly, the IC(ts 2 ) because the IC shared by terms should not cumulatively contribute to the IC of the set.It is not hard to imagine that the IC of the set would be larger than reality since more shared IC exists.To overcome this limitation, it is necessary to remove the shared IC between the terms, which is summed repeatedly.
In fact, the calculation of the shared IC has been already proposed by GraSM (Couto et al., 2005) and DiShIn (Couto et al., 2011).These works focused on dealing with the shared IC when measuring semantic similarity between terms.GraSM defined the shared IC between terms as the average of their common disjunctive ancestors while DiShIn redefined it as the average of their all disjunctive ancestors.As verified, both of them could improve the performance of the semantic similarity measures.However, in our opinion, the shared IC between terms could be measured alternatively by the IC of their common ancestors set.Similarly, the shared IC between the term sets could be measured by the IC of their intersection.
Subsequently, we put forward an algorithm for computing the IC of the term set, as illustrated in Figure 2, which combines inherited and extended IC values of its members according to the structure of GO.To simplify the description of the algorithm, some notations are used in the algorithm: considering a term set X, CET(X) consists of the terms without descendants in X; t extend is used to extend term set X in each round, which is selected from CET(X); ES extend consists of the t extend and its ancestors; ES i (X) is the extended term set X after the ith round extension and IC i (X) is the IC of ES i (X); OTS i is the overlapped term set between ES extend and ES i (X); ES(X) is the final term set X after all extensions, and IC(X) is IC of the term set X.
The process of measuring the IC of the term set is demonstrated by an example shown in Figure 3. Gene Q9BPW9 is annotated with manually assigned term set X g ¼ {GO: 0004022, GO: 0004745, GO: 0047035, GO: 0016854} in molecular function sub-ontology.The initial CET(X g ) is {GO: 0004022, GO: 0004745, GO: 0047035, GO: 0016854}.The process of computing the IC of the term set X g includes several rounds and each round consists of four main steps: (1) Select t extend to extend ES i (X g ); (2) Generate ES extend and OTS i ; (3) Calculate IC extended (OTS i !t extend ) and IC i (X g ); (4) Update CET(X g ) and ES i (X g ).
As displayed in Figure 3, each term is represented by an oval with a GO identifier and IC value.In each round, the term t extend is denoted by an oval with octagon.The terms of ES extend are marked by the ovals with asterisks.The terms of ES i (X g ) are labeled with symbols like t j , j2N in the circles.The overlapped terms between ES i (X g ) and ES extend are shown by the ovals with circles and asterisks.
In the first round, as shown in Figure 3a, GO: 0047035 is selected as t extend to extend ES 1 (X g ).Because the initial ES(X g ) is null, OTS 1 is null and IC 1 (X g ) ¼ IC(t extend ) ¼ 0.42857 in term of Equation ( 4).According to the true path rule, X g can also be annotated with the ancestors of the term t extend .Therefore, the term t extend and its all ancestors should be added into ES(X g ).Then GO: 0047035 is removed from CET(X g ).At the end of the round, ES 1 (X g ) and CET(X g ) become {t 1 , t 2 , t 3 , t 4 , t 5 , t 6 , t 7 } and {GO: 0004745, GO: 0004022, GO: 0016854}, respectively.
In the second round, as illustrated by Figure 3b, GO: 0004745 is selected as t extend to extend ES 2 (X g ).The overlapped terms between ES extend and ES 1 (X g ) are t 1 , t 2 , t 3 , t 4 and t 5 .To measure IC extended (OTS 2 !t extend ), it is necessary to measure IC(OTS 2 ).Because t 5 is the only member of the CET(OTS 2 ), IC(OTS 2 ) ¼ IC(t 5 ), i.e. 0.10474.According to Equation (5), , IC 2 (X g ) becomes 0.68097.Then, the terms of ES extend are added into ES 1 (X g ) and GO: 0004745 is removed from CET(X g ).At the end of the second round, ES 2 (X g ) and CET(X g ) are {t 1 , t 2 , t 3 , t 4 , t 5 , t 6 , t 7 , t 8 } and {GO: 0004022, GO: 0016854}, respectively.
In the fourth round, as seen in Figure 3d, GO: 0016854 is selected as t extend to extend ES 4 (X g ).The overlapped terms between ES extend and ES 3 (X g ) are t 1 and t 2 .For t 2 is one child of t 1 , IC(OTS 4 ) ¼ IC(t 2 ), i.e. 0.00316.Thus, IC extended (OTS 4 !t extend ) ¼ 0.1152 and IC 4 (X g ) ¼ 0.98492.Next, the terms of ES extend are added into ES 3 (X g ), and GO: 0016854 is removed from CET(X g ).Here, it is found that CET(X g ) is null; thus, the iteration is finished.
After iteration is finished, the IC 4 (X g ) and ES 4 (X g ) are returned as IC(X g ) and ES(X g ), respectively.As shown in Figure 3e, the IC(X g ) is 0.98492.The final ES(X g ) is {t 1 , t 2 , t 3 , t 4 , t 5 , t 6 , t 7 , t 8 , t 9 , t 10 , t 11 , t 12 }, which is consistent with the true path rule of GO.
Besides, we find that the key terms of which the IC could represent the shared IC between two term sets such as t 2 and t 5 in our strategy are coincidently the common disjunctive ancestors of the terms in the set like t 8 , t 9 , t 10 and t 12 in Figure 3. From this point, the IC of term set can also be given alternatively by summing the IC of the terms and remove the repeatedly summed IC of their common disjunctive ancestors.

MEASURE THE FUNCTIONAL SIMILARITY BETWEEN GENES
To compute gene functional similarity, set-based methods usually make use of Tversky's ratio model or its variants.Assuming that genes G A and G B are annotated with term sets T A ¼ {t 1 ,t 2 , . . .,t m } and T B ¼ {t 1 ,t 2 , . . .,t n }, respectively, simUI (Gentleman et al., 2005) defined the functional similarity between G A and G B as follows: j.j is the number of terms in the specified set.This method neglected the differences of the terms; simGIC (Pesquita et al., 2008) improved simUI by the IC of the terms.In simGIC, the functional similarity between G A and G B is where f(.) is the IC of the term.However, the shared IC of the terms was also summed repeatedly under this method.In fact, repeated summing of the shared IC is common in set-based methods.It may also result in misjudgments of gene functional similarity.
Inspired by Chen et al. (2012), the functional similarity between two genes is defined as the IC overlap ratio (ICOR) between their term sets as Equation ( 8).As known, the GOAs of genes are currently incomplete and suffer from a large research bias (Wang et al., 2010, Yang et al., 2012).
To reduce the effects of annotation bias and imperfection, a simple reciprocal average method is used to make a balance between shallow and well annotated genes.In the Equation ( 8), the first item on the right of the equation reflects the proportion of the shared IC between T A and T B to the IC of T A , and the second item reflects the proportion of the shared IC between T A and T B to the IC of T B .The shared IC between the term sets is measured by the IC of the intersection between them IC(T A \ T B ).To avoid repeated summing of shared IC, the IC of the term set T A ,T B and T A \T B are computed by the algorithm described in Figure 2. To validate the performance of our method, SORA is implemented, and its web service can available at http://nclab.hit.edu.cn/SORA/.SORA is compared on a widely used platform for Collaborative Evaluation of GO-based Semantic Similarity Measure (CESSM) (Pesquita et al., 2009b).The task is to measure functional similarity of 13 430 protein pairs, which involved 1039 proteins, in GO database and GOA released in August, 2008.According to the resources, terms in the GO are classified as Electronic-assigned terms (E-terms) and Manually assigned terms (M-Terms).E-terms are inferred from electronic annotations, whereas M-terms are inferred from experiments, computational analysis, author statements and curatorial statements.Considering GO aspects and the electronic annotations may influence performances of methods, validation experiments are conducted on six GOAs: AMF, ABP, ACC, MMF, MBP and MCC.The details of the six experimental GOAs are listed in Table 1.
As for the performance criteria, CESSM provides the Pearson correlations with sequence similarity (Seq), protein family similarity (Pfam), enzyme commission classification similarity (ECC) and Resolution (Res) to evaluate measures.Sequence similarity is computed by dividing the sum of their reciprocal BlAST bit scores by the sum of their self-BLAST bit scores.The Pfam similarity between two proteins is the ratio between the number of domains they share and the total number of those they have.ECC similarity is measured by the digits of the enzyme commission number shared by the proteins.The larger Pearson correlations with them suggest that the semantic similarities reflect the functional closeness of proteins better.Resolution is the relative intensity with which values in the sequence similarity scale are translated into the semantic similarity (Pesquita et al., 2008).A higher resolution indicates the method is more sensitive to the differences in annotations.It is noteworthy that, as reported by Pesquita et al. (2008), the relationship between semantic and sequence similarity is not linear, and the resolution was verified more appropriate to depict the intrinsic relationship between them than the correlation.
To evaluate the impact of the term IC, we measure the functional similarities of the protein pairs specified by CESSM using the methods based on the structural IC and that based on the term IC computed by our strategy (called SORA IC), respectively.These two approaches are evaluated on CESSM, and the results are displayed in Table 2.As suggested by the results, the method based on the SORA IC performs identically better than the other with respect to Seq, Pfam and ECC in the experiments.However, it is also found that the performance of the method based on SORA IC is not as good as the one based on structural IC on Res in some cases.It suggests that the differences of SORA IC may be not as obvious as those of structural IC, but the former reflect the reality better than the latter in terms of other metrics.On the whole, the SORA IC has more positive impacts on functional comparison of protein.
To validate the effects of the converting strategy, we convert the semantic similarity into function similarity using Jaccard and ICOR, respectively.The functional similarity scores measured with the two converting strategies are compared on CESSM.
As listed in the Table 3, the method with ICOR gets higher Res and ECC, whereas it is comparable with the other one on Pfam in most experiments.On all of the experimental datasets, the scores computed by ICOR show lower correlation with sequence similarities.This may illustrate that the distribution of the scores converted by Jaccard matches better with that of sequence similarities than by ours.According to Res, the scores derived by Jaccard are less capable to capture the differences in the annotations of the proteins than by our strategy.Overall, the results indicate that ICOR is more discriminating for gene functional comparison.
To evaluate effectiveness of our method, SORA is performed on the six experimental GOAs separately.The functional similarities of the 13 430 protein pairs computed by SORA are compared with other methods on CESSM after every experiment.The CESSM enables the comparison of new methods against 11 pairwise and group-wise functional similarity methods.SORA is compared against typical methods of them including simUI, simGIC as well as Resnik's (RB), Lin's (LB) and Jiang and Conrath's (JB) based on BMA, respectively.Table 4 shows the Seq, Res, Pfam, ECC, average and the improvement on respective average level of them computed by different methods.The negative values, signed with '#' in Table 4, imply that the method is under average level with respect to the specific metric.
As for Seq, simGIC shows consistently better performance than others on the six experimental datasets, whereas SORA is To evaluate SORA against each metric, the average improvements of them in six experiments are calculated and shown in Table 5. Regarding Seq, simGIC is the best by 12%, and SORA has a positive effect on it, whereas some others have a negative impact on it.As for Res and ECC, SORA shows the best performances with 15.37 and 8.82% improvement on average level, respectively.In terms of Pfam, SORA gets a significant improvement and performs comparably with the best, simGIC.It reveals that SORA has improved the performances of gene functional comparison.
Furthermore, to provide an intuitive measure of relative performance, we summarize the comparison results by ranking performances of the concerned methods in the six experiments.To simplify, we define the ranking of a given method m i with respect to an assigned performance metric p j in a specific experiment E as rank(m i ,p j ,E).As these methods are compared in the same task, the comprehensive ranking of m i , RS(m i ), is measured by Sorting RS(m i ) in increasing order gives the final ranking of the concerned methods.The rankings of different methods are listed in Table 6.It suggests that SORA is at the top of the list by smallest comprehensive ranking of 54.SORA is still the best among these methods.The second is simGIC and RB is the third.Generally, SORA is able to obtain better results and perform better than other methods.The structure of GO has a great contribution to its success, as it implies expressive information about gene function.Further analysis indicates that the groupwise methods show better overall performances than pairwise methods.It may be related to the ways of converting semantic similarity into gene functional similarity.The pairwise methods combine semantic similarity of terms into gene functional similarity with the help of BMA.The group-wise methods take semantic similarity between the term sets as gene functional similarity in a single step.The way of converting in the latter may be closer to reality than that in the former.

CONCLUSION
In this article, we put forward a novel method, namely SORA, to measure gene functional similarity.It was evaluated against typical pairwise and group-wise methods on CESSM.From the experimental results, SORA is a more effective and reliable way to estimate gene functional similarity than other tested methods.The success of SORA may be related to the following characteristics.
First, SORA makes use of semantic specificity and coverage to measure the IC of the term.The term IC is determined by its location in the GO hierarchy rather than the number of proteins annotated with it.Thus, it can overcome the limitation of GOA corpus bias, which affects the corpus-based approach heavily.With the help of both semantic specificity and coverage, our strategy could reflect the differences in semantics of terms more objectively than the structural IC.
Second, SORA computes the IC of annotating term set by combining the inherited and extended IC of the terms based on the structure of GO.It can effectively avoid repeated summing of the shared IC of terms, which is the key point for estimating the IC of the term set correctly.
Third, SORA uses simple reciprocal ICOR between the term sets as gene functional similarity.It is an appropriate description of functional relationship between genes.As discussed before, SORA measures semantic similarity in a single step, regardless of the number of annotations per protein, which is essential for combining similarities of term pairs in pairwise approach.This strategy has positive impacts on gene function comparison.
Moreover, from the results of our experiments, all of the methods performed better with E-terms than without.We consider that sometimes the E-terms may provide new knowledge about protein function, which has not been confirmed by manual means.High quality computational inferring of annotations would promote the gene function comparison, which is one of our interests in the future.The best results are in bold.

Fig. 2 .
Fig. 2. Algorithm for measuring the IC of the term set

Fig. 3 .
Fig.3.The process of measuring the IC of the term set.Each term is represented by an oval node with GO identifier and the IC value.In each round, the term t extend is denoted by an oval with the octagon.The terms of ES extend are marked by ovals with asterisks.And the terms of ES i (X g ) are labeled with symbols like t j , j2N in yellow circles.Overlapped terms between ES i (X g ) and ES extend are shown by the ovals with circles and asterisks.The process includes four rounds corresponding to (a-d), respectively.The final ES(X g ) and IC(X g ) are shown by (e)

Table 2 .
The impacts of the term IC

Table 1 .
Descriptions of the six experimental GOA the average level.Regarding Res and ECC, SORA outperforms to others in most cases.When performed on AMF, SORA is the best with improvements in the average level against Res and ECC, by 25.47 and 8%, respectively.When applied to MMF, SORA has significant improvements in the average level against Res and ECC by 13.85 and 16.98%, respectively.Referring to average levels of the Res and ECC, SORA improves them by 14.04 and 10.86% when conducted on ABP and improves by 11.81 and 5.14% when performed on the MBP.SORA running on the terms of CC sub-ontology is the best.Regarding Pfam, SORA is comparable with the best and has significant improvements in the average level of Pfam.Moreover, SORA outperforms average level of these methods in terms of almost all of the metrics in the experiments.From these results, SORA is outstanding than others while measuring gene functional similarity.

Table 4 .
The performances of different methods in six experiments Original values show Seq, Res, Pfam and ECC provided by CESSM.Average values present the average level on each metric.Improvements in the average level (%) display the improvement on average level with respect to each metric.Symbol '#' denotes that the method is under average level in term of the specific metric.The best levels of each metric are in bold.

Table 3 .
The effects of the converting strategies

Table 6 .
The rankings of the concerned methods

Table 5 .
Performances of different methods in term of the metrics