Monitoring changes in the Gene Ontology and their impact on genomic data analysis

Abstract Background The Gene Ontology (GO) is one of the most widely used resources in molecular and cellular biology, largely through the use of “enrichment analysis.” To facilitate informed use of GO, we present GOtrack (https://gotrack.msl.ubc.ca), which provides access to historical records and trends in the GO and GO annotations. Findings GOtrack gives users access to gene- and term-level information on annotations for nine model organisms as well as an interactive tool that measures the stability of enrichment results over time for user-provided “hit lists” of genes. To document the effects of GO evolution on enrichment, we analyzed more than 2,500 published hit lists of human genes (most older than 9 years ); 53% of hit lists were considered to yield significantly stable enrichment results. Conclusions Because stability is far from assured for any individual hit list, GOtrack can lead to more informed and cautious application of GO to genomics research.

Reviewer #2: The use of Gene Ontology enrichment is widespread across biological sciences to investigate function or processes associated with particular conditions. This manuscript describes a resource called GOTrack that tracks changes in the Gene Ontology between releases over a number of years and enables users to investigate how the results of their enrichment analysis alter depending on the dataset that is used. This appears to be a resource that will be useful to many, including those working regularly with GO/GOA (e.g. those developing methods including for protein function prediction) and also more broadly researchers who want to identify the changes that occur in their system when they have different settings/conditions. Response: We thank the reviewer for the positive comments.
Reviewer: The manuscript shows that annotations can vary considerably over time and further that the results of enrichment analyses can also change as a result. This appears to be interesting work but I also wonder if some of it is simply obvious. It is well established that Gene Ontology annotations are far from complete -I think less than 1% of proteins in uniprot have been assigned annotations with experimental evidence codes. Clearly there is much that is not known about protein function and therefore the Gene Ontology would be expected to develop over time. Some of this comes across in the discussion but I feel it would be better if the scene was set in the introduction.
Response: While it is obvious to us and the reviewers that GO is incomplete and changes, and we cited the previous literature on the topic, the reviewer might be surprised at how many biologists don't know, and Reviewer 1 (a GO insider) concurs that this needs wider attention. But our main point isn't that GO changes over time, but the difficulty in evaluating the changes and their impact. We provide a means to do so as well as a new evaluation of the impact of changes. To help make this clearer we have slightly reorganized and rewritten the first paragraphs of the introduction.
Reviewer: The manuscript shows that the GOTrack tool can be used to assess if enrichment analyses are stable over time -surely what researchers are interested in is whether the enrichment analysis that they perform is correct with the latest data set (which would be expected to contain the latest annotations)? It strikes me that if many new annotations are added in a particular release that use of previous release data may give different results and I would want to know if I can be confident based on the data that has been used rather than older data. It also seems that it is one of those problems where there will never be a correct answer unless all genes are fully annotated and this is unlikely to happen.
Response: The reviewer is right, that there will never be a correct answer, and we believe our work will help others realize this and use enrichment accordingly as an exploratory method.
Reviewer: The manuscript states that it is difficult to predict which GO terms will remain enriched in future datasets and this seems to be me to the most relevant or useful potential feature -looking to older datasets is ok but how useful is this?
Response: We concede that it remains to be seen how valuable the GOtrack tool proves to be, but we don't believe having predictions of future changes in GO/GOA are necessary for the enrichment tracking tool to have utility (and the reviewer's opening remarks seem to concur). Use cases (described in the manuscript) include checking the validity of past results, or to bolster confidence in new results (that they aren't merely a fluke of today's GO annotations).
Reviewer: In one of the examples ( for ACTC1) the manuscript refers to IEA annotation -Inferred by electronic annotation. My understanding is that these are low confidence and variable should these really be used?
Response: We agree that it would be useful to be able to better separate annotations by evidence code and this is high on the list of features for a future release. In any case, other work has indicated that IEA annotations are not necessarily less reliable (Skunca et al., https://www.ncbi.nlm.nih.gov/pubmed/22693439). Whether they should be used or not may be a matter of opinion, but the fact is that many people do use IEA annotations.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Abstract Background: The Gene Ontology (GO) is one of the most widely used resources in molecular and cellular biology, largely through the use of "enrichment analysis". To facilitate informed use of GO, we present GOtrack (https://gotrack.msl.ubc.ca), which provides access to historical records and trends in the Gene Ontology and GO annotations (GOA).
Findings: GOtrack gives users access to gene-and term-level information on annotations for nine model organisms as well as an interactive tool that measures the stability of enrichment results over time for userprovided "hit lists" of genes. To document the effects of GO evolution on enrichment, we analyzed over 2500 published hit lists of human genes (most over 9 years old). 53% of hit lists were considered to yield significantly stable enrichment results.
Conclusions: Because stability is far from assured for any individual hit list, GOtrack can lead to more informed and cautious application of GO to genomics research.

Background
The Gene Ontology (GO) has been widely adopted by computational and experimental biologists and Gene Ontology annotation (GOA) of genes is one of the most prominent descriptive features of major genome databases. The original paper describing GO [1] is among the most cited papers in the biomedical literature (over 14,000 citations, Clarviate Analytics Web of Science, accessed 1/2018). The popularity of GO is in large part due to the challenge of interpreting data generated from high-throughput technologies such as gene expression profiling.
In a typical simple setting, researchers contrast a genome-wide feature (e.g., gene expression levels or genetic association) in two experimental conditions and generate a list of genes, either ranked across the whole genome, or in the form of a "hit list" of selected candidates. Another way such lists can be generated is by clustering, such it is now standard practice to use GO annotations in an "enrichment" framework.
The widespread use of these methods suggests it is important that users understand their underpinnings.
However, despite the importance of GO, many users likely have little understanding of how it is developed, despite some effort on the part of the GO Consortium (GOC) to disseminate such information [2][3][4]. An important feature of GO is that it changes over time, as curation is performed. This has potentially major implications for the utility and interpretation of GO/GOA, but there is currently no means for users of GO to easily see this for themselves. Our goal in this paper is to help fill this gap and provide some insight into the actual impact of changes on data analysis.
The structure, content and curation of GO/GOA is the essential backdrop for the work we present so we review it briefly. It is important to distinguish the GO itself (the ontology) from the annotations (GOA), which connect genes to terms in GO. GO is organized into three sub-ontologies, representing Biological Processes, Molecular Functions and Cellular Components. Collectively these currently encompass over 40,000 concepts, arranged in a directed acyclic graph (like a tree, but with the potential for multiple paths from any leaf to the root).
Curation is managed through the Gene Ontology Consortium, in which member organizations such as model organism database curation teams provide annotations to a central repository. Genes may be associated with terms in the ontology using either manual curation (associated with a specific reference to the literature or based on a computational analysis reviewed by a curator) or "automatic" annotations that are not reviewed by curators.
The different types of associations are represented by evidence codes, for example the automatic annotations receive the code "IEA" ("Inferred from electronic annotation").
Annotations created by the curation process are referred to as "direct annotations" because they explicitly associate a GO term with a gene. Genes are also associated with terms indirectly via the graph structure of GO, referred to as inference. Thus, a gene that is directly annotated with the term "protein tyrosine kinase" is also implicitly annotated with the term "protein kinase" because that term is a parent term of "protein tyrosine kinase".
When the operation of propagating direct annotations through the GO hierarchies is completed ("transitive closure" in graph theory terminology), the number of annotations available greatly increases, albeit at a range of granularities. These "indirect annotations" (also referred to as "inferred" or "propagated") are as valid as direct annotations because GO enforces a "true path" rule [5]. In most analyses, it is important to use propagated annotations (the combination of direct and inferred annotations) [6].
Assessments of GO/GOA have recently turned to considerations of changes over time. For example, we quantified the effect that annotation have on the apparent (annotated) function of genes, showing that on average changes over short periods (months) are minor, but changes over longer periods are much more substantial [7].
This and other work has shown that GO enrichment results may not be stable over time. However, the effects of changes are not likely to be uniform across data sets nor easily predictable. Indeed, previous studies have been either anecdotal (considering a single or just a few examples [8][9][10][11]), with the largest study analyzing around 100 [12], or yielded mixed findings. Groß et al. (2012) found that enrichment results were stable based on analysis of two hit lists. Alam-Faruqe et al. considered changes in results to be improvements due to focused curation, based on analysis of two data sets. Others have emphasized instability [11,12] or reported mixed impacts [9]. Given this variety of results and interpretations, there is clearly a need for researchers to assess the stability of their own specific enrichment results.
Here we report the development and application of a database (GOtrack) that contains historical information on GO going back to the early 2000s for human and major model organisms. The GOtrack web site enables quick exploration of GO and GO annotations over time, and evaluation of how changes impact interpretation of analyses derived from GOA. Using the data in GOtrack, we present several analyses of trends in GO annotations, complementing earlier work. We performed a large-scale analysis of enrichment analysis results over time, using a large corpus of over 2500 "hit lists". We confirm that GO enrichment analysis results can change over time.
While the web interface is the most complete and detailed way to interact with the data, we also offer a RESTful API to enable programmatic access to the data. Via this API, users can download GO annotations for a taxon, as well as GO, for any selected point in time. GOtrack does not contain all information on GO/GOA and thus complements other resources such as QuickGO [13] and AmiGO [14].
The GOtrack web interface offers views of history at the gene level, and at the GO term level. A third view provides a "global overview" of trends according to a variety of statistics. Finally, we offer a web tool to track changes in GO enrichment results over time. In this paper we provide only a high-level overview of basic functionality and readers are invited to explore the web interface for more information. Figure 2A shows an example of one type of data offered in the gene view, for the human gene GRIN1 (glutamate ionotropic receptor NMDA type subunit 1; [15] and supplementary Figure 1A). The plot shows the number of GO terms directly annotated to the gene, with the mean of all genes from the same organism plotted for comparison.
GRIN1 is consistently more highly annotated than the average, and its trajectory is typical in that annotations rise over time, interrupted by drops and recoveries. In general, such changes can be due to either annotation curation addition or removal of terms annotated to the genesor changes in the structure or content of the GO itself such as addition of terms or relations. The GOtrack interface also allows users to inspect changes in the use of evidence codes used to support an annotation, and directly compare annotations for a gene at up to four time points .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Jacobson et al.

GOtrack 6
To help users interpret the changes in number of terms over time, we provide additional plots of statistics derived from the annotations. The first of these is of multifunctionality [16,17], which is related to the number of terms annotated to a gene, with a weighting to account for term specificity (where specificity is defined by how many genes are annotated with the term; see [17] for details). This more precisely captures how heavily annotated the gene is relative to other genes. The second derived statistic is semantic similarity. As time passes, changes in annotations can cause a gene to change "functional identity" [7]. To quantify this effect, we plot the Jaccard index between the annotations in the current edition to each previous edition. These and other plots and tables are presented on the web page for each gene.
The term-level view provides information on how a GO term has changed over time. This includes how many genes were annotated to it either in total ( Figure 2B) or broken down by evidence type (for example [18] and Supplementary Figure 1B) as well as changes in the GO structure that impact the terms relationships. Finally, the Global Trends page [19] shows species-level summaries of the numbers of annotated genes, genes annotated per term, annotations per gene, and the size of GO itself.

Long-term trends in GOA
In this section we present some analysis of the data in GOtrack, focusing on annotations (rather than GO itself).
As noted, genes vary strongly in how highly annotated they are, due to varying degrees of experimental and curation attention paid to the gene as well as potentially true biological differences in multifunctionality [17]. We previously reported that this bias tends to persist, that is, genes which are relatively highly annotated tend to stay that way [7]. We confirmed this is still the case five years later. For example, if we rank genes by how many direct annotations they have, the ranking at the earliest time point is correlated with the ranking at the latest time point (human: Spearman rank correlation 0.52; mouse 0.43; Arabidopsis 0.53). Thus we confirm that genes are not just unequal in their annotation, but that this inequality is stable over long periods.

GOtrack
7 3B), direct annotations per gene ( Figure 3C) and inferred annotations per gene ( Figure 3D). This reveals that large jumps and drops are sometimes simultaneously observed in multiple, or even all species. One such notable event was a rapid increase in the number of annotated genes starting March 2011 for Arabidopsis, mouse and zebrafish ( Figure 3A). Another dramatic event was a large drop in the mean number of direct annotations per gene in March 2012 for all species ( Figure 3C). The jump is not visible in the plots for indirect annotations ( Figure   3D). This would be consistent with a large-scale purging of redundant annotations (rejecting higher-level terms that are inferable from more specific terms). Other jumps are species-specific, such as the large increase in  Figure 3A), it is found that one of the terms that was deleted during the dip was "apoptosis" (GO:0006915). Viewing the annotation history for that term on the gene, we see that the term was  Figure   3B).
The enrichment tool has some limitations: we use a simple over-representation method (as do many tools including the popular DAVID [22], and the "background" set of genes is not settable by the user: it is the set of all genes annotated at the particular time point. But because GOtrack provides downloads of GO and GOA for any date, users can confirm findings with software of their choice, provided it allows user-provided GO and GOA as inputs (such as ErmineJ, [16], whose annotation input format is directly supported).

GOtrack
9 date for comparison because we expected the CP set to have greater stability, so comparing to the earliest date is the "worst case scenario" for comparing to the experimentally-derived CGP sets.
Our first key observation is that on average for the CGP hit lists, the number of significant terms goes up dramatically (from 21±32 terms to 110±136 terms, mean ± standard deviation; p<10 -15 , Wilcoxon rank sum test).
The values are highly correlated ( Figure 4A): hit lists that had few significant terms at the time of publication (henceforth t0) had relatively few at the most recent timepoint (tnow) (rank correlation 0.54). These results also held for the Canonical Pathways (growing from 37±59 to 246±216 terms, correlation 0.57). It is likely that these increases are not just due to increased annotation, but the growth of GO to over 47,000 terms of increasing granularity.
The explosion in the number of significant terms is an obvious form of instability, but of course what matters more is whether the enriched terms resemble each other at tnow compared to t0. To evaluate this, we did direct comparisons of the enriched terms associated with each hit list (at t0 and tnow), using the Jaccard index (see Methods and Supplement). The Jaccard index was calibrated using a null distribution created by comparing pairs of unrelated hit lists (see Methods). To simplify the analysis, we binned the CGP hit lists by age into three groups of similar numbers of hit lists: up to 10 years, 10-12 years, and 12-16 years.
The results are shown in Figure 4B. Overall, 53% of the CGP hit lists had results which were more similar than 95% of the null trials. This fraction is much higher for relatively recent lists (71%, N=640) and lower for the older lists (55% for the middle tranche, N=960; and 38% for the oldest, N=973; Figure 4B). In comparison 75% of the Canonical Pathways remained above this threshold, despite most of the comparisons being done to the earliest possible time point. The overall rank correlation (unbinned) between stability and age is -0.34 (CGP; -0.39 for Canonical Pathways). This demonstrates that it is possible for results to maintain a substantial degree of similarity over periods of greater than 15 years, but that in general, drift in the semantic content of enrichment results is very substantial after 12-16 years and is substantial but less striking at shorter time spans (<10 years  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 A notable feature of the data shown in Figure 4B is that very low values of the complete Jaccard index were statistically significant. This shows the importance of using a null distribution to calibrate the scores, but clearly leaves something to be desired as a Jaccard index of 0.01 seems negligible. However, this effect is due in large part to the increase in the number of terms over time ( Figure 4A), guaranteeing that the Jaccard index will drop. In attempts to explore this further, we tested six variants on the Jaccard index (see We looked for factors that might contribute to stability. For the CGP hitlists, the number of genes in a hit list was not strongly predictive of Jaccard stability (rank correlation 0.18). It was only modestly correlated with the mean number of directly annotated terms (-0.12) or mean multifunctionality of the genes in the hit list (-0.09). There were more obvious trends for the canonical pathways lists, which have higher stability than the CGP lists on average, despite the (artificially) long time passed between t0 and tnow (over 12 years; Figure 4B). The number of direct annotations per CP is higher (36 vs. 25.4 for CGP). However, this does not appear to explain the overall higher stability of the CP lists, because we get the same result for the subset of CP that has <35 mean direct annotations (mean of 22.9; correlation is -0.48; overall correlation -0.46). Thus hit lists that have more highly annotated genes have a tendency to be less stable. But given these low correlations (-0.12 for the CGP set) and without further insight, it appears to be difficult to predict (even in hindsight) which hit lists will yield stable results.
GOtrack 11 result stability. Our analyses further highlight the necessity for users of GO/GOA to be cautious in their interpretation of any GO annotation, and to temper whatever trust they have in GO enrichment results.
Our evaluation of the stability of enrichment results differs in several important ways from earlier efforts. First, we matched GO and GOA for each time point (rather than fixing either GO or GOA while varying the other), which we feel is more realistic. We also analyzed a much larger number of hit lists (>2500 vs. a maximum of ~100 [12]) and considered time of publication to ensure comparisons were also realistic. An obvious limitation is GOtrack cannot see into the future. While the stability of any particular GO enrichment result might be high or low when looking backwards in time, it is generally impossible to know whether it will remain to be stable because knowledge of biology as represented in GO/GOA is a work in progress. Indeed, we found it is difficult to predict which hit list will give stable results. The strongest clue we could identify is how well annotated the genes in the hit list are: hit lists with highly annotated genes (mean direct annotation count) tend to be less stable. We speculate that this is because highly annotated genes have more changes to their annotations, which can drive shifts in enrichment results, but we have yet to explore this further and in any case the relationship is not strong enough to be usefully predictive. In addition we did not assess other possible factors influencing stability such as evidence codes [24], a topic we leave for future research.

Conclusions
The evolving and incomplete nature of GO/GOA has always been inherent and is well understood by the GO community. But it is seemingly less appreciated more broadly. For example, the extremely popular enrichment tool DAVID (over 32000 citations as of May 2018 [26]) did not update its GO annotations for nearly seven years, an eon in GO history (and at this writing DAVID has not been updated for nearly two years [27]). We find it interesting that there wasn't a massive outcry in response to the use of such out-of-date GO annotations, suggesting either ignorance or apathy. While it might seem obvious that one would always want to use the latest GO annotations, this can be questioned. GO/GOA can change dramatically in a see-saw fashion over a period of months, suggesting that not all changes are improvements. Furthermore, we report a strong tendency for hit lists to yield ever more significant terms over time ( Figure 4A), and it is not clear this comes with any increase in useful information. It could be that using GO/GOA from an earlier, simpler era might be beneficial for enrichment analyses (using a GO slim [28] may approximate this concept). While we may not be able to settle that question here, it is clear that whatever version of GO/GOA is used, it cannot be treated as a gold standard. Enrichment analysis should be considered exploratory, and never used as a primary finding [29]. Computational researchers should also be cautious in using GO/GOA as an optimization target when developing and evaluating algorithms, especially since changes over time are not the only concern [7,17].
GOtrack should be a valuable resource for biologists to gain a greater understanding of where GO annotations come from and how they change over time, as well as their impact on the major use case for GO/GOA, enrichment analysis. Our analysis of the data in GOtrack also revealed a number of interesting features, and it is likely that deeper analyses can be used to gain more insight into patterns of curation that might influence future efforts. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 Jacobson et al.
Dates after 2016-05-09 were obtained from a combination of goa_<species>.gpi.<edition>.gz and goa_<species>.gpa.<edition>.gz files. Mapping of historical annotations to a release of the Gene Ontology was done by selecting the ontology with the closest release date before that of the annotation file. Annotations were propagated up the GO graph as per the "true path rule" [5]. To convert release editions to dates, prior to edition 135 (July 2014) the release number of the file is compared to the dates given on the GOA news site [32]. For edition 135 onwards we use the date provided in the files. We note that there are some gaps in the available data, especially at early time points. For example, we lack data for human for September and October 2002. In addition, the spacing of dates is not uniform; while the median inter-edition gap is 28 days, there are a few gaps that are smaller (minimum 13 days) or correspondingly larger (e.g. 40 days).

Mapping of gene identifiers over time:
Gene product annotations are tracked historically using their associated UniProt accession number(s) [33]. Each gene product in UniProt has a unique primary accession, called the 'Primary (citable) accession number'. In addition to this, a gene product may also have secondary accession numbers which could have been created historically from merges and/or splits. During a merge, the first accession is retained as the primary while all others become secondary. During a split, a new primary accession is created for all products involved while their original accessions are retained as secondary. An accession is only deleted when its corresponding entry has been removed from UniProt. The mapping of primary to secondary 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 accessions is retrieved from [34]. This mapping allows us to find the current primary accession of a historical annotation.
Enrichment analysis: GOtrack implements over-representation analysis using the hypergeometric distribution [16]. The background is the set of all annotated genes (for the time point being analyzed). For analyses presented in the paper, terms with between 20 and 200 genes were included, and only Biological Process terms were considered. The false discovery rate was controlled at 5% using the method of Benjamini and Hochberg [35]. The GOtrack enrichment tool allows these parameters to be varied by the user.
Data analysis: Many of the analyses described are based on files available via the GOtrack web site [36] including the "summary" files by edition, terms and genes. Analyses were conducted with custom scripts written in R [37,38] and python. Correlations are Spearman Rank correlations except where indicated otherwise.

Analysis of MSigDB hit lists:
The MSigDB C2 collection [23] was downloaded from [39]. This corpus is divided into a set of "canonical pathways" (CP) and "chemical and genetic perturbations" (CGP). For the CGP hit lists, the publication associated with each hit list was extracted, and the date of publication (t0) was used to identify the nearest matching version of GO/GOA in GOtrack. Each hit list was analyzed for enrichment as described above, for t0 and a recent comparison time point (January 2018, tnow). We analyzed 2573 CGP hit lists that yielded at least five significant terms at either (or both) t0 and the comparison time point. CP lists (n=1327 after filtering) were treated the same way, except t0 was fixed at Nov 21 2005 (the mean date for the CGP lists).
To compare two sets of enrichment results, we explored several measures (see Supplement) but focus on a standard Jaccard index: | 0 ∩ 1|/| 0 ∪ 1|, where E0 and E1 are the sets of all significantly enriched GO terms for the same input hit list at two time points ("complete Jaccard"). The primary alternative measure we examined was a modified Jaccard that examines only the top five terms plus their inferred parent terms ("top-term-parents Jaccard"), similar to the measure proposed by [40]. See the supplement for details and discussion.
To generate a null distribution, we compare enrichment results from pairs of randomly-selected hits lists (i.e., coming from different publications). Instead of comparing a hit list's results for t0 to tnow, the data are permuted so t0 of one hit list is compared to tnow for a randomly-selected hit list (with the same constraint that at least one of them must have 5 or more significant GO terms). We analyzed 1000 such permutations of the data and pooled This null also inherently addresses the tendency of some GO terms to recur more frequently than others in independent enrichment analyses [16].