The Gene Ontology knowledgebase in 2023

Abstract The Gene Ontology (GO) knowledgebase (http://geneontology.org) is a comprehensive resource concerning the functions of genes and gene products (proteins and noncoding RNAs). GO annotations cover genes from organisms across the tree of life as well as viruses, though most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms. Here, we provide an updated overview of the GO knowledgebase, as well as the efforts of the broad, international consortium of scientists that develops, maintains, and updates the GO knowledgebase. The GO knowledgebase consists of three components: (1) the GO—a computational knowledge structure describing the functional characteristics of genes; (2) GO annotations—evidence-supported statements asserting that a specific gene product has a particular functional characteristic; and (3) GO Causal Activity Models (GO-CAMs)—mechanistic models of molecular “pathways” (GO biological processes) created by linking multiple GO annotations using defined relations. Each of these components is continually expanded, revised, and updated in response to newly published discoveries and receives extensive QA checks, reviews, and user feedback. For each of these components, we provide a description of the current contents, recent developments to keep the knowledgebase up to date with new discoveries, and guidance on how users can best make use of the data that we provide. We conclude with future directions for the project.


Introduction
Genes encode gene products, often proteins but also noncoding RNA molecules (ncRNAs), that perform functions at the molecular, cellular, and organismal levels. The Gene Ontology (GO) knowledgebase provides a comprehensive, structured, computeraccessible representation of gene function, for genes from any cellular organism or virus. The GO knowledgebase has become a critical component of life science research, supporting analysis of large-scale experiments and biological systems (Duck et al. 2016). It is designed to make expert knowledge of gene function accessible for bench scientists as well as computational analyses. The basic model underlying GO is the "molecular biology paradigm" (Ashburner et al. 2000;Thomas 2017), in which there are three types (aspects) of functional characteristics used to describe gene function: • Molecular function (MF): the activities performed by a gene product at the molecular level • Cellular component (CC): the locations, relative to cellular structures, where MFs are performed • Biological process (BP): a "biological program" comprising molecular activities acting in concert to achieve a particular outcome; this program can be at the cellular level or at the organism level of multicellular organisms.
The GO knowledgebase consists of three components: the GO, GO annotations, and GO Causal Activity models (GO-CAMs) (Fig. 1). The GO (Fig. 1a) structures our current knowledge of the types of functional characteristics that a gene product may possess into a connected graph-based representation. Each ontology term (called "class" in the field of ontologies) represents a functional characteristic that can be attributed to a gene product. Terms can have relationships between them, such as one term being more specific than another term (also called "subclass"), e.g. DNA-binding transcription factor activity is a subclass of transcription regulator activity. A GO annotation (Fig. 1b) is an association between a specific gene (or gene product) and a GO term and should be interpreted as a statement that the specified gene product possesses the specified functional characteristic represented by the GO term. Each GO annotation includes the evidence upon which it is based. Because each GO annotation covers only a single characteristic of gene function, multiple GO annotations are generally required to completely describe the function of a gene product. GO-CAMs (Fig. 1c) link multiple GO annotations together to create models of BPs by (1) connecting the activities of more than one gene product together into causal networks and (2) allowing the specification of the biological context (e.g. cell type and tissue type) in which the activities occur.

The GO knowledgebase is large and dynamic
For applications that use the components of the GO knowledgebase, it is crucial that the ontology and associated annotations represent the current state of knowledge and are not just an archive of all public data. Therefore, all aspects of the GO knowledgebase are dynamic (ontology, annotations, GO-CAMs, links to external ontologies, etc.), and citable, versioned updates are released on a monthly basis. Below, we describe each component of the knowledgebase, focusing on recent changes made to improve the resource during the past two years. Statistics and descriptions given here are based on the GO release 2022-11-03 (http://release. geneontology.org/2022-11-03, doi:10.5281/zenodo.7407024).

Ontology
The ontology component of the GO knowledgebase consists of the terms used to describe functional characteristics of gene products, which are linked together by relations into a labeled directed acyclic graph (like a hierarchy but with multiple parentages allowed). It also includes term definitions, synonyms, and relations to terms from external ontologies. The GO is available in different editions, including (1) the "basic" edition, which includes only core relationship types; (2) the core ontology, including additional relationship types; and (3) the "go-plus" edition which also includes relationships to terms in other ontologies. These editions are explained on the GO downloads page http://geneontology.org/ docs/download-ontology/. The ontology contains 43,303 terms (Table 2), linked together by 88,099 relationships in the basic edition. When relationships to external terms are included, there are 121,698 relationships; release statistics can be viewed at http:// geneontology.org/stats.
The GO is subject to constant review and revision to most accurately model the current biological knowledge. Revision of the ontology includes the addition or obsoletion of terms and reorganization of the relationship structure. New GO terms are added to represent concepts previously missing from the GO in response to published findings, or when a branch of GO is revised. Terms may be obsoleted when unused or inconsistently used in annotation, when they are redundant with other terms, or during revision of specific branches of the ontology.
Most of the revisions in the structure of GO are in response to advances in biological knowledge, as well as improvements in the precision of newer experimental approaches. In addition, because many branches of the ontology have grown organically in a bottomup fashion by accumulating specific individual term requests, we also perform systematic review aimed at improving consistency and clarity while reducing redundancy. Additional revisions are initiated by internal review, and consistency and quality assurance checks. Revisions are also made following feedback from users. Whenever possible, changes are performed in collaboration with expert biocurators or domain specialists; recent examples include blood-brain barrier-related functions (Saverimuttu et al. 2021) and transcription factors (Gaudet et al. 2021).
At each release, we track all changes and report on our website the number of added, obsoleted, and merged terms in the ontology. Table 1 shows the number of GO terms added and removed (merged or obsoleted) over the past two-year period, for each aspect of GO. In the MF and CC aspects of the ontology, term creation versus term obsoletion have approximately balanced each other, such that the number of terms in these two branches has remained roughly constant. The most significant changes have been in the BP aspect of GO, with a net decrease of over 800 terms.
Many of these revisions result from global reviews of the ontology to address clear inconsistencies in usage and changes in annotation practices. Terms that have been removed from the ontology over the last two years fall into several different categories, including the following: • Terms that correspond to phenotypes and for which the understanding of the process was previously too incomplete to annotate to a different term. Examples include the following: regulation of spindle density (GO:0090225) and age-dependent general metabolic decline (GO:0007571).

Fig. 1.
Examples of the three components of the GO knowledgebase. a) The GO ontology consists of terms, e.g. DNA binding transcription factor activity, and relationships between the terms (arrows; black = is a, blue = part of, and orange = regulates). b) GO annotations associate a specific gene product (here, human ZNF410) with GO terms asserting its functional aspects ("GO Class" column, e.g. sequence-specific double-stranded DNA binding) and the evidence for each assertion with its traceable source ("Evidence" and "Reference" columns). c) The GO-CAM model combines individual GO annotations into a model, in this case a very simple model describing how human ZNF410 acts as a transcription factor to positively regulate (denoted by the arrow) transcription of the CHD4 gene, which in turn acts as a corepressor to repress (denoted by dashed lines) transcription of fetal hemoglobin genes (HBG1 and HBG2) in erythroid lineage cells. In this view, each box in the GO-CAM is labeled with the gene product and species abbreviation for simplicity.
• Terms that are combinations of multiple GO terms that can now be represented more precisely using GO-CAM models.
Examples are chromatin remodeling in response to cation stress (GO:0043156) and regulation of cyclin-dependent protein serine/ threonine kinase activity involved in G2/M transition of the mitotic cell cycle (GO:0031660). • Revisions based on updated knowledge, either by GO editors, by authoritative databases, or in the literature. For example, alpha-taxilin (UniProt:P40222) was originally thought to be the high-molecular weight interleukin-14 (Ambrus et al. 1993); an erratum was later published (Ambrus et al. 1996)   database, and the corresponding GO term 1,6-dihydroxy-5-methylcyclohexa-2,4-dienecarboxylate dehydrogenase activity (GO:0018512) was obsoleted in GO. • Single step reactions in the BP aspect of the ontology: there were many instances in the GO where a MF could be represented as both a MF and a BP, for example "histone kinase activity" and "histone phosphorylation." This was useful when fewer activities were characterized at the molecular level, and the best level of resolution for many experiments was that the gene has some uncharacterized role that led to histone phosphorylation, for example. However, with increasingly detailed molecular data, the redundancy between MF and BP annotations became unnecessary and the value of having a similar term in both aspects of the ontology led to inconsistency. This is an ongoing project and many BP terms still need to be obsoleted for this reason. • Terms that refer to more than one ontological aspect: ubiquinone biosynthetic process monooxygenase activity (GO:0015997) included a BP within a MF; MAP kinase phosphatase activity involved in regulation of innate immune response (GO:0038078) included a MF within a BP; and histone deacetylation at centromere (GO:0031059) represented all three aspects: a MF in the BP branch of the ontology (histone acetylation) that also included CC information (centromere). • Misclassified terms, for example urea homeostasis (GO:0097274) and creatinine homeostasis (GO:0097273): while these compounds are important medical biomarkers, the normal process that they measure is proper renal function; therefore, these terms have been obsoleted. Annotations have been rehoused under renal tubular secretion (GO:0097254) (or one of its children) or removed if the paper supporting the annotation did not allow one to infer the process that affected the circulating levels of urea or creatinine. • Reaction mechanisms: primary charge separation (GO:0009766) and enzyme active site formation (GO:0018307) were obsoleted because they represent substeps of reactions which are beyond the scope of GO. • Protein-modifying activity terms that mention specific substrates, for example [cytochrome c]-arginine N-methyltransferase activity (GO:0016275), which is captured by the more general arginine N-methyltransferase activity (GO:0016274). Substrates can be captured with the "has input" relationship in GO-CAM models and in annotation extensions. The exception to this is the histone code: for GO to represent this important mechanism of gene expression and chromatin structure mechanism, specific activities are created for known histone modifications, for example histone H2AR3 methyltransferase activity (GO:0070612) and histone H3T3 kinase activity (GO:0072354). • Experimental assays and nonphysiological substrates: some experiments are easier to perform using analogs of physiological substrates. Because GO terms should represent in vivo functions, we have removed some terms that represent an experiment rather than its biological conclusion. An example is rubidium ion transport (GO:0035826): rubidium is used as a tracer for potassium ions (Gill et al. 2004) but has no physiological role in itself. Another example is regulation of nucleosome density (GO:0060303), which measures the degree of compaction of chromatin, and is a readout for heterochromatin assembly or disassembly.
Concomitantly with these term obsoletions, many new terms have been added to the ontology in the past two years. An example is molecular condensate scaffold activity (GO:0140693) for proteins that nucleate condensates that mediate liquid phase transition. This latter term represents a recent advance in the understanding of the organization of cellular biochemistry (Banani et al. 2017).
We have also clarified the level of specificity at which MF terms should be represented in GO. For example, we now strive to create GO terms that represent the range of in vivo substrate specificity of an enzyme or transporter. This is in contrast to earlier guidelines, in which a GO term was created for each separate molecular substrate tested in a single, isolated experimental assay or result, which could include nonphysiological substrates. With recent improvements in experimental technologies and practices, it is now often possible to annotate with a concept that more closely matches the biological substrate specificity range of a protein. Therefore, while GO makes cross-references to Enzyme Commission (EC) (McDonald and Tipton 2014), Rhea (Bansal et al. 2022), Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al. 2022), and MetaCyc (Altman et al. 2013), GO does not necessarily create a different term for each of the reactions represented in these resources for each substrate on which a MF acts. For example, the GO term 3-oxoacyl-[acyl-carrier-protein] reductase (NADPH) activity (GO:0004316) represents the fact that the same gene product has a broad specificity toward 3-oxo-acyl groups, and therefore, we have obsoleted the more specific GO terms that refer to only one specific substrate, such as 3-oxo-cis-Delta9-hexadecenoyl-[acp] reductase activity (GO:0102072), 3-oxo-glutaryl-[acp] methyl ester reductase activity (GO:0102131), and 3-oxo-pimeloyl-[acp] methyl ester reductase activity (GO:0102132). For broad specificity enzymes and transporters, the activity on a specific substrate in a specific pathway can be captured by biocurators in a GO-CAM  or an annotation extension (Gene Ontology Consortium 2010) rather than in a GO term.

Annotations
A GO annotation is a statement asserting that a particular gene or gene product has a particular functional characteristic (GO term); examples are shown in Fig. 1b. New annotations are continually added to the knowledgebase. In the past two years, experimentally supported gene function annotations have been added from over 10,000 scientific papers. As of November 2022, the GO knowledgebase contains experimental knowledge from almost 173,000 papers. GO annotations derived from experimental data are added primarily by the annotation groups in the GO Consortium, which typically curate biological knowledge by organism (Table 2).
GO annotations are also regularly reviewed and may be edited or removed from the knowledgebase for various reasons, particularly when ontology terms are revised (see "Ontology" section above) or when annotations are invalidated by later experimental data. Annotations to terms that will be obsoleted are manually reviewed and annotations are made to a different term whenever possible. For example, when we edited the ontology for histone modifications, over 2,000 annotations to the obsoleted terms were manually reviewed, and histone modifying enzymes were reannotated to the appropriate MF term, while annotations from indirect effects were either removed or reannotated to different, appropriate GO terms. More minor annotation reviews occur regularly.
The Phylogenetic Annotation with the GO project (see below) involves an integrated biocurator review of annotations that has provided additional quality control. The GO user community also plays an important role in identifying incorrect annotations. Because each annotation can be traced to the published paper containing the underlying evidence or describing a method used to infer the annotation, users can quickly verify the accuracy of a given annotation. Potential errors can be reported by clicking on the "Help" link at the top of the GO homepage (http:// geneontology.org). In addition, authors of a paper used to create GO annotations can easily retrieve and review all annotations from a given paper and suggest changes; this can be done from the PubMed abstract page (e.g. the PubMed page https://pubmed.ncbi.nlm.nih.gov/20516198/ (Lydeard et al. 2010)) by clicking on LinkOut and then the "Gene Ontology" link.

Phylogenetic annotations as a source of highly reviewed annotations
The Phylogenetic Annotation using GO (PAN-GO) project creates a set of biocurator-reviewed, selected GO annotations. The PAN-GO process is described in detail in Gaudet et al. 2011. Briefly, using the PAINT software tool, a biocurator reviews all experimentally supported GO annotations collected for all members of a protein family, in the context of a phylogenetic tree from the PANTHER resource (Thomas et al. 2022). They then select the most informative and nonredundant GO terms that represent the gene's functional characteristics. Biocurators then model the evolution of these characteristics in the tree by specifying branches along which the GO terms were gained or lost, taking into account events such as duplications, mutations, horizontal gene transfers, and taxonomic specificity. This allows for different members of the same family to be annotated with different GO terms when justified by the experimental data. All PAN-GO annotations can be traced to experimental evidence in one or more related genes. To date, a total of 8,196 protein families (out of 11,719 families with experimental data) have been curated. The PAN-GO curation effort has prioritized human gene-containing families, though many other families have also been curated. As a result, annotation coverage of a genome generally depends on how closely related it is to humans. PAN-GO annotations are available for 82% of human genes (compared with 68% with experimental evidence alone). Other vertebrate genomes have similarly high coverage, with genomes from other taxa covered at lower but still substantial levels (Tables 3 and 4). PAN-GO annotations are updated at each GO release and are included in the standard, downloadable GO annotation files. These annotations can be identified by the "IBA" (inferred from biological ancestor) evidence code and are available for the 142 organisms included in PANTHER gene families (http://pantherdb.org/panther/summaryStats.jsp).

Protein binding and protein-containing complex annotations
We suggest that users should be particularly cautious when using GO annotations directly to the term protein binding (GO:0005515; see Table 2). These are highly specific annotations that include the protein binding partner in another field of the annotation (not in the GO term itself) and should not be used in applications such as gene set enrichment analysis. Instead, they are recommended for applications such as protein-protein interaction network construction for human proteins (which represent the vast majority of direct protein binding annotations in the knowledgebase). Since all protein functions encompass some type of binding (to a substrate or to another protein), GO strives to describe the molecular activity of proteins using at least one term that is not only under the binding branch of GO; see also the "noncatalytic MF" section above. Therefore, binding (GO:0005488) in isolation can be considered a limited functional description and is represented as a distinct branch of GO MF.

Annotation evidence
All annotations are supported by evidence, comprising two fields in the annotation file (Fig. 1b): an evidence code that describes the type of evidence, and a reference that lists a persistent identifier for tracing the source (provenance) of the original data. It has often been asserted that the most reliable annotations are those made using an experimental evidence code. However, we suggest that users take into account the type of experimental evidence and the level of review of the annotation (Table 5). Some types of experimental evidence, such as inference from a gene expression pattern (IEP), mutant phenotype (IMP), or genetic interaction (IGI), can often be suggestive of function but not definitive when considered in isolation; other annotations for the same gene are often useful to help interpret these annotations. "High-throughput" evidence codes should be treated with particular care. These codes (beginning with the letter H) denote experiments in which many genes are analyzed at the same time, and these annotations are not individually reviewed by either the paper's authors or GO Consortium biocurator (Attrill et al. 2019). Conversely, many nonexperimental evidence types are carefully reviewed by experts. Phylogenetic annotations (IBA evidence code) are based on integration and expert assessment of experimental annotations and thus are individually reviewed twice: once in making the annotation from published experimental results and once in the context of all annotations for related genes (Gaudet et al. 2011). While annotations using the Inferred from Electronic Annotation (IEA) evidence code are considered automated, most implement expert review of a subset of annotations to minimize false positives (for example, UniRule (MacDougall et al. 2021) and InterPro2GO (Paysan-Lafosse et al. 2022)). The GOC considers these annotations to be accurate though they are often less specific than other annotations.
GO evidence codes correspond to a subset of the terms found in the Evidence and Conclusion Ontology (ECO) (Nadendla et al. 2022). Combinations of particular GO internal references (GO_REFs) and evidence codes are also mapped to specific ECO terms (https://github.com/evidenceontology/evidenceontology/ blob/master/gaf-eco-mapping.txt). Users needing to map granular ECO terms to GO evidence code abbreviations can use the mapping file provided by ECO (https://github.com/evidenceontology/ evidenceontology/blob/master/gaf-eco-mapping-derived.txt).

GO causal activity models
GO-CAMs are models of causal influences between gene products  or pathways. More precisely, a GO-CAM links the activities (GO MFs) of gene products together by causal relations that specify the effect of one activity on the other. Each element of a GO-CAM is an instance of an ontology class or other Table 3. Genome coverage of PAN-GO annotations. Percentage of protein-coding genes with at least one PAN-GO-reviewed annotation, for different taxonomic groups. standard database identifiers, so GO-CAMs are highly structured and amenable to computational analysis. The basic unit of a GO-CAM is a "gene product activity unit," which combines a GO MF annotation (molecular activity), together with GO CC (location) and GO BP (larger functional module) annotations that provide the biological context of the activity. The context can be further specified with other ontologies to capture the cell type [using the Cell Type Ontology (Diehl et al. 2016)], tissue/anatomical location (using several different ontologies depending on the species, e.g. Uberon (Mungall et al. 2012) for most vertebrates, other metazoan ontologies such as the Drosophila anatomy ontology (Costa et al. 2013), Caenorhabditis elegans anatomy ontology (Lee and Sternberg 2003), or nonanimal ontologies as the Plant Ontology (Cooper and Jaiswal 2016), or a temporal period (e.g. GO biological phase). Activity units are linked together by causal relationships from the Relations Ontology (Smith et al. 2005) to capture how they interact to impact larger pathways, modules, or processes. As of November 2022, GO Consortium annotation groups have created over 300 GO-CAM models that describe molecular pathways (defined as containing at least three distinct gene product activities linked into a causal chain). These models reflect curation priorities of the contributing groups. Most of the available GO-CAMs are for processes in human or mouse, with a limited number in zebrafish, Drosophila melanogaster, and C. elegans. Many of the human GO-CAMs describe chromatin-mediated regulation of gene expression and immune response pathways, while the mouse GO-CAMs focus on metabolic and signaling pathways. GO-CAMs are accessible from the GO website homepage, by clicking on the "Browse GO-CAMs" link. GO-CAMs can be viewed as pathway diagrams (Fig. 2) and are currently available on GitHub at https://github.com/geneontology/noctua-models.

Community collaborations
The GO Consortium collaborates with experts in specific areas of molecular and cellular biology to systematically update and improve their representation in the ontology and the corresponding GO annotations and GO-CAMs. We recently revised the representation for transcription factors and transcriptional regulation in collaboration with the GREEKC Consortium (Kuiper et al. 2022). Additional collaborative projects include working with the DisProt project ) on improving the ontology and annotations for intrinsically disordered proteins (IDPs); revising processes that involve molecular pathways between interacting species, such as viral infection processes; and integrating the GO and annotations with external biochemical databases.
In 2021, the GO started a collaboration with DisProt (https:// disprot.org/)-the gold standard database of manually curated annotations from the literature for IDPs. IDPs lack a stable threedimensional structure and are characterized by highly flexible and unstructured segments, i.e. intrinsically disordered regions (IDRs). DisProt has developed a custom ontology, the Intrinsically Disordered Proteins Ontology (IDPO), and used it to annotate the structural states of IDPs. The GO Consortium and DisProt have collaborated to refactor IDPO and map the IDPO terms to GO terms whenever possible (those related to functions and interactions of IDPs). The collaboration between the GO Consortium and the DisProt database included the creation and addition of new GO terms to align with already existing IDPO terms that were not yet available in GO. These newly created terms also include the MF activator (GO:0140677) and MF inhibitor (GO:0140678) terms, used to annotate MF regulators that activate/inhibit or increase/decrease the activities of their targets via noncovalent binding that does not result in covalent modification to the target. This collaboration resulted in more accurate and detailed annotation of the modes of action of IDPs, e.g. localization (GO:0051179, IDPO:00010) and DNA binding (GO:0003677, IDPO:00065), as well as providing GO annotations. Currently, more than 1,000 expert-curated annotations from DisProt are available in the GO knowledgebase, comprising more than 860 MFs, 200 BPs, and 10 CC annotations. The only terms in IDPO that could not be mapped to GO were those describing selfregulatory (e.g. self-activation and self-inhibition) and intrinsic disorder-specific functions (i.e. entropic chains), so these annotations are available only in DisProt.

Multiorganism interactions
A group that includes experts from within and outside the GO Consortium has been working together to improve and simplify the representation of interactions between organisms, including medically and agriculturally important host-pathogen interactions. Examples of these interactions include how a symbiont such as a virus enters its host, how the host's immune response recognizes and defends the body against a potentially harmful organism, and also beneficial interactions such as how plants form a symbiosis with nitrogen-fixing bacteria. The goal of this project is to revise the host-symbiont branch of GO BP to reflect the current scientific knowledge in the field and to ensure that genes are properly annotated to the new ontology terms and structure, building on previous work undertaken as part of the PAMGO consortium (Tyler et al. 2009). Symbionts in GO are broadly defined to include pathogens that infect a host organism. We expect that this revision will improve GO-based analyses of molecular studies of pathogens, the mechanisms by which they infect host cells, and host response processes. A major change is that the branch of GO under BP involved in interspecies interaction between organisms (GO:0044419) has been reorganized. It now reflects important  concepts such as the types of biological programs used by symbionts to enable infection and by hosts to prevent or manage infection, such as disruption of CC of another organism (GO:0140975), formation of structure involved in a symbiotic process (GO:0044111), killing of cells of another organism (GO:0031640), and modulation of process of another organism (GO:0035821). Each of these terms has multiple, more specific subclass terms. One challenge in this area was that some previous GO annotations for pathogen genes used terms that apply to normal host processes, such as regulation of defense response processes. Thus, it was not clear whether the pathogen gene was regulating its own defense process or that of a host. With the new ontology terms and structure, these distinctions are clear for both GO biocurators and users of GO. In general, it was important to clearly represent that certain symbiont-initiated processes hijack various host cellular processes. This includes mechanisms to enter and exit the cell, either by binding to host membrane proteins or using the intracellular transport machinery and using the host cellular machinery for genome replication, as well as transcription and translation. We have obsoleted terms that do not clearly distinguish hijacking with the functions that a host gene performs for the host organism, such as dissemination or transmission of symbiont from host by vector (GO:0044008) and positive regulation of viral release from host cell (GO:1902188). Conversely, a pathogenic symbiont triggers innate responses in the host that are not the evolved role of these symbiont proteins, such as induction by symbiont of host cytokine production (GO:0036523) and pathogen-associated molecular pattern-dependent induction by symbiont of host innate immune response (GO:0052033)-these are not functions that a symbiont protein performs to enable its own survival and reproduction.

Integration with biochemical knowledgebases
For accurate representation of biochemical aspects of gene function, we work closely with the Rhea database of reactions (Bansal et al. 2022) and the ChEBI ontology of chemical entities (Hastings et al. 2016). Rhea provides precise representations of in vivo biochemical reactions, including precise chemical entity participants and their stoichiometry. Rhea uses ChEBI terms to represent chemical entities in a standardized, consistent manner. The Rhea database overlaps in content with the catalytic activity branch of the GO but provides additional detailed reaction information and in some cases provides additional specificity. We have improved GO mappings to Rhea, which now covers 4,399 GO catalytic activities (in the MF branch of GO). These mappings allow for nonexact matches when the chemical specificity differs between GO and Rhea. For example, Rhea has two reactions, each referring to a different type of beta glucoside (RHEA:69647 and RHEA:69655, narrow match), whereas GO:0008422, betaglucosidase activity, covers both substrates, as no known enzyme is specific for just one of them. We have recently used the Rhea-GO mappings to include additional linkages between GO MF terms and ChEBI terms in the go-plus release (see below). Previously, ChEBI terms were linked only to general terms in the GO BP branch (e.g. between folate transport and folate), but the additional Rhea linkages have added a total of 4,334 distinct chemical entities linked via 20,307 relationships. The extensive linkage to chemical entities opens opportunities for using GO in other applications, e.g. metabolomics analyses.

Browsing GO and its annotations
GO and associated annotations can be searched directly from the GO home page (http://geneontology.org/), queried using the AmiGO browser (http://amigo.geneontology.org/amigo) or the QuickGO tool (https://www.ebi.ac.uk/QuickGO/) (Munoz-Torres and Carbon 2017). Gene set enrichment analysis is also directly accessible from the GO home page, which launches the PANTHER gene analysis tool at http://pantherdb.org/ webservices/go/overrep.jsp ).

Ontology downloads
GO provides three editions of the ontology on the download page (http://geneontology.org/docs/download-ontology/) to accommodate various applications: go-basic, go, and go-plus (Table 6). All GO terms, including obsolete terms and term metadata such as definitions, cross-references, and synonyms, are available in all three editions. These editions differ in the set of relations they contain: • go-basic contains the types of information that has been available for GO from the beginning of the project; hence, it only contains is a, part of, regulates, negatively regulates, and positively regulates relationships and excludes relationships that cross different aspects (BP, MF, or CC) of the ontology. This edition of the ontology is guaranteed to be acyclic and can safely be used to selectively propagate annotations across any relation. It is recommended for most GO-based software tools. • go additionally includes has part and occurs in relationships that link terms across different aspects of the ontology (for example, a BP can have a has part relation to a MF term or an occurs in relation to a CC). This edition is not acyclic and annotations should not be propagated across all the relationship types that it contains. This edition should not be used in most software tools that rely on the GO. • go-plus is the fully axiomatized edition of the ontology and includes cross-ontology relationships to external ontologies including ChEBI, Cell Ontology, and Uberon.

Ontology subsets (GO slims)
GO subsets are condensed versions of the GO containing a portion of the terms, which are specified by tags within the ontology files that indicate if a given term is a member of a particular subset. GO subsets are particularly useful for providing a global overview of the functions of all the genes in a genome, and even for all the functions of a single gene. GO subsets are particularly useful for providing a global overview of the range of functions and processes found in a given clade or organism's genome. We have recently revised the "GO Generic subset," a subset maintained by the GO Consortium that aims to be general and applicable to any species. We have tested that the subset covers as many gene products as possible in various organisms (human, D. melanogaster, fission yeast, Arabidopsis thaliana, and Escherichia coli) with as little redundancy as possible. This new GO Generic Subset contains 75 BP terms, 40 MF terms, and 29 CC terms. The GO generic subset can be accessed at http://current.geneontology.org/ontology/ subsets/goslim_generic.obo. Versions in .owl, .json, and .tsv are also available from http://current.geneontology.org/ontology/ subsets/index.html. As part of the Alliance of Genome Resources, we have developed a widget that provides a graphical visualization of a gene's function in a "ribbon"-like display (Fig. 3). The widget can be customized to use any GO subset and uses the goslim_agr subset by default. This widget is implemented in the Alliance gene pages and in the UniProt entry pages. It accesses GO annotations using the GO API (application programming interface) and can be easily added to any webpage. Fig. 2. GO-CAM model of the SARS-CoV2-host interactions as displayed using the GO-CAM Pathway Widget (code available at https://github.com/ geneontology/wc-gocam-viz) on the Alliance of Genome Resources gene pages (https://www.alliancegenome.org/gene/HGNC:20144#pathways). The model includes proteins from both humans (Hsap) and the SARS-CoV-2 virus (Scov2). A simplified representation of the causal model is shown on the main figure, which is simplified by labeling each activity with the gene and organism. The model includes many additional details, which are displayed as "cards;" the information for MAVS activity (inset) which normally acts as a signaling adaptor located in the mitochondrial membrane. MAVS activity is suppressed directly by the SARS-CoV-2M protein and indirectly by other SARS-CoV-2 proteins. Each of the "E" symbols on the right-hand side can be clicked to see the evidence for each assertion in the model.

GO annotations
The two major sites for downloading GO annotations are geneontology.org and UniProt-GOA. geneontology.org is the website developed by the GO Consortium. This downloads site (http:// current.geneontology.org/products/pages/downloads.html) provides a total of 7.5 million human and model organism annotations contributed by multiple groups. It contains all manually reviewed GO annotations and electronic (computationally predicted) annotations for the most commonly used organisms. For model organisms, all annotations use gene identifiers from the authoritative database (for example, FlyBase (FBgn), WormBase (WBGene), and SGD (S)). Human and other organisms without an authoritative dedicated database are represented by UniProtKB accession numbers. For these organisms, the GO website provides annotations to UniProt reference proteomes (https:// www.uniprot.org/help/reference_proteome), which are generally one entry per gene, thus limiting redundancy in annotations. UniProt-GOA (https://www.ebi.ac.uk/GOA/uniprot_release) contains 1 billion annotations for all entries in UniProt (1,264,340 taxa), covering both reviewed (Swiss-Prot) entries of UniProt, and unreviewed (TrEMBL) entries. All annotations for the model organism genes are converted to UniProt protein identifiers. For most organisms, all annotations are electronic annotations generated via various pipelines (see above for evidence codes and references for different methods). In addition to these resources, GO annotations are also viewable in a number of biological databases, including model organism databases, UniProt (UniProt: the universal protein knowledgebase 2017), NCBI (Sayers et al. 2020), and The Alliance of Genome Resources (Alliance of Genome Resources Consortium 2022). These sites show GO annotations in the broader context of a gene product's expression pattern, phenotypes, metabolic and signaling pathways, etc.

Conclusions and future directions
The extensive and wide-ranging use of the GO knowledgebase, evidenced by its recent, peer-reviewed designation as a Global Core Biodata Resource (https://globalbiodata.org/scientificactivities/global-core-biodata-resources/), demands its continued development and expansion. We are focusing on several highpriority areas of development for the near future. For pathways, we will continue to accumulate GO-CAM models. The UniProt/ Swiss-Prot curation team has ramped up their production of GO-CAM models and we expect to add models at a rapid rate. In parallel, we have started converting Reactome pathways into GO-CAMs (Good et al. 2021) and expect to release GO-CAM representations of most Reactome metabolic pathways in the near future. This will provide a complementary, causal flow representation of the chemical reaction-centered representation in Reactome. Conversion of Reactome signaling pathways is more challenging and will be released somewhat later. We are also working on converting the YeastPathways resource (https:// pathways.yeastgenome.org) into GO-CAMs, making a large number of yeast metabolic pathways available. The increasing number of GO-CAM models will allow us to expand on the utility of these highly structured pathway and process representations. Some potential areas are automated pathway visualization, using the causal links and more granular gene sets to enhance enrichment analysis, and better generation of automated descriptions of gene function (e.g. Kishore et al. 2020).
With respect to ontology development, in addition to continuing to revise the ontology in response to recent discoveries, we see an immediate need for clearly delineating the level of biological organization at which a function is described. This includes distinguishing MFs from BPs, and distinguishing BPs that occur at the level of individual cells, versus those that occur at the level of multicellular organisms. For example, the term "homeostasis"the maintenance of a roughly steady level of a molecule or ionis used very broadly in the literature to refer to both processes that maintain a steady-state level within a cell and processes that maintain a steady state in blood or other fluid that is transported within a multicellular organism. Even in some publications, it is difficult to know which type of homeostasis is being tested.
We will continue to make the GO knowledgebase easier to use and more community-driven. One near-term priority is to make annotations available for download by species, with a single identifier for each distinct gene. We are also planning to create quick- Table 6. GO editions. Editions are distinguished by the relations and metadata that they include. All editions are updated at each GO release. External ontologies used in GO include the following: ChEBI, Uberon (Haendel et al. 2014), Relation Ontology (Smith et al. 2005), Cell Ontology (Diehl et al. 2016), Sequence Ontology (Mungall, Batchelor and Eilbeck 2011), Dicty Anatomy, CARO (Haendel et al. 2008), Fungal Anatomy Ontology (Fungal-Anatomy-Ontology, 2020), Plant Ontology (Walls et al. 2019), PATO (Gkoutos, Schofield and Hoehndorf 2018), and Protein Ontology (Natale et al. 2017 start guides for common GO use cases, in both written and video forms. The immense user base of the GO and the need for much improvement and extension drives us to consider how to expand the number of people that contribute to the GO. From its inception, the GO has been a large, open, community project. However, we are planning additional routes through which the broader GO user community can contribute their expert feedback and knowledge to GO, improving the resource for all users. For now, users are encouraged to contact the GO Helpdesk (http:// help.geneontology.org/) with any questions or to report any GO ontology terms or annotations that may be inaccurate or difficult to interpret.

Data availability
All GO code and resources are freely available for download and reuse. Software