OMA orthology in 2024: improved prokaryote coverage, ancestral and extant GO enrichment, a revamped synteny viewer and more in the OMA Ecosystem

Abstract In this update paper, we present the latest developments in the OMA browser knowledgebase, which aims to provide high-quality orthology inferences and facilitate the study of gene families, genomes and their evolution. First, we discuss the addition of new species in the database, particularly an expanded representation of prokaryotic species. The OMA browser now offers Ancestral Genome pages and an Ancestral Gene Order viewer, allowing users to explore the evolutionary history and gene content of ancestral genomes. We also introduce a revamped Local Synteny Viewer to compare genomic neighborhoods across both extant and ancestral genomes. Hierarchical Orthologous Groups (HOGs) are now annotated with Gene Ontology annotations, and users can easily perform extant or ancestral GO enrichments. Finally, we recap new tools in the OMA Ecosystem, including OMAmer for proteome mapping, OMArk for proteome quality assessment, OMAMO for model organism selection and Read2Tree for phylogenetic species tree construction from reads. These new features provide exciting opportunities for orthology analysis and comparative genomics. OMA is accessible at https://omabrowser.org.


Introduction
Genes which are related through speciation are called orthologs, as opposed to paralogs, which are related through duplication ( 1 ).This distinction is useful in a wide range of contexts, including phylogenetic tree inference, protein function prediction, or whole genome alignment (reviewed in ( 2 )).
For more than 18 years, the OMA ('Orthologous Matrix') database has elucidated orthologs among complete genomes across the entire tree of life (3)(4)(5)(6)(7).In this update paper, we report on the most recent developments, including new and updated species, ancestral Gene Ontology annotations and synteny reconstruction, and an overview of the tools we provide to facilitate the use of the OMA knowledgebase.

New and updated species-a better sampling of prokaryotic phylogenetic diversity
In this update, we've significantly broadened the representation of prokaryotic species within the OMA database.Since the last OMA paper, the number of bacteria species we include has risen from 1607 to 1965 in our latest release (+22%), and the number of archaea species from 152 to 173 (+14%; Figure 1 ).This new sampling aims to reflect the massive expansion of the described prokaryotic diversity recently enabled by metagenomics ( 8 ).Notably, we now include genomes from the novel Patescibacteria bacterial phylum (also known as the Candidate Phyla Radiation), as well as the DPANN and Asgardarchaeota archaeal lineages ( 8 ).This improved coverage of the prokaryotic diversity allows for a more comprehensive understanding of microbial genome evolution.
Our sampling strategy aims for a balanced representation, as we sought to include at least one species from each class in the Genome Taxonomy Database (GTDB ( 9 )).The representative genome for each taxonomic class was chosen based on the completeness and contamination as assessed by CheckM ( 10 ), the length of contigs and the species' importance within the class, reflected by the number of available genomes for this species.
Since the latest OMA paper we have also added 230 new Eukaryotic genomes (+47%).These additional genomes were selected to improve the sampling diversity.The species present in the OMA database can be found from the home page under Explore → Species / Release information .The prioritization of new and updated genomes is informed by our users, so we invite researchers to suggest new or updated genomes by filling in the following form: https:// omabrowser.org/suggest .

Ancestr al g enome g ene cont ent and g ene order
The OMA database uses HOGs to model ancestral genomes corresponding to each internal node of OMA's sampled Tree of Life.Conceptually, HOGs can be thought of as ancestral genes, as they encompass orthologs and paralogs descending from a common ancestral gene at a specific taxonomic level.Thus, the HOGs are proxies for ancestral genes in a common ancestor and the collection of HOGs at a given level are proxies for ancestral genomes.
To facilitate ancestral genome exploration, we now offer a total of 1133 Ancestral Genome pages on the OMA browser (Figure 2 A).Ancestral genomes can be accessed by searching for a specific taxa name or identifier, or by the Explore → Quick access to → Extant and ancestral genomes .These dedicated pages show all the HOGs specific to chosen taxonomic levels.From these pages, users can access details about that ancestral genome, including the number of descendant species found in the OMA database and the number of genes inferred to have existed in this common ancestor.The genes are displayed in a table by clicking on Ancestral genes in the left-hand menu.Recognizing that inferred counts can sometimes be influenced by HOG inference errors, particularly in the presence of highly divergent sequences, an option to filter for high-quality HOGs is provided.We define this 'Completeness Score' as the number of species included in the HOG divided by the total number of species in the clade, thus it ranges from 0 to 1 (default: 0.3).HOGs with a low Completeness Score could be dubious, as it would imply a high number of gene loss events.
HOGs offer insight into the evolutionary history of genes, thus the Ancestral Genome pages offer access to details concerning the evolutionary history of the HOGs present at that taxonomic level, which we compute with pyham ( 12 ).The left-hand menu displays the parental genome used to determine the evolutionary events that happened on the branch leading to the ancestor of interest.For example, if investigating the Mammalia genome, 'Amniota' is chosen as the most recent parental genome.The column 'Evolutionary event' in the Ancestral Genes table gives the status of the gene from the parental ancestor to the ancestor of interest.In this context, 'retained' denotes a gene's consistent presence as a single copy from the parent to the child.'Duplicated' signifies that the gene arose due to a duplication event between the previous common ancestor and the ancestor of interest.In contrast, 'gained' signifies OMA's inference that the gene emerged at the chosen taxonomic level (i.e. a root HOG).In the example in Figure 2 A, the evolutionary events refer to what happened to the gene on the branch leading from Amniota to Mammalia.The evolutionary event displayed in the table can be filtered using the left-hand menu.We also give the option to display 'Lost' genes, which are genes present in the parental genome, but not in the ancestral genome of interest.These 'phylostratigraphic' gene pools hold potential for functional enrichment-an avenue of exploration detailed in the section 'Ancestral and extant Gene Ontology enrichment analysis'.
As opposed to the previous version of OMA in which an ancestral genome was solely represented as a flat list of genes, the last update integrates ancestral gene order inferences to represent an ancestral genome as a collection of inferred contiguous regions of genes.An ancestral contig is depicted as a linear graph that can be interactively explored in our new ancestral gene order viewer by clicking on the 'Ancestral Gene Order' section of any Ancestral Genome page (Figure 2 B).In this representation, a node corresponds to a HOG inferred to be a gene present in the ancestral genome, and each edge links two HOGs inferred by parsimony to be of closest proximity, with a weight equal to the number of supporting contexts in descendant extant genomes.Users can visualize the inferred contiguous regions of genes, ordered by decreasing length.Within each ancestral contig, hovering over an edge will display its weight, while clicking on a node will display the identifier of the corresponding HOG, its functional description and the list of associated GO terms ranked by information content (see 'HOG (ancestral gene) annotation' section below).
This additional level of resolution introduced with ancestral gene order inferences unlocks two key applications.First, it makes it possible to track the chain of genomic rearrangements that occurred during the evolution of lineages, with opportunities to flag past rearrangements that correlate with adaptation (13)(14)(15).Second, the reconstructed gene orders allow for identifying conserved genomic neighborhoods, which can be indicative of functional coupling between adjacent genes ( 16 ,17 ).Within any clade of interest, including clades as old as Eukaryota, Bacteria or Archaea, the corresponding D 515 Figure 1.Impro v ed co v erage of prokary otic div ersity in the latest OMA browser release (J uly 2023).T he tw o panels displa y the bacterial tree ( A ) and the archaeal tree ( B ) from GTDB release 207 ( 9), collapsed at the le v el of a GTDB phylum and labeled according to the updated nomenclature of prokaryotic phyla ( 11 ).Leaves with an assembly identifier label correspond to deep branching genomes without closely related genomes that are treated by GTDB as forming a phylum on their own.The height of each histogram bar is indicative of the proportion of the total number of bacterial (A) or archaeal (B) genomes a v ailable in OMA originating from the corresponding ph ylum.Green reflects the number of genomes in the Aug2020 OMA release, and red the additional genomes in the Jul 2023 release.Despite the bacteria coverage being heavily biased towards the Pseudomonadota (formerly P roteobacteria), B acillota (f ormerly Firmicutes) and A ctinom y cetota (f ormerly A ctinobacteria) ph yla, with the updated co v erage, almost all ph yla in GTDB ha v e at least one representative.

Ancestral and extant local synteny viewer
In this update, we took advantage of the inferred ancestral contiguous regions of genes to propose a new unified, genecentric local synteny viewer.For any query HOG or extant gene, it is now possible to visualize the multiple alignment of its inferred genomic neighborhood at the taxonomic level of investigation with that of all its descendant orthologs and paralogs in intermediate and extant descendant genomes (Figure 3 ).This kind of unified local synteny viewer enabling interactive comparison of both extant and ancestral genomic neighborhoods in a phylogeny-aware context has been provided by specialized databases-notably the Yeast Gene Order Browser ( 25 ) and Genomicus ( 26 ), but the scale and depth of our reconstructions is unprecedented.
At the top of the view, the reference ancestral neighborhood of the query HOG is shown.It is composed of the central query HOG and up to five inferred flanking HOGs in both directions.To the left of the view, the reconciled gene tree derived from the query HOG is shown.The gene neighborhoods of extant genes in the query HOG are displayed to the right, corresponding to the leaves of the gene tree.Genes which descended from the same HOG have the same color, and genes which are colored grey indicate no inferred homology to the genes in the query HOG neighborhood.These features allow for visualizing and investigating the conservation of the genomic neighborhoods of a gene family in a phylogeny-aware context.
Users can collapse clades of extant genes.Upon collapsing, the collapsed clade of extant genes becomes represented by the inferred neighborhood of their last common ancestral gene (i.e. the HOG).This option provides a way to compare the genomic neighborhood of a gene in a more ancient ancestor to the descendant genes in intermediate ancestors or extant species.Users can continue to collapse older clades, further customizing the display.In addition to allowing for a more compressed visualization, collapsing nodes can also be

HOG (ancestral gene) annotation
The hierarchical nature of the HOG structure facilitates exploration of gene family and species evolution.By adding functional information to the HOGs, we can gain further insight into the functional evolution of genes and genomes, and track when specific functions appeared throughout the Tree of Life.We now annotate the HOGs at every level in the OMA taxonomy with Gene Ontology (GO) terms.Using the framework of the HOGs and the HOGProp method ( 27), we parsimoniously propagated GO terms from extant genes to the HOGs that encompass them.Thus, HOGProp yields ancestral genes annotated with GO terms.
In the latest release, we started with 486 837 274 GO annotations for 14 107 684 (63.86%) proteins over all extant species in OMA, and used this information to annotate 26 512 830 (88.14%)HOGs with at least 1 GO term, across 507 406 (55.58%) root-level HOGs.Of the root-level HOGs with GO annotations, the average number of GO terms is 27.This represents functional annotation for a significant proportion of the ancestral genomes at key taxonomic levels (Table 1 ).Examples of key clades in the OMA database, the total number of HOGs at that taxonomic level, and the number / percentage of HOGs annotated with at least 1 GO term.Note that the number of HOGs displayed in the table are unfiltered for HOG quality, thus they may not reflect the true number of ancestral genes.
Ancestral GO annotations are available from the HOG Group pages on the left-hand menu.

Ancestral and extant Gene Ontology enrichment analysis
Knowing the functions of ancestral genes allows for functional enrichment analysis.Traditionally performed on extant genomes, this analysis identifies functions that are over-represented in a study set (foreground) compared to the population (background).
Users can now carry out Gene Ontology enrichment analyses (GOEA) for sets of both extant genes or ancestral genes using the OMA browser.Extant gene GO enrichment is carried out as described in ( 28 ), where the study set is a userdefined set of extant genes, and the population is all genes in the extant genome of interest.For our newly-introduced ancestral gene GO enrichment feature, we exploit the HOG GO annotations by considering a study set of user-defined HOGs at a given taxonomic level, compared to the population of all HOGs defined at a given taxonomic level.Each GO term also implies its parental terms ( 29 ), so GO annotations are typically stored in a non-redundant manner, as a set of the most specific terms necessary to reconstruct the annotation.After propagating the more general GO terms for each member of the analysis, uncorrected P -values for overrepresentation of the study set compared to the population are generated using Fisher's exact test.Corrected P -values are obtained using either Bonferroni correction or the Benjamini-Hochberg procedure.To our knowledge, this method is the first instance of ancestral gene enrichment analysis available on a user-friendly browser.
To enable users to perform GO enrichments on their own sets of ancestral or extant genes of interest, we offer both a browser interface and API functionality.The interface is straightforward: select the analysis type (Extant or Ancestral) and specify a study set of genes (Figure 4 A).The input must be a list of identifiers separated by spaces, tabs, commas, or new lines.In the case of extant GO enrichment, OMA identifiers or other cross-reference ids are accepted; for ancestral GO en- richment, HOG identifiers from a specific taxonomic level are necessary (e.g.HOG:D0639603.1b)and must be indicated.For extant enrichments, OMA will identify the appropriate taxonomic level and report it in the output.
Upon computation, the results of the GO enrichment are displayed (Figure 4 B, C).A table lists the enriched GO terms, featuring the Gene Ontology unique identifier, GO term name, which one of the three sub-ontologies that the term belongs to (MF: Molecular Function, BP: Biological Process, CC: Cellular Component), the uncorrected P -value as determined by Fisher's exact test, the P -value corrected for multiple testing using the Bonferroni and Benjamini-Hochberg procedure, the number, ratio and proportion of genes in the study set and population which are annotated with the GO term, and finally the fold-change between the study set and population.
We also provide up to three plots to summarize the significant GO terms for the MF , BP , and CC categories.Here, as in REVIGO and GO-Figure ( 30 ,31 ), the two dimensions represent a semantic space for the significant GO terms, resulting from a multi-dimensional scaling of the pairwise simRel semantic similarity measures ( 32 ).GO terms will appear closer to each other the more semantically similar they are.The bubble colors denote corrected P -values and the bubble size reflects the information content ( 33 ).Bubbles provide term and enrichment details upon hover.Full results including the table, plots, study set and population set (including the taxonomic level if inferred) can be downloaded in compressed files.
The GO enrichment analysis form can be accessed from the home page ( Tools → GO enrichment analysis ) or from an Extant or Ancestral Genomes page.From an Ancestral Genome page, GO enrichments can be performed on subsets of genomes by selecting phylostratigraphy categories (retained, duplicated, lost, or gained).

Impro v ed sear c h engine
We have made significant enhancements to the OMA browser's search engine, resulting in a less ambiguous and more effective searching experience.This improvement centers on providing contextual understanding to query terms, which in turn leads to faster and more accurate search results.
This optimization is driven by the indexing of data such as the gene or HOG identifiers.The revamped search engine is built upon a token-based framework that relies on visual cues and logical associations.Users initiate a search by entering a query into the search field and pressing Space or Enter, generating a distinct token.These tokens consist of two parts: a prefix (field) that guides the search, and the query itself.
Tokens can be of different types, each representing a specific category: Gene, HOG, OMA Group, or Taxon.The prefixes of the tokens define the category that should be associated with the corresponding query term.For example, if a user inputs the token [go:4225], the search engine will look for genes in the OMA database that are annotated with the GO:0004225 Gene Ontology term.Without a token, '4225' could refer to a taxon, HOG, or gene identifier, resulting in slower and more ambiguous searches.
Multi-word queries are supported if users enclose them in quotation marks (e.g.[species:'homo sapiens']).Editing queries is possible by clicking on them to modify the input field.Users can select and adjust prefixes by clicking on the dropdown icon to select a different one.Additionally, if one starts typing a query without hitting Enter or Space, OMA will automatically suggest identifiers after a few seconds, streamlining the search process.
Combining different tokens opens up possibilities for advanced searches.For instance, a search input like [hog:60627 species:HUMAN] would yield human genes found in HOG:606207.D 519

New applications and tools in the OMA Ecosystem
The OMA Ecosystem is the collection of tools, software, data downloads, interactive visualizations and other resources, which are associated with the OMA knowledgebase.The main entry point is the OMA browser.In this section, we highlight some tools which are recent editions to the OMA Ecosystem.

OMAmer: mapping proteomes to existing HOGs
The OMA database contains HOGs with species covering a wide diversity of taxa.However, researchers are often interested in finding orthologs for proteins in other species.The OMAmer ( 34 ) software offers the possibility to quickly place protein sequences into existing HOGs from the OMA database.
OMAmer is a tool that helps identify orthologs and paralogs more effectively compared to sequence similarity-based methods.The limitation of naive methods is that the closest sequence may not necessarily belong to the same subfamily due to differences in evolutionary rates across the tree, among other complications ( 34 ).OMAmer overcomes this issue by using evolutionarily-informed k -mers for alignment-free mapping to HOGs.One of the key advantages of OMAmer is its ability to map not only to root HOGs (gene families), but also to the correct subfamilies.This level of delineation is crucial when studying the functional divergence of paralogs.OMAmer has demonstrated higher accuracy than approaches merely based on maximizing sequence similarity ( 35 ).
OMAmer users can quickly identify the most likely HOG for one or multiple proteins of interest.The OMA Browser provides an asynchronous tool that allows users to upload their custom proteomes and map them to HOGs.This tool has replaced the previous implementation of Fast Mapping, which was not based on OMAmer.The user can access the tool from the home page → Tools → Fast Mapping with OMAmer .
Alternatively, users can perform OMAmer searches locally, by using the software available at https://github.com/DessimozLab/omamer .A relevant OMAmer database is required, for which we now offer four precomputed OMAmer databases at each new release of the OMA database: one global database (LUCA.h5)and three clade-specific databases for more targeted searches.These OMAmer databases are available in the Download → Current release of the Browser and enable the processing of entire proteomes in minutes, allowing users to perform phylogenomics analyses without the computational cost of direct orthology calling.This is an option for users who would like to quickly and easily benefit from OMA's wealth of precomputed data for their dataset.
By utilizing OMAmer, users can then explore the available information for the identified gene family in the OMA browser or export the sequence data for further analysis.This streamlined process makes it easier for researchers to analyze their custom proteomes and gain insights into orthologs and paralogs.

OMArk: quality assessment of proteomes
OMArk is a new tool for quality assessment of proteomes, which can estimate not only the completeness, but also the prevalence of fragments, contamination and dubious gene models.OMArk uses OMAmer to place sequences rapidly into HOGs in the OMA database and then exploits the evolu-tionary information of the HOGs at that taxonomic level (i.e.presence of single-copy or duplicated genes) ( 36 ).Comparing user-defined proteomes to that which is expected, based on the OMA database, allows for evolutionarily-informed quality assessment.OMArk additionally identifies likely contamination within proteomes.
With the regular addition of high-quality and taxonomically diverse genomes to the OMA browser, we expect the resolution of OMArk estimates to improve with each new release.Since the November 2022 release, we now use OMArk quality assessments to update the OMA Browser by replacing lower quality proteomes with higher quality ones where available and to improve taxonomic diversity.This will continue to be part of our process for proteome selection and updates in the future, in combination with community requests.
A few examples of decisions that were influenced by OMArk in the latest release are as follows: we removed Drosophila biarmipes from our dataset due to contamination by Saccharomyces cerevisiae sequences.We added new birds ( Haliaeetus leucocephalus, Falco tinnunculus, Corvus moneduloides, Tyto alba ) and butterflies ( Vanessa tameamea and Papillo machao ), guided by OMArk quality metrics.In the latest release, we switched to the RefSeq annotation of the chicken genome and updated the tardigrade, Hypsibius dujardini, also guided by OMArk quality metrics.
With the increase in genomic data, OMArk will continue to help improve the OMA Browser dataset by improving orthology calling and other downstream analyses, which first depend on the quality of the underlying annotation data.OMArk is available at https:// github.com/DessimozLab/ OMArk and online at https://omark.omabrowser.org .

OMAMO: selecting the best model organisms for a biological process of interest
The conservation of pathways across species enables researchers to use non-human organisms as models.However, the most widely-used model species include higher animals such as mice and zebrafish, which is costly, time-consuming, and which requires ethical consideration.These problems highlight the need for identifying less complex model species.Existing tools that focus on gene conservation across species for identifying model organisms only consider a very sparse set of single-cell species or none at all.To address these issues, we developed 'Orthologous Matrix and Model Organisms' (OMAMO) ( 37 ), a software tool that is integrated into the OMA browser.It identifies the most suitable less complex organism for research, given a biological process of interest.The OMAMO algorithm relies on the fact that orthologous genes tend to be functionally conserved and have similar expression patterns, unlike other types of homologs ( 38 ).It uses GO annotations of orthologous species-human pairs to estimate the functional similarity of genes in a given biological process, thereby providing a pathway-oriented approach for model organism search.The software can be used through the OMA browser ( https:// omabrowser.org/oma/ omamo/ search/ ) by searching a biological process GO term.

Read2Tree: building phylogenetic species trees from reads
Phylogenetic trees play a crucial role in various contexts, but constructing high-quality trees remains highly challenging due

Figure 2 .
Figure 2. Ancestral Genome page for Mammalia.( A ) Screenshot of the Ancestral Genes table.Here, each row is a HOG, i.e. ancestral gene, at the tax onomic le v el of interest.For each gene, the H O G identifier f or that le v el and the H O G identifier of the entire gene f amily ('R oot H O G ID') is sho wn.T he 'Ev olutionary e v ent' column reflects if the gene w as retained in a single cop y, duplicated, or gained (originated) on the branch leading to this tax onomic le v el.T hese e v ents can be filtered using the side menu.( B ) Screenshot of the Ancestral Gene Order vie w er.Here, the reconstructed gene order is shown: HOGs at that taxonomic level are represented as rectangles, connected based on evidence of gene order in the extant genomes.Node color indicates the Completeness Score (0 = red, 1 = dark grey), and node size is the number of extant genes in the HOG.

D 517 Figure 3 .
Figure 3. Ancestral and extant unified local synteny viewer.The boxes represent extant or ancestral genes, with genes of the same color related by homology, i.e. found in the same HOG at the taxonomic level of interest.The focal HOG (or extant gene) and its homologs are outlined in black, with their names shown below.In this example, the focal HOG displayed is the cellular tumor antigen p53 at the Mammalia level.The top row is the reconstructed genomic neighborhood of this gene in the Mammalian common ancestor.The reconciled gene tree is displayed to the left, with the genomic neighborhood of each ancestral or extant genome shown next to it.Duplication nodes are shown in red.The viewer is interactive in that users can collapse nodes on the tree to display the genomic neighborhood in the ancestor corresponding to that node.Clicking on an ancestral gene displays the H O G ID and the functional description of the H O G, and clicking on an e xtant gene displa y s the identifiers, sequence length, chromosome, description and the H O G it belongs to.

Figure 4 .
Figure 4. Extant and ancestral Gene Ontology enrichment analysis.( A ) User interface to submit a study set of genes.( B ) The resulting table displa y s the GO terms o v er-represented in the study set with a Benjamini-Hochberg FDR corrected P -value ≤0.05 Notably, users have access to the study and population sets, as well as the entries in the study set annotated with the GO term.( C ) The Biological Process output plot showing enriched GO terms.Here, each bubble represents an enriched GO term, with the P -value indicated by color.The size of the bubble is proportional to the information content.Mousing over the bubbles gives the GO name, GO identifier, corrected P -value and information content.(Not shown: CC and MF.)

Table 1 .
GO annotation co v erage