Orthologous relationships form the basis of most comparative genomic and metagenomic studies and are essential for proper phylogenetic and functional analyses. The third version of the eggNOG database (http://eggnog.embl.de) contains non-supervised orthologous groups constructed from 1133 organisms, doubling the number of genes with orthology assignment compared to eggNOG v2. The new release is the result of a number of improvements and expansions: (i) the underlying homology searches are now based on the SIMAP database; (ii) the orthologous groups have been extended to 41 levels of selected taxonomic ranges enabling much more fine-grained orthology assignments; and (iii) the newly designed web page is considerably faster with more functionality. In total, eggNOG v3 contains 721 801 orthologous groups, encompassing a total of 4 396 591 genes. Additionally, we updated 4873 and 4850 original COGs and KOGs, respectively, to include all 1133 organisms. At the universal level, covering all three domains of life, 101 208 orthologous groups are available, while the others are applicable at 40 more limited taxonomic ranges. Each group is amended by multiple sequence alignments and maximum-likelihood trees and broad functional descriptions are provided for 450 904 orthologous groups (62.5%).
Orthology, defined as homology via speciation (1), is a crucial concept in evolutionary biology and is essential for disciplines such as comparative genomics, metagenomics and phylogenomics. The concepts of orthology and paralogy, with the latter being defined as homology via duplication (1), have been used as a foundation to introduce the concept of clusters of orthologous groups: proteins that have evolved from a single ancestral sequence existing in the last common ancestor (LCA) of the species that are being compared, through a series of speciation and duplication events (2). Orthologous groups (OGs) have proven useful for functional analyses and the annotation of newly sequenced genomes (3–5) as orthologs tend to have equivalent functions (6).
A number of orthology prediction methods have been recently introduced that can be classified into (i) graph-based methods, from the reciprocal-best-hit approach (7) to more sophisticated methods, such as the identification of best-hit triangles (2,8–11) and other clustering-based approaches (12–15) or (ii) tree-based methods that can be further classified into methods that use tree reconciliation to infer orthologs (16–19) and those that do not (20,21). Their methodological advantages and disadvantages have been reviewed in refs (22–24).
An important point is that OGs depend on their taxonomic context. The broader the taxonomic range, the deeper the LCA is placed, resulting in larger OGs with lower resolution of the orthologous relationships. Thus, the smaller taxonomical range results in more fine-grained groups. Therefore, the first and most successful resource, COG (2), provided OGs for certain taxonomic ranges, namely COGs for all three domains of life, KOGs for Eukaryotes (8) and arCOGs for Archaea (9). Some automatic orthology prediction methods also provide distinct sets of OGs for an increasing number of taxonomic groups [e.g. OrthoDB (10), eggNOG (11) and OMA (12)].
The functional annotation of OGs is particularly necessary, as functional insights from well-studied proteins/species can be transferred to uncharacterized orthologs. Moreover, several genome annotation tools [e.g. (25)] use the functional annotations of OGs to automatically map function information to large-scale genomic data. The most common form of orthologous group annotation is a consensus-based (longest common string) approach (9,12,18,21,26) in which the description of the OG is derived from available annotations of the member proteins. Only a few available resources conduct a more robust manual annotation of the groups (8) or incorporate multiple annotation sources for the description and annotate the groups with functional categories (8,11).
Here, we describe the third version of eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups), a database that provides orthologous groups for 943 Bacteria, 69 Archaea and 121 Eukaryotes. In total, 721 801 OGs have been computed including about twice as many orthologous relations for genes compared to the previous version. Most importantly, it contains considerably more taxonomically restricted OGs with higher resolution, covering 41 taxonomically relevant ranges such as Proteobacteria or Metazoans.
SELECTION OF GENOMES
We downloaded complete proteomes from RefSeq (27), Ensembl (28), UniProt (29), GiardiaDB (30), JGI (http://genome.jgi-psf.org/) and TAIR (31). This particular set of genomes also forms the basis for the most recent STRING (32) and STITCH (33) database, allowing for easy integration across these databases.
The analyses were performed on 1133 complete genomes, encoding 5 214 234 proteins. The genomes were selected based on pertinence and quality. Except for the many model organisms that were included in the database, the species were selected based on their taxonomic position to ensure a dense sampling of 41 selected taxonomical ranges (see below) as well as a broad coverage of the tree of life. As genome quality significantly affects the accuracy of orthology assignment (34,35) all genomes in eggNOG v3 were manually selected for genomic quality based on sequencing coverage and genome completeness judged by the coverage of 40 phylogenetic marker genes (36,37).
CONSTRUCTION OF ORTHOLOGOUS GROUPS AT DIFFERENT TAXONOMIC LEVELS
The first step of the eggNOG pipeline is an all-against-all similarity search. Due to the quadratic escalation of computational power necessary for such an all-against-all search, eggNOG v3 now uses the SIMAP database (38) for the required homology comparisons. SIMAP uses the FASTA heuristics (39), which are better at capturing sequences with a lower degree of similarity than BLAST (40), which was previously used in eggNOG, at the cost of reduced performance.
After the homology searches and the subsequent clustering step (11), 4 396 591 (84%) of all proteins investigated were assigned to at least one of the 721 801 orthologous groups generated by eggNOG (Figure 1). We extended the COGs, KOGs and arCOGs (8,9) to include the 1133 organisms, 121 eukaryotic and 69 archaeal species, respectively. As an enhancement to the 4873 COGs, 4850 KOGs and 7538 arCOGs, additional groups have been created as non-supervised OGs (NOGs), eukaryote-specific NOGs (euNOGs) and archaea-specific NOGs (arNOGs), extending those original COGs/KOGs/arNOGs by 101 208 NOGs, 41 267 euNOGs and 11 387 arNOGs. To provide a higher resolution of orthologous groups in frequently used taxonomic ranks, we applied our procedure to several subsets of organisms separately. Apart from the level of Eukaryotes (euNOGs) and Archaea (arNOGs), to provide information for all three domains of life, we provide newly derived bacteria-specific NOGs (bactNOGs). Subsequently, the orthology for 22 bacterial levels such as Firmicutes (firmNOGS), Proteobacteria (proNOGs) and Actinobacteria (actNOGs) (Figure 1) is further resolved, as well as for 14 major levels in the eukaryotic clade including Animals (meNOGs) and Fungi (fuNOGs).
AUTOMATED ANNOTATION OF PROTEIN FUNCTION
An important feature of eggNOG v3 is the automatic functional annotation of the OGs. The groups are annotated with a function description based on the functional annotations of each protein member within the group (26) and in parallel with one of 25 functional categories (11) compatible with those provided by the COG and KOG databases (8).
In eggNOG v3, the functional annotation pipeline has similarly been optimized to scale to the large amount of data. This has led to a significant improvement in computation time while simultaneously increasing the total number of functionally annotated OGs. Between eggNOG v2 and eggNOG v3, for corresponding taxonomic levels, the total number of annotated OGs increased by 28.8% and 10.0% for function description and functional category, respectively. In summary, of the 721 801 OGs in eggNOG v3, 62.5% have a functional annotation and 47.6% have been classified into a functional category (for details see Figure 1).
As the exponential growth of genomes and genes therein leads to considerable issues regarding performance, a number of technical improvements and speedups have been introduced; for example the parallelization of some key aspects of the OG pipeline have contributed to the performance enhancement.
One important step in the eggNOG pipeline is the inference of in-paralogs. Proteins that belong to a given subset of species and are more similar to each other than to proteins belonging to species outside that subset are defined as in-paralogs. In this release, we determined the aforementioned subsets automatically: for the universal, domain- and phylum-specific OGs, we grouped organisms within the same taxonomic order. For taxonomical ranges between the phylum and class, we used the taxonomical family, while for ranges below the class level we grouped given species together.
QUALITY ASSESSMENT OF eggNOG v3.0
So far, the majority of quality assessment tests are based on the functional conservation of predicted orthologs (41–44); however, it has been acknowledged that a phylogeny-based benchmarking approach would be more appropriate (44,45). We therefore manually curated a set of orthologous groups exemplifying multiple caveats of orthology prediction (35), named Reference OGs (RefOGs), which were used to assess the quality between this release and eggNOG v2. As many as 95% of the reference orthologs can be detected in the new release compared to only 75% in the previous version (Figure 2). This is mainly due to the updated genome annotations in eggNOG v3. We estimated the impact of four error sources: (i) false assignments, (ii) missing orthologs, (iii) fusions and (iv) fissions (for details see Figure 2). eggNOG v3 is less influenced than eggNOG v2 by false assignments and missing orthologs. Especially, for the missing orthologs, only 41% of the RefOGs are affected in this release compared to 57% in previous one. The high coverage of the benchmark set (95%) due to new genome annotations is the major contributor to this observation, highlighting the importance of frequent database updates, which is one of our goals. On the other hand, the previous release contains slightly fewer artificial fusions and fissions. As coverage of compared species affect the accuracy of orthology assignment (35), it can be expected that the addition of more species does not always improve all benchmark parameters.
To improve the usability of eggNOG v3, a new, modernized web interface was developed. As with the previous versions, the new interface provides data that can be downloaded under the Creative Commons Attribution 3.0 License at http://eggnog.embl.de. The available data include the OGs, protein sequences, multiple sequence alignments, precomputed gene trees (Figure 3) as well as the annotation of 62% of the OGs. Possible queries include multiple OG names, gene names and/or protein names. One goal of the new interface is to simplify the navigation of the various OGs by (i) a cleaner, more intuitive interface as well as (ii) an interactive species tree on the right side of the search results. The interactive species tree facilitates the navigation across different hierarchical levels by following the orthologs through the taxonomic levels. Homo sapiens serves as the default species for protein name queries; however, this can be changed to a multiple of common species within the search results. The multiple sequence alignments can be displayed using the Jalview applet (46) or downloaded in aligned or unaligned form. Precomputed phylogenetic trees are also provided and can be viewed together with any assigned PFAM (47) and SMART (48) domain via iTOL (49) or downloaded in Newick format.
With eggNOG v3, we provide one of the most comprehensive and up-to-date databases of orthologous groups available that delivers protein function annotation for 1133 genomes across the three domains of life. Not only does eggNOG v3 cover a broad taxonomic spectrum, but it also supplies orthologous groups for 41 manually selected taxonomic ranges. The modern, easy-to-use web interface facilitates the usage of the database with novel extended functionalities, such as an interactive species tree to assist the navigation through the increased number of hierarchical levels. Our future plans include the ongoing improvement of the quality of orthology and functional assignments, a further increase of taxonomic ranges and technical improvements to manage the computational challenges that come along with the expected exponential increase of available genomes.
EMBL; MetaHit RTD EC (201052); Novo Nordisk Foundation Center for Protein Research; Swiss Institute of Bioinformatics; and the University of Zurich through its Research Priority Program ‘Systems Biology and Functional Genomics’. Funding for open access charge: EMBL (internal).
Conflict of interest statement. None declared.
We would like to thank Yan Yuan for all his help and support on all technical and infrastructure issues we encountered during this project.