Abstract

Humans have coexisted with pathogenic microorganisms throughout its history of evolution. We have never halted the exploration of pathogenic microorganisms. With the improvement of genome-sequencing technology and the continuous reduction of sequencing costs, an increasing number of complete genome sequences of pathogenic microorganisms have become available. Genome annotation of this massive sequence information has become a daunting task in biological research. This paper summarizes the approaches to the genome annotation of pathogenic microorganisms and the available popular genome annotation tools for prokaryotes, eukaryotes and viruses. Furthermore, real-world comparisons of different annotation tools using 12 genomes from prokaryotes, eukaryotes and viruses were conducted. Current challenges and problems were also discussed.

Introduction

As the earliest known life forms, microorganisms or microbes have been developing on Earth for approximately four billion years [1–3]. Since modern humans (Homo sapiens) appeared approximately 300 000 years ago [4], microorganisms have interacted with humans and secretly lived together on Earth [5]. Until the first century B.C., our ancestors speculated that some unseen creatures possibly caused serious diseases [6]. In the late 1880s, disease-causing microorganisms, namely, pathogens, were gradually established and studied by scientists. Since then, scientists have been studying why and how these microorganisms, including bacteria, fungi, viruses and parasites, make us sick [7]. To date, approximately 1400 species of human pathogenic microorganisms have been identified. Of these, almost 90 were first reported in humans since 1980 [8]. Although we have accumulated much useful understanding of the emergence, evolution, pandemic and pathogenic mechanism of pathogenic microorganisms, there are still some crucial gaps in our knowledge. For instance, we still know very little about the diversity of pathogens that humans may be exposed to. The prior conditions for determining which pathogens can infect humans and estimating these pathogens’ transmission potential are more challenging [9, 10]. Since the first complete genome sequence of the microorganism Haemophilus influenza was published in 1995 [11], the whole-genome era’s arrival has brought a new dawn for us to fill these gaps. Over the past two decades, the genomes of hundreds of eukaryotes and hundreds of thousands of prokaryotes, including pathogenic microorganisms, have been sequenced. Moreover, many thousands of complete genomes are coming soon [12]. Genome annotation has become a critical element for us to understand genomic biology, especially the genomes of pathogenic microorganisms. Over the years of efforts, genome annotation has flourished. Today, we know not only protein-coding genes but also many other functional elements, such as promoters, enhancers, noncoding RNAs and DNA methylation sites.

Approaches to genome annotation of pathogenic microorganisms

Since the first complete genome was sequenced 25 years ago [11], people have been trying to use various approaches to annotate the genomes more completely and accurately. Generally, genome annotation mainly consists of three parts: structural gene annotation, functional gene annotation and noncoding sequence annotation (Figure 1). Each part contains many different methods for annotations.

Figure 1

Genome annotation strategies.

Structural gene annotation identifies the coordinates of genes in a genome. Although pathogenic microorganisms cover a wide range of species, including eukaryotes, prokaryotes and viruses, their structural gene annotations generally fall into similar categories as follows: homology-based gene prediction and ab initio gene prediction. For homology-based structural gene annotation, the sequences of proteins, transcripts, genes or even genomes from related species or databases are aligned to the genome to be annotated [13–16]. Commonly, methods based on BLAST [17] or hidden Markov models (HMMs) [18] are used for alignment. BLAST is mainly used to find homologs based on whole sequence alignment, while HMM is mainly used to find remote homologs based on domain structures. If the selected reference and the genome to be annotated are closely related, such as belonging to the same species, BLAST is often selected. Conversely, HMM is often selected if they have a distant relationship. As the output, the homologs and their coordinates in the genome are identified. Boundaries between introns and exons are also found. Ab initio structural gene annotation is a method that uses statistical models to identify gene features, such as exons, introns, start codons and stop codons. The process of ab initio structural gene annotation usually includes the following two components: model parameter estimation and gene prediction. For conventional ab initio methods, model parameter estimation requires constructing training sets of known gene structures. Transcriptome sequences, expressed sequence tags (EST), gene sequences or protein sequences from closely or distantly related species can be used as an initial point to estimate the parameters of the ab initio statistical model [19–21]. Other ab initio methods use self-training algorithms for model parameter estimation, which does not require a user to generate or provide any training set [22–24]. Regardless of the methods used for model parameter estimation, the gene prediction model based on these parameters can identify the gene coordinates in a new genome. In short, homology-based and ab initio gene predictions have advantages and limitations. Furthermore, these predictions are applied in structural gene annotation of various pathogenic bacteria, viruses, fungi and protozoa. In reality, many pipelines for genome annotation use the combination of the two categories to improve genome annotation’s accuracy and efficiency [13, 15, 25–27].

Functional gene annotation predicts gene functions, namely, assigns gene (family) names and attributes. Once the genes in a new genome are obtained, it is natural that we want to know their functions. Similar to structural gene annotation, functional gene annotation also has two strategies in principle, i.e. homology based and ab initio. However, almost all genome annotation pipelines adopt the homology-based strategy for functional gene annotation [13, 15, 20, 26–33], although some ab initio algorithms for gene function prediction have been reported [34–36]. Generally, based on BLAST or HMM, the predicted proteins are searched against a hierarchical collection of protein homologs, protein families and domain architectures from various public databases, such as GenBank, SwissProt and Pfam, among others. Then, the hit name with the highest-precedence evidence is assigned to the proteins.

Although the core feature of genome annotation has been gene prediction, many reports about functional elements in the genome’s noncoding region have prompted efforts to expand the genome annotation to noncoding sequences. Currently, many genome annotation pipelines include the annotation of noncoding functional elements, such as promoters, enhancers, noncoding RNAs and transposable elements [15, 30–32]. In these pipelines, noncoding sequences are annotated by special programs to find various noncoding functional elements. Generally, these programs also employ a homology-based or ab initio strategy to identify distinctive noncoding functional elements in the genome [37–40].

In addition, for eukaryotic pathogens such as fungi and protozoa, sometimes the mitochondrial genome needs to be annotated. Unlike the nuclear genome, the mitochondrial genome uses a different codon translation table, which brings unique annotation challenges [13]. The current genome annotation pipelines generally do not include mitochondrial genome annotations. A researcher may have to employ separate annotation programs or pipelines for this task [41, 42].

Genome annotation tools for pathogenic microorganisms

For the genome annotation of pathogenic microorganisms, many popular tools or pipelines can be selected, which can be divided into two types: stand-alone software and web-based systems. Generally, most stand-alone software tools require the command line to run, which requires the users to have some computer skills of the command line and Linux operating system. In contrast, the web-based system does not require users to have any specific computer skills. Users usually only need to upload their data files in the specified format onto the website. The genome annotation pipeline in the background will automatically analyze the data and return the results to the users. However, the stand-alone version of the software usually runs much faster than the web-based system. A flowchart of the general guideline is provided to help new researchers in the field (Figure 2).

Figure 2

Flowchart of a general guideline for choosing annotation tools.

For the genome annotation of prokaryotic pathogens, popular tools include RAST [31], RASTtk [43], DOE-JGI Microbial Genome Annotation Pipeline (MGAP) [44], GeneSAS [45], Prokka [30] and PGAP [15], among others. RAST (https://rast.nmpdr.org/), as a web-based system of prokaryotic genome annotation, was published in 2008 [31]. To date, RAST has annotated over 60 000 distinct genomes by over 12 000 users [46]. Using this server, protein-encoding, rRNA and tRNA gene coordinates and their functions can be predicted. Additionally, the metabolic network can be reconstructed based on the predicted information. Moreover, the annotated genome can be browsed in the SEED environment [47]. In 2015, a modular and extensible version of RAST, namely, the RAST tool kit (RASTtk), was released [43]. RASTtk is available not only as embedded options on the RAST web server but also as a stand-alone tool to be installed and run locally (https://github.com/TheSEED/RASTtk-Distribution/releases/). The DOE-JGI MGAP is also a web-based system for bacterial and archaeal genome annotation (https://gold.jgi.doe.gov/index). MGAP can annotate protein-encoding and RNA genes, gene functions and CRISPR elements in a genome. The users need first to register a project and submit their assembled nucleotide sequence datasets to the Genomes Online Database (GOLD), and then these datasets are annotated by MGAP [44]. GenSAS (https://www.gensas.org/) is another online platform that can be used for prokaryotic genome annotation [45]. The platform provides a variety of optional tools for structural gene annotation, functional gene annotation and genome visualization. Prokka (https://github.com/tseemann/prokka) is a popular stand-alone tool for prokaryotic genome annotation [30]. Prokka can quickly annotate a bacterial genome on a desktop computer, and it can annotate protein-coding genes, RNA genes and their functions and provide standard-compliant output files for genome browsers. Many integrated genome analysis tools, such as ASA3P [48], TORMES [49] and DFAST [50], directly embed Prokka as their genome annotation function. PGAP (https://github.com/ncbi/pgap) is another stand-alone pipeline for genome annotation of prokaryotes [15]. NCBI has used PGAP for prokaryotic genome annotation (https://www.ncbi.nlm.nih.gov/genome/annotation_prok/). Coding genes, noncoding genes, gene function and CRISPR loci can be predicted with this pipeline, and genes are predicted by the combination of homology-based and ab initio methods.

Because of the complexity of eukaryotic genomes [51], many eukaryotic genome annotation tools, such as PASA [52], Augustus [53], SNAP [54], EVidenceModeler (EVM) [25], MAKER [55], BRAKER [56] and GeneMark-EP+ [57], only focus on structural annotation of genomes. Only a small number of eukaryotic genome annotation tools can perform functional annotation. GenSAS (https://www.gensas.org/) is a web server that can perform structural annotation, functional annotation and visualization [45]. Multiple gene prediction tools, such as Augustus, SNAP and EVM, can be chosen for genome structural annotation in GenSAS. For functional annotation in GenSAS, BLAST+ [58] and Diamond [59] are used to search public databases, such as NCBI RefSeq [60], SwissProt [61] and TrEMBL [61], or any protein files provided by users. Companion is a web server (http://companion.sanger.ac.uk) for the annotation of parasite genomes [33]. A stand-alone version of Companion (https://github.com/sanger-pathogens/companion) is also available for large genome annotation and user confidentiality. The tool performs gene finding, functional annotation, pseudogene detection, comparative analyses based on reference genomes and visualization of annotation and analysis results. For the structural annotation of the genome in the tool, Augustus, SNAP and RATT [62] are used. For functional annotation of the genome in the tool, the OrthoMCL [63] and Pfam [64] databases are searched for homologs by BLAST+ and HMMER [65]. GAAP (http://gaap.hallym.ac.kr/) is a stand-alone pipeline for eukaryotic genome assembly and annotation [66]. The pipeline can directly accept raw DNA/RNA data from the NGS platform and assemble them into the genome and transcriptome. Then, the assembled genome is annotated for structure and function by the pipeline. Augustus, SNAP, Maker, EVM and PASA are combined for genome structural annotation in the pipeline. For the functional annotation portion of the pipeline, commercial software Blast2GO [67] and some public databases, such as NCBI, Pfam, Gene Ontology [68] and KEGG Pathway [69], are used to perform the function. genome annotator light (GAL) (https://hub.docker.com/u/cglabiicb/) is a docker-based package for genome structural and functional annotations for eukaryotes and prokaryotes [70].

Unlike prokaryotic and eukaryotic genome annotation tools, viral genome annotation tools have appeared only relatively recently, as most of them have been published in the past few years. Viral genome annotation system (http://cefg.uestc.cn/vgas/) provides two versions (stand-alone and web server) for viral genome annotation [29], and it uses an ab initio method improved from ZCURVE_V [71] for gene prediction. The resulting predicted genes are aligned with the SwissProt database by BLAST+ for functional annotation. Viral Annotation DefineR (VADR, https://github.com/nawrockie/vadr) is a reference-based tool for viral genome annotation, and it was developed by NCBI and integrated into GenBank for viral submissions [16]. By searching RefSeq, an appropriate reference is identified for the genome to be annotated. The genes and their functions are predicted based on the mapping between the reference and the target genome. Similarly, VAPiD [28] and VIGOR [72] are also reference-based tools and provide the capability for genome submissions to NCBI GenBank.

Performance test of genome annotation tools

To compare the performances of these genome annotation tools, 12 genomes with complete annotation information from prokaryotes, eukaryotes and viruses were selected to test and compare the above genome annotation tools (Table 1). Among them, Clostridioides difficile 630, Klebsiella pneumoniae subsp. pneumoniae HS11286, Neisseria gonorrhoeae FA 1090 and Staphylococcus aureus subsp. aureus NCTC 8325 are important bacterial pathogens from prokaryotes. Plasmodium falciparum 3D7, Toxoplasma gondii ME49 and Babesia microti strain R.I. are eukaryotic parasites. Aspergillus fumigatus Af293 is a fungal pathogen. SARS-CoV-2 (COVID-19), Dengue virus 2 Thailand/16681/84, Hepatitis C virus subtype 1a and Norwalk virus are important viral pathogens. These genomes were used to test prokaryotic, eukaryotic and viral genome annotation tools.

Table 1

Information about selected reference genome annotation

Table 1

Information about selected reference genome annotation

For the testing of prokaryotic genome annotation tools, PROKKA, RAST, RASTtk, GeneSAS and PGAP were selected to annotate the four prokaryotic genomes (Table 2 and Table 3). All these tools showed high accuracy of genome annotation. For most genomes, the annotation accuracy of these tools reached approximately 90%. Of these tools, PGAP showed the best annotation performance. For example, for the genomes of Clostridioides difficile 630, Neisseria gonorrhoeae FA 1090 and Staphylococcus aureus subsp. aureus NCTC 8325, PGAP predicted the most genes (96.39%, 94.42% and 90.78%, respectively) that were completely identical to the reference annotation. Although PROKKA predicted the fewest genes identical to the reference annotation in most genomes, over 80% of the genes from the reference annotation were still detected by PROKKA. To further compare the similarity between the reference annotation and the result from each tested tool, the revised similarity score from BEACON [73] was used as follows:
$$ SimilarityScore=\frac{Genes\ with\ same\ start\ position}{Totalx+ Totalz}\ast 2\ast 100 $$
where Totalx and Totalz are the total numbers of genes from the tested tool and the reference annotation, respectively. The similarity scores indicated that these tools’ annotations had high similarity with the reference annotation for most genomes. For example, for the genomes of Clostridioides difficile 630, Neisseria gonorrhoeae FA 1090 and Staphylococcus aureus subsp. aureus NCTC 8325, the similarity scores of PGAP reached 98.28%, 96.56% and 95.16%, respectively. The similarity scores of RAST and RASTtk were also over 90% for the three genomes. Although the similarity score of PGAP in the genome of Klebsiella pneumoniae subsp. pneumoniae HS11286 did not exceed 90%, it still reached 89.93%. For functional annotation, the coding sequences (CDSs) with functional annotation occupied more than 80% of the total CDSs in most reference annotations. For most genomes, PGAP achieved the highest functional annotation similarity rates using the reference annotations as the gold standard, while PROKKA had the lowest rates, such as 55.54% in Clostridioides difficile 630 and 57.58% in Neisseria gonorrhoeae FA 1090. It is worth noting that the functional annotation in the GeneSAS webserver was presented in the form of links to other protein databases and not included in the outputted files. Although PGAP outperformed in structure and function annotations of the prokaryotic genome, its running was the slowest and its requirement for the computing performance of a computer was the highest. In our test, PGAP simply could not run on a computer with 8 CPU cores and 15 GiB memory. Even on a computer with 16 CPU cores and 376 GiB memory, it still ran more than 5 h. In contrast, PROKKA and RASTtk only took a few minutes on the computer with 8 CPU cores and 15 GiB memory.
Table 2

The software selected for the performance test of genome annotation

Applicable objectSoftwareaSoftware release typeAverage running speedbSource
Prokaryotic genomePROKKAStand aloneAbout 2 minhttps://github.com/tseemann/prokka
RASTtkStand aloneAbout 4 minhttps://github.com/TheSEED/RASTtk-Distribution/releases/
RASTWeb basedAbout 40 minhttps://rast.nmpdr.org/
GeneSAS_genemarkSWeb basedLess than 1 hhttps://www.gensas.org/
PGAPStand aloneAbout 5 hhttps://github.com/ncbi/pgap
Eukaryotic genomeCompanion_webWeb basedAbout 10 hhttp://companion.sanger.ac.uk
Companion_clStand aloneAbout 15 hhttps://github.com/sanger-pathogens/companion
GeneSAS_genemarkESWeb basedLess than 1 hhttps://www.gensas.org/
GALStand aloneAbout 1 dayhttps://hub.docker.com/u/cglabiicb/
GAAPStand aloneMore than 1 dayhttp://gaap.hallym.ac.kr/
Viral genomeVADRStand aloneAbout 1 minhttps://github.com/nawrockie/vadr
VAPiDStand aloneAbout 1 minhttps://github.com/rcs333/VAPiD
GeneSAS_genemarkSWeb basedA few minuteshttps://www.gensas.org/
GeneSAS_glimmer3Web basedA few minuteshttps://www.gensas.org/
Applicable objectSoftwareaSoftware release typeAverage running speedbSource
Prokaryotic genomePROKKAStand aloneAbout 2 minhttps://github.com/tseemann/prokka
RASTtkStand aloneAbout 4 minhttps://github.com/TheSEED/RASTtk-Distribution/releases/
RASTWeb basedAbout 40 minhttps://rast.nmpdr.org/
GeneSAS_genemarkSWeb basedLess than 1 hhttps://www.gensas.org/
PGAPStand aloneAbout 5 hhttps://github.com/ncbi/pgap
Eukaryotic genomeCompanion_webWeb basedAbout 10 hhttp://companion.sanger.ac.uk
Companion_clStand aloneAbout 15 hhttps://github.com/sanger-pathogens/companion
GeneSAS_genemarkESWeb basedLess than 1 hhttps://www.gensas.org/
GALStand aloneAbout 1 dayhttps://hub.docker.com/u/cglabiicb/
GAAPStand aloneMore than 1 dayhttp://gaap.hallym.ac.kr/
Viral genomeVADRStand aloneAbout 1 minhttps://github.com/nawrockie/vadr
VAPiDStand aloneAbout 1 minhttps://github.com/rcs333/VAPiD
GeneSAS_genemarkSWeb basedA few minuteshttps://www.gensas.org/
GeneSAS_glimmer3Web basedA few minuteshttps://www.gensas.org/

aFor stand-alone software except for PGAP, they are tested at a workstation with 8 CPU cores and 15 GiB memory. Because high-performance computing (at least over 4 GB memory/CPU) is required for PGAP, PGAP is tested at a server with 16 CPU cores, 376 GiB memory.

bThe average of three running times of all related genome annotations.

Table 2

The software selected for the performance test of genome annotation

Applicable objectSoftwareaSoftware release typeAverage running speedbSource
Prokaryotic genomePROKKAStand aloneAbout 2 minhttps://github.com/tseemann/prokka
RASTtkStand aloneAbout 4 minhttps://github.com/TheSEED/RASTtk-Distribution/releases/
RASTWeb basedAbout 40 minhttps://rast.nmpdr.org/
GeneSAS_genemarkSWeb basedLess than 1 hhttps://www.gensas.org/
PGAPStand aloneAbout 5 hhttps://github.com/ncbi/pgap
Eukaryotic genomeCompanion_webWeb basedAbout 10 hhttp://companion.sanger.ac.uk
Companion_clStand aloneAbout 15 hhttps://github.com/sanger-pathogens/companion
GeneSAS_genemarkESWeb basedLess than 1 hhttps://www.gensas.org/
GALStand aloneAbout 1 dayhttps://hub.docker.com/u/cglabiicb/
GAAPStand aloneMore than 1 dayhttp://gaap.hallym.ac.kr/
Viral genomeVADRStand aloneAbout 1 minhttps://github.com/nawrockie/vadr
VAPiDStand aloneAbout 1 minhttps://github.com/rcs333/VAPiD
GeneSAS_genemarkSWeb basedA few minuteshttps://www.gensas.org/
GeneSAS_glimmer3Web basedA few minuteshttps://www.gensas.org/
Applicable objectSoftwareaSoftware release typeAverage running speedbSource
Prokaryotic genomePROKKAStand aloneAbout 2 minhttps://github.com/tseemann/prokka
RASTtkStand aloneAbout 4 minhttps://github.com/TheSEED/RASTtk-Distribution/releases/
RASTWeb basedAbout 40 minhttps://rast.nmpdr.org/
GeneSAS_genemarkSWeb basedLess than 1 hhttps://www.gensas.org/
PGAPStand aloneAbout 5 hhttps://github.com/ncbi/pgap
Eukaryotic genomeCompanion_webWeb basedAbout 10 hhttp://companion.sanger.ac.uk
Companion_clStand aloneAbout 15 hhttps://github.com/sanger-pathogens/companion
GeneSAS_genemarkESWeb basedLess than 1 hhttps://www.gensas.org/
GALStand aloneAbout 1 dayhttps://hub.docker.com/u/cglabiicb/
GAAPStand aloneMore than 1 dayhttp://gaap.hallym.ac.kr/
Viral genomeVADRStand aloneAbout 1 minhttps://github.com/nawrockie/vadr
VAPiDStand aloneAbout 1 minhttps://github.com/rcs333/VAPiD
GeneSAS_genemarkSWeb basedA few minuteshttps://www.gensas.org/
GeneSAS_glimmer3Web basedA few minuteshttps://www.gensas.org/

aFor stand-alone software except for PGAP, they are tested at a workstation with 8 CPU cores and 15 GiB memory. Because high-performance computing (at least over 4 GB memory/CPU) is required for PGAP, PGAP is tested at a server with 16 CPU cores, 376 GiB memory.

bThe average of three running times of all related genome annotations.

Table 3

Performance test of prokaryotic genome annotation tools based on four species

SpeciesSoftwareGenesCDSCDS with functionCDS without functionrRNAtRNAGenes with same start and end position (%)aGenes with same start position (%)Similarity scoreb
Clostridioides difficileRef_annotationc390438503353 (87.09%)497 (12.91%)3288///
PROKKA391338242124 (55.54%)1700 (44.46%)/893547 (90.86%)3714 (95.13%)95.02%
RASTtk403139122930 (74.90%)982 (25.10%)32873557 (91.11%)3727 (95.47%)93.94%
RAST426139042929 (75.03%)975 (24.97%)962613721 (95.31%)3900 (99.90%)95.53%
GeneSAS_genemarkS40873966//32893615 (92.60%)3755 (96.20%)93.98%
PGAP388638293410 (89.06%)419 (10.94%)32883763 (96.39%)3828 (98.05%)98.28%
Klebsiella pneumoniaeRef_annotation586857794085 (70.69%)1694 (29.31%)2562///
PROKKA554054513889 (71.34%)1562 (28.66%)/884760 (81.12%)5100 (86.91%)89.41%
RASTtk584457314826 (84.21%)905 (15.79%)25885092 (86.78%)5394 (91.92%)92.11%
RAST607057314857 (84.75%)874 (15.25%)752645216 (88.89%)5517 (94.02%)92.43%
GeneSAS_genemarkS55445544////4581 (78.07%)5018 (85.51%)87.94%
PGAP546755275014 (90.72%)513 (9.28%)25884858 (82.79%)5097 (86.86%)89.93%
Neisseria gonorrhoeaeRef_annotation204419731600 (81.09%)373 (18.91%)1154///
PROKKA220121451235 (57.58%)910 (42.42%)/551713 (83.81%)1836 (89.82%)86.50%
RASTtk264225751765 (68.54%)810 (31.46%)12551731 (84.69%)1881 (92.03%)80.28%
RAST277625751783 (69.24%)792 (30.76%)361651839 (89.98%)1999 (97.80%)82.95%
GeneSAS_genemarkS23572357////1421 (69.52%)1683 (82.34%)76.48%
PGAP203019601629 (83.11%)331 (16.89%)12551930 (94.42%)1967 (96.23%)96.56%
Staphylococcus aureusRef_annotation284227671238 (44.74%)1529 (55.26%)1659///
PROKKA269326301720 (65.40%)910 (34.60%)/622315 (81.46%)2476 (87.12%)89.47%
RASTtk276326872157 (80.28%)530 (19.72%)16602353 (82.79%)2523 (88.78%)90.03%
RAST291526872157 (80.28%)530 (19.72%)481802471 (86.95%)2647 (93.14%)91.96%
GeneSAS_genemarkS26832683////2320 (81.63%)2458 (86.49%)88.98%
PGAP278227042327 (86.06%)377 (13.94%)16592580 (90.78%)2676 (94.16%)95.16%
SpeciesSoftwareGenesCDSCDS with functionCDS without functionrRNAtRNAGenes with same start and end position (%)aGenes with same start position (%)Similarity scoreb
Clostridioides difficileRef_annotationc390438503353 (87.09%)497 (12.91%)3288///
PROKKA391338242124 (55.54%)1700 (44.46%)/893547 (90.86%)3714 (95.13%)95.02%
RASTtk403139122930 (74.90%)982 (25.10%)32873557 (91.11%)3727 (95.47%)93.94%
RAST426139042929 (75.03%)975 (24.97%)962613721 (95.31%)3900 (99.90%)95.53%
GeneSAS_genemarkS40873966//32893615 (92.60%)3755 (96.20%)93.98%
PGAP388638293410 (89.06%)419 (10.94%)32883763 (96.39%)3828 (98.05%)98.28%
Klebsiella pneumoniaeRef_annotation586857794085 (70.69%)1694 (29.31%)2562///
PROKKA554054513889 (71.34%)1562 (28.66%)/884760 (81.12%)5100 (86.91%)89.41%
RASTtk584457314826 (84.21%)905 (15.79%)25885092 (86.78%)5394 (91.92%)92.11%
RAST607057314857 (84.75%)874 (15.25%)752645216 (88.89%)5517 (94.02%)92.43%
GeneSAS_genemarkS55445544////4581 (78.07%)5018 (85.51%)87.94%
PGAP546755275014 (90.72%)513 (9.28%)25884858 (82.79%)5097 (86.86%)89.93%
Neisseria gonorrhoeaeRef_annotation204419731600 (81.09%)373 (18.91%)1154///
PROKKA220121451235 (57.58%)910 (42.42%)/551713 (83.81%)1836 (89.82%)86.50%
RASTtk264225751765 (68.54%)810 (31.46%)12551731 (84.69%)1881 (92.03%)80.28%
RAST277625751783 (69.24%)792 (30.76%)361651839 (89.98%)1999 (97.80%)82.95%
GeneSAS_genemarkS23572357////1421 (69.52%)1683 (82.34%)76.48%
PGAP203019601629 (83.11%)331 (16.89%)12551930 (94.42%)1967 (96.23%)96.56%
Staphylococcus aureusRef_annotation284227671238 (44.74%)1529 (55.26%)1659///
PROKKA269326301720 (65.40%)910 (34.60%)/622315 (81.46%)2476 (87.12%)89.47%
RASTtk276326872157 (80.28%)530 (19.72%)16602353 (82.79%)2523 (88.78%)90.03%
RAST291526872157 (80.28%)530 (19.72%)481802471 (86.95%)2647 (93.14%)91.96%
GeneSAS_genemarkS26832683////2320 (81.63%)2458 (86.49%)88.98%
PGAP278227042327 (86.06%)377 (13.94%)16592580 (90.78%)2676 (94.16%)95.16%

aPercentage is equal to (the number of detected identical genes/ref_annotation genes)*100.

bSimilarityScore = ((Genes with same start position)/(Totalx + Totalz))*2*100. Totalx and Totalz are the total number of genes in the software annotation and reference annotation. The function is from BEACON (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539851/).

cref_annotation means the reference annotation. It is from NCBI RefSeq.

Table 3

Performance test of prokaryotic genome annotation tools based on four species

SpeciesSoftwareGenesCDSCDS with functionCDS without functionrRNAtRNAGenes with same start and end position (%)aGenes with same start position (%)Similarity scoreb
Clostridioides difficileRef_annotationc390438503353 (87.09%)497 (12.91%)3288///
PROKKA391338242124 (55.54%)1700 (44.46%)/893547 (90.86%)3714 (95.13%)95.02%
RASTtk403139122930 (74.90%)982 (25.10%)32873557 (91.11%)3727 (95.47%)93.94%
RAST426139042929 (75.03%)975 (24.97%)962613721 (95.31%)3900 (99.90%)95.53%
GeneSAS_genemarkS40873966//32893615 (92.60%)3755 (96.20%)93.98%
PGAP388638293410 (89.06%)419 (10.94%)32883763 (96.39%)3828 (98.05%)98.28%
Klebsiella pneumoniaeRef_annotation586857794085 (70.69%)1694 (29.31%)2562///
PROKKA554054513889 (71.34%)1562 (28.66%)/884760 (81.12%)5100 (86.91%)89.41%
RASTtk584457314826 (84.21%)905 (15.79%)25885092 (86.78%)5394 (91.92%)92.11%
RAST607057314857 (84.75%)874 (15.25%)752645216 (88.89%)5517 (94.02%)92.43%
GeneSAS_genemarkS55445544////4581 (78.07%)5018 (85.51%)87.94%
PGAP546755275014 (90.72%)513 (9.28%)25884858 (82.79%)5097 (86.86%)89.93%
Neisseria gonorrhoeaeRef_annotation204419731600 (81.09%)373 (18.91%)1154///
PROKKA220121451235 (57.58%)910 (42.42%)/551713 (83.81%)1836 (89.82%)86.50%
RASTtk264225751765 (68.54%)810 (31.46%)12551731 (84.69%)1881 (92.03%)80.28%
RAST277625751783 (69.24%)792 (30.76%)361651839 (89.98%)1999 (97.80%)82.95%
GeneSAS_genemarkS23572357////1421 (69.52%)1683 (82.34%)76.48%
PGAP203019601629 (83.11%)331 (16.89%)12551930 (94.42%)1967 (96.23%)96.56%
Staphylococcus aureusRef_annotation284227671238 (44.74%)1529 (55.26%)1659///
PROKKA269326301720 (65.40%)910 (34.60%)/622315 (81.46%)2476 (87.12%)89.47%
RASTtk276326872157 (80.28%)530 (19.72%)16602353 (82.79%)2523 (88.78%)90.03%
RAST291526872157 (80.28%)530 (19.72%)481802471 (86.95%)2647 (93.14%)91.96%
GeneSAS_genemarkS26832683////2320 (81.63%)2458 (86.49%)88.98%
PGAP278227042327 (86.06%)377 (13.94%)16592580 (90.78%)2676 (94.16%)95.16%
SpeciesSoftwareGenesCDSCDS with functionCDS without functionrRNAtRNAGenes with same start and end position (%)aGenes with same start position (%)Similarity scoreb
Clostridioides difficileRef_annotationc390438503353 (87.09%)497 (12.91%)3288///
PROKKA391338242124 (55.54%)1700 (44.46%)/893547 (90.86%)3714 (95.13%)95.02%
RASTtk403139122930 (74.90%)982 (25.10%)32873557 (91.11%)3727 (95.47%)93.94%
RAST426139042929 (75.03%)975 (24.97%)962613721 (95.31%)3900 (99.90%)95.53%
GeneSAS_genemarkS40873966//32893615 (92.60%)3755 (96.20%)93.98%
PGAP388638293410 (89.06%)419 (10.94%)32883763 (96.39%)3828 (98.05%)98.28%
Klebsiella pneumoniaeRef_annotation586857794085 (70.69%)1694 (29.31%)2562///
PROKKA554054513889 (71.34%)1562 (28.66%)/884760 (81.12%)5100 (86.91%)89.41%
RASTtk584457314826 (84.21%)905 (15.79%)25885092 (86.78%)5394 (91.92%)92.11%
RAST607057314857 (84.75%)874 (15.25%)752645216 (88.89%)5517 (94.02%)92.43%
GeneSAS_genemarkS55445544////4581 (78.07%)5018 (85.51%)87.94%
PGAP546755275014 (90.72%)513 (9.28%)25884858 (82.79%)5097 (86.86%)89.93%
Neisseria gonorrhoeaeRef_annotation204419731600 (81.09%)373 (18.91%)1154///
PROKKA220121451235 (57.58%)910 (42.42%)/551713 (83.81%)1836 (89.82%)86.50%
RASTtk264225751765 (68.54%)810 (31.46%)12551731 (84.69%)1881 (92.03%)80.28%
RAST277625751783 (69.24%)792 (30.76%)361651839 (89.98%)1999 (97.80%)82.95%
GeneSAS_genemarkS23572357////1421 (69.52%)1683 (82.34%)76.48%
PGAP203019601629 (83.11%)331 (16.89%)12551930 (94.42%)1967 (96.23%)96.56%
Staphylococcus aureusRef_annotation284227671238 (44.74%)1529 (55.26%)1659///
PROKKA269326301720 (65.40%)910 (34.60%)/622315 (81.46%)2476 (87.12%)89.47%
RASTtk276326872157 (80.28%)530 (19.72%)16602353 (82.79%)2523 (88.78%)90.03%
RAST291526872157 (80.28%)530 (19.72%)481802471 (86.95%)2647 (93.14%)91.96%
GeneSAS_genemarkS26832683////2320 (81.63%)2458 (86.49%)88.98%
PGAP278227042327 (86.06%)377 (13.94%)16592580 (90.78%)2676 (94.16%)95.16%

aPercentage is equal to (the number of detected identical genes/ref_annotation genes)*100.

bSimilarityScore = ((Genes with same start position)/(Totalx + Totalz))*2*100. Totalx and Totalz are the total number of genes in the software annotation and reference annotation. The function is from BEACON (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539851/).

cref_annotation means the reference annotation. It is from NCBI RefSeq.

For the test of eukaryotic genome annotation tools, the Companion web server, Companion command line, GeneSAS, GAL and GAAP were selected to annotate the four eukaryotic genomes (Table 2 and Table 4). Although the Companion web server and its command-line generally performed better on genome annotation, they could only annotate parasite genomes and required reference genome annotation from related species. GeneSAS with the genemarkES method usually performed well on structural gene annotation for most genomes. However, GeneSAS did not have functional gene annotation in the output files. GAL and GAAP generally showed lower annotation accuracy for most genomes, but they could annotate any eukaryotic genome. For GAAP, no functional annotation was tested as a commercial tool (Blast2GO Cloud) was required. Additionally, GAAP is not a wrapped-up pipeline, thereby requiring users to perform all steps separately by themselves. For the running speed, the Companion web server, Companion command-line and GAAP required approximately 1 day to complete the annotation, while GeneSAS completed the annotation in 1 h.

Table 4

Performance test of eukaryotic genome annotation tools based on four species

SpeciesSoftwareGenesCDSCDS with functionCDS without functionrRNAtRNAGenes with same start and end position (%)aGenes with same start positionSimilarity scoreb
Plasmodium falciparumRef_annotationc545753543616 (67.54%)1738 (32.46%)2845///
Companion_web519651303072 (59.89%)2058 (40.11%)14475009 (91.80%)5027 (92.12%)94.38%
Companion_cl452143922672 (60.84%)1720 (39.16%)19464317 (79.11%)4345 (79.62)87.09%
GeneSAS_genemarkES51845108//14623929 (72.00%)4388 (80.41%)82.47%
GAL170617061362 (79.84%)344 (20.16%)//323 (5.92%)753 (13.80%)21.02%
GAAP53775377/ d///2799 (51.30%)3657 (67.01%)67.51%
Toxoplasma gondiiRef_annotation892582924008 (48.34%)4284 (51.66%)424183///
Companion_web4996448824415083011931639 (18.36%)1976 (22.14%)28.39%
Companion_cl11 29710 5202151 (20.45%)8369 (79.55%)5661911067 (12.06%)1488 (16.67%)14.72%
GeneSAS_genemarkES/////////
GAL34 28834 28827 368 (79.82%)6920 (20.18%)//116 (1.30%)796 (8.92%)3.68%
GAAP26 20426 204////96 (1.08%)545 (6.11%)3.10%
Babesia microtiRef_annotation368535672335 (65.46%)1232 (34.54%)1668///
Companion_web315130751 (0.03%)3074 (99.97%)8642262 (61.38%)2642 (71.70%)77.30%
Companion_cl265525721474 (57.31%)1098 (42.69%)8632461 (66.78%)2507 (68.03%)79.09%
GeneSAS_genemarkES30663066////2427 (65.86%)2740 (74.36%)81.17%
GAL210421042031 (96.53%)73 (3.5%)//710 (19.27%)1200 (32.56%)41.46%
GAAP17121712////466 (12.65%)852 (23.12%)31.57%
Aspergillus fumigatusRef_annotation985996306977 (72.45%)2653 (27.55%)/229///
Companion_web/////////
Companion_cl/////////
GeneSAS_genemarkES97069706////6147 (62.35%)7424 (75.30%)75.89%
GAL897889788623 (96.05%)355 (4%)//6150 (62.38%)7250 (73.54%)76.98%
GAAP86818681////5908 (59.92%)7036 (71.37%)75.90%
SpeciesSoftwareGenesCDSCDS with functionCDS without functionrRNAtRNAGenes with same start and end position (%)aGenes with same start positionSimilarity scoreb
Plasmodium falciparumRef_annotationc545753543616 (67.54%)1738 (32.46%)2845///
Companion_web519651303072 (59.89%)2058 (40.11%)14475009 (91.80%)5027 (92.12%)94.38%
Companion_cl452143922672 (60.84%)1720 (39.16%)19464317 (79.11%)4345 (79.62)87.09%
GeneSAS_genemarkES51845108//14623929 (72.00%)4388 (80.41%)82.47%
GAL170617061362 (79.84%)344 (20.16%)//323 (5.92%)753 (13.80%)21.02%
GAAP53775377/ d///2799 (51.30%)3657 (67.01%)67.51%
Toxoplasma gondiiRef_annotation892582924008 (48.34%)4284 (51.66%)424183///
Companion_web4996448824415083011931639 (18.36%)1976 (22.14%)28.39%
Companion_cl11 29710 5202151 (20.45%)8369 (79.55%)5661911067 (12.06%)1488 (16.67%)14.72%
GeneSAS_genemarkES/////////
GAL34 28834 28827 368 (79.82%)6920 (20.18%)//116 (1.30%)796 (8.92%)3.68%
GAAP26 20426 204////96 (1.08%)545 (6.11%)3.10%
Babesia microtiRef_annotation368535672335 (65.46%)1232 (34.54%)1668///
Companion_web315130751 (0.03%)3074 (99.97%)8642262 (61.38%)2642 (71.70%)77.30%
Companion_cl265525721474 (57.31%)1098 (42.69%)8632461 (66.78%)2507 (68.03%)79.09%
GeneSAS_genemarkES30663066////2427 (65.86%)2740 (74.36%)81.17%
GAL210421042031 (96.53%)73 (3.5%)//710 (19.27%)1200 (32.56%)41.46%
GAAP17121712////466 (12.65%)852 (23.12%)31.57%
Aspergillus fumigatusRef_annotation985996306977 (72.45%)2653 (27.55%)/229///
Companion_web/////////
Companion_cl/////////
GeneSAS_genemarkES97069706////6147 (62.35%)7424 (75.30%)75.89%
GAL897889788623 (96.05%)355 (4%)//6150 (62.38%)7250 (73.54%)76.98%
GAAP86818681////5908 (59.92%)7036 (71.37%)75.90%

aPercentage is equal to (genes with same start and end position/Ref_annotation genes)*100.

bSimilarityScore = ((genes with same start position)/(Totalx + Totalz))*2*100. Totalx and Totalz are the total numbers of genes in the software annotation and reference annotation. The function is from BEACON (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539851/).

cref_annotation means the reference annotation. It is from NCBI RefSeq.

dThe blank is here because the commercial tool CloudBlast is required for functional annotation.

Table 4

Performance test of eukaryotic genome annotation tools based on four species

SpeciesSoftwareGenesCDSCDS with functionCDS without functionrRNAtRNAGenes with same start and end position (%)aGenes with same start positionSimilarity scoreb
Plasmodium falciparumRef_annotationc545753543616 (67.54%)1738 (32.46%)2845///
Companion_web519651303072 (59.89%)2058 (40.11%)14475009 (91.80%)5027 (92.12%)94.38%
Companion_cl452143922672 (60.84%)1720 (39.16%)19464317 (79.11%)4345 (79.62)87.09%
GeneSAS_genemarkES51845108//14623929 (72.00%)4388 (80.41%)82.47%
GAL170617061362 (79.84%)344 (20.16%)//323 (5.92%)753 (13.80%)21.02%
GAAP53775377/ d///2799 (51.30%)3657 (67.01%)67.51%
Toxoplasma gondiiRef_annotation892582924008 (48.34%)4284 (51.66%)424183///
Companion_web4996448824415083011931639 (18.36%)1976 (22.14%)28.39%
Companion_cl11 29710 5202151 (20.45%)8369 (79.55%)5661911067 (12.06%)1488 (16.67%)14.72%
GeneSAS_genemarkES/////////
GAL34 28834 28827 368 (79.82%)6920 (20.18%)//116 (1.30%)796 (8.92%)3.68%
GAAP26 20426 204////96 (1.08%)545 (6.11%)3.10%
Babesia microtiRef_annotation368535672335 (65.46%)1232 (34.54%)1668///
Companion_web315130751 (0.03%)3074 (99.97%)8642262 (61.38%)2642 (71.70%)77.30%
Companion_cl265525721474 (57.31%)1098 (42.69%)8632461 (66.78%)2507 (68.03%)79.09%
GeneSAS_genemarkES30663066////2427 (65.86%)2740 (74.36%)81.17%
GAL210421042031 (96.53%)73 (3.5%)//710 (19.27%)1200 (32.56%)41.46%
GAAP17121712////466 (12.65%)852 (23.12%)31.57%
Aspergillus fumigatusRef_annotation985996306977 (72.45%)2653 (27.55%)/229///
Companion_web/////////
Companion_cl/////////
GeneSAS_genemarkES97069706////6147 (62.35%)7424 (75.30%)75.89%
GAL897889788623 (96.05%)355 (4%)//6150 (62.38%)7250 (73.54%)76.98%
GAAP86818681////5908 (59.92%)7036 (71.37%)75.90%
SpeciesSoftwareGenesCDSCDS with functionCDS without functionrRNAtRNAGenes with same start and end position (%)aGenes with same start positionSimilarity scoreb
Plasmodium falciparumRef_annotationc545753543616 (67.54%)1738 (32.46%)2845///
Companion_web519651303072 (59.89%)2058 (40.11%)14475009 (91.80%)5027 (92.12%)94.38%
Companion_cl452143922672 (60.84%)1720 (39.16%)19464317 (79.11%)4345 (79.62)87.09%
GeneSAS_genemarkES51845108//14623929 (72.00%)4388 (80.41%)82.47%
GAL170617061362 (79.84%)344 (20.16%)//323 (5.92%)753 (13.80%)21.02%
GAAP53775377/ d///2799 (51.30%)3657 (67.01%)67.51%
Toxoplasma gondiiRef_annotation892582924008 (48.34%)4284 (51.66%)424183///
Companion_web4996448824415083011931639 (18.36%)1976 (22.14%)28.39%
Companion_cl11 29710 5202151 (20.45%)8369 (79.55%)5661911067 (12.06%)1488 (16.67%)14.72%
GeneSAS_genemarkES/////////
GAL34 28834 28827 368 (79.82%)6920 (20.18%)//116 (1.30%)796 (8.92%)3.68%
GAAP26 20426 204////96 (1.08%)545 (6.11%)3.10%
Babesia microtiRef_annotation368535672335 (65.46%)1232 (34.54%)1668///
Companion_web315130751 (0.03%)3074 (99.97%)8642262 (61.38%)2642 (71.70%)77.30%
Companion_cl265525721474 (57.31%)1098 (42.69%)8632461 (66.78%)2507 (68.03%)79.09%
GeneSAS_genemarkES30663066////2427 (65.86%)2740 (74.36%)81.17%
GAL210421042031 (96.53%)73 (3.5%)//710 (19.27%)1200 (32.56%)41.46%
GAAP17121712////466 (12.65%)852 (23.12%)31.57%
Aspergillus fumigatusRef_annotation985996306977 (72.45%)2653 (27.55%)/229///
Companion_web/////////
Companion_cl/////////
GeneSAS_genemarkES97069706////6147 (62.35%)7424 (75.30%)75.89%
GAL897889788623 (96.05%)355 (4%)//6150 (62.38%)7250 (73.54%)76.98%
GAAP86818681////5908 (59.92%)7036 (71.37%)75.90%

aPercentage is equal to (genes with same start and end position/Ref_annotation genes)*100.

bSimilarityScore = ((genes with same start position)/(Totalx + Totalz))*2*100. Totalx and Totalz are the total numbers of genes in the software annotation and reference annotation. The function is from BEACON (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539851/).

cref_annotation means the reference annotation. It is from NCBI RefSeq.

dThe blank is here because the commercial tool CloudBlast is required for functional annotation.

Another thing to note is that all tested tools showed extremely low annotation accuracy for T. gondii ME49. According to the NCBI genome assembly information, the genome dataset contained over 2000 scaffolds, except for 15 chromosomes and plasmids (Table S1). Therefore, we hypothesized that the reason for the low annotation accuracy of all tested tools might be the incomplete assembly of the genome. To verify this hypothesis, we selected another strain of T. gondii (RH-88), which had a more complete genome assembly, to test all the tools. The annotation results showed that these tools had higher annotation accuracy for strain RH-88 than strain ME49 (Table S2), supporting our hypothesis.

VADR, VAPiD and GeneSAS were selected to annotate the four viral genomes (Table 2 and Table 5). Of these tools, only VADR included the reference model of SARS-CoV-2. To make an unbiased comparison among these tools, the reference model of SARS-CoV-2 was removed from VADR. Because VADR’s default reference models only include Flaviviridae and Caliciviridae, as a result, it cannot find a reference meeting the similarity standard for homology-based annotation of SARS-CoV-2. Although users could build their own reference model for specific genome annotation, only RefSeq genomes are allowed to be used, and the length of a reference genome has to be less than 25 K in VADR. In contrast, VAPiD and GeneSAS can annotate the genome of any virus. To further explore and compare the detailed differences in annotation capabilities of these tools, we visualized the genome annotation results of these tools for SARS-CoV-2 (Figure S1). We found that the main difference in genome annotation of these tools occurred in predicting overlapping genes. It was difficult for most tools to accurately predict overlapping genes in a viral genome. The genome annotation results indicated that VADR and VAPiD generally had more stable and better annotation performance than GeneSAS. With regard to the running speed, all the tested tools needed only one or 2 min to complete the annotation.

Table 5

Performance test of viral genome annotation tools based on four species

SpeciesSoftwareGenesCDSCDS with functionCDS without functionGenes with same start and end position (%)aGenes with same start positionSimilarity scoreb
SARS-CoV-2Ref_annotationc11134 (30.77%)9 (69.23%)///
VADRd///////
VAPiD10104 (40.00%)6 (60.00%)9 (81.82%%)9 (81.82%)86%
GeneSAS_genemarkS99//6 (54.55%)7 (63.64%)61%
GeneSAS_glimmer31212//8 (72.73%)9 (81.82%)78%
Dengue virusRef_annotation1110///
VADR11101 (100%)1 (100%)100%
VAPiD11101 (100%)1 (100%)100%
GeneSAS_genemarkS11//1 (100%)1 (100%)100%
GeneSAS_glimmer322//1 (100%)1 (100%)66.67%
Hepacivirus CRef_annotation1330///
VADR13301 (100%)1 (100%)100%
VAPiD13301 (100%)1 (100%)100%
GeneSAS_genemarkS22//0 (0%)0 (0%)0%
GeneSAS_glimmer333//1 (100%)1 (100%)50%
Norwalk virusRef_annotation3330///
VADR33303 (100%)3 (100%)100%
VAPiD33303 (100%)3 (100%)100%
GeneSAS_genemarkS33//3 (100%)3 (100%)100%
GeneSAS_glimmer333//3 (100%)3 (100%)100%
SpeciesSoftwareGenesCDSCDS with functionCDS without functionGenes with same start and end position (%)aGenes with same start positionSimilarity scoreb
SARS-CoV-2Ref_annotationc11134 (30.77%)9 (69.23%)///
VADRd///////
VAPiD10104 (40.00%)6 (60.00%)9 (81.82%%)9 (81.82%)86%
GeneSAS_genemarkS99//6 (54.55%)7 (63.64%)61%
GeneSAS_glimmer31212//8 (72.73%)9 (81.82%)78%
Dengue virusRef_annotation1110///
VADR11101 (100%)1 (100%)100%
VAPiD11101 (100%)1 (100%)100%
GeneSAS_genemarkS11//1 (100%)1 (100%)100%
GeneSAS_glimmer322//1 (100%)1 (100%)66.67%
Hepacivirus CRef_annotation1330///
VADR13301 (100%)1 (100%)100%
VAPiD13301 (100%)1 (100%)100%
GeneSAS_genemarkS22//0 (0%)0 (0%)0%
GeneSAS_glimmer333//1 (100%)1 (100%)50%
Norwalk virusRef_annotation3330///
VADR33303 (100%)3 (100%)100%
VAPiD33303 (100%)3 (100%)100%
GeneSAS_genemarkS33//3 (100%)3 (100%)100%
GeneSAS_glimmer333//3 (100%)3 (100%)100%

aPercentage is equal to (Genes with same start and end position/Ref_annotation genes)*100.

bSimilarityScore = ((Genes with same start position)/(Totalx + Totalz))*2*100. Totalx and Totalz are the total number of genes in the software annotation and reference annotation. The function is from BEACON (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539851/).

cref_annotation means the reference annotation. It is from NCBI RefSeq.

dVADR with default reference models (Flaviviridae and Caliciviridae) could not annotate the genome of SARS-CoV-2, because VADR could not find anyone in the default model library to meet the similarity standard for homology-based annotation. VADR can annotate SARS-CoV-2 by using its SARS-CoV-2 specific reference model.

Table 5

Performance test of viral genome annotation tools based on four species

SpeciesSoftwareGenesCDSCDS with functionCDS without functionGenes with same start and end position (%)aGenes with same start positionSimilarity scoreb
SARS-CoV-2Ref_annotationc11134 (30.77%)9 (69.23%)///
VADRd///////
VAPiD10104 (40.00%)6 (60.00%)9 (81.82%%)9 (81.82%)86%
GeneSAS_genemarkS99//6 (54.55%)7 (63.64%)61%
GeneSAS_glimmer31212//8 (72.73%)9 (81.82%)78%
Dengue virusRef_annotation1110///
VADR11101 (100%)1 (100%)100%
VAPiD11101 (100%)1 (100%)100%
GeneSAS_genemarkS11//1 (100%)1 (100%)100%
GeneSAS_glimmer322//1 (100%)1 (100%)66.67%
Hepacivirus CRef_annotation1330///
VADR13301 (100%)1 (100%)100%
VAPiD13301 (100%)1 (100%)100%
GeneSAS_genemarkS22//0 (0%)0 (0%)0%
GeneSAS_glimmer333//1 (100%)1 (100%)50%
Norwalk virusRef_annotation3330///
VADR33303 (100%)3 (100%)100%
VAPiD33303 (100%)3 (100%)100%
GeneSAS_genemarkS33//3 (100%)3 (100%)100%
GeneSAS_glimmer333//3 (100%)3 (100%)100%
SpeciesSoftwareGenesCDSCDS with functionCDS without functionGenes with same start and end position (%)aGenes with same start positionSimilarity scoreb
SARS-CoV-2Ref_annotationc11134 (30.77%)9 (69.23%)///
VADRd///////
VAPiD10104 (40.00%)6 (60.00%)9 (81.82%%)9 (81.82%)86%
GeneSAS_genemarkS99//6 (54.55%)7 (63.64%)61%
GeneSAS_glimmer31212//8 (72.73%)9 (81.82%)78%
Dengue virusRef_annotation1110///
VADR11101 (100%)1 (100%)100%
VAPiD11101 (100%)1 (100%)100%
GeneSAS_genemarkS11//1 (100%)1 (100%)100%
GeneSAS_glimmer322//1 (100%)1 (100%)66.67%
Hepacivirus CRef_annotation1330///
VADR13301 (100%)1 (100%)100%
VAPiD13301 (100%)1 (100%)100%
GeneSAS_genemarkS22//0 (0%)0 (0%)0%
GeneSAS_glimmer333//1 (100%)1 (100%)50%
Norwalk virusRef_annotation3330///
VADR33303 (100%)3 (100%)100%
VAPiD33303 (100%)3 (100%)100%
GeneSAS_genemarkS33//3 (100%)3 (100%)100%
GeneSAS_glimmer333//3 (100%)3 (100%)100%

aPercentage is equal to (Genes with same start and end position/Ref_annotation genes)*100.

bSimilarityScore = ((Genes with same start position)/(Totalx + Totalz))*2*100. Totalx and Totalz are the total number of genes in the software annotation and reference annotation. The function is from BEACON (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539851/).

cref_annotation means the reference annotation. It is from NCBI RefSeq.

dVADR with default reference models (Flaviviridae and Caliciviridae) could not annotate the genome of SARS-CoV-2, because VADR could not find anyone in the default model library to meet the similarity standard for homology-based annotation. VADR can annotate SARS-CoV-2 by using its SARS-CoV-2 specific reference model.

Current challenges for genome annotation

Over the past two decades, genome annotation has experienced rapid development and prosperity, making our understanding of genomes more explicit and profound. Genome annotation has become a necessary foundation or critical element in various biological fields. However, the accuracy of genome annotation has always been a considerable challenge for researchers. The impact on the annotation accuracy comes mainly from two aspects as follows: the quality of genome assembly and the selection of annotation tools. Draft assemblies with errors and contamination decrease the accuracy of genome annotation. The existence of many draft genomes has led to less accurate genome annotation [12]. Further improvements in sequencing and assembly technologies may increase the quality of genome assembly, thereby improving the accuracy of downstream genome annotation. The other aspect impacting the accuracy of an annotation is the selection of annotation tools. Although many tools can currently be selected for genome annotation, the annotation results obtained from different tools contain apparent differences. Due to the lack of recognized evaluation standards and systems, it is challenging to evaluate and identify the quality of annotation tools objectively. The introduction of unified evaluation standards or systems would improve the overall quality and accuracy of genome annotation tools.

For the current annotation tools, functional annotation is based on searching homologs in protein databases. However, some reports have indicated that approximately 30%–40% of proteins in databases are under ‘unknown function’ [74–76]. Moreover, for many proteins from new species, their homologs are not found in current databases [77]. Therefore, some nonhomology-based functional annotation methods should be developed and integrated into genome annotation tools to enhance the ability to annotate proteins without homologs or only with homologs of unknown function in public databases.

More ab initio algorithms for viral genome annotation need to be developed to enhance the strength of viral genome annotation tools. In addition, more stand-alone genome annotation pipelines for eukaryotes with full functions should also be developed.

Key Points
  • We summarize the strategies and approaches for the genome annotation of pathogenic microorganisms.

  • Popular genome annotation tools for eukaryotes, prokaryotes and viruses are collected, and their characteristics are presented.

  • Real-world comparisons of different genome annotation tools using sample genomes are conducted.

  • Although genome annotation methods have shown rapid development and prosperity, researchers still face challenges such as the inconsistency of genome annotations between tools and the large proportion of proteins with unknown functions.

Funding

National Institute of Allergy and Infectious Diseases (grant no. 3U19AI089672 to L.C. and X.L.).

Yibo Dong: Research associate in bioinformatics research, College of Public Health, University of South Florida.

Chang Li: Postdoc scholar in bioinformatics and genomics research, College of Public Health, University of South Florida.

Kami Kim: Professor and director, Division of Infectious Disease and International Medicine, Department of Internal Medicine, Morsani College of Medicine, University of South Florida.

Liwang Cui: Cohen professor in Malaria Research, Division of Infectious Disease and International Medicine, Department of Internal Medicine, Morsani College of Medicine, University of South Florida.

Xiaoming Liu: Associate professor in population genomics, bioinformatics and genetics of human diseases research, College of Public Health, University of South Florida.

References

1.

Schopf
JW
.
Fossil evidence of Archaean life
.
Philos Trans R Soc Lond B Biol Sci
2006
;
361
(
1470
):
869
85
.

2.

Altermann
W
,
Kazmierczak
J
.
Archean microfossils: a reappraisal of early life on Earth
.
Res Microbiol
2003
;
154
(
9
):
611
7
.

3.

Cavalier-Smith
T
.
Cell evolution and Earth history: stasis and revolution
.
Philos Trans R Soc Lond B Biol Sci
2006
;
361
(
1470
):
969
1006
.

4.

McBrearty
S
,
Brooks
A
.
The revolution that wasn't: a new interpretation of the origin of modern human behavior
.
J Hum Evol
2000
;
39
(
5
):
453
563
.

5.

Rook
G
,
Bäckhed
F
,
Levin
BR
, et al. 
Evolution, human–microbe interactions, and life history plasticity
.
Lancet
2017
;
390
(
10093
):
521
30
.

6.

Cato
V
.
On Agriculture
.
Cambridge
:
Harvard University Press
,
1930
.

7.

Casadevall
A
,
Pirofski
L
.
Microbiology: ditch the term pathogen
.
Nature
2014
;
516
(
7530
):
165
6
.

8.

Taylor
LH
,
Latham
SM
,
Woolhouse
MEJ
.
Risk factors for human disease emergence
.
Philos Trans R Soc Lond B Biol Sci
2001
;
356
:
983
9
.

9.

Institute of Medicine (U.S.)
. Forum on microbial threats. Infectious disease emergence: past, present, and future. In:
Microbial Evolution and Co-Adaptation
.
Washington D.C.
:
National Academies Press (U.S.)
,
2009
.

10.

Woolhouse
MEJ
.
Population biology of emerging and re-emerging pathogens
.
Trends Microbiol
2002
;
10
:
S3
7
.

11.

Fleischmann
RD
,
Adams
MD
,
White
O
, et al. 
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd
.
Science
1995
;
269
(
5223
):
496
512
.

12.

Salzberg
SL
.
Next-generation genome annotation: we still struggle to get it right
.
Genome Biol
2019
;
20
(
1
):
92
.

13.

Haas
BJ
,
Zeng
Q
,
Pearson
MD
, et al. 
Approaches to fungal genome annotation
.
Mycology
2011
;
2
(
3
):
118
41
.

14.

Birney
E
,
Clamp
M
,
Durbin
R
.
GeneWise and Genomewise
.
Genome Res
2004
;
14
(
5
):
988
95
.

15.

Tatusova
T
,
DiCuccio
M
,
Badretdin
A
, et al. 
NCBI prokaryotic genome annotation pipeline
.
Nucleic Acids Res
2016
;
44
(
14
):
6614
24
.

16.

Schäffer
AA
,
Hatcher
EL
,
Yankie
L
, et al. 
VADR: validation and annotation of virus sequence submissions to GenBank
.
BMC Bioinform
2020
;
21
(
1
):
211
.

17.

Altschul
S
,
Gish
W
,
Miller
W
, et al. 
Basic local alignment search tool
.
J Mol Biol
1990
;
215
(
3
):
403
10
.

18.

Yoon
BJ
.
Hidden markov models and their applications in biological sequence analysis
.
Curr Genomics
2009
;
10
(
6
):
402
15
.

19.

Brejová
B
,
Vinar
T
,
Chen
Y
, et al. 
Finding genes in Schistosoma japonicum: annotating novel genomes with help of extrinsic evidence
.
Nucleic Acids Res
2009
;
37
(
7
):
e52
.

20.

Parra
G
,
Bradnam
K
,
Korf
I
.
CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes
.
Bioinformatics
2007
;
23
(
9
):
1061
7
.

21.

Majoros
WH
,
Pertea
M
,
Salzberg
SL
.
TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders
.
Bioinformatics
2004
;
20
(
16
):
2878
9
.

22.

ter-Hovhannisyan
V
,
Lomsadze
A
,
Chernoff
YO
, et al. 
Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training
.
Genome Res
2008
;
18
(
12
):
1979
90
.

23.

Lomsadze
A
,
ter-Hovhannisyan
V
,
Chernoff
YO
, et al. 
Gene identification in novel eukaryotic genomes by self-training algorithm
.
Nucleic Acids Res
2005
;
33
(
20
):
6494
506
.

24.

Hyatt
D
,
Chen
G
,
Locascio
PF
, et al. 
Prodigal: prokaryotic gene recognition and translation initiation site identification
.
BMC Bioinform
2010
;
11
:
119
.

25.

Haas
BJ
,
Salzberg
SL
,
Zhu
W
, et al. 
Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments
.
Genome Biol
2008
;
9
:
R7
.

26.

Allen
JE
,
Salzberg
SL
.
JIGSAW: integration of multiple sources of evidence for gene prediction
.
Bioinformatics
2005
;
21
(
18
):
3596
603
.

27.

Cook
DE
,
Valle-Inclan
JE
,
Pajoro
A
, et al. 
Long-read annotation: automated eukaryotic genome annotation based on long-read cDNA sequencing
.
Plant Physiol
2019
;
179
(
1
):
38
54
.

28.

Shean
RC
,
Makhsous
N
,
Stoddard
GD
, et al. 
VAPiD: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank
.
BMC Bioinform
2019
;
20
(
1
):
48
.

29.

Zhang
KY
,
Gao
YZ
,
Du
MZ
, et al. 
Vgas: a viral genome annotation system
.
Front Microbiol
2019
;
10
:
184
.

30.

Seemann
T
.
Prokka: rapid prokaryotic genome annotation
.
Bioinformatics
2014
;
30
(
14
):
2068
9
.

31.

Aziz
RK
,
Bartels
D
,
Best
AA
, et al. 
The RAST server: rapid annotations using subsystems technology
.
BMC Genom
2008
;
9
:
75
.

32.

Thibaud-Nissen
F
,
Souvorov
A
,
Murphy
T
, et al.  Eukaryotic genome annotation pipeline. In:
The NCBI Handbook
[Internet]
, 2nd edn.
Bethesda (M.D.)
:
National Center for Biotechnology Information (U.S.)
,
2013
.

33.

Steinbiss
S
,
Silva-Franco
F
,
Brunk
B
, et al. 
Companion: a web server for annotation and analysis of parasite genomes
.
Nucleic Acids Res
2016
;
44
(
W1
):
W29
34
.

34.

Zhou
N
,
Jiang
Y
,
Bergquist
TR
, et al. 
The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens
.
Genome Biol
2019
;
20
:
244
.

35.

Kulmanov
M
,
Hoehndorf
R
.
DeepGOPlus: improved protein function prediction from sequence
.
Bioinformatics
2020
;
36
(
2
):
422
9
.

36.

Jones
P
,
Binns
D
,
Chang
HY
, et al. 
InterProScan 5: genome-scale protein function classification
.
Bioinformatics
2014
;
30
(
9
):
1236
40
.

37.

Lagesen
K
,
Hallin
PF
,
Rødland
E
, et al. 
RNAmmer: consistent and rapid annotation of ribosomal RNA genes
.
Nucleic Acids Res
2007
;
35
(
9
):
3100
8
.

38.

van Baren
MJ
,
Brent
MR
.
Iterative gene prediction and pseudogene removal improves genome annotation
.
Genome Res
2006
;
16
(
5
):
678
85
.

39.

Laslett
D
,
Canback
B
.
ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences
.
Nucleic Acids Res
2004
;
32
(
1
):
11
6
.

40.

Kolbe
DL
,
Eddy
SR
.
Fast filtering for RNA homology search
.
Bioinformatics
2011
;
27
(
22
):
3102
9
.

41.

Jex
AR
,
Hall
RS
,
Littlewood
DT
, et al. 
An integrated pipeline for next-generation sequencing and annotation of mitochondrial genomes
.
Nucleic Acids Res
2010
;
38
(
2
):
522
33
.

42.

Bris
C
,
Goudenege
D
,
Desquiret-Dumas
V
, et al. 
Bioinformatics tools and databases to assess the pathogenicity of mitochondrial DNA variants in the field of next generation sequencing
.
Front Genet
2018
;
9
:
632
.

43.

Brettin
T
,
Davis
J
,
Disz
T
, et al. 
RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes
.
Sci Rep
2015
;
5
:
8365
.

44.

Huntemann
M
,
Ivanova
NN
,
Mavromatis
K
, et al. 
The standard operating procedure of the DOE-JGI microbial genome annotation pipeline (MGAP v.4)
.
Stand Genomic Sci
2015
;
10
:
86
.

45.

Humann
JL
,
Lee
T
,
Ficklin
S
, et al.  Structural and functional annotation of eukaryotic genomes with GenSAS. In:
Kollmar
M
(ed).
Gene Prediction: Methods and Protocols
.
New York
:
Humana Press
,
2019
,
29
51
.

46.

Overbeek
R
,
Olson
R
,
Pusch
GD
, et al. 
The SEED and the rapid annotation of microbial genomes using subsystems technology (RAST)
.
Nucleic Acids Res
2014
;
42
(
Database issue
):
D206
14
.

47.

Overbeek
R
,
Disz
T
,
Stevens
R
.
The seed: a peer-to-peer environment for genome annotation
.
Communications of the ACM
2004
;
47
(
11
):
46
51
.

48.

Schwengers
O
,
Hoek
A
,
Fritzenwanker
M
, et al. 
ASA3P: an automatic and scalable pipeline for the assembly, annotation and higher-level analysis of closely related bacterial isolates
.
PLoS Comput Biol
2020
;
16
(
3
):
e1007134
.

49.

Quijada
NM
,
Rodríguez-Lázaro
D
,
Eiros
JM
, et al. 
TORMES: an automated pipeline for whole bacterial genome analysis
.
Bioinformatics
2019
;
35
(
21
):
4207
12
.

50.

Tanizawa
Y
,
Fujisawa
T
,
Nakamura
Y
.
DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication
.
Bioinformatics
2018
;
34
(
6
):
1037
9
.

51.

Cooper
GM
.
The Cell: A Molecular Approach
, 2nd edn.
Sinauer Associates
:
Sunderland M.A
,
2000
.

52.

Haas
BJ
,
Delcher
AL
,
Mount
SM
, et al. 
Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies
.
Nucleic Acids Res
2003
;
31
(
19
):
5654
66
.

53.

Stanke
M
,
Steinkamp
R
,
Waack
S
, et al. 
AUGUSTUS: a web server for gene finding in eukaryotes
.
Nucleic Acids Res
2004
;
32
(
Web Server issue
):
W309
12
.

54.

Korf
I
.
Gene finding in novel genomes
.
BMC Bioinform
2004
;
5
:
59
.

55.

Cantarel
BL
,
Korf
I
,
Robb
SM
, et al. 
MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes
.
Genome Res
2008
;
18
(
1
):
188
96
.

56.

Hoff
KJ
,
Lomsadze
A
,
Borodovsky
M
, et al. 
Whole-genome annotation with BRAKER
.
Methods Mol Biol
2019
;
1962
:
65
95
.

57.

Brůna
T
,
Lomsadze
A
,
Borodovsky
M
.
GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins
.
NAR Genomics and Bioinformatics
2020
;
2
(
2
):
lqaa026
.

58.

Camacho
C
,
Coulouris
G
,
Avagyan
V
, et al. 
BLAST plus: architecture and applications
.
BMC Bioinform
2009
;
10
:
421
.

59.

Buchfink
B
,
Xie
C
,
Huson
DH
.
Fast and sensitive protein alignment using DIAMOND
.
Nat Methods
2015
;
12
(
1
):
59
60
.

60.

O'Leary
NA
,
Wright
MW
,
Brister
JR
, et al. 
Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation
.
Nucleic Acids Res
2016
;
44
(
D1
):
D733
45
.

61.

The UniProt Consortium
.
UniProt: the universal protein knowledgebase
.
Nucleic Acids Res
2017
;
45
(
D1
):
D158
69
.

62.

Otto
TD
,
Dillon
GP
,
Degrave
WS
, et al. 
RATT: rapid annotation transfer tool
.
Nucleic Acids Res
2011
;
39
(
9
):
e57
.

63.

Li
L
,
Stoeckert
CJ Jr
,
Roos
DS
.
OrthoMCL: identification of ortholog groups for eukaryotic genomes
.
Genome Res
2003
;
13
(
9
):
2178
89
.

64.

Finn
RD
,
Coggill
P
,
Eberhardt
RY
, et al. 
The Pfam protein families database: towards a more sustainable future
.
Nucleic Acids Res
2016
;
44
(
D1
):
D279
85
.

65.

Wheeler
TJ
,
Eddy
SR
.
nhmmer: DNA homology search with profile HMMs
.
Bioinformatics
2013
;
29
:
2487
9
.

66.

Kong
J
,
Huh
S
,
Won
JI
, et al. 
GAAP: a genome assembly + annotation pipeline
.
Biomed Res Int
2019
;
2019
:
4767354
.

67.

Conesa
A
,
Götz
S
,
García-Gómez
JM
, et al. 
Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research
.
Bioinformatics
2005
;
21
(
18
):
3674
6
.

68.

The Gene Ontology Consortium
,
Ashburner
M
,
Ball
CA
, et al. 
Gene ontology: tool for the unification of biology
.
Nat Genet
2000
;
25
(
1
):
25
9
.

69.

Kanehisa
M
,
Goto
S
.
KEGG: Kyoto encyclopedia of genes and genomes
.
Nucleic Acids Res
2000
;
28
(
1
):
27
30
.

70.

Panda
A
,
Chaudhari
NM
,
Tripathy
S
.
Genome annotator light (GAL): a Docker-based package for genome analysis and visualization
.
Genomics
2020
;
112
(
1
):
127
34
.

71.

Guo
FB
,
Zhang
CT
.
ZCURVE_V: a new self-training system for recognizing protein-coding genes in viral and phage genomes
.
BMC Bioinform
2006
;
7
:
9
.

72.

Wang
S
,
Sundaram
JP
,
Spiro
D
.
VIGOR, an annotation program for small viral genomes
.
BMC Bioinform
2010
;
11
:
451
.

73.

Kalkatawi
M
,
Alam
I
,
Bajic
VB
.
BEACON: automated tool for bacterial GEnome annotation ComparisON
.
BMC Genom
2015
;
16
(
1
):
616
.

74.

Meier
M
,
Sit
RV
,
Quake
SR
.
Proteome-wide protein interaction measurements of bacterial proteins of unknown function
.
Proc Natl Acad Sci U S A
2013
;
110
(
2
):
477
82
.

75.

Nadzirin
N
,
Firdaus-Raih
M
.
Proteins of unknown function in the Protein Data Bank (PDB): an inventory of true uncharacterized proteins and computational tools for their analysis
.
Int J Mol Sci
2012
;
13
(
10
):
12761
72
.

76.

Horan
K
,
Jang
C
,
Bailey-Serres
J
, et al. 
Annotating genes of known and unknown function by large-scale coexpression analysis
.
Plant Physiol
2008
;
147
(
1
):
41
57
.

77.

Antczak
M
,
Michaelis
M
,
Wass
MN
.
Environmental conditions shape the nature of a minimal bacterial genome
.
Nat Commun
2019
;
10
(
1
):
3100
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)