Driving through stop signs: predicting stop codon reassignment improves functional annotation of bacteriophages

Abstract The majority of bacteriophage diversity remains uncharacterized, and new intriguing mechanisms of their biology are being continually described. Members of some phage lineages, such as the Crassvirales, repurpose stop codons to encode an amino acid by using alternate genetic codes. Here, we investigated the prevalence of stop codon reassignment in phage genomes and its subsequent impacts on functional annotation. We predicted 76 genomes within INPHARED and 712 vOTUs from the Unified Human Gut Virome Catalogue (UHGV) that repurpose a stop codon to encode an amino acid. We re-annotated these sequences with modified versions of Pharokka and Prokka, called Pharokka-gv and Prokka-gv, to automatically predict stop codon reassignment prior to annotation. Both tools significantly improved the quality of annotations, with Pharokka-gv performing best. For sequences predicted to repurpose TAG to glutamine (translation table 15), Pharokka-gv increased the median gene length (median of per genome median) from 287 to 481 bp for UHGV sequences (67.8% increase) and from 318 to 550 bp for INPHARED sequences (72.9% increase). The re-annotation increased median coding capacity from 66.8% to 90.0% and from 69.0% to 89.8% for UHGV and INPHARED sequences predicted to use translation table 15. Furthermore, the proportion of genes that could be assigned functional annotation increased, including an increase in the number of major capsid proteins that could be identified. We propose that automatic prediction of stop codon reassignment before annotation is beneficial to downstream viral genomic and metagenomic analyses.

Bacteriophages, hereafter phages, are increasingly recognized as a vital component of microbial communities in all environments where they have been studied in detail [1][2][3].Phages are known to drive bacterial evolution and community composition through predator-prey dynamics and their potential as agents of horizontal gene transfer [4,5].The use of viral metagenomics, or viromics, has massively expanded our understanding of global viral diversity and shed light on the ecological roles that phages play [1][2][3].
Much of the study into viral communities has been conducted on the human gut.Here, viromics has uncovered ecologically important viruses that are difficult to bring into culture using standard laboratory techniques [6], shown the potential roles of viruses in disease states [3], and allowed for the recovery of enormous phage genomes larger than any brought into culture [7].As the majority of phage diversity remains uncharacterized, new and enigmatic diversification mechanisms are being described continually, including the potential use of alternative translation tables.
Lineage-specific stop codon reassignment has been described previously in bacteriophages [8,9], whereby a stop codon is repurposed to encode an amino acid.Notably, annotations of Lak "megaphages" assembled from metagenomes were observed to exhibit unusually low coding density (∼70%) when genes are predicted using the standard bacterial, archaeal, and plant plastid genetic code (translation table 11) [7], much lower than the value observed for most cultured phages of ∼90% [10].The Lak megaphages were predicted to repurpose the TAG stop codon into an as-of-yet unknown amino acid [7].More recently, uncultured members of Crassvirales have been predicted to repurpose TAG to glutamine (translation table 15) and TGA to tryptophan (translation table 4) [9], and since then, the use of translation table 15 has been experimentally validated in two phages belonging to Crassvirales [11].Although the reasons for stop codon reassignment in viruses are not yet understood, it has been suggested that stop codon reassignment is involved in the regulation of lytic genes that are involved in late-stage infection [12].
As stop codon reassignment may be widespread in human gut viruses, we trained a fork of Prodigal [13], named prodigal-gv, to predict stop codon reassignment in phages [14] and implemented it in the pyrodigal-gv library to provide efficient Cython bindings to Prodigal-gv with pyrodigal [15].Additionally, the virus discovery tool geNomad incorporates pyrodigal-gv to predict stop codon reassignment for viral sequences identified in metagenomes and viromes [14].Similarly, others have developed a tool for the detection of stop codon reassignment named MgCod [16].However, the detection of translation table 15 still has limited support in many tools, and the impacts of stop codon reassignment on functional annotation are rarely considered in viral genomics and metagenomics.
To assess the extent of stop codon reassignment in studied phage genomes and the impacts on functional annotation, we extracted phage genomes from INPHARED [10] and predicted those using alternative stop codons.We also added high-quality and complete vOTUs from the Unified Human Gut Virome Catalogue (UHGV; https://github.com/snayfach/UHGV)predicted to use alternative codons.The viral genomes were re-annotated using modified versions of the commonly used annotation pipelines Prokka [17] and Pharokka [18], implementing prodigalgv and pyrodigal-gv for gene prediction (see Supplementary Methods).Hereafter, the modified versions are referred to as Prokka-gv and Pharokka-gv.
From INPHARED, 49 genomes (0.24%) were predicted to use translation table 15, and 27 (0.13%) were predicted to use translation table 4. From the UHGV, 666 vOTUs (1.2%) were predicted to use translation table 15, and 46 (0.08%) were predicted to use translation table 4.These genomes and vOTUs were not constrained to one particular clade of viruses, being predicted to occur on both dsDNA viruses of the realm Duplodnaviria and ssDNA viruses of the realm Monodnaviria.At the family level, we see clear lineages of viruses that conserve this feature, such as the Suoliviridae of Crassvirales; however, it also appears sporadically in other families that are not widely known to re-purpose stop codons, such as the Demerecviridae (Supplementary Table S1).The appearance of stop codon repurposing on distant lineages of viruses suggests this is a phenomenon that has arisen on multiple occasions.The lower frequency of these genomes in cultured isolates (INPHARED) versus human viromes (UHGV) may be due to culturing and sequencing biases, perhaps including modifications to DNA that are known to be recalcitrant to sequencing.
Although the mechanism for stop codon reassignment in phages is not fully understood, suppressor tRNAs are suggested to play a role [8,19] Prediction of stop codon reassignment led to improved annotations for both Prokka and Pharokka, although the extent of this varied with the two datasets, translation tables, and annotation pipelines tested (Fig. 1; Supplementary Table S2; Supplementary Results).As Pharokka-gv outperformed Prokka-gv on all metrics tested, only Pharokka-gv is discussed further, and the equivalent results for Prokka-gv can be found in Supplementary Results.Despite using the same method for initially predicting ORFs, Prokka-gv filters more predicted ORFs than Pharokka-gv, which likely caused the difference in results.
The largest improvements to annotations were observed for sequences predicted to use translation table 15, for which Pharokka-gv increased the median gene length (median of per genome medians) from 287 to 481 bp for UHGV sequences (67.8% increase) and from 318 to 550 bp for INPHARED sequences (72.9% increase; Fig. 1A).This was also ref lected in an increase of median coding capacity from 66.8% to 90.0% for UHGV and 69.0% to 89.8% for INPHARED (Fig. 1B).Overall, these improved gene calls led to an increased gene length and a reduction in the number of predicted genes per kb (Supplementary Table S2).This was mirrored by an increase in the proportion of predicted proteins that could be assigned functions, with the median proportion of unannotated "hypothetical proteins" decreasing from 83.1% to 76.4% for UHGV and from 84% to 76.4% for INPHARED (Fig. 1C).As it is commonly used as a phylogenetic marker for bacteriophages, we investigated how commonly the major capsid protein (MCP) could be identified with and without predicted stop codon reassignment [21].For those viruses we predicted to use translation table 15, annotation using the default translation table 11 only resulted in the MCP being identified in 407/715 (56.9%) of the genomes.In contrast, using translation table 15 with Pharokka-gv, we could identify the MCP in 475/715 (66.4%).
When investigating the sequences for which translation table 4 was predicted to be optimal, a substantial increase was also observed for UHGV sequences, with Pharokka-gv increasing median gene length (median of per genome medians) from 350 to 518 bp (a 48.0% increase in length; Fig. 1A), resulting in an increase of median coding capacity from 78.0% to 90.4% (Fig. 1B), and a decrease in the median proportion of unannotated hypothetical proteins from 79.3% to 73.2% (Fig. 1C).However, the same was not observed for the 27 INPHARED genomes predicted to use translation table 4. Reannotation resulted in a modest increase in median gene length (median of per genome medians) from 573 to 588 bp (a 2.6% increase in length; Fig. 1A).Median coding capacity was not increased, with both Pharokka and Pharokka-gv obtaining 89.1% (Fig. 1B).As the median gene length and coding capacity for INPHARED sequences predicted to use translation table 4 are in line with expected values, their prediction to use an alternate translation table may not be true.Similarly, many of these sequences belong to the viruses Mycoplasma and Sprioplasma, bacteria that are known to use translation table 4. Perhaps similarities of these viruses and their hosts have led to the prediction of translation table 4. Reassuringly, the prediction of translation table 4 has not hindered the quality of annotations for those genomes that have not observed a clear improvement in functional annotation.
The analysis of viral (meta)genomes relies on accurate protein predictions, with predicted ORFs being used in common analyses, including (pro)phage prediction, functional annotation, and phylogenetic analyses.The clear differences in protein predictions with/without predicted stop codon reassignment will likely have downstream impacts upon these analyses.However, this We have demonstrated the impacts of stop codon reassignment in the functional annotation of phages and provided tools for the automatic prediction and annotation of viral genomes that repurpose stop codons.Our analysis highlights the need for accurate viral ORF prediction and further experimental validation to elucidate the mechanisms of stop codon reassignment.

Figure 1 .
Figure 1.Re-annotating with predicted stop codon reassignment increases the quality of annotations.Comparison of (A) median predicted gene length (bp), (B) coding capacity (%), and (C) proportion of unannotated "hypothetical" proteins for INPHARED genomes and UHGV vOTUs annotated with Pharokka (translation table 11 only) and Pharokka-gv (prediction of stop codon reassignment), grouped by dataset and predicted stop codon reassignment.Grey lines indicate pairing of the genomes across the two annotation strategies tested.Asterisk indicates significance at P ≤ 10e-10 with P determined by a paired sample T test and adjusted with the Benjamini-Hochberg procedure.
. Consistent with previous findings, we found 375/715 (52.4%) phages predicted to use translation table 15 encoded at least one suppressor tRNA corresponding to the amber stop codon (Sup-CTA tRNA), and 11/73 (15.1%) of those predicted to use translation table 4 encoded at least one suppressor tRNA corresponding to the opal stop codon (Sup-TCA tRNA) [8, 19, 20].Although fewer of those predicted to use translation table 4 encoded the relevant suppressor tRNA, 22/27 (81%) of the INPHARED phages predicted to use translation table 4 were viruses of Mycoplasma or Spiroplasma.As Mycoplasma and Sprioplasma are known to use translation table 4, many of the viruses predicted to use translation table 4 may be simply using the same translation table as their host.