Analysis of Spounaviruses as a Case Study for the Overdue Reclassification of Tailed Bacteriophages

ABSTRACT It is almost a cliche that tailed bacteriophages of the order Caudovirales are the most abundant and diverse viruses in the world. Yet, their taxonomy still consists of a single order with just three families: Myoviridae, Siphoviridae, and Podoviridae. Thousands of newly discovered phage genomes have recently challenged this morphology-based classification, revealing that tailed bacteriophages are genomically even more diverse than once thought. Here, we evaluate a range of methods for bacteriophage taxonomy by using a particularly challenging group as an example, the Bacillus phage SPO1-related viruses of the myovirid subfamily Spounavirinae. Exhaustive phylogenetic and phylogenomic analyses indicate that the spounavirins are consistent with the taxonomic rank of family and should be divided into at least five subfamilies. This work is a case study for virus genomic taxonomy and the first step in an impending massive reorganization of the tailed bacteriophage taxonomy.

a balanced minimum evolution tree with branch support via FASTME including subtree 145 pruning and regrafting post-processing (Lefort et al. 2015) for each of the formulas D0, D4, 146 and D6, respectively. Trees were visualized with FigTree (Rambaud 2007). Taxon 147 demarcations at the species, genus and family rank were estimated with the OPTSIL program 148 (Göker et al. 2009 were visualized using Itol (Letunic and Bork 2007). 174 Multiple alignments were generated for each OPC using Clustal Omega (Sievers et al. 175 2011). For each cluster, the amino acid identity between all protein pairs inside a cluster was 176 determined using multiple alignment. For all genome pairs, the AAI (Konstantinidis and 177 Tiedje 2005) was then computed and transformed into distance using the formula: 178 The resulting distance matrix was clustered and visualized as described above. 180 OPCs and multiple alignments for each cluster were used to determine a distance 181 similar to the distance used to generate the Phage Proteomic Tree. To estimate protein 182 distances, in this case, the distance (dist) function of the seqinR package (Charif et al. 9 DAB = 1− + + 192 in which SAB and SBA represent the summed bit-scores between tBLASTx searches of 193 A versus B, and B versusA, respectively, while SAA and SBB represent the summed tBLASTx 194 bit-scores of the self-queries of A and B, respectively. The resulting distance matrix was 195 clustered with BionJ (Gascuel 1997). 196 To investigate a genomic synteny-based classification signal, we developed a gene order-197 based metric built on dynamic programming, the Gene Order Alignment Tool (GOAT,Schuller 198 et al.: Python scripts are available on request, manuscript in preparation). GOAT first identified 199 protein-coding genes in the 93 spounavirin and spouna-like virus genomes using Prodigal 200 V2.6.3 in anonymous mode (Hyatt et al. 2010), and assigned them to the latest pVOGs 201 (Grazziotin et al. 2017)). pVOG alignments (9,518) were downloaded (http://dmk-202 brain.ecn.uiowa.edu/pVOGs/) and converted to profiles of hidden Markov models (HMM) 203 using HMMbuild (HMMer 3.1b2, (Finn et al. 2011)). Proteins were assigned to pVOGs using 204 HMMsearch (E-value <10-2) and used to generate a synteny profile of every genome. GOAT 205 accounted for gene replacements and distant homology by using an all-vs-all similarity matrix 206 between pVOG pairs based on HMM-HMM similarity (HH-suite 2.0.16) (Söding et al. 2005)). 207 Distant HHsearch similarity scores between protein families were calculated as the average of 208 reciprocal hits and used as substitution scores in the gene order alignment. The GOAT 209 algorithm identified the optimal gene order alignment score between two virus genomes by 210 implementing semi-global dynamic programming alignment based only on the order of pVOGs 211 identified on every virus genome. To account for virus genomes being cut at arbitrary positions 212 during sequence assembly, GOAT transmutes the gene order at all possible positions and in 213 both sense and antisense directions in search of the optimal alignment score. The optimal 214 GOAT alignment score GAB between every pair of virus genomes A and B, was converted to 215 a distance DAB as follows: 216 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/220434 doi: bioRxiv preprint first posted online Nov. 16, 2017; DAB = 1 − + + 217 in which GAB and GBA represent the optimal GOAT score between A and B, and B and 218 A, respectively, while GAA and GBB represent the GOAT scores of the self-alignments of A 219 and B, respectively. This pairwise distance matrix was clustered with BionJ (Gascuel 1997). 220 Prokka re-annotated genomes were used to create pan-, core-, and accessory genomes of 221 all selected spounavirins and spouna-like viruses (Seemann 2014 (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/220434 doi: bioRxiv preprint first posted online Nov. 16, 2017; Estimated models were used to generate phylograms with FastTree 2.1.7 (Price et al. 2010). 243 The program implements the approximately maximum-likelihood method with Shimodaira-244 Hasegawa tests to generate the tree and calculate support of the splits. This approach is much 245 faster than "traditional" maximum-likelihood methods with negligible accuracy loss (Price et 246 al. 2010;Darriba et al. 2011;Liu et al. 2011). 247

250
To determine the phylogenetic relationship between 93 known and alleged spounavirins, 251 we used genomic, proteomic and marker gene-based comparative strategies. Regardless of the 252 adopted phylogenetic approach applied, five separate, clear-cut clusters were identified. We 253 believe that these clusters have a common origin and ought to come together under one 254 umbrella taxon. We suggest to name this taxon "Herelleviridae," in honor of the 100th 255 anniversary of the discovery of prokaryotic viruses by Félix d'Hérelle ( The fourth cluster ("Twortvirinae," named in honor of Frederick William Twort, the 264 bacteriologist who discovered prokaryotic viruses in 1915) gathers staphylococci-infected 265 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/220434 doi: bioRxiv preprint first posted online Nov. 16, 2017; viruses that are similar to Staphylococcus phage Twort. The remaining cluster 266 ("Jasinskavirinae," named in honor of Stanisława Jasińska-Lewandowska who was one of the 267 first to study Listeria and its viruses) consists of viruses infecting Listeria that are similar to 268 Listeria phage P100. The classification in five clusters left three viruses unassigned at this rank: 269 Lactobacillus phage Lb338, Lactobacillus phage LP65, and Brochothrix phage A9. 270 These robust clusters can be further subdivided into smaller clades that correspond well 271 with the currently accepted genera. The evidence supporting this suggested taxonomic re-272 classification is presented in the following sections. Enterococcus viruses clustered as a clade representing a new genus (here suggested to be 281 named "Kochikohdavirus" after the place of origin of the type virus of the clade, 282 Enterococcus phage φEF24C (Uchiyama et al. 2008a(Uchiyama et al. , 2008b Fig. 3), a tBLASTx-based measure that compares whole genome 289 sequences at the amino acid level (Mizuno et al. 2013). 290 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/220434 doi: bioRxiv preprint first posted online Nov. 16, 2017; The patterns coalesced at a higher taxonomic level when the genomes were analyzed 291 using tBLASTx ( Supplementary Fig. 4). The Enterococcus viruses clustered into a single group 292 sharing 41% genome identity, whereas the Bacillus viruses fell into two major groups, a group 293 combining the genera Spo1virus and Cp51virus, and the remainder. All Staphylococcus viruses 294 clustered above ≈36% genome identity, whereas Listeria viruses grouped with more than 79% 295 genome identity. Overall, all these genomes were related at the level of at least 15% genome 296 identity. Lactobacillus and Brochothrix viruses remained genomic orphans, peripherally 297 related to the remainder of the viruses in this assemblage. 298 299 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  Clustering was performed using nucleotide similarities (BLASTn) or B) translated nucleotide 307 similarities (tBLASTx,). Genomes were compared in a pairwise fashion using Gegenees, 308 transformed into a distance matrix, clustered using R, and visualized as trees using Interactive 309 tree of life (Itol). The trees were rooted at Brochotrix phage A9. Genera and suggested 310 subfamilies are delineated with colored squares and colored circles, respectively. 311 312 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

313
The virus proteomic tree showed five robust groupings corresponding with the 314 suggested subfamilies (Fig. 2). Viruses that infect Bacillus fell into two groups as described 315 before, represented by the revised Spounavirinae subfamily and the suggested new subfamily 316 "Bastillevirinae." Similarly, the Listeria and Staphylococcus viruses formed their own 317 clusters, "Jasinskavirinae" and "Twortvirinae", respectively. This clustering suggests that the 318 major Bacillus, Listeria, and Staphylococcus virus groups are represented, but that further 319 representatives are required from the under-sampled groups. The suggested "Brockvirinae" 320 subfamily is under-sampled, and the grouping observed in the tree was not as well-supported 321 as the other clusters. 322 Among 1,296 singleton proteins and 2,070 protein clusters defined using the 323 orthologous protein clusters (OPC) approach, we identified 12 clusters common for all viruses 324 (  Table 3). This finding was in stark contrast with the results 327 from core genome analysis using Roary, which revealed only one core gene (the tail tube 328 protein gene). Upon closer inspection of the gene annotations, we found that these analyses 329 CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/220434 doi: bioRxiv preprint first posted online Nov. 16, 2017; The pairwise comparison of the predicted proteome content of the viruses revealed a 344 very low overall similarity at the protein level ( Supplementary Fig. 7). Most viruses shared less 345 than 10% of their proteins. However, at the suggested new subfamily rank, we observed 346 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. Genomic fluidity is a measure of the dissimilarity of genomes evaluated at the gene 354 level (Kislyuk et al. 2011). Accordingly, the genomic fluidity results followed those obtained 355 using proteome content analysis ( Supplementary Fig. 8). Despite a high genomic fluidity for 356 most of these viruses, the newly suggested subfamilies and genera were all supported. 357 The topology of the dendrogram obtained using the average amino acid identity (AAI) 358 approach also supported the suggested new taxonomic scheme ( Supplementary Fig. 8). The 359 AAI was greater than 35% within each subfamily and greater than 67% within each genus. The 360 genomic fluidity (0.15), suggesting that the protein sequences of wphviruses might have 364 evolved rapidly. 365 366 367 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/220434 doi: bioRxiv preprint first posted online Nov. 16, 2017; The pangenome of the spounavirins and spouna-like viruses (4,182 genes) as calculated 374 using Roary (Page et al. 2015) was further analyzed by clustering the genomes based on the 375 presence or absence of the accessory genes ( Supplementary Fig. 9). The obtained tree 376 supported the current division of the viruses into approved genera and the suggested new 377 subfamilies. 378 Many virus genomes are thought to be highly modular, with recombination and 379 horizontal gene transfer potentially resulting in "mosaic genomes" (Juhala et al. 2000;380 Krupovic et al. 2011). By clustering the spounavirin and spouna-like virus genomes based 381 solely on the gene order of their genomes, we investigated whether gene synteny was preserved 382 ( Supplementary Fig. 10). The results revealed that genomic rearrangements leave a measurable 383 evolutionary signal in all lineages, since the genomic architecture analysis clustered all viruses 384 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/220434 doi: bioRxiv preprint first posted online Nov. 16, 2017; according the suggested taxa. The potential exception was Bacillus phage Moonbeam 385 (Cadungog et al. 2015). However, we did not observe the high modularity that may be expected 386 with rampant mosaicism. The lack of considerable mosaicism supports the recent findings by 387 Bolduc et al. that, at most, about  Spounavirinae as a subclade of "Bastillevirinae" and the second protein shuffled viruses from 400 the genera Silviavirus and Kayvirus. This result may indicate that some degree of horizontal 401 gene transfer occurs between groups, which share common hosts. 402 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/220434 doi: bioRxiv preprint first posted online Nov. 16, 2017; CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/220434 doi: bioRxiv preprint first posted online Nov. 16, 2017; are provisionally stashed within "unclassified" bins attached to the order or associated families 417 (Adams et al. 2017;Simmonds et al. 2017). We believe that this work is an important step 418 toward solving the problem of these "phage orphans". This study represents the first example 419 of a true taxonomic assessment from an "ensemble of methods". We are encouraged that the 420 combination of genome sequence analyses, virus proteomic trees, core protein clusters, gene 421 order genomic synteny (GOAT), and single gene phylogenies yields consistent and 422 complementary results. Convergence of the results reasserts the usefulness of genome-based 423 classification at a higher taxonomic rank and the ability of these methods to accommodate viral 424 diversity. 425 All evidence considered, we suggest that the spounavirins should be removed from the 426 family Myoviridae and given a family rank. Hence, we propose establishing a new family 427 "Herelleviridae", in the order Caudovirales next to a smaller Myoviridae family. The new 428 family would contain five subfamilies: Spounavirinae (sensu stricto), "Bastillevirinae", 429 with long non-contractile tails, and podovirids forming particles with short non-contractile 442 ones. By disconnecting morphotype and family classification, taxonomically related clades can 443 be grouped across different morphotypes. Such an approach would solve the problems of the 444 muviruses that are suggested to be classified in the family "Saltoviridae" ( . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.