-
PDF
- Split View
-
Views
-
Cite
Cite
Bhavya Papudeshi, Michael J Roach, Vijini Mallawaarachchi, George Bouras, Susanna R Grigson, Sarah K Giles, Clarice M Harker, Abbey L K Hutton, Anita Tarasenko, Laura K Inglis, Alejandro A Vega, Cole Souza, Lance Boling, Hamza Hajama, Ana Georgina Cobián Güemes, Anca M Segall, Elizabeth A Dinsdale, Robert A Edwards, Sphae: an automated toolkit for predicting phage therapy candidates from sequencing data, Bioinformatics Advances, Volume 5, Issue 1, 2025, vbaf004, https://doi.org/10.1093/bioadv/vbaf004
- Share Icon Share
Abstract
Phage therapy offers a viable alternative for bacterial infections amid rising antimicrobial resistance. Its success relies on selecting safe and effective phage candidates that require comprehensive genomic screening to identify potential risks. However, this process is often labor intensive and time-consuming, hindering rapid clinical deployment.
We developed Sphae, an automated bioinformatics pipeline designed to streamline the therapeutic potential of a phage in under 10 minutes. Using Snakemake workflow manager, Sphae integrates tools for quality control, assembly, genome assessment, and annotation tailored specifically for phage biology. Sphae automates the detection of key genomic markers, including virulence factors, antimicrobial resistance genes, and lysogeny indicators such as integrase, recombinase, and transposase, which could preclude therapeutic use. Among the 65 phage sequences analyzed, 28 showed therapeutic potential, 8 failed due to low sequencing depth, 22 contained prophage or virulent markers, and 23 had multiple phage genomes. This workflow produces a report to assess phage safety and therapy suitability quickly. Sphae is scalable and portable, facilitating efficient deployment across most high-performance computing and cloud platforms, accelerating the genomic evaluation process.
Sphae source code is freely available at https://github.com/linsalrob/sphae, with installation supported on Conda, PyPi, Docker containers.
1 Introduction
With the escalating global challenge of antimicrobial resistance (AMR) comes an increasing demand for alternative treatments against bacterial infections. Bacteriophages, or phages, are viruses that infect bacteria and are ubiquitous in the environment. The use of phages to treat bacterial infections is being explored worldwide as a replacement for antimicrobials. In the USA, Australia and parts of Europe, this treatment option is typically administered as a last resort care for severely ill patients under compassionate use (Wang et al. 2022, Singh et al. 2023). For phage therapy to be most effective, thorough safety assessments of the phage isolates must be performed before treatment. This includes experimental testing to confirm that the phage is a pure isolate and can infect the targeted pathogen variant. Additionally, phages are screened to specifically select lytic phages that infect, replicate, and quickly kill the bacterial host over temperate or lysogenic phages that integrate into the host genome during infection and remain stable (Bondy-Denomy et al. 2016, Altamirano and Barr 2021). Temperate phages are not preferred as they can protect the host by improving their fitness and may confer phage resistance through repressor-mediated immunity and/or superinfection exclusion (Samson et al. 2013, Bondy-Denomy et al. 2016). Additionally, phages are screened for large burst sizes and short latent periods to ensure quick and sustained infectivity and high adsorption rates to ensure effectiveness at low concentrations. The presence of these qualities is essential for high virulence to overwhelm the bacteria quickly (Rohde et al. 2018).
Phages and bacteria are locked in an evolutionary arms race where bacterial defense mechanisms like CRISPR-Cas systems co-evolve with phage countermeasures and can propagate throughout bacterial populations (Sorek et al. 2008, Fineran et al. 2009, Yirmiya et al. 2024). Interestingly, it has been shown that the development of phage resistance by the host often coincides with a loss of antibiotic resistance (Oromí-Bosch et al. 2023), allowing antibiotics to augment phage therapy by eliminating bacteria as they switch from an antibiotic- to a phage-resistant state. This synergy can be enhanced by using phage cocktails consisting of a range of phages with a combined specificity for a broad host range to further reduce the evolution of phage resistance within a bacterial infection. Especially if the cocktail includes phages with distinct mechanisms of host recognition and/or host factors so that resistance to one phage does not confer resistance to all phages (Altamirano and Barr 2021, Torres-Barceló et al. 2022, Wandro et al. 2022). Consequently, phage therapy has significant potential to be an effective treatment strategy for combating antibiotic resistance.
Efforts have been renewed to isolate phages for antibiotic-resistant bacterial pathogens in Europe, the USA, and Australia. The use of bacteriophages as therapeutic applications is subject to stringent regulatory oversight, particularly concerning toxin production and AMR genes. Ideally, phage isolates are sequenced during screening to predict their genetic potential for safety and efficacy (Luong et al. 2020, Cobián Güemes et al. 2023, Grigson et al. 2023). Bioinformatics analysis is now an indispensable component of this approach, ensuring sequencing data is processed efficiently to guide decision-making. For time-sensitive applications, rapid and scalable computational tools are essential, especially for large-scale screening initiatives. However, current analysis workflows can be time-consuming and require manual intervention, limiting their throughput and scalability.
Phage genomes are typically small, with a median size of about 40 kb, and can usually be assembled easily into complete genomes. However, the assembly process using default assembly tools obfuscates genome termini signals (Grigson et al. 2023). The recently published Phables algorithm (Mallawaarachchi et al. 2023) uses the assembly graph and read coverage to identify and correctly resolve genome termini. Alternatively, the HYbrid and Poly-polish Phage Assembly method utilizes long-read assemblies in combination with short-read sequencing (Elek et al. 2023). Phage genome sequences can also be contaminated with contigs from the bacterial host due to contamination during DNA extraction or due to induction of host prophages, resulting in mixed phage lysates (Cobián Güemes et al. 2023). Tools such as ViralVerify (Raiko 2021) identify and remove putative host contigs. Additionally, phage assemblies may be split over multiple contigs. Therefore, it is important to utilize tools such as CheckV (Nayfach et al. 2021) to determine if the assembly represents a single complete phage genome, and in identification of direct terminal repeats (DTRs). In some cases, even a single phage lysate can yield multiple phage genomes, making such tools indispensable for accurate phage identification (Gendre et al. 2022).
Once assembled, genome annotation tools like Pharokka (Bouras et al. 2023a) predict genes and assign biological functions using database searches against genes with known functions. However, assigning biological functions remains challenging, as 65% of viral proteins lack sequence homology to a protein with a known function (Grigson and Edwards 2023). Nonetheless, specific genes that serve as markers for temperate lifestyle (such as integrase genes) or confer phage resistance, including a search for toxin, virulence factors, or AMR, are screened for. Such genes are attributed to the risk of horizontal gene transfer (HGT) and propagation of resistance through bacterial populations. These genes are exclusionary criteria for phage therapeutic use; however, in cases where lytic phages are unavailable, engineered phages with disabled integrase and repressor functions have been demonstrated as an option (Dedrick et al. 2019, Strathdee et al. 2023). Meanwhile, anti-CRIPSR (Acr) proteins against their host and depolymerase genes are preferred as they can be advantageous in infection (Grigson et al. 2023). However, running all these tools sequentially is time-consuming and resource intensive.
Previous studies describe step-by-step tutorials and guidelines for assembling high-quality phage genomes and best practices for predicting and annotating their genes (Shen and Millard 2021, Turner et al. 2021, Grigson et al. 2023). We have developed Sphae, a rapid phage characterization workflow designed to streamline the selection of phage therapy candidates. This name is derived from “spae” which means “to foretell” with a modified spelling (s-ph-ae) denoting its specific focus on predicting a phage’s suitability for therapeutic use. This workflow helps quickly select phage therapy candidates based on their genomic potential, which can lead to faster medical interventions and improved patient survival outcomes. We developed this workflow to ensure reproducibility and consistency in the outputs, as using different databases and software versions can influence the results. This workflow is easy to install and run and generates a final summary text file with phage characteristics that anyone can examine to determine the therapeutic potential of a phage.
2 Methods
2.1 Workflow input
Sphae requires sequencing reads in fastq format, either paired-end short reads from Illumina or MGI sequencing platforms or unpaired long reads from Oxford Nanopore sequencing platforms. Oxford Nanopore raw sequencing output is in fast5 or pod5 format, which must be basecalled using Guppy or Dorado to convert the reads to fastq format before running this workflow.
2.2 Snakemake workflow manager
We utilized the Snakemake workflow manager (Köster and Rahmann 2012), which facilitates the automated installation of packages and dependencies. We also utilized Snaketool, which provides a user-friendly command line interface for Sphae to make running the pipeline as easy as possible (Roach et al. 2022). Workflow managers such as Snakemake provide scalability, reproducibility, reentrancy (Welivita et al. 2018), parallel processing of multiple samples, and integration for running commands and various steps on high-performance computing (HPC) systems and cloud-based environments (Roach et al. 2022). Therefore, we employed this template to leverage the capabilities of the Snakemake workflow manager in developing our pipeline for carrying out quality control, genome assembly, and annotation (Fig. 1).

Sphae workflow overview. The workflow processes sequencing reads from short- and/or long-read data in fastq format. The command sphae run, starts with quality control, filtering out low-quality reads and adaptor sequences. Processed reads are assembled, and the resulting assemblies are processed to confirm complete phage genomes in each sample. The phage genomes are annotated to identify the genes and assign biological functions. The final output folder contains the assembled genome (fasta format), annotations (GenBank format), a Circos plot (PNG format), and a summary text file detailing phage characteristics.
2.3 Steps in the workflow
Quality control: Fastp (Chen et al. 2018) and Filtlong (Wick 2018) are run to remove low-quality reads and trim adaptor sequences to ensure only high-quality reads are retained for downstream analysis.
Read subsampling: Rasusa (Hall 2022) is run to subsample up to 10 million base pairs per sample to keep an ideal 25× to 100× genome coverage for phage assembly (Grigson et al. 2023).
Assembly process: Paired-end short reads are assembled using MEGAHIT (Li et al. 2016), while long-read assemblies are conducted using Flye (Kolmogorov et al. 2019). Although recent advances in Nanopore sequencing chemistry have reduced the need for long-read polishing (Bouras et al. 2024b), Medaka (Nanoporetech Consortium 2022) is used to correct older, more error-prone reads.
Completeness assessment: Assembled contigs are classified using:
ViralVerify (Raiko 2021) to identify viral, plasmid, or bacteria origin using gene content,
CheckV (Nayfach et al. 2021) to determine the completeness of the viral contigs by comparing the genomes against a database of viral genomes and identifying the conserved gene markers and regions,
custom Python script to assess contig connectivity within the assembly graph (Mallawaarachchi et al. 2023), and
Overall, only contigs classified as viral by ViralVerify (longer than 1000 base pairs and having a completeness score of over 70%) are selected for further analysis. In cases of multiple genomes in a sample, each genome is saved as a separate phage genome.
Gene annotation is performed using Pharokka (Bouras et al. 2023a). Gene prediction is conducted using PHANOTATE (McNair et al. 2019) or Pyrodigal (Larralde 2022), followed by functional annotation through comparison with the PHROGs database (Terzian et al. 2021). In addition, genes are also run against:
AMR gene databases: CARD (Alcock et al. 2020),
virulence factor database; VFDB (Liu et al. 2022),
CRISPR recognition tool; MinCED (Bland et al. 2007),
anti-CRISPR (Acr) gene detection using AcrDB (Huang et al. 2021),
anti-phage systems using DefenseFinder (Tesson et al. 2022), and
tRNA genes using tRNAscanSE (Chan et al. 2021) and tmRNA using ARAGORN (Laslett and Canback 2004).
Taxonomic assignment is performed within Pharokka, via MASH (Ondov et al. 2016) that compares the genome against the phage INPHARED database (Cook et al. 2021).
Hypothetical gene analysis: To address the prevalence of remaining hypothetical genes, Sphae uses:
Phold applies the ProstT5 (Heinzinger et al. 2024) protein language model to generate a structural representation for each gene. These are compared against a database of predicted phage protein structures using FoldSeek (van Kempen et al. 2024) to assign potential functions.
The resulting Genbank files are further processed through Phynteny (Grigson and Mallawaarachchi 2023), which utilizes a long short-term memory model trained with phage synteny to refine gene function predictions.
Phage therapy suitability: The annotated genome is systematically analyzed for key markers, including integrase, recombinase, transposase, toxins, AMR, and virulence genes.
2.4 Workflow output
Each workflow step yields a set of files, not all directly pertinent for deciding the therapeutic potential of the phage. Sphae workflow produces a “FINAL” directory containing essential summary files to streamline the output. These files include:
assembled phage genome (.fasta)
phage annotations (.gbk)
genome plot (.png)
summary table (.tsv): annotations from the three tools, tracking which tool assigned a function to the gene
summary (.txt): phage characteristics described in Table 1
Phage characteristic . | Value . | Explanation . |
---|---|---|
Sample name | Bc01 | Sample name |
Total length of reads after QC and subsampling | 5 363 156 bp | Total length of reads used for assembly to help calculate genome coverage |
Length | 100 743 | Length of the phage genome assembled |
Circular | False | Was the genome assembled to be circular, according to the information provided in the assembly graph? For more information, you can visualize the file ending in .gfa with Bandage (Wick et al. 2015). |
Graph connections | 0 | If the assembly generated fragmented contigs due to low coverage, the graph shows potential connections, offering clues for identifying terminal repeats and low-complexity regions. Visualize the file ending in .gfa with Bandage (Wick et al. 2015). |
Direct terminal repeat (DTR) found | – | Is DTR detected by CheckV (Nayfach et al. 2021) in the phage contig |
Completeness | 100.0 | Phage completeness score from CheckV |
Contamination | 0.0 | Contamination score from CheckV |
Taxon description | Kehishuvirus sp. tikkala | Assigned taxon name from Pharokka (Bouras et al. 2023a) output, comparing the phage genome against the INPHARED database (Cook et al. 2021) using Mash (Ondov et al. 2016) |
Taxa result: matching hashes | 972/1000 | How close the phage isolated is to the assigned taxon. Results from Pharokka Mash sketch against the INPHARED database |
Lowest taxon classification | Kehishuvirus | The lowest taxon rank assigned |
Isolation host of the described taxa | Bacteroides cellulosilyticus | Bacterial host of the assigned taxa from the INPHARED database |
Number of CDS | 154 | Number of genes identified in the genome from Pharokka results |
Total number of CDS annotated as “hypothetical protein” | 91 | Counting only the genes annotated as hypothetical, which have not been assigned a biological function or have ambiguous descriptions in Phynteny (Grigson and Mallawaarachchi 2023) output |
GC content (proportion) | 0.35 | GC content from Pharokka result |
Percent coding density | 91.3 | Phages generally have high coding capacity, so if the density is low, it could indicate issues with gene calling for this phage |
Prophage or temperate lifestyle markers | No integrases, recombinases, or transposases found | These genes indicate the phage can have a temperate lifestyle, which would most likely exclude it from use in therapy. Results from Pharokka, Phold, and Phynteny searches |
Toxin genes | No toxins found | Search for genes with “toxins” in the gene description from the final Phynteny output |
Virulence genes | No antimicrobial resistance (AMR) genes found, no virulence factors found | Search against the CARD (Alcock et al. 2020) and VFDB (Liu et al. 2022) databases using Pharokka and Phold results |
Defense genes | No anti-CRISPR or spacers found, no defense genes found | Pharokka and Phold search the genes against ACR (Huang et al. 2021) and DefenseFinder (Tesson et al. 2022) databases |
Phage characteristic . | Value . | Explanation . |
---|---|---|
Sample name | Bc01 | Sample name |
Total length of reads after QC and subsampling | 5 363 156 bp | Total length of reads used for assembly to help calculate genome coverage |
Length | 100 743 | Length of the phage genome assembled |
Circular | False | Was the genome assembled to be circular, according to the information provided in the assembly graph? For more information, you can visualize the file ending in .gfa with Bandage (Wick et al. 2015). |
Graph connections | 0 | If the assembly generated fragmented contigs due to low coverage, the graph shows potential connections, offering clues for identifying terminal repeats and low-complexity regions. Visualize the file ending in .gfa with Bandage (Wick et al. 2015). |
Direct terminal repeat (DTR) found | – | Is DTR detected by CheckV (Nayfach et al. 2021) in the phage contig |
Completeness | 100.0 | Phage completeness score from CheckV |
Contamination | 0.0 | Contamination score from CheckV |
Taxon description | Kehishuvirus sp. tikkala | Assigned taxon name from Pharokka (Bouras et al. 2023a) output, comparing the phage genome against the INPHARED database (Cook et al. 2021) using Mash (Ondov et al. 2016) |
Taxa result: matching hashes | 972/1000 | How close the phage isolated is to the assigned taxon. Results from Pharokka Mash sketch against the INPHARED database |
Lowest taxon classification | Kehishuvirus | The lowest taxon rank assigned |
Isolation host of the described taxa | Bacteroides cellulosilyticus | Bacterial host of the assigned taxa from the INPHARED database |
Number of CDS | 154 | Number of genes identified in the genome from Pharokka results |
Total number of CDS annotated as “hypothetical protein” | 91 | Counting only the genes annotated as hypothetical, which have not been assigned a biological function or have ambiguous descriptions in Phynteny (Grigson and Mallawaarachchi 2023) output |
GC content (proportion) | 0.35 | GC content from Pharokka result |
Percent coding density | 91.3 | Phages generally have high coding capacity, so if the density is low, it could indicate issues with gene calling for this phage |
Prophage or temperate lifestyle markers | No integrases, recombinases, or transposases found | These genes indicate the phage can have a temperate lifestyle, which would most likely exclude it from use in therapy. Results from Pharokka, Phold, and Phynteny searches |
Toxin genes | No toxins found | Search for genes with “toxins” in the gene description from the final Phynteny output |
Virulence genes | No antimicrobial resistance (AMR) genes found, no virulence factors found | Search against the CARD (Alcock et al. 2020) and VFDB (Liu et al. 2022) databases using Pharokka and Phold results |
Defense genes | No anti-CRISPR or spacers found, no defense genes found | Pharokka and Phold search the genes against ACR (Huang et al. 2021) and DefenseFinder (Tesson et al. 2022) databases |
Phage characteristic . | Value . | Explanation . |
---|---|---|
Sample name | Bc01 | Sample name |
Total length of reads after QC and subsampling | 5 363 156 bp | Total length of reads used for assembly to help calculate genome coverage |
Length | 100 743 | Length of the phage genome assembled |
Circular | False | Was the genome assembled to be circular, according to the information provided in the assembly graph? For more information, you can visualize the file ending in .gfa with Bandage (Wick et al. 2015). |
Graph connections | 0 | If the assembly generated fragmented contigs due to low coverage, the graph shows potential connections, offering clues for identifying terminal repeats and low-complexity regions. Visualize the file ending in .gfa with Bandage (Wick et al. 2015). |
Direct terminal repeat (DTR) found | – | Is DTR detected by CheckV (Nayfach et al. 2021) in the phage contig |
Completeness | 100.0 | Phage completeness score from CheckV |
Contamination | 0.0 | Contamination score from CheckV |
Taxon description | Kehishuvirus sp. tikkala | Assigned taxon name from Pharokka (Bouras et al. 2023a) output, comparing the phage genome against the INPHARED database (Cook et al. 2021) using Mash (Ondov et al. 2016) |
Taxa result: matching hashes | 972/1000 | How close the phage isolated is to the assigned taxon. Results from Pharokka Mash sketch against the INPHARED database |
Lowest taxon classification | Kehishuvirus | The lowest taxon rank assigned |
Isolation host of the described taxa | Bacteroides cellulosilyticus | Bacterial host of the assigned taxa from the INPHARED database |
Number of CDS | 154 | Number of genes identified in the genome from Pharokka results |
Total number of CDS annotated as “hypothetical protein” | 91 | Counting only the genes annotated as hypothetical, which have not been assigned a biological function or have ambiguous descriptions in Phynteny (Grigson and Mallawaarachchi 2023) output |
GC content (proportion) | 0.35 | GC content from Pharokka result |
Percent coding density | 91.3 | Phages generally have high coding capacity, so if the density is low, it could indicate issues with gene calling for this phage |
Prophage or temperate lifestyle markers | No integrases, recombinases, or transposases found | These genes indicate the phage can have a temperate lifestyle, which would most likely exclude it from use in therapy. Results from Pharokka, Phold, and Phynteny searches |
Toxin genes | No toxins found | Search for genes with “toxins” in the gene description from the final Phynteny output |
Virulence genes | No antimicrobial resistance (AMR) genes found, no virulence factors found | Search against the CARD (Alcock et al. 2020) and VFDB (Liu et al. 2022) databases using Pharokka and Phold results |
Defense genes | No anti-CRISPR or spacers found, no defense genes found | Pharokka and Phold search the genes against ACR (Huang et al. 2021) and DefenseFinder (Tesson et al. 2022) databases |
Phage characteristic . | Value . | Explanation . |
---|---|---|
Sample name | Bc01 | Sample name |
Total length of reads after QC and subsampling | 5 363 156 bp | Total length of reads used for assembly to help calculate genome coverage |
Length | 100 743 | Length of the phage genome assembled |
Circular | False | Was the genome assembled to be circular, according to the information provided in the assembly graph? For more information, you can visualize the file ending in .gfa with Bandage (Wick et al. 2015). |
Graph connections | 0 | If the assembly generated fragmented contigs due to low coverage, the graph shows potential connections, offering clues for identifying terminal repeats and low-complexity regions. Visualize the file ending in .gfa with Bandage (Wick et al. 2015). |
Direct terminal repeat (DTR) found | – | Is DTR detected by CheckV (Nayfach et al. 2021) in the phage contig |
Completeness | 100.0 | Phage completeness score from CheckV |
Contamination | 0.0 | Contamination score from CheckV |
Taxon description | Kehishuvirus sp. tikkala | Assigned taxon name from Pharokka (Bouras et al. 2023a) output, comparing the phage genome against the INPHARED database (Cook et al. 2021) using Mash (Ondov et al. 2016) |
Taxa result: matching hashes | 972/1000 | How close the phage isolated is to the assigned taxon. Results from Pharokka Mash sketch against the INPHARED database |
Lowest taxon classification | Kehishuvirus | The lowest taxon rank assigned |
Isolation host of the described taxa | Bacteroides cellulosilyticus | Bacterial host of the assigned taxa from the INPHARED database |
Number of CDS | 154 | Number of genes identified in the genome from Pharokka results |
Total number of CDS annotated as “hypothetical protein” | 91 | Counting only the genes annotated as hypothetical, which have not been assigned a biological function or have ambiguous descriptions in Phynteny (Grigson and Mallawaarachchi 2023) output |
GC content (proportion) | 0.35 | GC content from Pharokka result |
Percent coding density | 91.3 | Phages generally have high coding capacity, so if the density is low, it could indicate issues with gene calling for this phage |
Prophage or temperate lifestyle markers | No integrases, recombinases, or transposases found | These genes indicate the phage can have a temperate lifestyle, which would most likely exclude it from use in therapy. Results from Pharokka, Phold, and Phynteny searches |
Toxin genes | No toxins found | Search for genes with “toxins” in the gene description from the final Phynteny output |
Virulence genes | No antimicrobial resistance (AMR) genes found, no virulence factors found | Search against the CARD (Alcock et al. 2020) and VFDB (Liu et al. 2022) databases using Pharokka and Phold results |
Defense genes | No anti-CRISPR or spacers found, no defense genes found | Pharokka and Phold search the genes against ACR (Huang et al. 2021) and DefenseFinder (Tesson et al. 2022) databases |
2.5 Phage sampling and sequencing
Escherichia coli strain CoGEN001851 (BEI Resources: Catalog number NR-4359) was received as a glycerol stock from BEI resources. The strain was plated on Brain-Heart Infusion media, supplemented with 1.5% agar (w/v), MgSO4, and MgCl2 to a final concentration of 10 mM and 2 mM, respectively. The culture plates were incubated at 37°C for 24 h. The phages were isolated from untreated sewage water (influent) collected from the waste treatment plant in Cardiff, California, as described in Papudeshi et al. (2023). An isolated plaque was selected from each plate and purified further. Phage DNA was then extracted, and E.coli phages were sequenced using Oxford Nanopore MinION sequencing according to the manufacturer’s instructions, using Oxford Nanopore Rapid Barcoding Sequencing Kit (SQK-RBK0004) and sequenced on Flowcell R9.4.1 (FLO-MIN106) as described in Papudeshi et al. (2023). The sequencing data were deposited to the Sequence Read Archive (SRA) in Bioproject PRJNA737576. The resulting fast5 reads were run through Guppy v6.0.1 with model dna_r9.4.1_450bps_hac for the Nanopore sequenced isolates. The resulting fastq reads were then run through the Sphae workflow.
2.6 Datasets
The workflow was tested on phages isolated from the above commercially available E.coli strains, and with publicly available sequence reads or genomes for Klebsiella, Salmonella, and Achromobacter phages (Table 2 and Table S1). Additionally, we included one dataset with five samples that included mixed Caudovirictes phages from multiple bacterial hosts to demonstrate the potential of Sphae workflow in assembling and separating each phage (Table 2 and Table S1). The reads were downloaded from SRA using sra-tools in fastq format as input for Sphae.
Study . | Number of phage samples . | Sequencing platform . | Bacterial host . | Bioproject . | Reference . |
---|---|---|---|---|---|
E.coli phages | 14 | MinION | E.coli strain CoGEN001851 (BEI Resources: Catalog number, NR-4359) | PRJNA737576 | This study |
Klebsiella phages | 20 | MinION, Illumina NextSeq | Klebsiella michiganensis, Klebsiella oxytoca, Klebsiella quasipneumoniae, Klebsiella variicola | PRJNA914245 | (Elek et al. 2023) |
Salmonella phages | 11 | Illumina MiSeq | Salmonella enterica subsp. enterica serovar Typhimurium (ATCC 14028S) | PRJNA914245 | (Gendre et al. 2022) |
Achromobacter phages | 15 | Illumina MiSeq | Achromobacter xylosoxidans strain 19–32 | PRJEB33638 | (Cobián Güemes et al. 2023) |
Mixed Caudovirictes phages | 5 | Illumina MiSeq | PRJNA222858 | NA |
Study . | Number of phage samples . | Sequencing platform . | Bacterial host . | Bioproject . | Reference . |
---|---|---|---|---|---|
E.coli phages | 14 | MinION | E.coli strain CoGEN001851 (BEI Resources: Catalog number, NR-4359) | PRJNA737576 | This study |
Klebsiella phages | 20 | MinION, Illumina NextSeq | Klebsiella michiganensis, Klebsiella oxytoca, Klebsiella quasipneumoniae, Klebsiella variicola | PRJNA914245 | (Elek et al. 2023) |
Salmonella phages | 11 | Illumina MiSeq | Salmonella enterica subsp. enterica serovar Typhimurium (ATCC 14028S) | PRJNA914245 | (Gendre et al. 2022) |
Achromobacter phages | 15 | Illumina MiSeq | Achromobacter xylosoxidans strain 19–32 | PRJEB33638 | (Cobián Güemes et al. 2023) |
Mixed Caudovirictes phages | 5 | Illumina MiSeq | PRJNA222858 | NA |
Study . | Number of phage samples . | Sequencing platform . | Bacterial host . | Bioproject . | Reference . |
---|---|---|---|---|---|
E.coli phages | 14 | MinION | E.coli strain CoGEN001851 (BEI Resources: Catalog number, NR-4359) | PRJNA737576 | This study |
Klebsiella phages | 20 | MinION, Illumina NextSeq | Klebsiella michiganensis, Klebsiella oxytoca, Klebsiella quasipneumoniae, Klebsiella variicola | PRJNA914245 | (Elek et al. 2023) |
Salmonella phages | 11 | Illumina MiSeq | Salmonella enterica subsp. enterica serovar Typhimurium (ATCC 14028S) | PRJNA914245 | (Gendre et al. 2022) |
Achromobacter phages | 15 | Illumina MiSeq | Achromobacter xylosoxidans strain 19–32 | PRJEB33638 | (Cobián Güemes et al. 2023) |
Mixed Caudovirictes phages | 5 | Illumina MiSeq | PRJNA222858 | NA |
Study . | Number of phage samples . | Sequencing platform . | Bacterial host . | Bioproject . | Reference . |
---|---|---|---|---|---|
E.coli phages | 14 | MinION | E.coli strain CoGEN001851 (BEI Resources: Catalog number, NR-4359) | PRJNA737576 | This study |
Klebsiella phages | 20 | MinION, Illumina NextSeq | Klebsiella michiganensis, Klebsiella oxytoca, Klebsiella quasipneumoniae, Klebsiella variicola | PRJNA914245 | (Elek et al. 2023) |
Salmonella phages | 11 | Illumina MiSeq | Salmonella enterica subsp. enterica serovar Typhimurium (ATCC 14028S) | PRJNA914245 | (Gendre et al. 2022) |
Achromobacter phages | 15 | Illumina MiSeq | Achromobacter xylosoxidans strain 19–32 | PRJEB33638 | (Cobián Güemes et al. 2023) |
Mixed Caudovirictes phages | 5 | Illumina MiSeq | PRJNA222858 | NA |
2.7 Benchmarking
We benchmarked Sphae’s performance on five published datasets with 65 samples (Table 2) to compare its functionality and performance. These datasets include known phages in each sample as they were experimentally isolated, assembled, and annotated to serve as reliable references. Previous studies have described guidelines (Shen and Millard 2021, Turner et al. 2021, Grigson et al. 2023) for assembling high-quality phage genomes and annotating their genes; we have used these tutorials as a framework to develop Sphae. All programs and dependency versions used for benchmarking can be found in Table S2. This adaptable workflow is designed with versatility, making it compatible with future updates and new software. As there are no comparable workflows, we assessed the workflow performance using datasets with varying complexities, different numbers of samples, and different sequencing platforms, including samples with single or multiple phages.
Running the workflow in parallel mode processes each phage genome as an individual job, thus speeding the output time. This can be set up on HPC systems using a user-provided profile.
2.8 Runtime performance comparison
To evaluate Sphae’s runtime, we measured the wall-clock runtime on a RedHat Linux release 8.10 machine with an AMD EPYC 7551 CPU @ 2.55 GHz. We analyzed sequencing data for a Klebsiella phage Amrap using both paired-end and long-read sequencing methods with default settings in Sphae. The analysis was conducted on six or eight threads and 32 GB of memory to mimic commonly available consumer hardware. Each paired-end, long-read with polishing, long-read without polishing, and annotate modes were executed five times with the same command, and the median wall-clock times with 8 and 16 threads were recorded.
3 Results
3.1 Determining complete genomes from assembly
Depending on the complexity and genome coverage of the phage, assembly steps can result in different results (Fig. 2). Ideally, the phage genomes are completely assembled into circular or linear genomes (Fig. 2A and B). In other cases, the DTR that plays a role in packaging cannot be resolved due to its low complexity during assembly; in this case, the code considers the longer contig as a final genome representation (Fig. 2C). Similarly, the DTR regions can cause multiple genomes to be tangled in an assembly graph (Fig. 2D). In this case, all the contigs identified as complete phage genomes by CheckV are considered separate phage genomes from a sample. In the final case, the assembly generates fragmented phage genomes; if the contigs are long enough to determine if they are components of a phage genome (Fig. 2E), or they may be too fragmented, making it challenging to determine if they are viral (Fig. 2F). In both the latter cases, the poor quality of the assembly can lead to poor annotation and, therefore, they are not considered further in the workflow.

Assembly graphs visualized using Bandage: (A) complete circular phage genome, (B) complete linear phage genome, (C) near-complete phage genome, with terminal repeats hard to assemble, (D) multiple phage genomes in one assembly, (E) fragmented phage genome, likely due to low genome coverage, and (F) multiple phage genomes in one assembly—in this case, there are three phages in the sample.
3.2 Assembly summary
We assembled 65 samples across the five datasets, described in Table 1, using Sphae v1.4.3 with the tools and their version listed in Table S2, which assembled 84 phages. In the summary output (Table S3), we indicate if the assemblies have failed, if the assembly itself has not produced contigs, or if the assembled contigs were fragmented.
In the E.coli dataset, some sequences lacked sufficient genome coverage, resulting in unassembled phage genomes (Fig. S1). Seven of the 14 samples were assembled, 4 generated fragmented assemblies, and 3 failed during assembly (Fig. S1). This dataset highlighted how Sphae captures the presence of poorly sequenced samples, suggesting to the user that further sequencing data is required to generate suitable genomes for these phages.
In the case of Klebsiella phages, short- and long-read sequences were assembled separately, revealing differences between the two sequencing platforms. Paired-end reads generated complete, circular assemblies with assembly graphs, including one sample featuring one region with multiple contigs tangled together (Fig. 2C, Fig. S2). Conversely, Nanopore read assemblies resulted in complete, linear phage genomes (Fig. 2B, Fig. S3), lacking the DTR region (Table S3). With the Salmonella and Achromobacter phage datasets, complexity arose from samples containing multiple phage genomes. While Sphae was able to assemble phage genomes for each sample (Figs S4 and S5), two samples (Se_F6 and Salfasec_13) contained two assembled phage genomes (Fig. S4B and J), and two samples (Se_F3 and Se_F1) contained three phage genomes (Fig. S4C and E). This observation aligns with the genome characteristics outlined in the original publication (Gendre et al. 2022), confirming the presence of multiple phages in specific samples. However, 3 of the 11 samples were potentially contaminated with E.coli X174, likely introduced during the sequencing process. Many Illumina sequences contain X174 contamination as it is used as a spike-in during 16S rRNA sequencing. Similarly, the Achromobacter phage dataset had multiple samples containing two phage genomes per isolate, with 11 out of the 15 phages having either 30, 40, or 70 kb genome lengths. The assembly graph illustrates a structure similar to Fig. 2D, with two phages connected by the DTR region (Fig. S5).
We further ran Sphae on a dataset comprising five mixed Caudoviricetes samples (SRR8788475, SRR8869231, SRR8869234, SRR8869239, and SRR8869241), demonstrating Sphae’s capacity to accurately resolve and separate multiple phages within each sample. For instance, sample SRR8788475 included four phages, and Sphae assembled all four phages (Fig. S6B, Table S3), similarly two phages in SRR8869231 were assembled (Fig. S6C), three phages from SRR8869239 (Fig. S6E) and SRR8869241 (Fig. S6F). Interestingly, sample SRR8869234 was listed to include two phages, but Sphae assembled three phages, Staphylococcus, Klebsiella, and Enterobacter phage (Fig. S6D). Importantly, the resulting assembly graphs across all samples were connected by short sequence fragments (Fig. 2D), reflecting the complexity of resolving multiple phages.
3.3 Phage annotation
Phage genes were identified in the 84 assembled phages using PHANOTATE with a default translation table 11. Since phages can potentially use alternative stop codons (Borges et al. 2022, Peters et al. 2022, Pfennig et al. 2023), the summary report includes coding density. If low coding density is reported, the assembled phage genomes can be rerun with sphae annotate, by changing the config file to utilize Pharokka’s pyrodigal-gv gene prediction (Larralde 2022). The average coding density for the 84 phages is 95.17%, with a median of 95.20%, confirming the appropriateness of the default translation table.
To enhance the accuracy of gene annotation, Sphae employs an approach that leverages sequence similarity via Pharokka, structural information through Phold, and synteny information through Phynteny. These methods were selected to provide a multi-faceted view of the gene functions and improve annotations. Initially, 8321 genes were predicted across all 84 phages, with 4871 genes (58.53%) classified as hypothetical proteins, which includes genes with ambiguous descriptions based solely on sequence similarity searches. However, integrating structural and synteny information, an additional 553 proteins were annotated, highlighting the effectiveness of this combined approach (Fig. 3B). Although synteny information did not improve annotation for these datasets, it has been shown to improve annotations in other phages (Kang et al. 2017). A summary of the annotated genes based on their PHROG categories showed Salmonella, Achromobacter, and mixed phage datasets included genes across all categories, but E.coli and Klebsiella (8 out of 10) datasets did not include genes from transcriptional regulation and integration and excision (Fig. 3A).

Overview of phage genome characteristics across datasets. (A) Proportions of genes in each PHROG function category are represented by dot sizes, with larger dots indicating higher proportions. Each row corresponds to a dataset, including Achromobacter, E. coli, Klebsiella (long-read and short-read), mixed phages, and Salmonella. (B) Stacked bars display the proportion of genes annotated by three types: Pharokka annotations, Phold annotations, and hypothetical proteins, indicated by distinct fill patterns. (C) Presence or absence of specific marker genes such as integrases, transposases, recombinases, toxin genes, and AMR genes is shown as filled or unfilled squares. (D) The determination of phage therapy candidates is shown in the last column, where a filled square indicates a candidate, and an unfilled square indicates non-candidacy.
To identify specific genes of interest for screening these phages for potential therapeutic use, we started with the presence of integrases, which were found in 15 phages from the Salmonella dataset, 13 from the Achromobacter dataset, and 3 (Serratia, Staphylococcus, Escherichia) phages from the mixed phage dataset (Fig. 3C). The presence of an integrase suggests that these phages are temperate and can persist using the lysogenic cycle. They may protect their host against other phages or express genes altering host functions. Additionally, 10 Salmonella phages contained transposase genes, four phages (Enterobacter, 2 Klebsiella, and Staphylococcus) from the mixed phage dataset contained recombinases, and two (Klebsiella and Escherichia phage) included two toxin genes. While none of the assembled phage genomes encoded antimicrobial genes, four phages contained virulence factors. Specifically, a phage from the Salmonella dataset and three phages (two Serratia phages and an Acinetobacter phage) from the mixed phage dataset were found to encode immune-modulating virulence genes. While the specific functions of these gene products remain unknown, their presence raises concerns and would disqualify these phages from consideration for therapeutic use. Overall, these 31 phages exhibit markers indicative of a prophage lifestyle or the presence of virulence factors, suggesting they may not be suitable candidates for phage therapy (Fig. 3D).
Among the remaining 48 phages, 12 encoded anti-CRISPR proteins: six from E.coli, a Salmonella phage, and five from Achromobacter phages. An Escherichia phage from a mixed dataset contained defense genes (Fig. 3C). However, 19 of the 48 potential phage therapy candidates came from samples containing multiple phages, necessitating re-isolation to ensure pure cultures. This reduces the viable candidates for phage therapy to 28 phages: 7 against E.coli, 19 against Klebsiella, 2 against Achromobacter, and 1 against Pseudomonas (Fig. 3D). No pure candidates were identified from the Salmonella dataset.
3.4 Sphae runtime performance
Sphae was executed five times on Klebsiella phage Amrap across various sequencing modes, and thread counts to assess differences in median runtime performance. This repetition allowed for robust comparisons, highlighting the variations in efficiency between configurations. Sphae paired-end sequencing mode took a median of 42 minutes on 8 threads but dropped significantly to a median of 9 minutes and 43 seconds on 16 threads. In long-read mode, the workflow was completed in a median of 14 minutes on 8 threads and 7 minutes and 24 seconds on 16 threads. Additionally, when Medaka polishing was omitted during the long-read mode, the median runtime increased to 17 minutes and 9 seconds on 8 threads, but similarly dropped to 8 minutes and 28 seconds on 16 threads. The sphae annotate command runs only the annotation steps of the workflow, taking a median of 6 minutes and 13 seconds on 8 threads compared to 6 minutes and 31 seconds on 16 threads (Table S4). Increasing thread count significantly reduces runtime for assembly-related tasks but does not always benefit annotation steps.
4 Discussion
Sphae is a reproducible workflow that automates the fundamental bioinformatics steps used in phage therapy to identify candidates for therapeutic use. By integrating 12 bioinformatics tools and nine Python scripts into a unified workflow, Sphae enables seamless execution using a single command. This workflow addresses key challenges in phage therapy by detecting induced prophages, multiple phage species in a sample, DTRs that could influence HGT. Leveraging Snakemake’s parallelization capabilities, Sphae can process multiple phages simultaneously, often within 10 minutes on 16 threads per phage sample. This makes Sphae a user-friendly solution for clinical applications and allows for rapid detection of phages with therapeutic potential.
We analyzed five datasets including 65 samples, to benchmark Sphae. These datasets included both short-read and long-read sequencing data, assembling 84 phage genomes, of which 28 phages could be used for therapy (Fig. 4). We found that phage samples can contain multiple phages, and Sphae reports the characteristics of these phages, making it easier to identify potential candidates for phage therapy that could be further purified if a therapeutic phage candidate is identified. In some instances, contaminants such as E.coli X174 in Illumina sequencing or phage in Nanopore sequences were detected as they are used as sequencing controls. In other cases, induced prophages may be present, identifiable by the presence of the same or highly similar sequences across all samples, as demonstrated in the Achromobacter dataset in this study. Finally, in cases where the phage fails, Sphae reports at which step the sample failed, if it was during assembly, or if the assembly was fragmented, as demonstrated with the E.coli dataset. These findings underscore the importance of thorough characterization and identification of phages for their potential therapeutic use.

Flowchart summarizing the analysis of 65 phage samples across five datasets, detailing the number of assembled phages, therapeutic candidates, failed assemblies, impure cultures, and phages containing prophage or virulence gene markers. Diagram generated using SankeyMATIC.
4.1 Sphae analysis reveals genomic insights into phage biology
Phage isolation is challenging as a plaque could have multiple phages from the environment, induced prophages, or other contaminants within a single sample. Bacterial isolates frequently contain prophages, and it has been reported that the average prophage density is 2.4% (McKerral et al. 2023, Inglis et al. 2024). The prophage excision can contaminate the therapeutic phage lysate, increasing the risk of HGT, including unwanted AMR and virulence genes (Pfeifer et al. 2022, Botelho et al. 2023). Here, we demonstrate that Sphae effectively captures the prophage contamination cases and informs the user when the sample might require further purification, as shown with the Achromobacter dataset, allowing for detecting and excluding phages that could be therapeutically problematic.
Sphae not only assembles and annotates phage genomes from various bacterial hosts but also identifies integrases, transposases, and recombinases—key enzymes involved in the integration and recombination of phage and bacterial DNA (Turner et al. 2021). These enzymes are central to HGT, particularly in facilitating the movement of genes between phages and hosts, which has implications for phage therapy. In the 84 phages analyzed, integrases were detected in 17 phages, transposases in 10 phages, and recombinases in four phages (Fig. 3C). While these three genes are associated with temperate lifecycle, recombinases are also part of recombination systems within lytic phages to help with DNA repair and enable the formation of concatemers in genome packaging. Therefore, the presence of recombinase genes is not a clear indication of a temperate lifecycle, further investigation is required.
Another critical aspect of phage biology is phage genome packaging, Phage packaging mechanisms, such as cos and pac packaging, can influence the likelihood of HGT events (Catalano and Morais 2021, Borodovich et al. 2022). For instance, cos site phages are less likely to carry out generalized transduction, while pac site or headful packaging wherein the bacterial DNA is mistakenly packaged into the phage capsids, facilitating gene transfer between the bacteria (Borodovich et al. 2022). Sphae addresses this by identifying the low-complexity DTR in genomes, typically associated with headful packaging, providing insights into the packaging processes. Sphae detected DTRs in 57 of the 84 phages. However, DTRs were detected in 83.82% Illumina sequenced phage genomes, while none were detected in Nanopore assemblies. Klebsiella dataset included 10 phages on both platforms, and DTRs were detected only on Illumina sequences, as noted in the original publication (Elek et al. 2023). This finding highlights two key points:(1) low-complexity regions such as DTRs are more reliable in short-read sequencing data, and (2) sequencing platforms influence the detection of packaging signals and completeness of the assembly. However, current bioinformatic tools cannot easily differentiate between the different packaging mechanisms or detect the correct copy number of repeats, as this influences completeness, which also depends on the type of phage it is (Borodovich et al. 2022).
These mechanisms are relevant to determine if AMR genes and virulence factors in the phage can be transferred to the bacterial hosts or introduced into the bacterial population. Sphae, therefore also searches for AMR genes and virulence factors. In the datasets tested, none of the phages encoded AMR, but four genomes included virulence factors. Overall, identification of these genes and reporting them in the summary file is aimed at making the detection of phage therapy candidates effective. As more phages are sequenced, Sphae could serve as a valuable tool not only for identifying therapy candidates but also for advancing studies on phage evolution and host interaction dynamics.
4.2 Sphae follows FAIR principles
This workflow promotes adherence to the Findable, Accessible, Interoperable, Reusable, and Reproducible (FAIR) principles (Wilkinson et al. 2016). While developing this workflow, we addressed a number of challenges generally associated with such workflows. This included creating comprehensive documentation with test datasets and structured output, making it easier to navigate and interpret results. While we provide the users with only pertinent outputs in the “RESULTS” directory, the intermediate files are retained so researchers can adapt their approach to resolve assembly complexities.
In instances of assembly failures, Sphae retains intermediate files that outline the steps where the breakdown occurred. For example, poor assemblies resulting from insufficient genome coverage can prompt more sequencing of the sample, if feasible. Additional adjustments such as altering the subsampled reads or switching to alternative assemblers could also be considered. Alternative assembler options include SPAdes (Bankevich et al. 2012), which handles a full spectrum of k-mers; Canu (Koren et al. 2017), which utilizes Overlap-Layout-Consensus assemblers; or hybrid assemblies with tools like Unicycler (Wick et al. 2017) or Plassembler (Bouras et al. 2023b), which may be necessary to resolve assembly complexities. Cases of fragmented assemblies connected in an assembly graph can be resolved using Phables (Mallawaarachchi et al. 2023). This ensures that even when complete assemblies are not immediately achievable, researchers can refine their approach to resolve assembly complexities, especially in time-sensitive cases.
Sphae workflow also tracks the versions of the software tools used, enhancing reproducibility. We also emphasize the pre-processing steps to ensure standard execution and minimize human error while providing users with readable errors. The challenges and solutions are presented in Table 3.
Challenges . | Solution . |
---|---|
Variability in tools and programming languages | Snakemake workflow manager allows the integration of tools written in multiple languages. |
Lack of version, parameters documentation, and installation of multiple programs | Snakemake allows logging each step, keeping track of the tool version and the command run with the default parameters listed. Each software is automatically downloaded to its separate conda environment with dependencies or via a pre-built Docker/Singularity container. |
Portability of the workflow | The workflow is available through conda, pip, pre-built containers, and source installation in GitHub or via a pre-built Docker/Singularity container. |
Hardware and software dependencies | The workflow’s configuration file includes resource information that the user can update for the system on which the workflow is running. In addition, a pipeline can be set to talk to job schedulers on high-performance computing (HPC) systems. |
Error handling | Provide detailed logs with information identifying the step at which the error occurred for each rule and an overall Snakemake .log file. |
Addition of new tools | New tools can be quickly added as a new rule to the workflow. This critical feature allows new and improved tools to be integrated as they are developed. |
Challenges . | Solution . |
---|---|
Variability in tools and programming languages | Snakemake workflow manager allows the integration of tools written in multiple languages. |
Lack of version, parameters documentation, and installation of multiple programs | Snakemake allows logging each step, keeping track of the tool version and the command run with the default parameters listed. Each software is automatically downloaded to its separate conda environment with dependencies or via a pre-built Docker/Singularity container. |
Portability of the workflow | The workflow is available through conda, pip, pre-built containers, and source installation in GitHub or via a pre-built Docker/Singularity container. |
Hardware and software dependencies | The workflow’s configuration file includes resource information that the user can update for the system on which the workflow is running. In addition, a pipeline can be set to talk to job schedulers on high-performance computing (HPC) systems. |
Error handling | Provide detailed logs with information identifying the step at which the error occurred for each rule and an overall Snakemake .log file. |
Addition of new tools | New tools can be quickly added as a new rule to the workflow. This critical feature allows new and improved tools to be integrated as they are developed. |
Challenges . | Solution . |
---|---|
Variability in tools and programming languages | Snakemake workflow manager allows the integration of tools written in multiple languages. |
Lack of version, parameters documentation, and installation of multiple programs | Snakemake allows logging each step, keeping track of the tool version and the command run with the default parameters listed. Each software is automatically downloaded to its separate conda environment with dependencies or via a pre-built Docker/Singularity container. |
Portability of the workflow | The workflow is available through conda, pip, pre-built containers, and source installation in GitHub or via a pre-built Docker/Singularity container. |
Hardware and software dependencies | The workflow’s configuration file includes resource information that the user can update for the system on which the workflow is running. In addition, a pipeline can be set to talk to job schedulers on high-performance computing (HPC) systems. |
Error handling | Provide detailed logs with information identifying the step at which the error occurred for each rule and an overall Snakemake .log file. |
Addition of new tools | New tools can be quickly added as a new rule to the workflow. This critical feature allows new and improved tools to be integrated as they are developed. |
Challenges . | Solution . |
---|---|
Variability in tools and programming languages | Snakemake workflow manager allows the integration of tools written in multiple languages. |
Lack of version, parameters documentation, and installation of multiple programs | Snakemake allows logging each step, keeping track of the tool version and the command run with the default parameters listed. Each software is automatically downloaded to its separate conda environment with dependencies or via a pre-built Docker/Singularity container. |
Portability of the workflow | The workflow is available through conda, pip, pre-built containers, and source installation in GitHub or via a pre-built Docker/Singularity container. |
Hardware and software dependencies | The workflow’s configuration file includes resource information that the user can update for the system on which the workflow is running. In addition, a pipeline can be set to talk to job schedulers on high-performance computing (HPC) systems. |
Error handling | Provide detailed logs with information identifying the step at which the error occurred for each rule and an overall Snakemake .log file. |
Addition of new tools | New tools can be quickly added as a new rule to the workflow. This critical feature allows new and improved tools to be integrated as they are developed. |
4.3 Sphae is a modular workflow solution
The tools were chosen based on best practices in phage genome characterization (Shen and Millard 2021, Turner et al. 2021, Grigson et al. 2023). The focus was on achieving high accuracy and benchmarking for low runtime results. Workflow managers offer the advantage of isolating each software in its environment (Köster and Rahmann 2012, Roach et al. 2022). This means that as the software is improved or new tools are published, they can be quickly added and replace outdated modules. Additionally, more samples can be added to each dataset, and the workflow will run only the new samples, with previously used tool versions if the conda environments were kept. The complete workflow, along with the individual modules, supports reentrancy, allowing steps to be resumed in case they were interrupted.
In Sphae, we have added the option, sphae run, to run the entire workflow beginning with sequencing reads to generate final annotations and a summary report. However, the sphae annotate module has been included to allow end-users to run only the annotation steps on pre-assembled phage genomes, leveraging Sphae’s approach to improving the number of annotated genes. This module was added for two reasons: first, the assembled genomes can be re-circularized to start from large terminase subunit (terL) or other user-selected genes using tools like Dnaapler (Bouras et al. 2024a) and visualized with Clinker (Gilchrist and Chooi 2021) or pyGenomeViz. Second, phages sometimes reassign stop codons by using alternative genetic codes (Borges et al. 2022, Peters et al. 2022, Cook et al. 2024) end-users can change the config file to run pyrodigal-gv (Larralde 2022) for gene prediction in Pharokka instead of the default PHANOTATE (McNair et al. 2019). The need for changing tools can be predicted from the coding density reported in the summary.txt file. Phages generally have high coding density to minimize non-coding regions; low-density coding regions suggest that the annotation tools may have incompletely annotated the phage genome (McNair et al. 2019).
4.4 Future improvements
The ongoing isolation and analysis of phages continue to enhance our grasp of phage biology, evolution and phage-host interactions. Although short-read platforms have traditionally been used for sequencing most phages, there is a growing adoption of long-read sequencing methods such as Oxford Nanopore and PacBio sequencing. An advantage of long-read sequencing is its ability to detect phage DNA modifications, like methylation (Simpson et al. 2017, Sun et al. 2023), which may play a role in phage resistance and adaptability to microbial communities. While there are over 2000 phage sequences available in the SRA from Illumina platforms, fewer than 300 phages have been sequenced using long-read technologies such as PacBio and Nanopore platforms (source: https://www.ncbi.nlm.nih.gov/sra). With the increasing availability of long-read sequencing data and the development of automated tools for identifying methylation in phage genomes with minimal manual intervention, we anticipate the integration of this feature into the workflow as a distinct module. Additionally, alternate codon reassignment, recently identified in phage genomes (Borges et al. 2022, Larralde 2022, Peters et al. 2022), is now included in Sphae offering users insights into unique coding adaptations, and insights into coding adaptations relevant to host specificity. Tools like Prfect that predict programmed ribosomal frameshifts producing longer proteins (McNair et al. 2023), also present an exciting future integration. These enhancements will enable end-users to explore this specialized genome feature, as our understanding of phage biology and evolution improves. The tools and modules within Sphae will be regularly updated to accommodate these advancements to include useful summary reports, ensuring users can easily access and interpret the latest advancements in a user-friendly manner.
5 Conclusions
Sphae is a bioinformatics workflow designed to quickly and comprehensively characterize phage isolates and identify phage therapy candidates, addressing the urgent need for effective alternatives to combat AMR. By seamlessly integrating high-quality genomic data and automated analysis, Sphae not only enhances our understanding of phage biology and evolution but also empowers researchers to make informed decisions in the fight against resistant bacterial pathogens.
Author contributions
Bhavya Papudeshi (Conceptualization [lead], Data curation [lead], Formal analysis [lead], Investigation [equal], Methodology [equal], Software [equal], Validation [lead], Visualization [lead]), Michael J. Roach (Conceptualization [equal], Methodology [equal], Writing—review & editing [equal]), Vijini Gimhani Mallawaarachchi (Methodology [equal], Software [equal]), George Bouras (Methodology [equal], Software [equal], Validation [equal]), Susanna R. Grigson (Methodology [equal], Software [equal]), Sarah K. Giles (Data curation [equal], Validation [supporting]), Clarice M. Harker (Data curation [supporting], Validation [supporting]), Abbey L. K. Hutton (Validation [supporting], Visualization [supporting]), Anita Tarasenko (Data curation [supporting], Validation [supporting], Visualization [supporting]), Laura K. Inglis (Visualization [equal], Writing—review & editing [supporting]), Alejandro A. Vega (Data curation [supporting]), Cole Souza (Data curation [supporting]), Lance Boling (Data curation [supporting]), Hamza Hajama (Validation [supporting]), Ana Cobain (Validation [supporting]), Anca Segall (Conceptualization [equal], Data curation [supporting], Investigation [supporting], Methodology [supporting], Project administration [supporting], Supervision [supporting]), Elizabeth A. Dinsdale (Funding acquisition [supporting], Project administration [supporting], Supervision [supporting]), and Robert Edwards (Conceptualization [equal], Data curation [equal], Formal analysis [equal], Funding acquisition [lead], Investigation [lead], Methodology [equal], Project administration [lead], Resources [lead], Software [equal], Supervision [supporting])
Supplementary data
Supplementary data are available at Bioinformatics Advances online.
Conflict of interest
None declared.
Funding
This research was supported by Flinders University through the DeepThought High-Performance Cluster (HPC). Additional resources and services were provided by the Australian Nectar Research Data Commons (ARDC) Nectar Infrastructure, the Pawsey Supercomputing Research Center, and the National Computational Infrastructure (NCI), funded by the Australian Government.
R.A.E. was supported by an award from the National Institute of Health (NIH); National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) [Grant number RC2DK116713] and an award from the Australian Research Council (ARC) [Grant number DP220102915].
Data availability
All raw data used to assess Sphae are publicly available through the NCBI Sequence Read Archive (SRA), with SRA accession numbers listed in Supplementary Table S1. The Sphae workflow code is openly accessible on GitHub at https://github.com/linsalrob/sphae.