Abstract

Summary

Phigaro is a standalone command-line application that is able to detect prophage regions taking raw genome and metagenome assemblies as an input. It also produces dynamic annotated ‘prophage genome maps’ and marks possible transposon insertion spots inside prophages. It is applicable for mining prophage regions from large metagenomic datasets.

Availability and implementation

Source code for Phigaro is freely available for download at https://github.com/bobeobibo/phigaro along with test data. The code is written in Python.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Bacteriophages (phages) are viruses that infect bacteria and have recently gained increasing interest due to the alarming spread of antibiotic-resistant strains of pathogenic bacteria. Phages are now considered as an alternative for the use of antibiotics in medicine (Lin et al., 2017; Waters et al., 2017), veterinary (Squires, 2018) and food industry (Gutiérrez et al., 2016, 2017). They are known for their substantial impact on diverse ecosystems, from animals’ intestinal tracts to oceans. Phages can sometimes provide benefits to their hosts by transporting virulence factors and other beneficial genes among bacterial strains. To date, our knowledge of bacteriophage diversity is narrow due to a negligible number of isolated and sequenced bacteriophage genomes, as compared to the huge proportion of viral ‘dark matter’ found in metagenomes (Yutin et al., 2018). Many undiscovered viral sequences of Myoviridae, Podoviridae, Siphoviridae, Inoviridae and Microviridae families lie within sequenced bacterial genomes in the form of prophages, as those families are known to have temperate life cycles, and even more unknown prophages are likely within metagenomes. Existing command-line tools for prophage prediction tend to output a limited selection of annotations and visualizations, and generally do not mark any overlapping mobile elements like transposons. Here, we present Phigaro, a novel high-throughput command-line tool that is able to predict and annotate prophage sequences with a dynamic visualization interface applicable to both genomic and metagenomic assembled data.

2 Phigaro overview

Phigaro is a Python package that accepts one or more FASTA files of assembled contigs as input. The core of this program is PhigaroFinder algorithm that defines regions of putative prophages based on preprocessed input data. The preprocessing is conducted consistently by two external programs. First, FASTA files are processed by Prodigal v2.6.3 (Hyatt et al., 2010), which returns a list of genes with their coordinates, GC content and other properties for a given sequence along with predicted protein sequences. Then the protein sequences are annotated with HMMSCAN v3.2.1 (Potter et al., 2018) using phage-specific profile hidden Markov models (HMMs) from prokaryotic Virus Orthologous Groups (pVOGs) (Grazziotin et al., 2017). A gene is considered ‘phage-like’ if it corresponds to one of the pVOG profile HMMs.

2.1 PhigaroFinder algorithm

For each gene, PhigaroFinder algorithm computes the probability of it being localized in a prophage region. The algorithm uses two pre-computed sets of pVOG profile HMMs: the ‘black list’ and the ‘white list’. Those lists were formed based on pVOG distributions inside and outside known prophage regions in 54 bacterial genomes (Supplementary Table S1) to correct the initial set of pVOG profile HMMs to avoid detecting regions with a high density of genes corresponding to pVOGs that are, in fact, not true prophage regions. The ‘black list’ consists of pVOGs that are likely to be found (according to Fisher test at 5% significance level) in other regions unrelated to prophages throughout 54 bacterial genomes (e.g. the ones annotated as ‘ABC transporters’, ‘plasmid partition proteins’, etc.), whereas the ‘white list’ is the opposite: it consists of pVOGs that are more likely to be found in prophage regions than in other regions (e.g. annotated as ‘capsid proteins’, ‘terminases’, etc.). To compute each gene’s scores, input data are transformed into two sequences of indicators using data obtained from Prodigal and HMMER3 outputs. The sequences of indicators for computing ‘phage scores’ are formed as follows:

  • 0 for a gene whose protein product does not match any pVOG profile HMMs

  • (1 + ‘black_penalty’) for a gene whose protein product does match a pVOG profile HMM from the ‘black list’

  • (1 + ‘white_bonus’) for a gene whose protein product does match a pVOG profile HMM from the ‘white list’

  • 1 for a gene whose protein product does match a pVOG profile HMM but it is from neither from the ‘black’ nor from ‘white’ list (‘neutral’ genes)

Then, a triangular window function is applied to count ‘phage scores’. This function allows weighted scoring of genes depending on their proximity to the middle of the sliding window (Oppenheim, 1999) using the following formula:
where i is the gene index, w is the window width and Indn is the nth gene’s indicator. Final scores are obtained depending on the chosen mode: ‘basic’, ‘abs_gc’ or ‘without_gc’. By default, Phigaro_Finder performs in ‘basic’ mode. We find ‘basic’ mode applicable for most cases. However, there might be prophages with GC content significantly lower or equal to those of their hosts’. For those, we suggest using ‘abs_gc’ and ‘without_gc’ modes. For the ‘basic’ and ‘abs_gc’ modes, final score is computed as a product of phage score and GC score as follows:
GC scores of the ‘basic’ mode are obtained similarly to phage score for each gene with the following formula:
where gc_contn is the GC content for a gene obtained from Prodigal output, and 1pVOG(gn) is an indicator function, which indicates if the gene’s protein product matches a pVOG profile HMM. These GC scores help to clarify the endings of prophage regions in the resulting score. GC scores of the ‘abs_gc’ mode are obtained as GC content deviations from mean GC values of the bacterial regions using the following formulae:
where gc_contn is the GC content for a gene obtained from Prodigal output, mean_gc is the mean GC value of bacterial regions and gc_contn is the GC content deviation from mean_gc. Final score for ‘without_gc’ mode is equal to phage score.

Finally, the algorithm determines phage regions based on the sequence of resulting scores. Phage regions are defined as ranges of genes with scores exceeding the ‘minimum score threshold’, given that at least one of the genes has a score exceeding the ‘maximum score threshold’. Thus, for each input contig, Phigaro returns a set of prophage regions with their coordinates and marks the possible presence of transposon insertions inside of the prophage sequence as more than two consecutive transposases or integrases. Phigaro produces annotated ‘prophage genome maps’ where prophages are visualized dynamically on a webpage by displaying their proteins as arrows with color coding of the phage functional modules (Supplementary Fig. S2).

2.2 PhigaroFinder parameters optimization

To optimize PhigaroFinder parameters, we used a ‘golden standard’ set of 54 bacterial genomes with manually annotated prophage positions (Casjens, 2003) (Supplementary Table S1). During a two-step optimization process, ‘black list’ penalty, ‘white list’ bonus, threshold values, as well as HMMSCAN E-value and window width were chosen. Parameter selection was done using grid search techniques (Supplementary Figs S3–S5) and Jaccard index (a measure that evaluates similarity of two sets) and Positive Predictive Value (PPV, a measure that evaluates the probability that a predicted positive result is a true positive result) as metrics:
where Li is the length of intersection of predicted and true prophage regions, Lu is the length of union of predicted and true prophage regions and Lp is the length of predicted phage region. Thus, the best set of parameters was estimated at ‘basic’ mode as follows:
  • ‘Black list’ penalty: -2.2

  • ‘White list’ bonus: +0.7

  • Minimum score threshold: 45.39

  • Maximum score threshold: 46.0

  • HMMSCAN E-value: 0.00445

  • Window size: 32 ORFs

For this set of parameters, Jaccard index was 0.627, and PPV was 0.872.

2.3 Performance analysis

Phigaro performance was compared to that of other prophage predicting tools using manually constructed dataset with previously annotated prophage regions: 14 organisms, 25 prophage regions in total. Although there are several prophage predicting tools to date [such as Phaster (Arndt et al., 2016), Virsorter (Roux et al., 2015), Phage_Finder (Fouts, 2006), ProphET, Prophinder (Lima-Mendez et al., 2008) and PhiSpy (Akhter et al., 2012)], only the first two accept unannotated FASTA sequences as input. To compare the performance of all of the three tools, we used the same metrics as those used in grid search procedure: Jaccard index and PPV (Table 1). The validation dataset and the detailed validation results table can be found in Supplementary Tables S2–S4 and Supplementary Figures S6.1–S6.5.

Table 1.

Performance of Phigaro compared to other prophage prediction tools accepting unannotated FASTA sequence as input

ProgramApp/webJaccard indexPPVAverage time
Phigaro (basic mode, default)Standalone0.4020.829270 s
Phigaro (abs_gc mode)Standalone0.3390.674270 s
Phigaro (without_gc mode)Standalone0.2400.538270 s
PHASTERWeb/API0.4780.631138 s + time in queue
VirSorter (levels 1&4)Standalone0.0700.0712829 s
VirSorter (levels 1&2&4&5)Standalone0.5780.5922829 s
VirSorter (levels 1&2&3&4&5&6)Standalone0.3380.3502829 s
ProgramApp/webJaccard indexPPVAverage time
Phigaro (basic mode, default)Standalone0.4020.829270 s
Phigaro (abs_gc mode)Standalone0.3390.674270 s
Phigaro (without_gc mode)Standalone0.2400.538270 s
PHASTERWeb/API0.4780.631138 s + time in queue
VirSorter (levels 1&4)Standalone0.0700.0712829 s
VirSorter (levels 1&2&4&5)Standalone0.5780.5922829 s
VirSorter (levels 1&2&3&4&5&6)Standalone0.3380.3502829 s
Table 1.

Performance of Phigaro compared to other prophage prediction tools accepting unannotated FASTA sequence as input

ProgramApp/webJaccard indexPPVAverage time
Phigaro (basic mode, default)Standalone0.4020.829270 s
Phigaro (abs_gc mode)Standalone0.3390.674270 s
Phigaro (without_gc mode)Standalone0.2400.538270 s
PHASTERWeb/API0.4780.631138 s + time in queue
VirSorter (levels 1&4)Standalone0.0700.0712829 s
VirSorter (levels 1&2&4&5)Standalone0.5780.5922829 s
VirSorter (levels 1&2&3&4&5&6)Standalone0.3380.3502829 s
ProgramApp/webJaccard indexPPVAverage time
Phigaro (basic mode, default)Standalone0.4020.829270 s
Phigaro (abs_gc mode)Standalone0.3390.674270 s
Phigaro (without_gc mode)Standalone0.2400.538270 s
PHASTERWeb/API0.4780.631138 s + time in queue
VirSorter (levels 1&4)Standalone0.0700.0712829 s
VirSorter (levels 1&2&4&5)Standalone0.5780.5922829 s
VirSorter (levels 1&2&3&4&5&6)Standalone0.3380.3502829 s

In spite of performing worse than Phaster and VirSorter in terms of mean Jaccard index, Phigaro’s mean performance appears to be the best among the existing tools. Also, the mean execution time is comparable with that of Phaster excluding the time of waiting in a queue which can be different from time to time and take up to several days. Overall, we show that Phigaro has decent performance compared to existing prophage prediction tools. In addition, the tool marks possible transposons inserted into prophages and provides dynamic visualizations to inspect the genome annotation and organization of prophages.

Acknowledgements

The authors thank the Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency for providing computational resources for this project.

Funding

This work was supported by RFBR (grant number 16-54-21012) and SNSF (grant identifier IZLRZ3_163863).

Conflict of Interest: none declared.

References

Akhter
 
S.
 et al. (
2012
)
PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies
.
Nucleic Acids Res
.,
40
,
e126
.

Arndt
 
D.
 et al. (
2016
)
PHASTER: a better, faster version of the PHAST phage search tool
.
Nucleic Acids Res
.,
44
,
W16
W21
.

Casjens
 
S.
(
2003
)
Prophages and bacterial genomics: what have we learned so far?
Mol. Microbiol
.,
49
,
277
300
.

Fouts
 
D.E.
(
2006
)
Phage_finder: automated identification and classification of prophage regions in complete bacterial genome sequences
.
Nucleic Acids Res
.,
34
,
5839
5851
.

Grazziotin
 
A.L.
 et al. (
2017
)
Prokaryotic virus orthologous groups (pVOGs): a resource for comparative genomics and protein family annotation
.
Nucleic Acids Res
.,
45
,
D491
–D
498
.

Gutiérrez
 
D.
 et al. (
2016
)
Bacteriophages as weapons against bacterial biofilms in the food industry
.
Front. Microbiol
.,
7
,
825
.

Gutiérrez
 
D.
 et al. (
2017
)
Applicability of commercial phage-based products against listeria monocytogenes for improvement of food safety in Spanish dry-cured ham and food contact surfaces
.
Food Control
,
73
,
1474
1482
.

Hyatt
 
D.
 et al. (
2010
)
Prodigal: prokaryotic gene recognition and translation initiation site identification
.
BMC Bioinformatics
,
11
,
119
.

Lima-Mendez
 
G.
 et al. (
2008
)
Prophinder: a computational tool for prophage prediction in prokaryotic genomes
.
Bioinformatics
,
24
,
863
865
.

Lin
 
D.M.
 et al. (
2017
)
Phage therapy: an alternative to antibiotics in the age of multi-drug resistance
.
World J. Gastrointest. Pharmacol. Ther
.,
8
,
162
173
.

Oppenheim
 
A.V.
(
1999
)
Discrete-Time Signal Processing
.
Prentice Hall
,
Upper Saddle River, NJ
.

Potter
 
S.C.
 et al. (
2018
)
HMMER web server: 2018 update
.
Nucleic Acids Res
.,
46
,
W200
W204
.

Roux
 
S.
 et al. (
2015
)
VirSorter: mining viral signal from microbial genomic data
.
PeerJ
,
3
,
e985
.

Squires
 
R.A.
(
2018
)
Bacteriophage therapy for management of bacterial infections in veterinary practice: what was once old is new again
.
N. Z. Veterinary J
.,
66
,
229
235
.

Waters
 
E.M.
 et al. (
2017
)
Phage therapy is highly effective against chronic lung infections with Pseudomonas aeruginosa
.
Thorax
,
72
,
666
667
.

Yutin
 
N.
 et al. (
2018
)
Discovery of an expansive bacteriophage family that includes the most abundant viruses from the human gut
.
Nat. Microbiol
.,
3
,
38
46
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Alfonso Valencia
Alfonso Valencia
Associate Editor
Search for other works by this author on:

Supplementary data