-
PDF
- Split View
-
Views
-
Cite
Cite
Elizaveta V Starikova, Polina O Tikhonova, Nikita A Prianichnikov, Chris M Rands, Evgeny M Zdobnov, Elena N Ilina, Vadim M Govorun, Phigaro: high-throughput prophage sequence annotation, Bioinformatics, Volume 36, Issue 12, June 2020, Pages 3882–3884, https://doi.org/10.1093/bioinformatics/btaa250
- Share Icon Share
Abstract
Phigaro is a standalone command-line application that is able to detect prophage regions taking raw genome and metagenome assemblies as an input. It also produces dynamic annotated ‘prophage genome maps’ and marks possible transposon insertion spots inside prophages. It is applicable for mining prophage regions from large metagenomic datasets.
Source code for Phigaro is freely available for download at https://github.com/bobeobibo/phigaro along with test data. The code is written in Python.
Supplementary data are available at Bioinformatics online.
1 Introduction
Bacteriophages (phages) are viruses that infect bacteria and have recently gained increasing interest due to the alarming spread of antibiotic-resistant strains of pathogenic bacteria. Phages are now considered as an alternative for the use of antibiotics in medicine (Lin et al., 2017; Waters et al., 2017), veterinary (Squires, 2018) and food industry (Gutiérrez et al., 2016, 2017). They are known for their substantial impact on diverse ecosystems, from animals’ intestinal tracts to oceans. Phages can sometimes provide benefits to their hosts by transporting virulence factors and other beneficial genes among bacterial strains. To date, our knowledge of bacteriophage diversity is narrow due to a negligible number of isolated and sequenced bacteriophage genomes, as compared to the huge proportion of viral ‘dark matter’ found in metagenomes (Yutin et al., 2018). Many undiscovered viral sequences of Myoviridae, Podoviridae, Siphoviridae, Inoviridae and Microviridae families lie within sequenced bacterial genomes in the form of prophages, as those families are known to have temperate life cycles, and even more unknown prophages are likely within metagenomes. Existing command-line tools for prophage prediction tend to output a limited selection of annotations and visualizations, and generally do not mark any overlapping mobile elements like transposons. Here, we present Phigaro, a novel high-throughput command-line tool that is able to predict and annotate prophage sequences with a dynamic visualization interface applicable to both genomic and metagenomic assembled data.
2 Phigaro overview
Phigaro is a Python package that accepts one or more FASTA files of assembled contigs as input. The core of this program is PhigaroFinder algorithm that defines regions of putative prophages based on preprocessed input data. The preprocessing is conducted consistently by two external programs. First, FASTA files are processed by Prodigal v2.6.3 (Hyatt et al., 2010), which returns a list of genes with their coordinates, GC content and other properties for a given sequence along with predicted protein sequences. Then the protein sequences are annotated with HMMSCAN v3.2.1 (Potter et al., 2018) using phage-specific profile hidden Markov models (HMMs) from prokaryotic Virus Orthologous Groups (pVOGs) (Grazziotin et al., 2017). A gene is considered ‘phage-like’ if it corresponds to one of the pVOG profile HMMs.
2.1 PhigaroFinder algorithm
For each gene, PhigaroFinder algorithm computes the probability of it being localized in a prophage region. The algorithm uses two pre-computed sets of pVOG profile HMMs: the ‘black list’ and the ‘white list’. Those lists were formed based on pVOG distributions inside and outside known prophage regions in 54 bacterial genomes (Supplementary Table S1) to correct the initial set of pVOG profile HMMs to avoid detecting regions with a high density of genes corresponding to pVOGs that are, in fact, not true prophage regions. The ‘black list’ consists of pVOGs that are likely to be found (according to Fisher test at 5% significance level) in other regions unrelated to prophages throughout 54 bacterial genomes (e.g. the ones annotated as ‘ABC transporters’, ‘plasmid partition proteins’, etc.), whereas the ‘white list’ is the opposite: it consists of pVOGs that are more likely to be found in prophage regions than in other regions (e.g. annotated as ‘capsid proteins’, ‘terminases’, etc.). To compute each gene’s scores, input data are transformed into two sequences of indicators using data obtained from Prodigal and HMMER3 outputs. The sequences of indicators for computing ‘phage scores’ are formed as follows:
0 for a gene whose protein product does not match any pVOG profile HMMs
(1 + ‘black_penalty’) for a gene whose protein product does match a pVOG profile HMM from the ‘black list’
(1 + ‘white_bonus’) for a gene whose protein product does match a pVOG profile HMM from the ‘white list’
1 for a gene whose protein product does match a pVOG profile HMM but it is from neither from the ‘black’ nor from ‘white’ list (‘neutral’ genes)
Finally, the algorithm determines phage regions based on the sequence of resulting scores. Phage regions are defined as ranges of genes with scores exceeding the ‘minimum score threshold’, given that at least one of the genes has a score exceeding the ‘maximum score threshold’. Thus, for each input contig, Phigaro returns a set of prophage regions with their coordinates and marks the possible presence of transposon insertions inside of the prophage sequence as more than two consecutive transposases or integrases. Phigaro produces annotated ‘prophage genome maps’ where prophages are visualized dynamically on a webpage by displaying their proteins as arrows with color coding of the phage functional modules (Supplementary Fig. S2).
2.2 PhigaroFinder parameters optimization
‘Black list’ penalty: -2.2
‘White list’ bonus: +0.7
Minimum score threshold: 45.39
Maximum score threshold: 46.0
HMMSCAN E-value: 0.00445
Window size: 32 ORFs
For this set of parameters, Jaccard index was 0.627, and PPV was 0.872.
2.3 Performance analysis
Phigaro performance was compared to that of other prophage predicting tools using manually constructed dataset with previously annotated prophage regions: 14 organisms, 25 prophage regions in total. Although there are several prophage predicting tools to date [such as Phaster (Arndt et al., 2016), Virsorter (Roux et al., 2015), Phage_Finder (Fouts, 2006), ProphET, Prophinder (Lima-Mendez et al., 2008) and PhiSpy (Akhter et al., 2012)], only the first two accept unannotated FASTA sequences as input. To compare the performance of all of the three tools, we used the same metrics as those used in grid search procedure: Jaccard index and PPV (Table 1). The validation dataset and the detailed validation results table can be found in Supplementary Tables S2–S4 and Supplementary Figures S6.1–S6.5.
Performance of Phigaro compared to other prophage prediction tools accepting unannotated FASTA sequence as input
Program . | App/web . | Jaccard index . | PPV . | Average time . |
---|---|---|---|---|
Phigaro (basic mode, default) | Standalone | 0.402 | 0.829 | 270 s |
Phigaro (abs_gc mode) | Standalone | 0.339 | 0.674 | 270 s |
Phigaro (without_gc mode) | Standalone | 0.240 | 0.538 | 270 s |
PHASTER | Web/API | 0.478 | 0.631 | 138 s + time in queue |
VirSorter (levels 1&4) | Standalone | 0.070 | 0.071 | 2829 s |
VirSorter (levels 1&2&4&5) | Standalone | 0.578 | 0.592 | 2829 s |
VirSorter (levels 1&2&3&4&5&6) | Standalone | 0.338 | 0.350 | 2829 s |
Program . | App/web . | Jaccard index . | PPV . | Average time . |
---|---|---|---|---|
Phigaro (basic mode, default) | Standalone | 0.402 | 0.829 | 270 s |
Phigaro (abs_gc mode) | Standalone | 0.339 | 0.674 | 270 s |
Phigaro (without_gc mode) | Standalone | 0.240 | 0.538 | 270 s |
PHASTER | Web/API | 0.478 | 0.631 | 138 s + time in queue |
VirSorter (levels 1&4) | Standalone | 0.070 | 0.071 | 2829 s |
VirSorter (levels 1&2&4&5) | Standalone | 0.578 | 0.592 | 2829 s |
VirSorter (levels 1&2&3&4&5&6) | Standalone | 0.338 | 0.350 | 2829 s |
Performance of Phigaro compared to other prophage prediction tools accepting unannotated FASTA sequence as input
Program . | App/web . | Jaccard index . | PPV . | Average time . |
---|---|---|---|---|
Phigaro (basic mode, default) | Standalone | 0.402 | 0.829 | 270 s |
Phigaro (abs_gc mode) | Standalone | 0.339 | 0.674 | 270 s |
Phigaro (without_gc mode) | Standalone | 0.240 | 0.538 | 270 s |
PHASTER | Web/API | 0.478 | 0.631 | 138 s + time in queue |
VirSorter (levels 1&4) | Standalone | 0.070 | 0.071 | 2829 s |
VirSorter (levels 1&2&4&5) | Standalone | 0.578 | 0.592 | 2829 s |
VirSorter (levels 1&2&3&4&5&6) | Standalone | 0.338 | 0.350 | 2829 s |
Program . | App/web . | Jaccard index . | PPV . | Average time . |
---|---|---|---|---|
Phigaro (basic mode, default) | Standalone | 0.402 | 0.829 | 270 s |
Phigaro (abs_gc mode) | Standalone | 0.339 | 0.674 | 270 s |
Phigaro (without_gc mode) | Standalone | 0.240 | 0.538 | 270 s |
PHASTER | Web/API | 0.478 | 0.631 | 138 s + time in queue |
VirSorter (levels 1&4) | Standalone | 0.070 | 0.071 | 2829 s |
VirSorter (levels 1&2&4&5) | Standalone | 0.578 | 0.592 | 2829 s |
VirSorter (levels 1&2&3&4&5&6) | Standalone | 0.338 | 0.350 | 2829 s |
In spite of performing worse than Phaster and VirSorter in terms of mean Jaccard index, Phigaro’s mean performance appears to be the best among the existing tools. Also, the mean execution time is comparable with that of Phaster excluding the time of waiting in a queue which can be different from time to time and take up to several days. Overall, we show that Phigaro has decent performance compared to existing prophage prediction tools. In addition, the tool marks possible transposons inserted into prophages and provides dynamic visualizations to inspect the genome annotation and organization of prophages.
Acknowledgements
The authors thank the Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency for providing computational resources for this project.
Funding
This work was supported by RFBR (grant number 16-54-21012) and SNSF (grant identifier IZLRZ3_163863).
Conflict of Interest: none declared.
References