Summary: Prophinder is a prophage prediction tool coupled with a prediction database, a web server and web service. Predicted prophages will help to fill the gaps in the current sparse phage sequence space, which should cover an estimated 100 million species. Systematic and reliable predictions will enable further studies of prophages contribution to the bacteriophage gene pool and to better understand gene shuffling between prophages and phages infecting the same host.
Availability: Softare is available at http://aclame.ulb.ac.be/prophinder
Supplementary information: Supplementary data is available on http://aclame.ulb.ac.be/Tools/Prophinder/evaluations_table.html.
Phages are viruses infecting prokaryotes. The sub-group of temperate phages has the capability to remain in their host, in a latent stage, as prophages. Most of the prophages are found integrated in the host chromosome while some are established as plasmids. Whether functional or defective, prophages can recombine with other phages and/or prophages, a central mechanism in bacteriophage evolution (Hendrix et al., 1999).
At present, the coverage of the phage sequence space remains very narrow and yet, phages are the most abundant organisms on Earth. The observed diversity gives us only a hint on the real variety of the phage population, estimated to a 100 million species (Rohwer, 2003). Previous studies support the view that a common gene pool is available to double stranded (ds) DNA tailed phages (Hendrix et al., 1999). A significant portion of this pool resides in prophages, providing many recombination opportunities for other prophages residing in the same host or infecting phages. Detecting prophages in prokaryotic genomes will therefore largely expand the phage sequence space and facilitate studies on gene exchange within the phage population.
We present Prophinder, an algorithm that combines similarity searches, statistical detection of phage-gene enriched regions and genomic context for prophage prediction. A database with prophage predictions in sequenced prokaryotic genomes has been developed with a Web interface for browsing the results. Prophinder can also be accessed via a programmatic interface (Web services), ensuring interoperability with other software tools.
Prophinder is written in Perl and uses the ACLAME database (Leplae et al., 2004) as source of phage data for similarity searches, gene annotation and detection of conserved pairs of genes are found in phage genomes.
2.1 Input data
The algorithm takes as input, a prokaryotic genome sequence in GenBank format with annotated positions of genes and coding sequences (CDSs).
2.2 Detection of phage-like CDSs in the prokaryotic genome
Phage-like CDSs in a prokaryotic genome are identified by gapped BLASTP search (Altschul et al., 1997) of all the translated CDSs from the input genome against all phage proteins in ACLAME.
2.3 Detecting phage-like dense regions (PGDRs)
Our method of prophage prediction is based on the detection of genomic segments statistically enriched in phage-like genes. Each set of n consecutive CDSs is modeled as succession of n trials (CDSs) that can each result in a success (phage-like) or a failure (i.e. not phage-like). The exceptionality of the enrichment is estimated with the binomial P-value, which represents the probability to observe by chance at least s phage-like CDSs in a set of n consecutive CDSs.
2.4 Selecting the putative prophages
All PGDRs are sorted by decreasing sig values. Mutually overlapping PGDRs are then compared on the basis of hierarchical rules: (1) PGDRs containing an integrase gene always take precedence over overlapping PGDRs lacking the integrase gene; (2) PGDRs with higher sig have the precedence over those of lower sig. The next step separates tandem prophages found in one single PGDR, based on the presence of integrase genes and/or instances of conserved gene pairs found in the ACLAME phage genomes.
2.5 Iterative process
Some PGRDs with lower sig values, often representing small prophages or prophage remnants, may escape the selection. These can be recovered through an iterative process, by running a new round of selection each time, on the same scoring matrix where PGDRs selected in the previous iteration are masked (sig scores set to −1). The iterative process stops either when no new PGDR is detected or when the number of iterations set by the user is reached.
2.6 Secondary search
For more exhaustive prophage detection in bacterial genomes, Prophinder can load a set of prophages and mask them when counting the phage-like CDSs. This lowers the expected probability (p) of phage-like CDSs in the genome, leading to additional putative prophages from the new significance matrix. This option may be useful for genomes with high average density of phage-like CDSs. Prophages predicted using this option need to be analyzed cautiously. The secondary search can be immediately run after completing the predictions using the predicted prophages as the source for masking the phage-like CDSs.
Prophinder is by default executed with several maximum window sizes (300, 200, 100, 50 and 20 CDS). A consensus is then created by combining the results from all the predictions. The consensus is the default solution proposed to the users.
3 ASSESSMENT OF PROPHINDER PERFORMANCE
Prophinder predictions were evaluated against a collection of annotated prophages provided by S. Casjens [extension from (Casjens, 2003)]. The Supplementary Table (accessible at http://aclame.ulb.ac.be/Tools/Prophinder/evaluations_table.html) provides the evaluation procedure and results. This evaluation gives a sensitivity of 79% and a positive predictive value of 94% for Prophinder. For the sake of comparison, Phage_Finder (Fouts, 2006) features a sensitivity of 67% and a positive predictive value of 94% under the strict settings. However, it cannot be ruled out that with other combinations of parameters Phage_Finder can reach higher sensitivity than Prophinder. The two methods are thus capable of detecting a large number of prophages in bacterial genomes while producing very few false positives. Many false negatives in both methods consist of small prophages with only few genes similar to those of known phages. This limitation is expected for such methods, but is likely to be reduced when more phages will be annotated. Prophinder, as an additional asset, is capable of detecting tandem prophages as such. Moreover, the execution time for predicting prophages in a genome such as Escherichia coli O157:H7 EDL933 (NC_002655) on a Pentium 4 2.6 GHz with 1 GB of memory is around 40 min for Prophinder and 130 min for Phage_Finder.
4 DATABASE AND WEB INTERFACE
A relational database has been developed to store all Prophinder predictions. A Web interface (http://aclame.ulb.ac.be/prophinder) allows for browsing the prediction results. A GenBank entry can be submitted to the server (http://aclame.ulb.ac.be/perl/Aclame/Prophages/prophinder.cgi) for running Prophinder and view the prediction results. Execution parameters can be set such as the BLASTP E-value threshold, the maximum window sizes to be used for screening the genome, the secondary search option, etc. All the options are documented on the web form. The Prophinder database is regularly updated with predictions from newly sequenced genomes using the default parameters.
5 WEB SERVICE
To facilitate the use of Prophinder in automated processes, such as an annotation pipeline, a Web service has been developed. The WSDL, which is the primary programmatic interface for the web clients, defines the services and is accessible at http://aclame.ulb.ac.be/prophinder/prophinder.wsdl. The Web service allows running remotely Prophinder on GenBank entries. The predictions stored in the Prophinder database can be retrieved through the web service as well. Two Perl clients are available for download to use the Prophinder web service.
We are grateful to Sherwood Casjens for providing the manually annotated prophages data and to Olivier Sand and Morgane Thomas-Chollier for their help in the web service development. Our work is supported by ESA-PRODEX (contract C90254), the Fonds de la Recherche Fondamentale Collective (FRFC), the Actions de Recherche Concertés du Ministre de la Communauté Française de Belgique and the Université Libre de Bruxelles (ULB). G.L.M. was supported by the Fonds Xenophilia, ULB.
Conflict of Interest: none declared.