WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs

Abstract Summary WIsH predicts prokaryotic hosts of phages from their genomic sequences. It achieves 63% mean accuracy when predicting the host genus among 20 genera for 3 kbp-long phage contigs. Over the best current tool, WisH shows much improved accuracy on phage sequences of a few kbp length and runs hundreds of times faster, making it suited for metagenomics studies. Availability and implementation OpenMP-parallelized GPL-licensed C ++ code available at https://github.com/soedinglab/wish. Supplementary information Supplementary data are available at Bioinformatics online.

For a potential bacterial or archaeal host H, a Markov model of order k is trained on its genome by a maximum-likelihood approach. The maximum likelihoods solution of the Markov model's conditional probability of observing nucleotide x k+1 given preceding nucleotides x 1 ...x k+1 is where #x 1 ...x k+1 is the number of times that the string x 1 . . . x k+1 has been observed in the genome of H and α is a pseudo-count parameter. The log-likelihood per position for a sequence y = y 1 ...y N of a phage Φ under the model H is: log p H (y i+k |y i ...y i+k−1 ) WIsH returns a matrix that contains in its cell (i, j) the mean log-likelihood LL(Φ j |H i ) of the phage sequence Φ j under the model H i in the training set.

p-values
If the phage φ does not infect the prokaryotic host H, we assume that composition of their genomes will be independent. Then LL(Φ|H) will be an average of independent identically distributed random variables log p H (y i+k |y i ...y i+k−1 ), thus following a Gaussian distribution by the central limit theorem. For a given host H, by selecting phage sequencesΦ 1 , ...Φ n that do not infect H, one can fit by maximum-likelihood the parameters µ H , σ H of the Gaussian null-distribution for H: If an organism H is predicted to be the host of a phage Φ with a score of s = LL(Φ|H), then we can compute the following p-value: where S ∼ N (µ H , σ H ).

Benchmark details 2.1 Building new null-models
We provide in the file KeggGaussianFits.tsv the pre-computed parameters for the Gaussian nulldistribution of every prokaryotic genome used in the benchmark. If a user wants to compute the parameters for the null-distribution for another prokaryotic genome newProk of genus G, the following process can be used: Get a (large) set Z of phages that are known to infect another genus than G, build a model with WIsH using the genome file newProk.fna, run the WIsH prediction of the model of newProk against the set Z, fit (e.g. using maximum likelihood) a gaussian distribution over the scores given in the likelihood. matrix file, add a line to a file negFits.tsv with the following format: newProk<TAB>Mean<TAB>StandardDeviation To get the p-values associated to a prediction, the user can then use the options -b -n negFits.tsv when running a prediction.

p-value and accuracy
We define the precision as the fraction of correct predictions among all predictions: We moreover define the "fraction predicted" as: Fraction predicted = T P + F P T P + F P + T N + F N The accuracy is defined as the precision when making prediction on all the phage sequences (i.e.precision for a predicted fraction of 1.0). Figure 1 shows the precision and the fraction of predictions which the user can expect for different p-values. The user can fix a suited threshold depending on whether a low false discovery rate or a high amount of predictions is desired.

Influence of the order of the Markov model
As shown the figure 2, the accuracy is generally maximal for order 8. The drop in accuracy for higher orders could be explained by the too high specificity of the models to their host, and fail at detecting a strong signal in more distant genomes such as phages that infect them. Another point is that at order 9 the model parameters cannot be estimated accurately enough anymore, because a genome of typical length 5 Mbp will yield only about 5.10 6 /4 10 = 4.8 counts per 10-mer, but will still yield 19 counts per 9-mer. The optimal order k = 8 is thus a general compromise learning as complex a model as possible with conditional probabilities that are still sufficiently reliably estimated.
The order of the model can be tuned accordingly to the size of the prokaryotic genome use for training. Figure 3 shows the accuracies of the prediction depending on the order of the Markov model when training on truncated versions of the prokaryotic genomes (down to 25kbp and 1Mbp).

ROC curve on WIsH benchmark
For plotting the ROC curves in figure 4, z-scores were used instead of raw log-likelihood (calling WIsH with -z option) to have comparable values between two different phages. As for log-likelihood values, the higher the z-score, the more probable the interaction is. The areas under the ROC curve are comparable when using WIsH or VirHostMatcher for full-genome phage sequences, although slightly better for WIsH.

Evaluation under lower prokaryotic diversity
In real use cases, the genera richness of the prokaryotic fraction is typically between 10 and 150 genera (Santigli et al., 2016;França et al., 2016;Nasidze et al., 2009). If the prokaryotic fraction has been sequenced as well, one can restrict the prediction of the host to the genera that are present (or even dominant) in the prokaryotic fraction. By randomly subsampling fewer genus (respectively orders), Figure 5 (respectively Figure 6) present an estimate of the expected accuracy when restricting the prediction to the corresponding number of genus (resp. orders).  4 Evaluation on 16S sequence identity Yarza et al. (2014) state that "although there are official rules for the nomenclature of bacteria and archaea, the entities that are known as taxa and their hierarchical classifications are artificial constructs and so are somewhat subjective". As a consequence, bacteria of different genera can be very closely related whereas bacteria of the same genus can be more divergent. For example, certain Escherichia and Shigella strains have more than 99% residue identity for their 16S rRNA genes and certain Escherichia and Salmonella), whereas some bacteria in the same genus Escherichia have less than 96% residue identity among their 16S rRNA genes. Thus, a classification of bacteria and archaea based on a more quantitative criterion would be preferable in general and, in particular, to test the accuracy of the WIsH predictions. Yarza et al. (2014) proposes such a classification based on 16S sequence identity using 16S sequence identity thresholds that typically correspond to the usual taxonomy (these thresholds are reported in table 3). Tables 2 and 3 show the accuracies of WIsH and VHM under these 16S taxonomic criteria for 3kbp contigs and full-length genomes respectively. Figure 7 shows the relation between the p-value of the best prediction made by WIsH and the 16S identity between the predicted host and the actual one.  Table 3: Accuracies of the predictions on the WIsH benchmark (full-length phage genomes) using 16S rRNA residue identity thresholds for taxonomic assignment.

Evaluation with distant prokaryotic genomes
Since the diversity of the prokaryotic world is yet not fully known, the genomes of the actual host of a phage may still be unavailable. In this case, one shall rely on more distantly available genomes (organism in the same genus as the actual host for instance). We evaluate here the accuracies of the prediction when removing from the evaluation the genomes of prokaryotes closely related to the actual host. More specifically, for a given phage, we predict only using models built on prokaryotes belonging to a different genus than the actual infected genus. We kept only the phages (N = 1016) infecting genus having more than a single genus in their family, since it cannot otherwise lead to any right prediction at family level. The resulting accuracies in table 4 do not prove to be enough for a reliable prediction of the taxonomy of the host. However, by restricting the number of possible orders in the prediction, as done in the section 3, we show on figure 8 that the prediction accuracies reach more acceptable values. WIsH and VirHostMatcher have comparable accuracy at family level, and WIsH improve on VirHost-Matcher on higher taxa prediction. VirHostMatcher is slightly better than WIsH at family level when testing on 10kbp contigs (cf. Figure 9), but the accuracy of both tools at family level is so low that it will not be useful in practice.
In practice, for each viral metagenome generated the prokaryotic fraction of the sample can be obtained and sequenced as well. By binning the prokaryotic metagenomic contigs, one can learn the WIsH models on the set of sequences of each bin (with model order that can be adjusted to the total length of the bins according to the graphs in section 2.3), and use them to scan the associated viral metagenome. If a virus is present in a significant amount, there is good hope for the infected organism to be also part of the prokaryotic fraction, and therefore to be accurately detected by WIsH.  Table 4: Accuracies of the predictions on the WIsH benchmark on 3kpb contigs when using only distantly related genomes from the host.

Annotation of Earth's virome contigs
Hosts have been predicted using WIsH for the 125,842 metagenomic viral contigs (mVCs) of the Earth's virome (Paez-Espino et al., 2016) using prokaryotic models from the WIsH benchmark dataset. For a given p-value threshold on the prediction, the expected precision was measureddefined as the fraction of the WIsH predictions that match the original host annotation of (Paez-Espino et al., 2016). These annotations originally covered 7.7% of the contigs using alternative host prediction methods such as CRISPR or tRNA sequence matches. Here, only the host annotations whose genus was represented in our prokaryotic dataset were kept, finally amounting to 5.3% of the mVCs. Figure 10 shows the expected precision with respect to the fraction of newly annotated contigs. For instance, setting a p-value threshold of 10 −1 (red line) allows to find annotation for more than half of the unannotated contigs, while being consistent at 70% with the original host family annotation.

Influence of encoded tRNAs
The more tRNA encoded in the viral genome, the worse was the prediction. The number of true (in green) and wrong (in red) predictions at species and genus level respectively are plotted in figures 11 and 12. At species level, all predictions made for phage genomes encoding more than 4 tRNAs are wrong. This is also significant at genus level since the 10 correctly predicted phages encoding more than 30 tRNA in figure 12 come down to only 1 single phage infecting the genus Mycobacterium (the 10 instances are all coming from the cluster C1 described in Pope et al. (2015)).
Taxonomic level Specie Genus Family Order Class Phylum p-value 5.10 −12 3.10 −11 9.10 −16 2.10 −16 2.10 −11 5.10 −11 Table 6: p-value under a Wilcoxon rank-sum test of number of encoded tRNAs between phages with correctly and wrongly predicted host. Figure 10: Precision (using original annotation as reference) at family level vs. fraction of newly annotated contigs. Figure 11: Distribution of the number of encoded tRNAs for correct (green) or wrong (red) host predictions for 1,420 phages of the WIsH benchmark, at species level. Figure 12: Distribution of the number of encoded tRNAs for correct (green) or wrong (red) host predictions for 1,420 phages of the WIsH benchmark, at genus level.

Influence of number of coding sequences on prediction
As for the number of encoded tRNAs, the more coding sequences (CDS) present in the genome, the less accurate the host prediction gets.
Taxonomic level Specie Genus Family Order Class Phylum p-value 1.10 −12 1.10 −5 9.10 −9 1.10 −7 3.10 −7 1.10 −7 Table 7: p-value under a Wilcoxon rank-sum test of number of coding sequences between phages with correctly and wrongly predicted host. Figure 13: Distribution of the number of CDS for correct (green) or wrong (red) best host predictions for 1,420 phages of the WIsH benchmark, at species level. Figure 14: Distribution of the number of CDS for correct (green) or wrong (red) best host predictions for 1,420 phages of the WIsH benchmark, at genus level.