SpacePHARER: sensitive identification of phages from CRISPR spacers in prokaryotic hosts

Abstract Summary SpacePHARER (CRISPR Spacer Phage–Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage–host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein level, optimizing its scores for matching very short sequences, and combining evidence from multiple matches, while controlling for false positives. We demonstrate SpacePHARER by searching a comprehensive spacer list against all complete phage genomes. Availability and implementation SpacePHARER is available as an open-source (GPLv3), user-friendly command-line software for Linux and macOS: https://github.com/soedinglab/spacepharer. Supplementary information Supplementary data are available at Bioinformatics online.


I. ALGORITHM DESCRIPTION
The query spacer set Q has N q translated ORFs q of CRISPR spacers (Q = {q 1 , ...q Nq }) from one prokaryotic genome. Phage proteome target set T has N t phage protein sequences t (T = {t 1 , ...t Nt }). These protein sequences are extracted in the input preprocessing step (Step 0) of the algorithm from each spacer set and each phage genome by scanning them in six translational frames. We refer to similarity between q and t as hit, and similarity between Q and T as match. The SpacePHARER algorithm relies on a statistic for the combination of hits between a spacer sequence set and a phage protein sequence set. The idea is that combining together several sub-significant hits (due to weak homologies or the typical length of spacers) can be highly informative and result in a significant match. Steps 2 and 3 of the algorithm test if the pairwise P-values of the best hit of sequences in the query set with those in the target set are due to homologous relationships or entirely due to chance.

A. (1) MMseqs2 protein-level search
The SpacePHARER algorithm first searches all q's against all t's using the fast, sensitive MMseqs2 proteinlevel search [10], with VTML40 substitution matrix [8], gap open cost of 16, gap extension cost of 2, and a short, spaced k-mer pattern for the prefilter stage (10111011) with six informative ('1') positions. Spaced k-mers are utilized in MMseqs2 to reduce the correlation between k-mers at neighboring positions, and to achieve better sensitivity and speed. The spaced k-mer pattern is chosen such that it is short in length in order to produce consecutive double k-mer matches (which are demanded by MMseqs2) within spacer fragments of 10-12 aa, and that the number of maximum overlapping informative positions is minimized.
Perfect or near-perfect hits (with no or 1-2 mismatches on the nucleotide level) are shown to be very reliable signals in predicting phage-host relationship and improve the taxonomic certainty of the prediction, even if there is only a single hit between a phage-host pair [4]. However, those hits are not well reflected in the pairwise Pvalue of the protein-level search. Therefore, all q − t hits reported from the sensitive protein-level search will be aligned again on the nucleotide level with match reward of 1, mismatch penalty of 1, gap open cost of 10 and gap extension cost of 2. The protein-level search will compute a protein pairwise P-value (p prot ) for each hit and nucleotide alignment a nucleotide pairwise P-value (p nucl ). In order to prioritize near-perfect hits on the nucleotide level to gain precision without losing much sensitivity, we compute the pairwise P-value as exp (min {(0.5 log p prot + 0.5 log p nucl ) , log p nucl }) (1)

B. (2) Computing P-value of best hit
All hits of each q against the N t proteins in a specific phage genome T are examined by their pairwise P-values, and the hit with the lowest pairwise P-value ("best hit") is retained. SpacePHARER computes the P-value of the best hit p bh (q) using first order statistics, i.e. the P-value of taking the minimum pairwise P-value (p(q)), given that a total of N t pairwise P-values were examined: Nt (2)

C. (3) Combining P-values using a modified truncated product method
In this step, we aim to combine the evidence from several best hits between a spacer set Q and a phage genome T . We sort the p bh of the given set Q of N q sequences in ascending order and denote the i'th p bh as p i . When combining independent P-values of individual hits, one needs to take into account the number of individual hits and the strength of each hit. The truncated product method combines independent P-values into a score by multiplying all p bh (q) smaller than a threshold p 0 [14], where I(·) is the indicator function that returns 1 if the argument is true and otherwise returns 0.
In SpacePHARER, we modified the truncated product method for better performance. We take the product of the smallest best-hit P-value p 1 times the ratio between p i and the threshold p 0 for all further p i below the threshold p 0 : For the threshold, we set p 0 = 1/(N q + 1), which corresponds to marginal significance, with an E-value of N q /(N q + 1) just below 1. This ensures that the combined score for null model-distributed P-values p i only rarely gets boosted by a contribution from the secondbest p i .

D. (4) Determining true predictions
SpacePHARER predicts matches de novo, i.e. without relying on any known phage-host relationships, by controlling for estimated false discovery rate (FDR). The FDR is the proportion of false predictions among all predictions: We implemented an FDR estimation approach similar to that of the R package "fdrtool" [11]. In essence, we estimate the FDR by a Grenander decreasing density estimate of the empirical cumulative distribution function (ECDF). This non-parametric approach achieves its robustness by ensuring monotonicity of the FDR.
SpacePHARER uses a null model dataset to estimate the proportion of false predictions. The same search and statistical computation procedures described in Steps 1, 2 and 3 of the algorithm are performed on a given null model dataset, e.g. inverted phage ORFs or eukaryotic viral ORFs. Inverting target ORFs as null model dataset can be easily performed by specifying one parameter when preparing the input.
To compute an empirical P-value for each query spacer set Q, we sort for each Q the combined scores S comb of matches in the original target dataset of phage proteomes in ascending order. For each S comb value in the target dataset, we calculate an empirical P-value p emp by using the fraction of Q−T matches with a combined score that is below S comb in the null model dataset. We denote the number of Q − T matches below the cutoff as K and the total number of matches using the null model dataset as N null . The empirical P-value is then computed as where, to stabilize the estimate, we used half pseudocounts with P-values at 0 and 1. In the following, we abbreviate these empirical P-values as p, or p Q for query set Q.
If we knew the fraction π 0 of false positives among all Q − T matches, we could in principle estimate the false discovery rate simply as where p π 0 is the fraction of false positives with empirical P-value less than p i . F emp (p) is the empirical cumulative distribution function of the p Q , in other words F emp (p) is the number of query sets Q with best matches p Q ≤ p.
We can increase the robustness of the estimate by using the fact that the true probability distribution of P-values f (p) must be monotonously decreasing. This will also ensure that the FDR decreases with increasing p, which is often violated with the simple procedure above. The Grenander estimate [11] is a simple, efficient procedure to obtain a robust estimateF (p) of F (p) from F emp (p) that has monotonously decreasing densityf (p) = dF (p)/dp. We simply obtain the convex hull of the area under the F emp (p) curve, that is, the smallest functionF (p) witĥ F (p) ≥ F emp (p) that yields a convex area under the curve. This results in a piecewise constant, monotonously decreasing density functionf (p) = dF (p)/dp with steps at points p i with p last = 1. We estimate the proportion of true null hypotheses π 0 as the average density using the last two steps, Finally, we compute the estimated FDR corresponding to each empirical P-value p (Fig.1A) as By default, SpacePHARER has an FDR cutoff of 0.05, and reports all matches in the test whose S comb corresponds to this FDR value or lower. Users can select other suitable FDR cutoffs to retain more or fewer predictions.

E. (5) Scanning for possible PAMs
For some CRISPR-Cas systems, protospacer adjacent motifs (PAMs) are required for the recognition of foreign invader sequences. After reporting phage-host pairs and their hits, SpacePHARER can perform a scan for possible PAMs. For this, SpacePHARER by default extracts 10 nt long fragments flanking the matched protospacer region at the 5' and 3' side, in guide-centric orientation (PAM is located on the strand that matches the spacer sequence). Users can increase or decrease the length of the flanking sequence. Both the 5' and 3' flanking sequences are searched in a list of consensus PAM patterns from representative CRISPR-Cas systems [5]. Since many CRISPR detection tools cannot reliably predict the orientation of the CRISPR array, the 5' and 3' flanking sequences on the reverse strand are also searched and two additional possible PAMs are reported. Users should refer to all possible PAMs without the accurate orientation information of the array.

II. OPTIMIZING PARAMETERS FOR SHORT FRAGMENTS SEARCH
Different substitution matrices are optimal for comparing sequences that have diverged to different degrees. By default, MMseqs2 search [10] uses the BLOSUM62 matrix with standard gap penalties: gap open cost of 11 and gap extend cost of 1 , which is more suited for long alignments and detecting weak protein similarities. Conversely for shorter sequences and higher protein similarity, one should consider a "shallower" (higher bit score per aligned column) matrix, and higher gap penalties to prevent gaps [9]. Searching with VTML40 matrix [8] with gap open cost of 16 and gap extend cost of 2 yielded the highest sensitivity with 20% on our test dataset at FDR cutoff of 0.05 ( Figure S2). We introduced a series of VTML matrices in MMseqs2 to solve general problems of short sequence search. After introducing the additional nucleotide alignment step, the search parameter combination (VTML40 matrix, gap open cost of 16 and gap extend cost of 2) remains the highest in sensitivity (result not shown).

III. PREDICTING MATCHES USING BLASTN
We compared SpacePHARER's performance with the state-of-the-art method using BLASTN. To generate a comparable result, we performed the search step with BLASTN and the downstream FDR control with SpacePHARER. We used BLASTN [1] to first query the 80% test spacer dataset against 7,824 phage genomes, then against 7,824 inverted phage genomes or 11,304 eukaryotic viral genomes as a null model database. For all searches we used the parameters: -max_target_seqs 10000000 -dust no -word_size 7 -outfmt '6 std qcovs' and recorded the running time. Hits with at least 95% sequence identity and 95% query(spacer) coverage (i.e., one or two mismatches were allowed) were retained. We grouped the hits into matches (unique phage-host genome pairs) and retained the minimum pairwise Evalue of the hits. We sorted the pairwise E-values of hits in ascending order for both searches and counted the matches at a given pairwise E-value cutoff. Therefore, we could calculate an FDR in the same way SpacePHARER does (described in section I.D) and compare the number of true predictions produced by the two methods ( Figure  1B).

IV. HOST TAXONOMIC RANK ANALYSIS
To assess the sensitivity of SpacePHARER at different host taxonomic rank, we searched with CRISPR spacers extracted from 1,066 bacterial genomes against 809 phage genomes with annotated host taxonomy [4], then against inverted ORFs of the 809 phage genomes as null model dataset. For each phage, SpacePHARER predicted the host's lowest common ancestor (LCA) based on a weighted LCA procedure [7].
We demanded a stricter FDR cutoff of 0.02 for matches that should be taken into account for the host taxonomic rank prediction. In order to limit the number of false taxonomic predictions due to incomplete databases, the LCA result was further corrected according to the average nucleotide sequence identity of the reported matches [6]. We used the following cutoffs for maximal taxonomic resolution: > 86% (species), > 84% (genus), > 82% (family), > 80% (order), > 78% (class), > 76% (phylum), > 74% (kingdom). Lower values were assigned at the superkingdom level. The taxonomic FDR cutoff and sequence identity cutoffs are user-definable parameters for the weighted LCA procedure.
We searched with the above-mentioned spacer dataset against phage genomes using BLASTN with parameters: blastn-short -dust no -word_size 7 -outfmt '6 std qcovs' -evalue 1 -gapopen 10 -gapextend 2 -penalty -1 [4]. Hits with at least 95% sequence identity and 95% query(spacer) coverage were retained (i.e., one or two mismatches were allowed). For each phage, the bacterium with the lowest pairwise E-value was predicted to be its host. Note that in Edwards et al., the authors searched with the phage genomes against the spacer dataset, and demanded 100% spacer coverage.
For ranks lower than phylum, we only included the predictions with the taxonomic resolution of the respective rank or below. At the species level, SpacePHARER predicted 142/237 hosts (60%), compared to 112/232 hosts of BLASTN (48%). SpacePHARER predicted the correct host for more phages at all taxonomic ranks, while including most of the BLASTN predictions on the same rank and sometimes even those agreeing only on a higher rank( Figure 1C, Figure S3).
Incomplete reference databases remain an issue for phage-host relationship predictions. To simulate scenarios where the database is very incomplete, we progressively exclude 25% and 50% of the host genomes in the spacer dataset, and compare the performance between BLASTN and SpacePHARER. SpacePHARER predicted the correct host for more phages than BLASTN at all taxonomic ranks when we searched with 50% and 75% of original host spacer dataset ( Figure S4).

V. IDENTIFYING MIS-ANNOTATIONS IN EUKARYOTIC VIRAL DATASET
Throughout this study we used the set of eukaryotic viral genomes as a null model dataset, assuming any match between a prokaryotic genome and a eukaryotic virus is false. Here, we used SpacePHARER's second mode of FDR control to detect viruses that were potentially misannotated as eukaryotic viruses. To that end, we first ran the SpacePHARER workflow with the full spacer dataset against the eukaryotic viral genomes as the target database, and then, against inverted eukaryotic viral ORFs as the null model database. We used the null set to estimate the FDR as described in section I.D.
By applying the same FDR cutoff of 0.05, we identified 11 viruses out of the 11,304 that matched a prokaryotic host (yielding a total of 12 matches). We observed three groups within these matches. The first group consisted of two matches between the smacovirus family (KP264966.1 and KY086299.1) and the archaeon CP005934.1 (Candidatus Methanomassiliicoccus intestinalis). Indeed this family has been recently reported as mis-annotated as "eukaryotic virus" by Díez-Villaseñor and Rodriguez-Valera [3]. The second group consisted of two matches between KT809302.1 (Haloarcula californiae icosahedral virus 1) and family Halobacteriaceae (CP001687.1 and LIST01000008.1). These matches are likely due to misannotation of the virus as "eukaryotic virus". The labeled host of this virus is Haloarcula californiae, which is an archaeon that belongs to the same family as our matches. The third group consisted of 8 members of the genus Mimivirus that were matched to HE978663.1 (Ruminococcus sp. JC304) and JAAF01000022.1 (Fusobacterium necrophorum DAB). Table I shows the standard output from SpacePHARER of this search. We suspect the matches of the third group are due to spacer misannotation and do not represent a real virus-host relationship. It was previously reported that Mimiviruses acquire bacterial genes, even of the class Clostridia [12] [13]. In the case of Ruminococcus sp. JC304, when we inspected the bacterial genomic region from which the spacers were extracted, we found that the entire region is likely to be a full bacterial ORF, rather than a CRISPR array. Thus, we conclude that in these cases, the misannotation is of the CRISPR array, rather than of the virus.

Name
Version SpacePHARER Git: 1d1f1b2 BLASTN 2.9.0+  Figure 1C, but on incomplete databases. The host spacer dataset was progressively depleted from 100% of genomes (1,066) to 75% (800) and 50% (533). Performance is evaluated by the number of host predictions that agree with annotated host taxonomy at different taxonomic ranks. with '#', followed by the prokaryote accession (the file from which spacers were extracted), viral genome accession, S comb and the number of hits in the match. Each hit line starts with '>', followed by the spacer sequence header, viral genome accession, p bh , spacer start, spacer end, viral genome start, viral genome end, and the possible PAM sequences on forward and reverse strand (5'|3'). Additionally (not shown), the aligned sequences can be printed following each hit line.