In silico abstraction of zinc finger nuclease cleavage profiles reveals an expanded landscape of off-target sites

Gene-editing nucleases enable targeted modification of DNA sequences in living cells, thereby facilitating efficient knockout and precise editing of endogenous loci. Engineered nucleases also have the potential to introduce mutations at off-target sites of action. Such unintended alterations can confound interpretation of experiments and can have implications for development of therapeutic applications. Recently, two improved methods for identifying the off-target effects of zinc finger nucleases (ZFNs) were described–one using an in vitro cleavage site selection method and the other exploiting the insertion of integration-defective lentiviruses into nuclease-induced double-stranded DNA breaks. However, application of these two methods to a ZFN pair targeted to the human CCR5 gene led to identification of largely non-overlapping off-target sites, raising the possibility that additional off-target sites might exist. Here, we show that in silico abstraction of ZFN cleavage profiles obtained from in vitro cleavage site selections can greatly enhance the ability to identify potential off-target sites in human cells. Our improved method should enable more comprehensive profiling of ZFN specificities.


INTRODUCTION
Gene-editing nucleases, such as zinc finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs) and clustered regularly interspaced short palindromic repeats (CRISPR)/CRISPR-associated (Cas) nucleases, can be used to create targeted sequence alterations with high efficiencies in numerous cell types and organisms (1)(2)(3)(4)(5)(6)(7). Repair of nuclease-induced doublestranded breaks can be exploited to introduce either insertion/deletion (indel) mutations via non-homologous end-joining (NHEJ) or specific sequence alterations from a donor template via homology-directed repair (1,2). Comprehensive delineation of unintended off-target mutations is important for customized nucleases in many biological applications and will be essential for developing therapeutic strategies based on these proteins.
Two different methods have recently been described for characterizing the genome-wide specificities of ZFNs (8,9), but neither study comprehensively identified the fullspectrum of possible off-target mutations. One method, previously developed by Liu and colleagues, used an in vitro cleavage site selection assay to identify sequences from a large partially degenerate library (based on the intended target DNA site) that can be cleaved by ZFNs. In vitro selections with a CCR5-targeted ZFN pair identified 36 potential off-target cleavage sites that occur in the sequence of the human genome; analysis of these sites in human cells in which CCR5-targeted ZFNs had been expressed revealed nine bona fide off-target sites (8). Another approach, described by von Kalle and colleagues, exploited the incorporation of integrase-deficient lentivirus (IDLV) DNAs into nuclease-induced doublestranded breaks to map ZFN cleavage sites in human cells (9). Application of this approach to the same CCR5-targeted ZFNs characterized with the in vitro selection approach identified four off-target genomic sites.
However, the substantial lack of overlap between offtarget sites identified in these two studies (only one site was common to both sets) strongly suggested that neither identified all possible off-target sites. In addition, these results also suggested that a broader range of potential off-target sites might exist beyond the sets identified by these two methods.
Here, we show that in silico abstraction of ZFN cleavage profiles generated by the selection method of Liu and colleagues provides an improved approach to screen the human genome for potential ZFN off-target sites. This enhanced strategy identifies both previously described as well as dozens of additional off-target sites for a ZFN pair targeted to CCR5 gene. We also show that this improved method works effectively for another ZFN pair targeted to the VEGFA gene. Our results demonstrate that the potential landscape of off-target mutagenesis effects for ZFNs may be broader than delineated in previous studies.

Plasmids
The plasmids encoding ZFNs targeted to sites in the human CCR5 (10) and VEGFA (11) genes were modified to include heterodimeric EL/KK FokI mutations (12) and were constructed as described in Pattanayak et al. (8) (Supplementary Figure S1).

Processing of in vitro selection data
Sequence reads from the in vitro cleavage assay reported by Pattanayak were used to generate nucleotide windows comprising the core 9 bp (VEGFA) and 12 bp (CCR5) zinc finger recognition sites as well as the adjacent nucleotides for each ZFN half-site. Sequences shown to cleave efficiently in vitro were considered active. The preselection library sequences minus those seen in the active set were considered not efficiently cleaved and labeled as the inactive class. Duplicate entries were removed unless they were identified as independent cleavage events either by experiment or sequence variation in the spacer. Several classifiers including SVMs, decision trees and Naı¨ve Bayes were tested in 10-fold cross-validation analyses using WEKA v3.5.7 (13). Naı¨ve Bayes performed as well or better than the rest of the classifiers and was used exclusively going forward in this study. The test set was built from human genome build HG36 was parsed into similar windows using spacers of 5 and 6 nt separating the two zinc finger half-sites.

Validation of putative cleavage sites
Individual windows are scored 0 to 1 with windows of lower scores representing sequences that are more likely to be cleaved by the CCR5 ZFNs. K562 cells were treated with catalytically active CCR5-224 ZFNs or a vector-only control, genomic DNA was harvested, and deep sequencing was used to analyze loci of interest as described in Pattanayak et al. (8), with the exception that sequencing was carried out for each paired-end library with a 150-cycle MiSeq run (Illumina; Harvard Biopolymers Facility, Boston, MA for CCR5 samples and Dana Farber Cancer Institute for VEGFA samples). Oligonucleotides used to amplify genomic loci of interest are listed in Supplementary Table S10.
Data processing to identify putative mutagenic NHEJ events Individual reads were mapped using primer sequences to the individual amplicons and aligned using the Needleman-Wunsch algorithm with affine gap penalties (14). Alignments with <40 bp (minimum combined length of primer) to the reference were excluded, and targets with <1500 reads in either the treated or untreated samples were excluded. Individual alignments were combined to generate a multiple sequence alignment. Identical alignments were counted, condensed and verified to map HG37.57 using BLAT (http://www. kentinformatics.com/). Sequences that mapped preferentially to an alternate target were excluded. Potential NHEJ events required indels of at least 2 nt in length that originated from within the spacer between the ZFN halfsites.

RESULTS
We sought to improve the original strategy of Liu and colleagues by addressing its inability to interrogate cleavage site libraries in vitro to a depth sufficient to identify all possible off-target sites present in the human genome. To do this, we added a machine-learning-based step that uses cleavage site preferences from the in vitro selection experiments to predict what sequences in the human genome are most likely to be cleaved ( Figure 1). We used standard machine-learning techniques to construct Naı¨ve Bayes classifiers that quantify how the nucleotide identity at each position within a DNA site differs between members of a partially degenerate library that were cleaved efficiently in vitro and those that were not ('Materials and Methods' section). The scores generated by each classifier range from 0 to 1, with lower scores representing a higher probability that any given site will be cleaved ('Materials and Methods' section).
We performed an initial test of our approach by developing a classifier based on in vitro site selection data previously obtained for ZFNs targeted to a site in the human CCR5 gene. As shown in Supplementary  Table S1, application of this CCR5 ZFN classifier to the human genome resulted in the overwhelming majority of potential target sites having a high classifier score: 11 421 321 184 of 11 421 337 066 potential sites (99.999861%) received a score higher than 0.75. By contrast, only 15 882 sites (0.000139% of all potential sites) had a score lower than 0.75, and only 1123 sites (0.00000983% of all potential sites) had a score below 0.5. Importantly, all 12 bona fide off-target sites identified previously by the in vitro cleavage site selection, and the IDLV integration methods had scores below 0.75. In addition, 11 of these 12 sites fall within the top 25% of sites with scores below 0.75 (Supplementary Table S2) (8)(9)(10).
Having established classifier score cutoffs that enable identification of all previously known off-target sites for the CCR5-targeted ZFNs, we next prospectively tested whether other sites with scores below 0.75 might include additional bona fide off-target sites. However, a comprehensive analysis of all sites with scores below 0.75 would require deep sequencing of 15 882 different alleles, an experiment that would be challenging and expensive to perform, given the current cost of next-generation sequencing. Therefore, we instead systematically assessed a smaller sampling of sites by first grouping them based on their position in exonic or non-exonic genomic sequence and then binning sites within each of these groups according to their classifier scores (i.e.-0.0 to 0.1, 0.1 to 0.2, etc.). To achieve high levels of nuclease activity that would facilitate detection of lower frequency off-target events, we used conditions described by Liu and colleagues to overexpress CCR5-targeted ZFNs in K562 cells ('Materials and Methods' section). We then used deep sequencing to assess the top 13 scoring sites (if available) within each bin for evidence of NHEJmediated indel mutations in the genomic DNA of these cells.
Analysis of 138 sites identified NHEJ-mediated indel mutations not only at the intended CCR5 target site and at a previously known off-target site in the CCR2 gene but also at 21 new off-target sites ( Table 1). As expected, the percentage of bona fide off-target sites found within each classifier score bin was inversely correlated with the magnitude of the score (i.e.-a greater percentage of actual off-target sites were identified in the lower score bins).
For example, 35% (16 of 46) of the screened targets with scores in the first tercile (lowest scores) showed significant evidence of NHEJ-mediated indel mutations compared with 13% (6 of 46) and 2% (1 of 46) of sites with scores in the second and third terciles, respectively (Supplementary Table S3).
To test the generalizability of our classifier-based approach, we used it to predict off-target sites for another pair of ZFNs targeted to the human VEGFA locus (Supplementary Table S4). Previous work using the in vitro cleavage site selection assay had identified a large number of potential off-target sites for this ZFN pair in human cells (Supplementary Table S5) (8). We used this selection data to build a classifier that we used to score every possible site in the human genome ('Materials and Methods' section). As we observed with the CCR5 classifier, only a small number (7242) of genomic sites had a classifier score below 0.75, and only 936 sites had a score below 0.5. In addition, all 31 bona fide off-target sites identified previously with the in vitro selection data all had scores below 0.6, with all but one of these sites having scores below 0.5 (Supplementary Table S6). We assessed 159 potential off-target sites (identified using the same stratified sampling approach we used for the CCR5 ZFNs) for evidence of off-target mutations from genomic DNA of human cells in which the VEGFAtargeted nucleases had been expressed. This systematic stratified analysis identified 34 bona fide off-target sites, including eight that were previously identified by Pattanayak et al. (8) and 26 that were novel (Table 2). We note that that the majority of these novel off-target sites had low classifier scores, again demonstrating the predictive capability of our method (Table 2). Furthermore, several of the sites we predicted to be off-target sites that did not show a statistically significant level of NHEJ mutations in this study had been previously confirmed as off-targets when screened with a greater depth of sequencing reads by Pattanayak et al. (8), suggesting that a greater number of the predicted off-target sites might show evidence of mutation with deeper sequence sampling.

DISCUSSION
Our results show in silico abstraction of in vitro cleavage data provides a strategy that more broadly identifies all potential off-target sites of ZFN activity in human cells than previously described methods. Our classifier-based approach not only successfully re-identified all previously known off-target sites for two different ZFNs but also enabled the identification of many additional novel offtarget sites, including some that differ from the target sequence by as many as 8 (of 24) or 6 (of 18) bp for the CCR5-or VEGFA-targeted ZFNs, respectively (Tables 3  and 4) (8)(9)(10). Because sequences harboring these numbers of mismatches will occur frequently in the human genome, identifying these off-targets would have been previously intractable by simple mismatch counting approaches; indeed, using such strategies would require screening hundreds of thousands of potential sites (Supplementary  Tables S7 and S8).
Importantly, because we only assessed a small sampling of the top scoring potential off-target sites in cells, we believe that the full range of potential off-target sites for the two ZFN pairs we examined is likely more expansive than just those identified in this study. This expectation is supported by another experimental screen (data not shown) that identified six additional bona fide offtarget sites for the CCR5-targeted ZFNs (Supplementary  Table S9 and Supplementary Discussion). Collectively, these results clearly demonstrate that ZFN off-target sites may occur at low rates much more widely on a genome-wide scale than suggested by data from previously described reports.
Although our data clearly demonstrate that sites with low classifier scores are highly enriched for bona fide offtarget sites, our results also show that bona fide offtargets are present (albeit at a much lower frequency) among loci with higher classifier scores. This suggests that comprehensive identification of off-target sites will require interrogation of a large number of loci by deep sequencing. We expect that decreases in the price per base and increases in the number of bases that can be sequenced should increase the number of potential sites with low classifier scores that can be examined, thereby enabling the identification of a greater number of bona fide off-target sites. However, until such reductions in sequencing costs become reality, an alternative approach might be to look at off-targets with the best scores or to pre-screen off-targets bioinformatically for sites that fall in regions of high priority such as promoters, exons and non-coding RNAs. We note that the number of off-target sites identified by our approach may be larger or smaller depending on the cell type examined as well as the level and duration of ZFN expression. Not all of the sites with low classifier scores we examined showed evidence of mutagenesis.
Potential reasons for this might include DNA methylation of the target site or chromatin status of the gene. These parameters will be cell-type specific and would not be accounted for by in vitro selections or in silico classifiers. As large-scale efforts such as ENCODE and the NIH Roadmap Epigenomics Mapping Project define these variables in multiple different cell types, it may be possible to use such information to prioritize sites with low classifier scores and thereby to increase the yield of bona fide off-target sites identified by deep sequencing. In addition, we expressed ZFNs from a strong constitutive CMV promoter using transiently transfected plasmids and harvested genomic DNA from cells 5 days post-transfection. Lower levels and shorter durations of ZFN expression might be expected to induce fewer off-target mutations, whereas higher levels and longer durations might induce an even greater number of such mutations. More broadly, the combined strategy of using in vitro cleavage site selection data together with machine-learning-based classifiers might also be extended to specificity information from other sources (e.g.-SELEX or bacterial selection) and to define the specificities of nucleases built on other platforms (e.g.-TALENs or CRISPR-Cas RNA-guided nucleases). The use of machine learning to improve the predictive power of data derived from in vitro selection experiments could be particularly useful for ZFNs composed of greater numbers of fingers in each monomer and for TALENs. These nucleases target longer sites, making it challenging to adequately sample all potential off-targets even in an in vitro system. Continuing to better define off-target effects of targeted nucleases will provide important information to guide refinement of the genome-wide specificities of these reagents. These improvements will be critically important, as these targeted nucleases are more widely applied for both research and therapeutic approaches.