Talenoffer: Genome-wide Talen Off-target Prediction

Transcription activator-like effector nucleases (TALENs) have become an accepted tool for targeted mutagenesis, but undesired off-targets remain an important issue. We present TALENoffer, a novel tool for the genome-wide prediction of TALEN off-targets. We show that TALENoffer successfully predicts known off-targets of engineered TALENs and yields a competitive runtime, scanning complete mamma-lian genomes within a few minutes. Availability: TALENoffer is available as a command line program from


INTRODUCTION
The DNA binding domain of transcription activator-like (TAL) effectors is composed of highly conserved tandem repeats, where amino acids 12 and 13 of a repeat [repeat-variable diresidue, (RVD)] determine DNA binding specificity. Each repeat binds to 1 bp of the DNA in a contiguous non-overlapping fashion (Boch et al., 2009;Moscou and Bogdanove, 2009). Recently, we developed TALgetter, a tool for predicting putative targets of natural TAL effectors, and we showed that this tool achieves an improved prediction accuracy compared with previous approaches (Grau et al., 2013).
The DNA binding domain of TAL effectors can be fused with a Fok1 endonuclease domain to yield TAL effector nucleases (TALENs), where homo-or hetero-dimers of TALENs need to bind to opposite strands of the DNA in 5 0 -3 0 orientation and in a restricted distance range to specifically cut the DNA double strand. TALENs have been established as a second genomeediting technique besides zinc-finger nucleases (Gaj et al., 2013;Miller et al., 2011). Although the binding of TALENs is highly specific, undesired off-targets in addition to the targeted genomic region remain an important issue (Hockemeyer et al., 2011;Mussolino et al., 2011;Osborn et al., 2013;Tesson et al., 2011) that may cause severe side effects. Hence, tools for the computational prediction of TALEN off-targets have been developed, namely, idTALE (http://idtale.kaust.edu.sa) and Paired Target Finder (PTF) [https://tale-nt.cac.cornell.edu, Doyle et al. (2012)]. Here, we only consider PTF because the 'Search for TALEN target' application of idTALE is not applicable to custom input data.
In this article, we present TALENoffer, an alternative tool for predicting TALEN off-targets. TALENoffer applies the statistical model of TALgetter to the more complex problem of TALEN offtarget prediction. This requires novel methods for ranking offtargets and accelerated scanning approaches to achieve acceptable runtimes, which are explained in the following section.

Statistical model
The statistical model of TALENoffer assumes that the probability of a nucleotide of a target site depends on the RVD of the corresponding repeat. In addition, it reflects that different RVDs contribute differently to the activity of TAL effector constructs (Streubel et al., 2012) (details in Supplementary Methods). Given RVD sequence y ¼ y 1 , . . . , y L and model parameters k, this model assigns each putative monomer target site x ¼ x 0 , . . . , x L a likelihood Pðxjy, kÞ. Based on the likelihood, we define a relative score sðxjy, kÞ :¼ 1 Lþ1 log Pðxjy, kÞ

Ranking and filtering off-targets
Given two TALEN monomers with RVD sequences y 1 and y 2 of length L 1 and L 2 , respectively, a distance d between TALEN monomers and a putative off-target site x ¼ x 0 , . . . , x L1þdþL2þ1 , we first determine the relative scores s 1 ¼ sðx 0 , . . . , x L1 jy 1 , kÞ and s 2 ¼ sðx c L1þdþL2þ1 , . . . , x c L1þdþ1 jy 2 , kÞ of the two monomer target sites, where x c denotes the complement of nucleotide x. We define the score s of the complete off-target site x as the sum s ¼ s 1 þ s 2 of the two relative scores. This scoring scheme allows for ranking off-target sites, given TALEN monomers of different lengths, although typically L 1 ¼ L 2 .
We report off-targets yielding a score s that exceeds a threshold t ¼ s Ã þ 2 logðqÞ, where s Ã denotes the score of the best-matching theoretical off-target site for the current pair of TALEN monomers. Parameter q specifies that the average likelihood over all positions of the off-target site shall be at least q Á 100 % of the average likelihood over all positions of the best-matching site. We additionally require each monomer score s i to exceed a threshold t i ¼ s Ã i þ logðq Á 0:9Þ, where s Ã i denotes the best monomer score of y i . This allows for only mild compensation between the two monomers of a common off-target site. In the TALENoffer application, users may select pre-defined or enter custom values of q (see also Supplementary Fig. S2). In addition, we limit the ranked sites that are reported to a user-specified number. We consider homo-and heterodimers of the TALEN monomers and both DNA strands because all can lead to off-target effects.

Runtime optimization
Using a naive scanning approach for predicting TALEN off-targets, we presumably shift a sliding window of width L 1 þ 1 along the DNA sequence and compute the score s 1 , given the first TALEN monomer y 1 *To whom correspondence should be addressed. within this window. Whenever we find a hit, i.e. s 1 ! t 1 , we also scan the reverse complement of the downstream sequence for sufficiently good hits, given the second monomer y 2 within the user-specified distance range.
The scanning approach of TALENoffer is based on this naive approach but uses a speed-up strategy, which is illustrated in Figure 1 for one TALEN monomer. Given the TALEN monomer, we compute a partial score for each possible 8mer prefix and store it in lookup table 1, together with the information whether a target site with this prefix may yield a sufficiently large total score s i (details in Supplementary Methods). For lookup table 2, we proceed in complete analogy using partially overlapping 8mer infixes. Both lookup tables can be accessed efficiently, given the prefix and infix of a putative monomer target site. Scanning input sequences for off-target sites, we test whether both lookup tables indicate that the putative target site under the sliding window might exceed threshold t 1 . If this is the case, we only need to compute the remaining score for the nucleotides beyond lookup Table 2 to yield the total score s 1 . If s 1 exceeds the threshold, we apply the same strategy for putative target sites of the second monomer on the opposite strand within the user-specified distance range.
In addition to this speed-up strategy, TALENoffer is multithreaded to allow for simultaneously loading, parsing and scanning input data.

Finding known off-targets
For evaluating predictions, we use TALENs and reported offtargets from several recent studies (details in Supplementary Methods, Supplementary Table S2, Supplementary Fig. S4). For all of these datasets, the intended TALEN target is reported on rank 1 by TALENoffer and PTF. However, we observe differences between both tools for the predicted off-targets. Tesson et al. (2011) designed a TALEN pair for targeting IgM in Rattus norvegicus. The off-target reported by Tesson et al. (2011) is predicted by TALENoffer on rank 2 and by PTF on rank 6. Mussolino et al. (2011) targeted CCR5 in human. The offtarget CCR2 is reported only by TALENoffer (rank 2) because of an atypical A at position 0 of one monomer target site not allowed by PTF. Hockemeyer et al. (2011) targeted PPP1R12C in human and reported two off-targets, which are both predicted by TALENoffer (ranks 47 and 195), whereas PTF predicts only one of these off-targets (rank 159). Osborn et al. (2013) targeted human COL7A1 and report three off-targets. Off-target GGT1 is reported by TALENoffer on rank 106 and by PTF on rank 72. PRMT2 is reported by TALENoffer (rank 3) but not by PTF. The third off-target is reported by neither approach due to the large number of 11 mismatch positions.

Runtime comparison
We compare the runtime of TALENoffer with runtime optimization with that of PTF in Table 1 for example datasets of different sizes. We find that PTF requires 3.0-5.8 times the runtime of TALENoffer on the same input datasets. Considering memory consumption, PTF consistently allocates less memory than TALENoffer. However, for all input datasets, TALENoffer requires at most 2 GB of memory, which allows for execution on current standard computers (details in Supplementary Table S1).

CONCLUSION
We present TALENoffer, a novel tool for the genome-wide prediction of TALEN targets and off-targets, which successfully predicts known off-targets of engineered TALENs and yields a competitive runtime. TALENoffer is implemented using the open-source Java library Jstacs (Grau et al., 2012) and is available as a command line program and as a Galaxy (Blankenberg et al., 2010) web application, which can also be installed to a local Galaxy server. 17 min 57 s 3 min 5 s Note: As an example TALEN, we use two TALEN monomers with 15 repeats (details in Supplementary Methods) with a distance of 12-24 bp between monomer target sites and with multithreading enabled. All values are measured on a standard laptop (Intel Core i7, ULV, dual core 2 GHz). Fig. 1. Speed-up strategy of TALENoffer. Two lookup tables of partial scores are represented by boxes, where the light shaded part of lookup table 2 serves as condition for computing the partial likelihood of the dark shaded part. The rightmost part of the putative off-target site only needs to be considered if both lookup tables indicate a score above the threshold