Nuclease-mediated gene editing by homologous recombination of the human globin locus

Tal-effector nucleases (TALENs) are engineered proteins that can stimulate precise genome editing through specific DNA double-strand breaks. Sickle cell disease and β-thalassemia are common genetic disorders caused by mutations in β-globin, and we engineered a pair of highly active TALENs that induce modification of 54% of human β-globin alleles near the site of the sickle mutation. These TALENS stimulate targeted integration of therapeutic, full-length beta-globin cDNA to the endogenous β-globin locus in 19% of cells prior to selection as quantified by single molecule real-time sequencing. We also developed highly active TALENs to human γ-globin, a pharmacologic target in sickle cell disease therapy. Using the β-globin and γ-globin TALENs, we generated cell lines that express GFP under the control of the endogenous β-globin promoter and tdTomato under the control of the endogenous γ-globin promoter. With these fluorescent reporter cell lines, we screened a library of small molecule compounds for their differential effect on the transcriptional activity of the endogenous β- and γ-globin genes and identified several that preferentially upregulate γ-globin expression.


INTRODUCTION
Sickle cell disease is the most common monogenic disease worldwide and is caused by a single point mutation in the b-globin gene. Painful clinical symptoms begin shortly after birth as mutated b-globin subunits replace non-defective g-globin chains in the predominant form of hemoglobin. Current pharmacological treatment with hydroxyurea partially reverses this globin switching by increasing the production of g-globin (1,2). This has led to broad interest in developing other compounds and discovering new mechanisms that preferentially upregulate g-globin (2)(3)(4)(5), and also in developing methods to study globin regulation (6,7). Analyses of differential expression of band g-globin generally have been limited to hemoglobin electrophoresis or qRT-PCR, but recent reports have described a method of using the expression of fluorescent molecules driven by the band g-globin promoters as a readout of differential globin regulation. In those studies, the authors integrated into the genome a bacterial artificial chromosome containing the entire 200 kb b-globin locus (which includes both b-globin and g-globin among other genes), modified such that the band g-globin promoters drive expression of fluorescent proteins (6,7). The integration of the complete genomic locus presumably maintains much of the physiologically relevant regulation of expression, but it does not allow for the direct analysis of the endogenous locus and is confounded by the fact that integration is in a random genomic location and that some cells gain multiple copies of the BAC. In addition, a BAC-based strategy creates a system in which the globin locus is triploid rather than diploid and this change may also affect the regulatory dynamics. Alternatively, direct modification of the endogenous band g-globin loci eliminates those confounding variables.
Endogenous genomic loci can be precisely altered using engineered zinc finger nucleases (ZFNs) (8)(9)(10)(11) and Taleffector nucleases (TALENs) (12)(13)(14). ZFNs and TALENs are comprised of a specifically engineered DNA binding domain fused to the FokI endonuclease domain. Binding of a pair of ZFNs or TALENs to contiguous sites leads to the dimerization of the FokI domain, resulting in a targeted DNA double-strand break. Repair of the break can proceed by mutagenic non-homologous end joining or by high-fidelity homologous recombination with a homologous DNA donor template. Compared to ZFNs, TALENs seem to cause lower levels of cytotoxicity (15). Their recognition domain is characterized by repeated arrays of 34 conserved amino acids, except in positions 12 and 13. These two amino acids comprise the repeat variable domain (RVD), which contacts the DNA and provides the nucleotide recognition specificity of each repeat array (16,17). Unlike the other DNA bases which each show strong preference for a single RVD, guanine can be recognized by at least two RVDs with different binding characteristics. The asparagine-asparagine (NN) RVD can form a high-affinity hydrogen bond with guanine, but is not specific because it can also hydrogen bond with adenine (18,19). Conversely, the asparagine-lysine (NK) RVD seems to be more specific for guanine (13) but is less commonly found in naturally occurring TAL-effector proteins (17).
Recent reports have described the development and use of b-globin ZFNs to correct the sickle mutation in human iPS cells. The low rates of confirmed targeting described in these studies (1/300 (20) and 28/286 (21) drug resistant clones were targeted) could be increased by improving the efficiency and toxicity profile of the engineered nucleases. Here, we used highly active and minimally toxic b-globin TALENs to stimulate homologous recombination of therapeutic b-globin cDNA to the endogenous b-globin locus in 19% of cells prior to selection. To analyse the efficiency of both the cutting by the TALENs and the rate of targeted integration, we employed a rapid, accurate and economical deep sequencing method known as single molecule real time (SMRT) sequencing (22). We then describe a new method to generate reporter cells that express fluorescent proteins from endogenous genomic promoters. By using TALENs to target a promoterless GFP in-frame to the endogenous b-globin ATG start site and a promoterless tdTomato in-frame to the endogenous g-globin ATG start site, we generated a robust endogenous reporter in the context of a common genetic disease. Finally, because g-globin upregulation is therapeutic in sickle cell disease, we used these fluorescent reporter cells to screen small molecule compounds that preferentially upregulate g-globin expression compared to b-globin.
Nuclease and targeting vector construction b-globin NK TALENs were synthesized (Genscript) using the Á152 N-terminal domain and the +63 C terminal domain previously described (13) and fused to the FokI nuclease domain and cloned into pcDNA3.1 (Invitrogen). b-globin NN TALENs and g-globin NN TALENs were synthesized using a Golden Gate cloning strategy (23) and cloned with the same N-and C-termini and nuclease domain into pcDNA3.1. The b-globin ZFNs were synthesized using the B2H selection strategy previously described (24). The b-Ubc-GFP targeting vector was synthesized by PCR amplifying arms of homology from genomic DNA isolated from K562 cells using the primers in Supplementary Figure S10 and cloning a Ubc-GFP expression cassette in between the arms. The b-in-frame-cDNA and b-in-frame-GFP targeting vectors were synthesized by overlap PCR to insert b-globin cDNA (OriGene) or GFP directly in-frame to the b-globin ATG start codon using the primers in Supplementary Figure S10. Silent mutations were introduced into the bglobin cDNA sequence at every sixth base pair between the nuclease cut site and the end of exon 1. The MGMT P140K drug selection cassette (generous gift from Dr Stan Gerson) was cloned into the targeting vector inside the arms of homology. The g-in-frame-tdTomato targeting vector was generated by genomic PCR of the 5 0 and 3 0 arms of homology using primers in Supplementary Figure S10. TdTomato was fused in-frame to the g-globin ATG start codon by overlap PCR. A neomycin phosphotransferase cassette was cloned in between the arms of homology.
In vitro transcription of nucleases TALEN and ZFN mRNA was synthesized in vitro with the MEGAscript T7 kit (Ambion), polyadenylated in vitro with the poly(A) tailing kit (Ambion) and purified with the MEGAclear kit (Ambion) following the manufacturer's protocols. Two versions of mRNA were synthesized, using unmodified nucleotides or using pseudouridine-5 0 -triphosphate (Trilink) in place of UTP (25).

SSA and toxicity assays
A single-strand annealing (SSA) reporter was generated by disrupting the GFP gene by duplicating an internal 42 bp region and separating the duplicated region with a 72 bp fragment from the b-globin region containing the nuclease recognition sites. The SSA reporter and each nuclease were transfected by calcium phosphate into HEK293T cells and analysed on an Accuri C6 flow cytometer (Accuri) after 2 days. The toxicity assay was performed as previously described (24). Briefly, HEK293T cells were co-transfected by calcium phosphate with a pair of nucleases and a GFP expression plasmid. The cells were analysed by FACS for percent GFP positive on day 2 and day 6. The day 2/day 6 ratio was normalized to a non-toxic nuclease sample.

Surveyor nuclease assay
The Surveyor nuclease assay was performed as previously described (26). Briefly, 6 Â 10 5 HEK293T cells were lipofected with 1.5 mg of each nuclease or 10 6 K562s were nucleofected with 2.5 mg of each nuclease unless otherwise indicated. After 3 days genomic DNA was isolated using the DNeasy kit (Qiagen) and the locus of interest was PCR amplified using the primers in Supplementary Figure S10 using Accuprime polymerase (Invitrogen). 200 ng of the PCR product was treated with the Surveyor nuclease (Transgenomic) following the manufacturer's protocol. HEK293T cells were used to characterize the b-globin nucleases because of the presence of a naturally occurring SNP in K562s.
SMRT sequencing and cDNA targeting PCR products prior to cutting by the Surveyor nuclease were prepared for SMRT sequencing following the manufacturer's protocol (Pacific Biosciences). For the SMRT sequencing of the b-globin cDNA targeting events, 10 6 K562s were nucleofected with 10 mg b-in-frame-cDNA and 1 mg each of bL4 and bR4 TALENs. Aliquots were removed after 3 days when the first round of selection was begun by adding 50 mM O6BG (Sigma) for 1 hour and then adding 40 mM BCNU (Sigma) for 1 hour before changing the media. Cells were allowed to recover for 7-10 days at which time another aliquot was harvested and another round of selection started. Genomic DNA was isolated (Qiagen) and the b-globin region was PCR amplified using primers in Supplementary Figure S10, which did not amplify random integrants. Primers with unique 3 bp tags were used in the PCR reactions from each time point, such that the samples could be combined and analysed in one SMRT sequencing reaction. Data were analysed using CLC Genomics Workbench software.
Generation of fluorescent reporter cell lines 10 6 K562 cells were nucleofected with 10 mg of the targeting vector and 1 mg of each TALEN. b-globin-GFP cells were enriched by four rounds of selection with O6BG and BCNU and clones were established by limiting dilution. g-globin-tdTomato cells were enriched by treatment with 500 mg/ml G418 and clones were established by limiting dilution. Targeting was confirmed by genomic PCR spanning the integration junctions using primers in Supplementary Figure S10.

Quantitative real-time PCR
Clonal populations of b-globin-GFP cells and g-globin-tdTomato cells which were targeted at one allele were treated for 4 days with 400 mM hydroxyurea and total mRNA was harvested by Trizol/chloroform extraction and purified on RNeasy columns (Qiagen). 1 mg total RNA was used to synthesize cDNA with the iScript cDNA kit (Bio-Rad) following the manufacturer's protocol. Biological triplicates were each assayed in triplicate by qRT-PCR using SYBR green (Applied Biosystems) on a CFX384 real-time thermocycler (Bio-Rad) using the primers in Supplementary Figure S10 using the following conditions: initial denaturation (3 min at 95 C), 3-step PCR cycle (10 s at 95 C, 30 s at 55 C, 5 s at 65 C, 40 cycles). PCR efficiency (between 91% and 119%) was calculated using serial dilutions of template for each primer set. mRNA expression was quantified using the 2 ÀÁÁCt method as compared to the housekeeping gene GAPDH.

Statistical analysis
Data from at least three samples were used to determine significance by statistical analysis. Mean ± SD is reported. Statistical significance was determined by Student's t-test and P-values < 0.05 were considered significant.

Design and characterization of b-and c-globin TALENs
To develop a system that robustly and rapidly reports on the activity of both the b-globin and g-globin loci, we designed a gene-targeting strategy using engineered nucleases. Recent reports have described low but significant levels of genome modification at the endogenous b-globin locus using ZFNs (20,21), and we first sought to improve the rate of gene targeting at the b-globin locus by designing custom TALENs to that site. First, we identified four putative left (bL1-bL4) and four right (bR1-bR4) TALEN binding sites near the sickle mutation in b-globin ( Figure 1A and Supplementary Figure S1), and synthesized the eight individual TALENs using the NK RVD to bind each guanine. Notably, we made slight modifications of the final TALEN expression vector to include the N-and C-terminal TALEN truncations that have been shown to be sufficient for optimal TALEN activity (13). In an extrachromosomal SSA assay, we identified six TALEN pairs that stimulated SSA at least 10-fold above background (Supplementary Figure S2). We then re-constructed the most active TALEN pair (bL4-NK/bR4-NK) to contain the NN RVD (bL4-NN/bR4-NN) using the Golden Gate cloning strategy previously described (23). To investigate their activities at the endogenous chromosomal b-globin locus, we used the Surveyor nuclease assay in HEK293T cells. 293T cells were used instead of hematopoietic K562 cells because a SNP in one b-globin allele of K562s confounded analysis in the Surveyor nuclease assay (data not shown). The NK versions modified up to 18% of alleles (Supplementary Figure S5A) and the NN TALENs modified 48% of alleles ( Figure 1B and Supplementary Figure S5A). As a comparison, we also used a modification of the 'oligomerized pool engineering' (OPEN) method to generate ZFNs to the b-globin locus (24,27). These ZFNs were made independently from the ones reported by Sebastiano et al. (21) but are designed to the same target sequence and are very similar in the amino acid sequence of the alpha-helices that mediate DNA binding (Supplementary Figure S3). Although the ZFNs were much more cytotoxic than were the TALENs (Supplementary Figure S4), the ZFNs were also active, modifying up to 12% of b-globin alleles in the Survyeor nuclease assay (Supplementary Figure S5A). Interestingly, delivery of TALENs as mRNA did not increase the already high frequency of cutting, but delivery of the ZFNs as mRNA increased the signal from 12% to 35% (Supplementary Figure S5B). Importantly, the TALENs showed only 4% modification at the d-globin locus (Supplementary Figure S5C), which has high sequence homology with b-globin (Supplementary Figure S1).
To confirm the frequency of genome modification by bL4-NN/bR4-NN, we used SMRT sequencing, a rapid, high-throughput method for sequencing of the b-globin locus following TALEN treatment (22). SMRT sequencing allows for simultaneous analysis of up to 30 000 sequences, as well as multiplexing various samples at once. Analysis of 14 215 b-globin sequences revealed TALEN modification of 54% ( Figure 1C).
Next, to modify the endogenous g-globin locus, we designed and constructed three left (gL1-gL3) and two right (gR2-gR3) NN TALENs that bind sequences near the ATG start codon of g-globin ( Figure 1D and Supplementary Figure S6). Because of the sequence identity between Ag-globin and Gg-globin these TALEN pairs do not distinguish the two loci. To measure the activity of the g-globin TALENs, we again used the Surveyor nuclease assay, which resulted in modification of up to 44% of g-globin alleles with the gL3/gR2 pair ( Figure 1E). Two other TALEN pairs modified >30% of g-globin alleles (Supplementary Figure S7). Analysis of 14 790 g-globin SMRT sequences revealed a modification rate of 53% with gL3/gR2 ( Figure 1F). As expected, because of the lack of sequence homology between the b-globin and g-globin loci, the g-globin TALENs had no activity at the b-globin locus (data not shown).

TALEN-mediated b-globin targeting by homologous recombination
We then sought to determine at what frequency these highly active TALENs stimulated gene targeting by homologous recombination (Figure 2A). First to target the b-globin locus, we designed a targeting vector with $1 kb arms of homology 5 0 and 3 0 of the TALEN cut site. In between the homology arms, we included a Ubc-GFP expression cassette that, upon successful homologous recombination, would be stably integrated into the b-globin locus ( Figure 2B, 'b-Ubc-GFP' targeting vector). Gene targeting was achieved by nucleofection of b-Ubc-GFP with bL4-NN and bR4-NN TALEN expression plasmids into erthyroleukemic K562 cells, and resulted in stable integration of Ubc-GFP in 19% of transfected cells (13% overall) compared to <1% in the absence of TALENs ( Figure 2C). We then compared the activities of the NK and NN b-globin TALENs in the gene-targeting assay. In confirmation of the Surveyor assay data, the NN versions stimulated a significantly higher rate of targeted integration compared to the NK TALENs. Interestingly, when paired with bR4-NN, both bL4-NK and bL4-NN stimulate high rates of targeting ($20%). However, when paired with bR4-NK, bL4-NK resulted in 1.8% stable GFP expression, while bL4-NN led to 4.5% stable GFP expression (Supplementary Figure S8). Despite high rates of modification in the Surveyor assay (Supplementary Figure S5A), the ZFNs did not stimulate targeting of the b-globin locus and targeted integration of the Ubc-GFP cassette could not be discriminated from background random integrants ( Figure 2C and Supplementary Figure S8). In this direct comparison of ZFNs and TALENs designed to target nearly the same sequence (Supplementary Figure S1), we found that the TALENs were significantly better because of their greater cutting activity, significantly greater stimulation of targeting and their lower toxicity. These data also demonstrate better activity with TALENs using NN as the RVD to recognize guanine compared to NK but that NK TALENs can have excellent activity in the correct context.
Targeting b-globin cDNA to the endogenous b-globin locus We next sought to target full-length b-globin cDNA to the endogenous b-globin ATG start site. In this way, endogenous b-globin regulatory elements would express b-globin from the cDNA instead of from the wild-type genomic sequence, a strategy that would be clinically relevant for both sickle cell disease and b-thalassemia. We modified the b-Ubc-GFP targeting vector, replacing the Ubc-GFP cassette with b-globin cDNA fused in-frame to the natural b-globin ATG start codon, already present in the 5 0 arm of homology ( Figure 3A, 'b-in-frame-cDNA' targeting vector). Also included in the b-in-frame-cDNA targeting vector was a drug selection cassette encoding a mutant form of methylguanine methyltransferase (MGMT P140K), which allowed for enrichment of targeted cells by treatment with the combination of O6-benzylguanine (O6BG) and carmustine (BCNU).
To determine the frequency of targeting and the efficiency of drug selection, we again employed SMRT sequencing. First, we targeted K562s with the b-inframe-cDNA targeting vector using bL4/bR4 TALENs. Then we pulsed the samples three times with O6BG and BCNU and harvested gDNA after each pulse. To amplify the b-globin locus, we used a forward primer that is 5 0 and outside the start of the 5 0 homology arm and a reverse primer in exon 2 of b-globin ( Figure 3B). In this way, random integrants were not amplified. The presence of intron 1 in the wild-type genomic DNA sequence of this locus, and its absence in the targeted b-globin cDNA, allowed us to determine the ratio of targeted alleles to wild-type alleles after each pulse based on the length of the sequence, which could then be confirmed by the sequence content ( Figure 3B). In the absence of drug selection, 8% of the alleles were targeted as determined by analysing the sequence of 1100 alleles in the TALENtreated sample. The targeting frequency of 8% of alleles is consistent with the observed rate of b-Ubc-GFP targeting in 19% of cells because there are three copies of the b-globin locus in K562 cells ( Figure 2C). Pulsing the targeted cells with O6BG/BCNU up to three times resulted in the enrichment of targeted alleles such that they accounted for >60% of all sequenced alleles ( Figure 3C). Since K562s are known to be aneuploid with three copies of the globin locus (28), a post-selection modified allele frequency of 60% is consistent with a highly purified population in which nearly 100% of cells are targeted at one or multiple b-globin alleles.
Generation of fluorescent b-and c-globin reporters by endogenous locus tagging Next, we redesigned the b-Ubc-GFP targeting vector such that a promoterless GFP was fused in-frame to the b-globin ATG start codon ( Figure 4A, 'b-in-frame-GFP' targeting vector). In this way, upon targeting to the endogenous b-globin locus, GFP would be driven by the endogenous b-globin promoter and would be subject to the regulatory elements controlling b-globin expression. We targeted the b-in-frame-GFP targeting vector to the b-globin locus, using either bL4/bR4 TALENs or ZFNs.
Under the same experimental conditions that resulted in targeting rates of 19% with the b-Ubc-GFP targeting vector, the b-in-frame-GFP targeting experiment resulted in a much lower percentage of GFP positive cells, which is attributable not to lower targeting frequencies but to the naturally low level of b-globin expression in K562s (28,29). That is, the level of GFP expression driven by the b-globin gene is too low to be seen above background in many cells. Nonetheless, in the presence of bL4 and bR4 TALENs, there was a significantly higher percentage of GFP positive cells than in control samples ( Figure 4B, white bars). Selection with two pulses of O6BG and BCNU resulted in significant enrichment of GFP positive cells in the TALEN and ZFN samples compared to the targeting vector alone ( Figure 4B). Notably, with up to four pulses with O6BG and BCNU, the overall percentage of GFP positive cells never increased above 20% (data not shown). We believe this is due to the low activity of the b-globin promoter in K562s. When we sorted for GFP positive cells from the TALEN sample, over the course of 2 weeks in culture, the population went from being >95% GFP positive to $15% (data not shown). A second sort again resulted in a population of >95% GFP positive cells that fell to 15%. We attributed this observation to the low level of b-globin expression in K562s, such that at any given time 15% of the population expressed GFP at a high enough level to be detected by flow cytometry. When we analysed 48 individual clones from the drug selected TALEN sample, we observed three distinct patterns of GFP expression that we designated 'high,' 'medium' and 'low' (Figure 4C). To determine whether these clonal populations expressed GFP because of targeting to the b-globin locus, we used a genomic PCR assay spanning the junction of integration ( Figure 4A, arrows). In this way, the presence of a PCR product indicates correct targeting to the endogenous bglobin locus. Indeed, 11 of 12 analysed clones showed targeted integration ( Figure 4D). Interestingly, the one clone that did not produce a PCR product and thus was not targeted (clone #1) was a 'high' GFP expressing clone that had undergone random integration. Although we did not investigate the specific site of integration in this clone, based on its expression profile, it was likely near strong promoter elements that drive robust expression of the transgene. Of the original 48 clones, only 4 had 'high' GFP expression, corresponding to the absence of targeting to the b-globin locus by junction PCR (data not shown). These data demonstrate that targeted cells show low levels of GFP expression because of the low activity of the b-globin promoter in K562s, and that high-expressing cells are paradoxically more likely to be the result of random integration.
To develop a fluorescence-based reporter of the endogenous g-globin locus, we targeted tdTomato in-frame to the ATG start codon of g-globin, using a homologous targeting vector containing in-frame tdTomato followed by a neomycin drug resistance cassette ( Figure 5A, 'g-inframe-tdTomato'). Unlike b-globin, g-globin is highly expressed in K562 cells so the fluorescent readout from the targeted g-in-frame-tdTomato accurately reflected the overall integration rate despite the lack of an exogenous promoter. Co-transfection of g-in-frame-tdTomato with gL3/gR2 TALENs resulted in stable tdTomato expression in 34% of transfected cells (23% overall), compared to <1% in samples without TALENs ( Figure  5B and C). Genomic PCR spanning the integration junction ( Figure 5A, arrows) revealed the presence of a targeted band in samples treated with any of the three most active pairs of g-globin TALENs ( Figure 5E, left).
To create a dual-fluorescent reporter that expresses GFP from the endogenous b-globin locus and tdTomato from the endogenous g-globin locus ( Figure 5D), we used the g-globin TALENs to target the g-in-frame-tdTomato vector to the g-globin locus in a previously targeted b-globin-GFP clone ( Figure 5E, right). In this way, we generated three cell lines that report on the activity of endogenous globin promoters, the b-globin-GFP reporter, g-globin-tdTomato reporter and the b-globin-GFP/ g-globin-tdTomato dual reporter ( Figure 5F). In the clones selected as reporter cell lines, expression from the fluorescent transgenes remained stable over the course of more than 4 months in culture.

Using endogenous fluorescent reporter cells to screen globin-modulating compounds
Next, we sought to establish these fluorescent reporter lines as tools that can be used to compare the globinmodulating activities of small molecule compounds. Hydroxyurea, is used clinically to increase the production of g-globin and it has been shown to upregulate g-globin in K562s (30,31). K562s treated for 4 days with 400 mM hydroxyurea showed a significant 62-fold increase in b-globin expression as measured by qRT-PCR. g-globin mRNA levels were even more elevated than b-globin transcripts after treatment with hydroxyurea, increasing 932fold ( Figure 6A). Next, we treated the b-globin-GFP reporter cells and the g-globin-tdTomato reporter cells with hydroxyurea and measured mean fluorescence intensity on day 4. GFP and tdTomato intensities were significantly higher compared to untreated cells, and the increase in tdTomato was significantly greater than the increase in  GFP, mirroring the changes in band g-globin expression levels. These results show that the reporter cell lines can be used to rapidly, accurately and robustly measure the activity of the endogenous globin loci.
To expand our analysis, we treated cells from the b-globin-GFP, g-globin-tdTomato and b-globin-GFP/ g-globin-tdTomato cell lines with 5 concentrations of 17 different compounds shown previously to modulate globin expression (Supplementary Figure S9). Of these, 10 significantly increased the expression of the endogenous g-globin locus, the most striking of which were guanine, guanosine, apicidin and hydroxyurea ( Figure 6B). Similarly 10 compounds increased the expression of endogenous b-globin, with the best inducers being guanosine, guanine and GMP ( Figure 6C). The ideal pharmacological therapy for sickle cell disease is a drug that preferentially induces the production of g-globin compared to b-globin. Therefore the most relevant analysis was of the ratio of induction of gto b-globin ( Figure 6D). Compounds such as guanosine increased the expression of both gand b-globin ( Figure 6E). However, apicidin was a strong inducer of g-globin but had no activity at the b-globin promoter ( Figure 6F). Importantly hydroxyurea, the clinical standard of care for induction of g-globin had one of the highest g/b induction ratios of all the screened compounds. In this way, we have established a system to robustly, rapidly and simultaneously report on the activity of the endogenous band g-globin promoters.

DISCUSSION
The emergence of the TALEN platform for engineering nucleases has made possible the rapid, open-source generation of highly active genome editing proteins. TALENs  have been used to cause site-specific gene disruption and gene targeting in yeast (32,33), plants (23), nematodes (34), zebrafish (35,36), rats (37) and human cells (12)(13)(14). A recent report described TALENs designed to human b-globin and showed 5% gene correction of a mutated GFP gene, which had been disrupted by the insertion of the b-globin sequence recognized by the TALENs, but did not describe their activity at the endogenous b-globin locus (38). The authors then the used b-globin TALENs and a transposon-based targeting strategy to correct the sickle mutation in patient-derived iPS cells (39). In a third report, Cradick et al. designed a CRISPR/Cas9 system to target b-globin and showed efficient modification of the endogenous locus but demonstrated significant off-target effects (40). Here, we synthesized and compared the activities of NN-TALENs, NK-TALENs and ZFNs designed to the same genomic region in the human b-globin gene. We sought to induce a DNA doublestrand break near the site of the sickle mutation, which limited the number of potential TALEN binding sites that adhered to the 5 0 T design rule (23). Although several of the TALENs without a 5 0 T did show nuclease activity at the b-globin sequence (Supplementary Figure S2), notably the most active pair bL4/bR4 adhered to the 5 0 T rule. The highly active bL4 TALEN monomer was designed such that the most C-terminal RVD binds to the sickle thymine and not the wild-type adenine. In spite of this 1 bp mismatch, bL4/bR4 were as active as the 'wild-type sequence' bL4/bR4 TALENs in non-sickle cell lines (data not shown). The promiscuity of the TALEN pair designed to the sickle site for the wild-type sequence highlights the necessity of a thorough analysis of off-target effects of this nuclease pair. However, the activity at the wild-type sequence itself is not a concern in the potential therapeutic applications of this TALEN pair, as it would only be clinically used in patients with two mutated alleles.
Using the bL4/bR4 TALENs, we targeted b-globin cDNA to the ATG start codon of the endogenous b-globin locus in human cells and used deep sequencing method to precisely detect rates of targeting. Then we developed a TALEN-based locus tagging method to report on the activity of endogenous promoters by targeting GFP and tdTomato to the start codons of the endogenous b-globin and g-globin genes, respectively. Finally, we showed that our endogenously tagged reporter cells provide a rapid and facile method to analyse the globinmodulating activities of small molecule compounds.
Our strategy of using SMRT sequencing to validate the activity of engineered nucleases as determined by the Surveyor nuclease assay allows for the analysis of many more sequences as compared to standard Sanger sequencing methods at a fraction of the cost of other deep sequencing platforms such as Illumina. We believe that using deep sequencing to determine cutting and targeting frequencies will be especially beneficial in primary cells such as CD34+ hematopoietic stem cells in which these rates are considerably lower compared to cell lines.
Using TALENs to target full-length b-globin cDNA to the endogenous b-globin locus provides an alternate method to gene conversion of the sickle mutation using ZFNs as recently described (20,21). First, we showed considerably higher nuclease activity, using a TALEN platform that has been shown in side-by-side comparisons to be less toxic than ZFNs (15). In terms of toxicity, we showed using a previously described toxicity assay that the bL4/bR4 TALENs have considerably less cellular toxicity than both the b-globin ZFNs and the widely used CCR5 ZFNs (Supplementary Figure S4). Analysis of the highly similar d-globin locus revealed that the bL4/bR4 TALENs have minimal activity at that site (Supplementary Figure  S5). True genome-wide, site-specific analysis for off-target activity is the focus of ongoing research.
With regard to nuclease activity, in the b-in-frame-GFP targeting experiments that have low background signal because of the lack of exogenous promoter, we could detect targeted integration with the ZFNs after drug selection, showing that the ZFNs are capable of stimulating gene targeting at the b-globin locus. However, we were unable to detect targeting of the b-Ubc-GFP cassette with ZFNs at levels above background random integration, presumably due to extremely low targeting and the toxicity of the ZFNs. In summary, our bL4/bR4 TALENs are more active and less toxic than OPEN-generated ZFNs in both genomic and functional assays.
Another improvement in our strategy is that cDNA targeting would be therapeutic in both sickle cell disease, in which the causative mutation is at codon 6 of the b-globin gene, and b-thalassemia, in which causative mutations can occur throughout the length of the b-globin gene. The coconversion of the sickle mutation with the downstream integration of a drug resistance cassette in the first intron as described (20,21) has been demonstrated to be less efficient in cases when there is homologous sequence in between the site of the conversion and the insertion of the selectable marker (41) such as the first exon of b-globin. Therefore, when we designed the b-in-frame-cDNA targeting vector we introduced silent mutations in every sixth nucleotide of the cDNA sequence between the nuclease cut site and the end of the first exon. By reducing the homology between the genomic locus and the cDNA, we shunted the repair to proceed via homologous recombination with the 3 0 arm of homology (instead of with the short stretch of homology in exon 1 of the cDNA), ensuring that the drug selection cassette is also targeted to the locus. Unlike previous gene therapy trials that relied on random integration of b-globin and described the importance of b-globin introns on the expression of the transgene (42), our next-generation approach directly modifies the endogenous locus preserving the extra-genic regulatory elements. Because the intervening sequence 2 (IVS2) has been shown to increase expression of b-globin cDNAs up to 500-fold (43), if we find that expression of the b-globin cDNA is too low in primary cells, we can test whether adding the IVS2 sequence to the construct to increase expression. In contrast to prior experiments testing the importance of IVS2, in our targeting experiments, IVS2 is retained at the locus and thus any regulatory effects it might have could be still be preserved. The effect that including this intronic sequence in the targeting construct would have on the efficiency of homologous recombination would also have to be tested. Additionally, we chose to use the MGMT P140K-based drug selection strategy because it is effective in vitro (44) and relies on the FDA-approved compounds O6BG and BCNU, which can enrich for targeted cells in vivo (45).
Dozens of reports have analysed the effect of drugs on globin expression, primarily by analysing transcript levels by qRT-PCR, hemoglobin electrophoresis or benzidine staining (4,6,7,(46)(47)(48)(49)(50)(51)(52)(53)(54)(55). We established a method to generate fluorescent cell lines as accurate reporters of differential globin expression, and validated them by comparing the induction of band g-globin mRNA transcripts with the increase in GFP and tdTomato signal following treatment with hydroxyurea. Although we did not directly control for potential cell cycle effects on fluorescence alteration following drug treatment, analysis of transcript level by qRT-PCR validates that these changes in fluorescence are due to the modification of gene expression. Then we used the fluorescent globin reporters in a mini drug screen to demonstrate their utility as tools to rapidly and accurately measure modulations in globin expression. Despite using compounds that have been previously described to be g-globin inducers, we found that more than half of them also significantly increased expression from the b-globin locus. One mechanism by which small molecule compounds affect globin expression is through the induction of erythroid differentiation (29). The degree to which these compounds affected the extent of differentiation of this cell line and the mechanism of globin-induction by these compounds was not directly investigated here. No matter the mechanism of induced globin expression, these data highlight the importance of simultaneously evaluating both band gglobin expression in globin-induction studies. This is the first proof-ofprinciple example of using precise genome engineering to rapidly and efficiently generate cell lines with endogenous promoter reporters, validating the output by direct comparison to mRNA transcript levels, and then using the dual reporter cell line to screen for small molecules that differentially regulate two genes. In this way, we introduce a novel method to analyse endogenous promoter activity in the context of the most prevalent monogenic disease.
Historically, many globin expression studies were done in K562 cells because they are ubiquitous erythroid precursors and are highly amenable to in vitro experimentation. However despite their widespread use, K562s are an imperfect system with which to study the intricacies of globin biology because of the non-physiological levels of b-and gglobin expression. Indeed, our results similarly show a very high level of baseline tdTomato expression in the g-in-frame-tdTomato targeted cells with a low basal level of GFP expression from the b-in-frame-GFP cells. Despite this, we are able to demonstrate robust differential expression of band g-globin upon induction by various pharmacological compounds, including high g/b induction with hydroxyurea, the only compound clinically approved for this purpose. With these limitations in mind, we chose K562s as our model system because they can tolerate transfection of large amounts of DNA, allowing for optimization of the vital genome engineering aspects of this strategy. It is clear that alternative cell lines and ultimately primary erythroid progenitors are required to mechanistically describe and validate the methods of globin modulation that are suggested in this proof-of-principle work. As transfection methods of primary cells improve and with the discovery of potentially less toxic modified RNAs, we anticipate achieving biologically relevant levels of genome modification in these cells.
Despite the limitations of K562s, there have been no fewer than 20 reports in the literature in the last year alone describing globin modulation in this cell line. Here, we describe a novel method to concurrently evaluate band g-globin expression, using compounds that have been previously described to regulate globin expression. Having validated the effectiveness of this multi-fluorescent endogenous globin expression approach, we are now transitioning this work into a more biologically relevant cell line which we can use in an unbiased high-throughput drug screen to identify novel g-globin-specific inducers to be the next generation of pharmacologic therapy for patients with sickle cell disease. More generally, this strategy could be broadly applied to generate multi-color reporter cell lines to allow rapid screening for conditions and compounds that promote the activity of a particular pathway or determine cellular fate.