Optimization of scarless human stem cell genome editing

Efficient strategies for precise genome editing in human-induced pluripotent cells (hiPSCs) will enable sophisticated genome engineering for research and clinical purposes. The development of programmable sequence-specific nucleases such as Transcription Activator-Like Effectors Nucleases (TALENs) and Cas9-gRNA allows genetic modifications to be made more efficiently at targeted sites of interest. However, many opportunities remain to optimize these tools and to enlarge their spheres of application. We present several improvements: First, we developed functional re-coded TALEs (reTALEs), which not only enable simple one-pot TALE synthesis but also allow TALE-based applications to be performed using lentiviral vectors. We then compared genome-editing efficiencies in hiPSCs mediated by 15 pairs of reTALENs and Cas9-gRNA targeting CCR5 and optimized ssODN design in conjunction with both methods for introducing specific mutations. We found Cas9-gRNA achieved 7–8× higher non-homologous end joining efficiencies (3%) than reTALENs (0.4%) and moderately superior homology-directed repair efficiencies (1.0 versus 0.6%) when combined with ssODN donors in hiPSCs. Using the optimal design, we demonstrated a streamlined process to generated seamlessly genome corrected hiPSCs within 3 weeks.

The correlation analysis of genome editing efficiency and epigenetic state.   We first devised a robust protocol in which gRNA for CAS9-mediated genome editing can be synthesized directly by incubating two 100mer oligos of customized sequence with the linearized backbone in an isothermal assembly mixture. We detected >90% assembly efficiency as confirmed by Sanger sequencing. In parallel, to expedite reTALE construct synthesis, we created a library of RVD dimer blocks and backbone constructs (Supplementary Figure 2a) for a robust and cost-effective assembly protocol (TASA, TALE Single-incubation Assembly). TASA enabled us to assemble re-TALEs in a one pot one hour reaction (Supplement Figure. 2b). We found perfect re-TALE assemblies with the following success rates: re-TALE-12.5, 46%; re-TALE-14.5, 32%; and re-TALE16.5, 18% (Supplement Fig. 2C). Alternatively, re-TALE16.5s can be assembled in a two-stage protocol (Material and Methods) with 90% efficiency.

Supplementary Note2: Statistical analysis of genome editing NGS data
(1) HDR specificity analysis We used an exact binomial test to compute the probabilities of observing various numbers of sequence reads containing the 2bp mismatch. Based on the sequencing results of 10bp windows before and after the targeting site, we estimated the maximum base change rates of the two windows (P1 and P2). Using the null hypothesis that the changes of each of the two target bp were independent, we computed the expected probability of observing 2bp mismatch at the targeting site by chance as the product of these two probabilities (P1*P2). Given a dataset containing N numbers of total reads and n number of HDR reads, we calculated the p-value of the observed HDR efficiency.
(2) HDR sensitivity analysis In our experimental design, the ssODN DNA donors contained a 2bp mismatch against the targeting genome, so that we expected co-presence of the base changes in the two target bp if the ssODN was incorporated into the targeting genome. Other non-intended observed sequence changes would not likely change at the same time. Thus, we predicted non-intended changes to be much less interdependent. Based on these assumptions, we used mutual information (MI) to measure the mutual dependence of simultaneous two base pair changes in all other pairs of positions, and we estimated the HDR detection limit as the smallest HDR where MI of the targeting 2bp site is higher than MI of all the other position pairs. For a given experiment, we first identified HDR reads with intended 2bp mismatch from the original fastq file and we simulated a set of fastq files with diluted HDR efficiencies by systematically removing different numbers of HDR reads from the original data set. Mutual information (MI) was computed between all pairs of positions within a 20bp window centered on the targeting site. In these calculations, the mutual information of the base composition between any two positions is computed. Thus, unlike our HDR specificity measure above, this measure does not assess the tendency of position pairs to change to any particular pairs of target bases, only their tendency to change at the same time.
( Figure S4A, Table S4). We coded our analysis in R and MI was computed using the package infotheo.
(3) Correlations between genome editing efficiency and epigenetic state We computed Pearson correlation coefficients to study possible associations between epigenetic parameters (DNase I HS or nucleosome occupancy) and genome engineering efficiencies (HDR, NHEJ). Dataset of DNAaseI Hypersensitivity was downloaded from UCSC genome browser.
hiPSCs DNase I HS: /gbdb/hg19/bbi/wgEncodeOpenChromDnaseIpsnihi7Sig.bigWig To compute P-values, we compared the observed correlation to a simulated distribution which was built by randomizing the position of the epigenetic parameter (N=100000). Observed correlations higher than the 95th percentile, or lower than the 5th percentile of the simulated distribution were considered as potential associations.
Supplementary Figure 1 (c) TASA assembly efficiency for re-TALEs possessing different monomer lengths. The blocks used for assembly are illustrated on the left and the assembly efficiency is presented on the right.
Supplementary Figure 3. The functionality and sequence integrity of Lenti-reTALEs.
(a) Schematic representation of the fluorescence reporter system for testing the activity lentiviral particle encoding re-TALE. (d) PCR of genomic DNA of 10 independent colonies infected by lentiviral particles encoding re-TALE-TF. We found all the colonies carried desired full length reTALE cassette.
Supplementary Figure 4. The sensitivity and reproducibility of GEAS Information-based analysis of HDR detection limit. Given the dataset of re-TALENs (#10)/ssODN, we identified the reads containing the expected editing (HDR) and systematically removed these HDR reads to generate different artificial datasets with a "diluted" editing signal.
We generated datasets with 100, 99.8, 99.9, 98.9, 97.8, 89.2, 78.4, 64.9, 21.6, 10.8, 2.2, 1.1, 0.2, 0.1, 0.02, and 0% removal of HDR reads to generate artificial datasets with HR efficiency ranging from 0~0.67%. For each individual dataset, we estimated mutual information (MI) of the background signal (in purple) and the signal obtained in the targeting site (in green). We observe that MI at the targeting site is remarkably higher than the background when the HDR efficiency is above 0.0014%. We estimated a limit of HDR detection between 0.0014% and 0.0071%. MI calculation is described in the Methods.  We used Pearson correlation to study possible associations between DNase I sensitivity and genome engineering efficiencies (HR, NHEJ). We compared the observed correlation to a randomized set (N=100000). Observed correlations higher than the 95th percentile, or lower than the 5th percentile of the simulated distribution were considered as potential associations. We did not observe any significant correlation between DNase1 sensitivity and NHEJ/HR efficiencies.
Supplementary Figure 7. The impact of homology pairing in the ssODN-mediated genome editing.
(a) In the experiment described in Figure 3b, we found that overall HDR as measured by the rate at which the middle 2b mismatch (A) was incorporated decreased as the secondary mismatches B increased their distance from the A (relative position of B to A varies from -3030bp). The higher rates of incorporation when B is only 10bp away from A (-10bp and +10b) may reflect a lesser need for pairing of the ssODN against genomic DNA proximal to the dsDNA break.
(b) Distribution of gene conversion lengths along the ssODN. We observed that at each distance of B from A, a fraction of HDR events incorporates only A while another fraction incorporates both A and B (see Figure 3b). These two events may be interpretable in terms of gene conversion tracts (Elliott et al., 1998), whereby A+B events represent long conversion tracts that extend beyond B and A-only events represent shorter ones that do not reach to B. Under this interpretation, a distribution of gene conversion lengths in both directions along the oligo can be estimated (we defined the middle of ssODN as 0, conversion tracks towards the 5' end of ssODN as -direction, and 3' end as + direction). Gene conversion tracts progressively decrease in incidence as their lengths increase, a result very similar to gene conversion tract distributions seen with dsDNA donors, but on a highly compressed distance scale of tens of bp for the ssDNA oligo vs. hundreds of bases for dsDNA donors.
(c) Assays for gene conversion tracts using a single ssODN that contains a series of mutations and measuring contiguous series of incorporations. Here, we used an ssODN donor with three pairs of 2bp mismatches (orange) spaced at intervals of 10nt on either side of the central 2bp mismatch (Top). We only detected few genomic sequencing reads (62) carrying >=1 mismatches defined by ssODN among >300,000 reads sequencing this region. We plotted all these reads in the plot (bottom) and the sequence of the reads was color coded. Orange: defined mismatches; green: wild type sequence. Genome editing with this ssODN gave rise of a pattern in which middle mutation alone was incorporated 85% (53/62) of the time, with multiple B mismatches incorporated at other times. Although numbers of B incorporation events were too low to estimate a distribution of tract lengths > 10bp, it is clear that the short tract region from -10-10bp predominates.
Supplementary Figure 8. Cas9-gRNA nuclease and nickases genome editing efficiencies PGP1 iPSCs were co-transfected with combination of nuclease (C2) (Cas9-gRNA, cleaves two strands) or nickase (Cc) (Cas9D10A-gRNA, cleaves the non complementary strand) and ssODNs of different orientation (Oc and On). All ssODNs possessed an identical 2bp mismatch against the genomic DNA in the middle of their sequence. The assessment of HDR is described in the Methods.