Guide-specific loss of efficiency and off-target reduction with Cas9 variants

Abstract High-fidelity clustered regularly interspaced palindromic repeats (CRISPR)-associated protein 9 (Cas9) variants have been developed to reduce the off-target effects of CRISPR systems at a cost of efficiency loss. To systematically evaluate the efficiency and off-target tolerance of Cas9 variants in complex with different single guide RNAs (sgRNAs), we applied high-throughput viability screens and a synthetic paired sgRNA–target system to assess thousands of sgRNAs in combination with two high-fidelity Cas9 variants HiFi and LZ3. Comparing these variants against wild-type SpCas9, we found that ∼20% of sgRNAs are associated with a significant loss of efficiency when complexed with either HiFi or LZ3. The loss of efficiency is dependent on the sequence context in the seed region of sgRNAs, as well as at positions 15–18 in the non-seed region that interacts with the REC3 domain of Cas9, suggesting that the variant-specific mutations in the REC3 domain account for the loss of efficiency. We also observed various degrees of sequence-dependent off-target reduction when different sgRNAs are used in combination with the variants. Given these observations, we developed GuideVar, a transfer learning-based computational framework for the prediction of on-target efficiency and off-target effects with high-fidelity variants. GuideVar facilitates the prioritization of sgRNAs in the applications with HiFi and LZ3, as demonstrated by the improvement of signal-to-noise ratios in high-throughput viability screens using these high-fidelity variants.


INTRODUCTION
Cluster ed r egularly interspaced palindromic r epeats (CRISPR) / CRISPR-associated protein 9 (CRISPR / Cas9) technology is a widely used tool for genome editing and is currently being tested in clinical trials for therapeutic a pplications (1)(2)(3)(4)(5). Many a pplications of this technology utilize wild-type Streptococcus pyogenes Cas9 (WT Sp-Cas9) owing to its high on-target acti vity. Howe v er, the of f-target ef fects of the WT SpCas9 have raised critical concerns in scientific and clinical applications (6)(7)(8)(9). To address this issue, high-fidelity Cas9 variants have been been reported to be associated with enhanced specificity while maintaining on-target activity comparable with WT SpCas9 ( 16 , 18 ). Howe v er, the cleavage acti vities of these high-fidelity variants were only tested in complex with a limited number of single guide RNAs (sgRNAs), and it is unclear if the improvements made by the variants are general for all sgRNAs or are specific to a subset of sgRNAs.
The efficiency and off-target tolerance of the CRISPR / Cas9 system are affected not only by the Cas9 protein, but also by the sequence context of the sgRNA (19)(20)(21). We and others have proposed sets of sequence rules for the prediction and optimization of sgR-NAs with WT SpCas9 ( 7 , 19 , 22 , 23 ). Some of the rules, such as a guide-intrinsic mismatch tolerance (GMT) that can be explained by a kinetic model, are generally applicable to different Cas9 variants ( 24 ). Howe v er, structural analysis has shown that the mutations introduced into the Cas9 variants significantly alter the interacting patterns between the Cas9 protein and the RN A / DN A heteroduplex ( 12 ). Ther efor e, it is expected that the sequence determinants underlying the sensitivity and specificity of Cas9 variants could be di v ergent from the rules deri v ed for WT SpCas9 ( 21 ).
To systematically explore the variant-specific sequence rules that determine the editing efficiency of Cas9 variants, we applied high-throughput viability screens to evaluate the efficiency of ∼24000 sgRNAs complexed with HiFi, LZ3 or WT SpCas9. Mor eover, we measur ed the off-target tolerance of Cas9 variants in the context of 328 sgRNAs and 1753 target sequences using a synthetic paired sgRNAtarget system. These data allowed comprehensi v e modeling of the efficiency and off-target effect of the variants, which further led to the de v elopment of a new machine learning frame wor k for sgRNA design for application with the highfidelity Cas9 variants.

Lentivirus production and titration
For virus packaging, HEK293T cells (4 × 10 6 ) were seeded into a 10 cm tissue culture dish containing 10 ml of fresh medium and kept at 37 • C overnight. A 4 g aliquot of endotoxin-free lentiviral plasmid (lentiCas9, lentiHiFi or lentiLZ3), 4 g of psPAX2 (Addgene, #12260) and 2 g of pMD2.G (Addgene, #12259) wer e mix ed in 500 l of Opti-MEM (Gibco) with 30 l of X-tremeGene HP DNA transfection reagent (Roche, #06366236001) at room temperature for 10 min, and then dropwise added to the 10 cm dish with HEK293T cells. The supernatant containing lentivirus was filtered through a 0.45 m syringe filter 48 h after transfection. Lentivirus was aliquoted and frozen at -80 • C until use. To test endogenous gene knockout, lentivirus production was performed using the same procedure employing the endotoxin-free lentiviral plasmid lentiGuide-sgYAP1 or -sgMTAP.
For virus titration, cells were seeded into 24-well plates with 1 × 10 5 cells / well for 6-8 h for DLD-1 or overnight for HEK293T cells before adding various volumes of lentivirus. H2171 cells were seeded into a 24-well plate with 5 × 10 5 cells / well and incubated with various volumes of lentivirus just after seeding. To increase infection efficiency, 8 g / ml polybr ene (Millipor e, #TR-1003-G) was supplemented. After a 48 h infection, cells were subjected to selection with Puro (2 g / ml). Cell viability was tested using CellTiter-Glo ® reagent (Promega, #G7572) 3 days after selection and the virus titer was determined based on the cell survival rate.

Stable cell line construction
The color ectal adenocar cinoma cell line DLD-1 and small cell lung cancer cell line H2171 are two model cell lines that can be easily transfected and transduced, and have been successfully utilized in high-throughput CRISPR knockout screens (25)(26)(27)(28)(29)(30)(31)(32). We therefore selected DLD-1 and H2171 for construction of stable cell lines with constituti v e e xpression of WT SpCas9 or Cas9 variants. Lentivirus infection was performed with the same procedure as virus titration mentioned above using a similar virus volume for WT SpCas9, HiFi and LZ3. V irus v olumes were adjusted accordingly to keep Cas9 expression consistent among WT SpCas9, HiFi and LZ3. After Puro (2 g / ml) selection, the pooled cells with individual Cas9 were subcultured and expanded for determining Cas9 expression by western blot. For the generation of HEK293T cell lines expressing doxy cy cline-inducib le WT SpCas9 or Cas9 variants, cells were infected with pCW -Cas9, pCW -HiFi or pCW-LZ3 lentivirus followed by Puro selection. Monoclonal HEK293T cells with inducible WT SpCas9 or Cas9 variants were produced by limiting dilution. To verify the knockout efficiency of the stable cell lines, the cells were infected with lentiviral lentiGuide-sgYAP1 or -sgMTAP. The knockout of YAP1 or MTAP in the resulting cells after Blast selection was determined by western blot. An sgRNA targeting the AAVS1 gene was used as the control.

sgRNA library design
The EpiC library and T1 library were synthesized and constructed as described in our previous publications ( 22 , 31 ). For Tiling library design, we compiled a set of essential transcription factor genes in DLD-1 cells which encode multiple domain proteins, with particular attention on zinc finger proteins. To define the essentiality, we calculated the CRISPR score using the CRISPR knockout screen datasets in DepMap ( 25 ) and set the threshold to < -0.5. These proteins are largely not well identified. We manually added several genes encoding proteins with well-identified domains as the positi v e control genes. All the sgRNAs (19 nt) targeting e xons (e xtent 20 bp up-and downstream) were included as long as there is a protospacer adjacent motif (PAM; NGG). The sgRNAs for the top 50 essential genes (four sgRNAs per gene) in our previous CRISPR knockout screen data were chosen as positi v e sgRNA controls ( 31 ). The 500 sgRNAs targeting intergenic regions deri v ed from the Sabatini dataset ( 33 ) were included as the negati v e sgRNA controls. The redundant sgRNAs and those targeting multiple loci were removed and yielded a final Tiling library with 11762 sgRNAs. Each sgRNA was flanked with 5 sequence: TATCTT GT GGAAAGGACGAAA-CACCg and 3 sequence: GTTTTA GA GCTA GAAATA G-CAAGTTAAAAT. This oligo library was synthesized as a pool by Custom Array Inc. (Bothell, WA, USA). The sequences of sgRNAs are provided in Supplementary  Table S1.

Plasmid library construction and transformation
We amplified the synthesized oligo library using the following primer pair GGCTTT AT AT ATCTT GT GGAAAG-GA CGAAA CA CCG (forward) and CT AGCCTT ATTT-T AACTTGCT ATTTCT AGCTCT AAAAC (re v erse) with NEBNext High-Fidelity Master Mix (NEB, #M0531) for 10 cycles. The pooled library cloning and transformation processes were performed according to our previous publication ( 31 ). Briefly, the amplified oligo library was purified and ligated into BsmBI-digested LentiGuide-Blast using Gibson assembly. The expression of an sgRNA and the Blast resistance gene is dri v en under the U6 promoter and the EF1 ␣ promoter, respecti v ely. The transformation was performed with 2 l of the ligation product for each tube of electrocompetent cells (Lucigen, #60242) using a GenePulser (BioRad) according to the manufacturer's protocol, and the cells were plated onto 15 cm plates with carbenicillin selection (50 g / ml). After 14 h, all the colonies were collected as a pool for plasmid libr ary extr action with Endotoxin-Free Plasmid Maxiprep (Qiagen, #12362).

Pooled screen
The lentivir al libr aries wer e pr epar ed as described pr eviously ( 31 ). Before determining the multiplicity of infection (MOI), DLD-1 and HEK293T cells were seeded into 10 cm dishes at the cell density of 6 × 10 6 cells / dish in 10 ml of fresh medium for 6-8 h and ov ernight, respecti v ely. H2171 cells were seeded into 6-well plates with 5 × 10 6 cells / well. For transduction, cells were incubated with different volumes of virus supplemented with 8 g / ml polybrene for 36-48 h and then selected with Blast (20 g / ml for DLD-1 and HEK293T, 10 g / ml for H2171) for 5 days. Cell survival rate was measured by CellTiter-Glo assay and cell counting, and MOIs for different virus volumes were calculated compared with the non-Blast treated cells.
The pooled CRISPR knockout screens were performed using a similar procedure to the MOI testing mentioned in this section. DLD-1 cells with WT SpCas9, HiFi or LZ3 were infected with the EpiC library or Tiling library. H2171 cells with WT SpCas9, HiFi or LZ3 were infected with the EpiC library. Three independent replicates were set for each cell line with a minimum of 2 × 10 7 cells / replicate aiming for at least 500-fold coverage. Cells were infected with the lentivir al libr ary at an MOI of ∼0.3 for 36-48 h, followed by Blast selection for 5 days. The resulting cells were collected and passaged in fresh medium at the density of 1 × 10 6 cells / ml for H2171 cells and 1 × 10 7 cells / 15 cm dish for DLD-1 cells. Cells were passaged e v ery 2-3 days and cell number at a minimum of 500-fold coverage of the library was maintained during each passaging. After 20 days of maintenance (23 days for the DLD-1 cell lines with the EpiC library), cells were collected for genomic DN A (gDN A) extraction using the Blood & Cell Culture Midi kit (Qiagen, #13343) according to the manufacturer's instructions.
For off-target evaluation, we performed high-throughput screens in HEK293T cells with doxy cy cline-inducib le WT SpCas9, HiFi or LZ3 using a synthetic dual-target system ( 22 ). The library T1 was designed with an sgRNA expression cassette paired with its relevant off-target and on-target sequences arranged in tandem. The screening procedure Nucleic Acids Research, 2023, Vol. 51, No. 18 9883 was performed as the screens in the DLD-1 cells. Cell pellets were collected on day 9 after Cas9 induction for gDNA extraction.

Library amplification and deep sequencing
The sgRNA inserts were obtained by two rounds of polymerase chain reaction (PCR) amplification from gDNA. The first round was performed with primer pairs listed in Supplementary Table S2 with 16 cycles using NEBNext High Fidelity Master Mix (NEB, # M0541) for the screens with EpiC and Tiling libraries and NEBNext Ultra II Q5 Master Mix (NEB, #M0544) for the screens with the T1 library. A 40 g aliquot of gDNA was used as the template in eight PCRs to achie v e ∼500-fold coverage. The resulting products were purified with a PCR purification kit (Qiagen) and 5 l of purified products were used as the template for the second round of PCR with 12 cycles. The PCR products were gel purified, quantified by Qubit and qPCR, and deep sequenced using Nextseq500 at MDACC-Smithville Next Generation Sequencing Core. Single-end (75 bp) sequencing was conducted for EpiC and Tiling libraries, and paired-end (75 bp) sequencing was used for the T1 library.

CRISPR screen data processing
The pooled CRISPR knockout screen data were processed using MoPAC ( https://sourceforge.net/projects/mopac/ ). Basically, the sequencing reads were first aligned to the sgRNA library and counted. The read counts were then processed with a quality control module, a rank-weighted average algorithm for gene essentiality measurement, a normalization module for removing biases caused by different depths of selection, and the assessment of dropout effects of each sgRNA based on the normal distribution (Z-score).
For the high-throughput off-target screen data generated with a synthetic dual-target system, we adopted the same computational pipeline described in our previous study to call the read counts for fiv e different types of indels and calculate the off-on ratios for each sgRNA-target pair ( 22 ).

Sequence feature analysis for efficiency loss with Cas9 variants
To determine the sequence features associated with efficiency loss with HiFi and LZ3, we first extracted sgRNAs that target essential genes and classified them into three different groups based on their dropout effects in WT SpCas9 and the variant screens: the 'Inefficient' group (Z-score > -3 in the WT SpCas9 screen), the 'WT efficient only' group [Zscore < -3 in the WT SpCas9 screen and > 2-fold standard devia tion (SD) dif ference between WT SpCas9 and the variant screens] and the 'Both efficient' group (Z-score < -3 in the WT SpCas9 screen and < 2-fold SD difference between WT SpCas9 and the variant screens). As for HF1, the sgR-NAs were also classified into three different groups based on the corresponding indel rates in WT SpCas9 and HF1 screens: the 'Inefficient' group (indel rate < 0.6 in WT Sp-Cas9 screen), the 'WT efficient only' group (indel rate > 0.6 in WT SpCas9 and > 2-fold SD difference between WT Sp-Cas9 and HF1 screens) and the 'Both efficient' group (indel rate > 0.6 in WT SpCas9 screen and < 2-fold SD difference between WT SpCas9 and HF1 screens). We then calculated the log odds ratios of different types of nucleotides at each position along the spacer sequence between the 'Both efficient' and 'WT efficient only' groups. The sequence logos were generated using the logomaker software ( 34 ).
We used the Kullback-Leib ler (KL) di v ergence to assess the importance of different nucleotide positions. Let N 1 i , N 2 i , N 3 i and N 4 i r epr esent the fr equency of A, T, C and G at the i th position in the 'Both efficient' group, and i and M 4 i r epr esent the fr equency of A, T, C and G at the i th position in the 'WT efficient only' group. The KL di v ergence was calculated as follows: Larger KL di v ergence values indicate larger differences between two groups of sgRNAs at a specific nucleotide position and higher importance of the specific position for the efficiency loss with Cas9 variants.

Efficiency prediction for HiFi and LZ3
To predict the on-target efficiency for Cas9 variants using the CRISPR screen da ta genera ted in this study, we extracted sgRNAs targeting essential genes in the variant screens with both Tiling and EpiC libraries, which resulted in 9875 sgRNAs in the training dataset. For each sgRNA, we averaged the Z-scores in the HiFi and LZ3 screens as the final measurement for the knockout efficiency for Cas9 variants.
We first tried f our con ventional machine learning methods: a linear r egr ession model (LR), a support vector regression model with RBF kernel (SVM), a random forest r egr ession model (RF) and a gradient boost tree regression model (GBT). All the methods were implemented through the corresponding modules from the scikit-learn package in Python. Next, we implemented a two-step transfer learning strategy to predict the on-target efficiency of sgRNAs for Cas9 variants. In the pre-training step, we took two recurrent neural network (RNN) models as our source models, which wer e pr e-trained on the large datasets with > 50000 sgRNAs in WT SpCas9 and SpCas9-HF1 scr eens, r especti v ely. The bidirectional long short-term memory (BiL-STM) layers of two RNN models were then frozen to avoid destroying any of the information they contain during future training rounds. In the fine-tuning step, the outputs of BiLSTM layers of the two RNN models together with mono-and dinucleotide sequence features were passed to downstream dense layers, the parameters of which were further fine-tuned with HiFi and LZ3 screen data described above. For the mononucleotide features, the 20 nt sgRNA sequence was binarized into a 4 × 20 two-dimensional array, with 0s and 1s indicating the absence or presence of four different nucleotides (A, T, C and G) at e v ery single position. For the dinucleotide features, the 20 nt sgRNA sequence was binarized into a 16 × 19 two-dimensional array, with 0s and 1s indicating the absence or presence of 16 different dinucleotides (AT, A C, AG , AA, TT, TA, TG, TC, CC, CA, CG, CT, GG, GA, GT and GC) at each of the constituti v e positions along the sgRNA sequence. To assess the effects of mono-and dinucleotide features on the performance, we built four different prediction models: (i) without any additional sequence features; (ii) with mononucleotide sequence features; (iii) with dinucleotide sequence features; and (iv) with both mono-and dinucleotide sequence features. We perf ormed 5-f old cross-valida tion and calcula ted the Spearman's correlations between predicted scores and experimentally measured dropout effects as evaluation metrics. This process was randomly repeated 10 times to ensure robustness and reliability.
To evaluate the performance of different machine learning methods, we adopted a cross-validation approach by randomly splitting the dataset 10 times into two groups, among which 80% were used as the training set and 20% were used for the testing. The performance was assessed by computing the Spearman correlation between the predicted and observed efficiency scores. For the independent evaluation of our transfer learning-based model which we named GuideVar-on, we compared our model with two other deep learning-based models, DeepHF ( 35 ) and Deep-SpCas9variants ( 17 ), on a dataset generated in the previous study, which includes the on-target indel rates for 59 sgR-NAs targeting different genomic loci with different Cas9 variants ( 18 ). DeepHF is the combination of the RNN model and important biological features, which predicts the efficiency of sgRNAs for eSpCas9(1.1), SpCas9-HF1 and WT SpCas9. The source codes of DeepHF were downloaded from https://github.com/izhangcd/DeepHF . Deep-SpCas9variants is a CNN-based deep learning model that predicts the activity of 16 different types of variants at any target sequence. The source codes of the DeepSp-Cas9variants were downloaded from https://github.com/ NahyeKim/DeepSpCas9variants . Of note, only the models predicting the efficiency for SpCas9-HF1 were used for the comparison. The performance was evaluated as the Spearman correlation between the predicted efficiency score and experimentally measured indel frequency for each sgRNA.

Estimation of off-target effect from dual-target assay
To assess the relati v e off-target effect of an sgRNA at the off-target site compared with its on-target sequence, we calculated the off-on ratio using the following formula: Where C 1 , C2 and C 3 indicate the read counts of indels a t of f-target site only, a t on-target site only and at both onand off-target sites, respecti v ely.
Because HiFi and LZ3 showed highly consistent offtarget effect (Figure 3 C), we averaged the of f-on ra tios of HiFi and LZ3 at specific target sequences for a more accurate measurement of off-target effects with the variants.
As the result, we computed the off-on ratios for 815 sgRNA-target pairs with different mismatch types and positions from our training set. To further measure the degree of off-target reduction by the variants (VT) relati v e to WT SpCas9, we calculated the VT / WT ratio using the f ollowing f ormula: where r h f , r lz and r wt indicate off-on ratios for HiFi, LZ3 and WT SpCas9, respecti v ely.

Calculation of mismatch-associated energy barrier
To quantify the energy barrier associated with a specific mismatch type between sgRNA and the targeted DNA, we computed the difference of base-stacking energy between DN A-DN A and DN A-RN A hybridization over three nucleotides centered at the mismatch position based on the nucleic acid duplex energy parameters adopted from the previous study ( 36 ). To determine the influence of the energy barrier on reducing off-target effects by the variants, we computed the ratio of off-target effects (off-on ratio) between the variants and WT SpCas9 (VT / WT ratio) for each sgRNA-target pair and the corresponding energy barriers. We also calculated the Pearson correlation between the expected energy barrier for a certain mismatch type and the median VT / WT ratio of all sgRNA-target pairs with that type of mismatch.

Off-target prediction for HiFi and LZ3
We model the sequence-specific off-target effect as the combination of four factors: (i) the individual mismatch effect (IME), which is the averaged effect for a certain type of misma tch tha t occurred a t the specific position of the sgR-NAs deri v ed from our pre vious WT SpCas9 screen ( 22 ); (ii) the GMT, which measures the off-target tolerance from the intrinsic sequence of an sgRNA and is predictable with a CNN model; (iii) the mismatch type (MT), which describes the type of the mismatch between an sgRNA and the target sequence (the MT is encoded as a 12 bit binary vector r epr esenting the occurr ence of 12 differ ent mismatch types rAdA, rAdC, rAdG, rCdA, rCdC, rCdT, rGdA, rGdG, rGdT, rUdC, rUdG and rUdT, where the value for the exactly occurring mismatch type is 1 and all the others are 0s); and (iv) the mismatch position (MP), which describes the position of the mismatch between an sgRNA and the target sequence. The MP is encoded as a 20 bit binary vector r epr esenting 20 positions along the target sequence, where the value at the mismatch-occurring position is 1 and all the other positions are 0s. We tested a simple linear regression (LR) model as well as three non-linear models including a support vector r egr ession model with RBF kernel (SVM), a random forest regression model (RF) and a gradient boost tree regression model (GBT), to integrate the four factors described above. All the methods were implemented through the corresponding modules from the scikitlearn package in Python. To evaluate the performance of different methods, we adopted a cross-validation approach by randomly splitting the dataset 20 times into two groups, of which 80% were used as the training set and the remaining 20% were used for the testing. The performance was assessed by computing the Spearman correlation between the predicted off-target effects and experimentally measured of f-on ra tios.  Gi v en sgRNA-target pairs with N ( N > 1) mismatches, we first predict the of f-target ef fects for each singlemismatched pair with the machine learning models described above, denoted as m 1 , m 2 , m 3 , . . . , m n . The overall of f-target ef fect is computed by incorporating the combina torial ef fects (CE) estima ted from our previous WT Sp-Cas9 screen data. We denote δ i j as the CE between two misma tches tha t occur a t the i th and the j th nucleotide relati v e to the PAM ( i, j = 0 , 1 , 2 , . . . , 20) . The overall combinatorial effect is computed as the geometric mean of all the possible combinations among the N mismatches: The final off-target effect (GuideVar-off) is then computed as the following formula: GuideV ar of f = δ all * n i = 1 m i Considering that there are no off-target prediction methods specifically for Cas9 variants, we sought to compare GuideVar-off with the three state-of-art methods used to predict of f-target ef fects for WT SpCas9. (i) Eleva tion is a two-layer r egr ession model wher e the first layer learns to predict the effects of a single mismatch and the second layer learns how to combine single-misma tch ef fects into a final scor e ( 37 ). The sour ce codes of Elevation were downloaded fr om https://github.com/Micr osoft/Elevation . (ii) CRISPR-Net is a more recent deep learning method using a recurrent convolutional network, which shows superior performance compared with other machine learning approaches ( 38 ). The source codes to implement CRISPR-Net were downloaded from https://codeocean.com/capsule/9553651/ tree/v1 . (iii) MOFF is a model-based off-target predictor that includes three factors corresponding to the multiplication of IME, CE and GMT ( 22 ). The source codes to implement MOFF were downloaded from https://github. com/MDhewei/MOFF . We compared the methods on the TTISS dataset, which contains 630 detected off-target sites across 59 sgRNAs with WT SpCas9, HiFi, LZ3 and SpCas9-HF1 ( 18 ). For each sgRNA-target pair, we measured the off-target effect as the off-on ratio, which is calculated as the detected off-target reads divided by on-target reads. We adopted the Spearman correlations between the measured and predicted off-on ratios for quantitati v e evaluations. The correlation between of f-target ef fects (VT / WT ra tio) and the expected energy barriers caused by the mismatches during R-loop formation. ( H ) Scatter plot showing the association between the energy barrier caused by a specific mismatch type and the median VT / WT ratio for that mismatch type. rXdY refers to the mismatch type where the sgRNA sequence is X and the target DNA sequence is Y.

sgRNA design for high-fidelity Cas9 variants using GuideVar
To facilitate the rational sgRNA design for the CRISPR applications using high-fidelity Cas9 variants, we built a unified computational frame wor k GuideVar, which consists of the following steps. First, gi v en the sequences of a group of sgRNA candidates, GuideVar will predict the on-target score for each sgRNA using GuideVar-on. Second, Guide-Var will map the sgRNAs to the genome to search for potential off-target sites harboring up to fiv e mismatches using CRISPRitz, a software for rapid and high-throughput in silico off-target site identification ( 39 ). A final off-target score for each sgRNA is calculated as the logarithm of the sum of the predicted of f-target ef fects for all the potential off-target sites. Third, GuideVar will classify all the sgR-NAs into three categories based on the predicted on-and off-target scores: the first class is of high on-target score ( > 0.5) and lo w off-tar get score ( < 1.0), which will be selected as priority; the second class is of high on-target score but high off-target score ( > 1.0), which can still be used but is less pr eferr ed; and the third class is of low on-target score ( < 0.5), which is highly recommended to be discarded. The source codes of GuideVar are available at https://github. com/MDhewei/GuideVar .
To test the frame wor k, we applied GuideVar to all the sgRNAs in the EpiC library and compared the average log-fold changes (LFCs) of sgRNAs targeting essential genes and non-essential genes within each category classified by GuideVar. To quantitate the separation, we computed the strictly standardized mean difference (SSMD) between the sgRNAs targeting essential and non-essential genes. A higher SSMD indica tes grea ter separa tion between the essential and non-essential genes ( 40 ).

Guide-specific loss of efficiency revealed by high-throughput knockout screens with Cas9 variants HiFi and LZ3
High-fidelity Cas9 variants have been developed for increased specificity of the Cas9 system. Among over a dozen Cas9 v ariants av ailable to date, HiFi and LZ3 are two variants that show optimal trade-off between activity and specificity ( 18 ). To systematically compare the editing efficiency between WT SpCas9 and Cas9 variants, we performed high-throughput viability screens in cell lines with stab le e xpression of WT SpCas9, HiFi or LZ3. A colorectal Nucleic Acids Research, 2023, Vol. 51, No. 18 9889 adenocarcinoma cell line DLD-1 and a small cell lung cancer cell line H2171 were individually engineered with WT SpCas9 or the two Cas9 v ariants b y lentiviral transduction. Western blot confirmed equivalent expression of the three types of Cas9 and effecti v e gene knockouts in all six engineered cell lines (Supplementary Figure S1). The procedures of high-throughput viability screens are demonstrated in Figure 1 A and detailed in the Methods. Here, we used two independent lentiviral plasmid libraries, each containing ∼12000 sgRNAs (Supplementary Figure S2). The first library, named EpiC, contains sgRNAs targeting ∼1000 epigenetic regulators and cancer-related genes, as well as control sgRNAs targeting core essential and non-essential genes ( 31 ). The second is a Tiling sgRNA library that contains all sgRNAs targeting the exons of 26 essential transcription factors r equir ed for the proliferation and survival of DLD-1 cells. Of the six engineered cell lines, the DLD-1 cells were transduced with either the EpiC or Tiling libraries, and the H2171 cells were transduced with the EpiC library. After an ∼20 day selection, the sgRNA sequences were amplified, sequenced and counted for abundance relati v e to plasmid DNA (Supplementary Tables S3-S5).
We first conducted quality control analysis based on the correlation of independent replicates and the dropout effects corresponding to sgRNAs targeting essential genes and non-essential control loci (Supplementary Figuresthe S3-S5). Our results confirmed reliability and consistency of the screening data in this study. Next, we tested the consistency of dropout effects among the screens with WT SpCas9, HiFi and LZ3, based on pairwise Pearson correlations of the LFCs of sgRNA abundance (Figure 1 B; Supplementary Table S6). In DLD-1 cells transduced with the EpiC library, HiFi and LZ3 showed a high correlation ( r = 0.93), whereas the correlations decreased when comparing HiFi and LZ3 against WT SpCas9 ( r = 0.87 and 0.83, respecti v ely). Upon further examination, we found that a fraction of sgRNAs was associated with significantly less LFCs when HiFi and LZ3 were used, as compared with the screen with WT SpCas9 (Figure 1 B) Tables S7  and S8). Thus, the correlation analysis suggests a guidespecific variant-dependent cell dropout effect.
We reason that two factors may account for the decreased dropout effects with Cas9 variants. First, the variants reduce the off-target effect which could lead to a defect in cell viability. Second, some sgRNAs can be associated with loss of efficiency when targeting essential genes using the variants. To dissect these two factors, we examined the sgRNAs in the EpiC library that target non-essential and essential genes individually. Considering the high correlation between HiFi and LZ3, we averaged the LFCs of sgRNA abundance corresponding to the two variants in DLD-1 cells for further analysis. For the sgRNAs targeting non-essential genes, 151 (1.8%) out of 8224 showed a high dropout effect (LFC < -1) in the WT SpCas9 screen (Figure 1 C). As expected, the dropped-out sgRNAs are associated with high off-targeting potential measured by the MOFF score ( 22 ), indicating that a small fraction of sgR-NAs leads to a viability defect of cells independent of the function of their target genes due to the off-target effects (Figure 1 D). Among them, 88 (58.3%) showed decreased dropout effects ( > 2-fold difference in sgRNA abundance between the variants and WT SpCas9 screens) in HiFi and LZ3 screens, suggesting that the majority of the off-target effects wer e r educed b y the v ariants. For the sgRNAs targeting essential genes, 2021 (50.7%) of 3990 showed dropout effects (LFC < -1) in the WT SpCas9 screen. Among them, 379 (18.8%) sgRNAs are associated with decreased dropout effects ( > 2-fold difference) in the variant screens compared with the WT SpCas9 screen (Figure 1 E, blue dots). Interestingl y, those sgRN As depleted onl y in the WT SpCas9 scr een ar e associated with similar MOFF scor es to those without a dropout effect, suggesting that the difference in dropout effects is mainly caused by the loss of efficiency of the sgRNAs in variant screens instead of the reduction of of f-target ef fects (Figure 1 F). Indeed, among all the sgR-NAs in the EpiC library that were depleted only in the WT SpCas9 screen, nearly 80% target essential genes where ontarget functional perturbations lead to phenotypic changes (Figure 1 G). We observed similar results to the EpiC library in H2171 cells and to the Tiling library in DLD-1 cells (Supplementary Figure S6C-E). Taken together, these lines of evidence indicate that sgRNA-specific loss of efficiency is the major causal factor that accounts for the difference in dropout effects between the screens with Cas9 variants and that with WT SpCas9.

Sequence features that contribute to the loss of efficiency with Cas9 variants
Next, we asked if the nucleotide sequence context of sgR-NAs contributes to the loss of efficiency in Cas9 variant screens. To answer this question, we combined the sgRNAs targeting essential genes in the EpiC and Tiling libraries and categorized them into three groups: 'Inefficient' (not efficient in either WT SpCas9 or Cas9 variant screens), 'WT efficient only' (efficient only in the WT SpCas9 screen) and 'Both ef ficient' (ef ficient in both WT SpCas9 and Cas9 variant screens). We then computed the log odds ratios of spacer nucleotide frequency at positions 1-19 between the 'Both efficient' and 'WT efficient only' groups. We observed highly consistent sequence features between HiFi and LZ3 ( Figure  2 A, B; Supplementary Figure S7). Of note, the identified sequence featur es ar e also consistent for sgRNAs with or without a mismatch at the 5 -appended 'g' nucleotide (Supplementary Figure S8), which has been reported to impact the activity of sgRNAs in complex with Cas9 variants ( 41 ). We further extended the analysis to an orthogonal dataset generated through direct measurement of indel frequency induced by WT SpCas9 or HF1, another high-fidelity Cas9 variant ( 35 ). Despite different types of Cas9 variants and experimental settings, many sequence features are reproducible among HiFi, LZ3 and HF1 (Figure 2 C; Supplementary Figure S7), suggesting a common mechanism underlying the sequence-dependent loss of efficiency among the three variants. To determine the importance of nucleotide positions, we calculated the KL di v ergence that measures the difference in nucleotide distribution at each position between the 'Both efficient' and 'WT efficient only' groups. We found that sequences in thr ee r egions significantly contribute to the loss of efficiency: (i) positions 2-5 in the seed region proximal to the PAM; (ii) positions 7-11, excluding position 9, at the junction of the seed and nonseed regions; and (iii) positions 15-18 in the non-seed region (Figure 2 D).
Pr evious r eports indica ted tha t sgRNA ef ficiency with WT SpCas9 is highly associated with the nucleotide sequence in the seed region ( 3 , 7 , 42 ). With regard to the variants, howe v er, our results showed that the sequence context in the non-seed region also contributes to the variantspecific editing efficiency. We hypothesized that the observed sequence features in the non-seed region are associated with the structural alteration mediated by the mutations introduced into the variants. Comparing the mutations in HiFi, LZ3 and HF1, we found that all three variants harbor mutations in REC3, a non-catalytic domain that targets complementarity and governs the HNH nuclease to regulate catalytic competence (Figure 2 E, upper) ( 11 , 16 , 18 , 43 ). Interestingly, the structural analysis revealed that positions 9-11 and 15-17, but not 12-14, interact with the REC3 domain of WT SpCas9 protein (Figure 2 E, bottom) ( 44 ). Of note, it has been reported that mutations in REC3 can disrupt the interaction between REC3 and the RN A / DN A heteroduplex, w hile the interaction is r equir ed for the catalytic function ( 12 , 43 , 45 , 46 ). Ther efor e, our results support a model in which specific spacer sequences at positions 15-18 and / or 9-11 mediate the activity of Cas9 variants via conformational regulation of REC3-RN A / DN A interaction.

Sequence-dependent off-target tolerance of Cas9 variants
Despite Cas9 variants being able to greatly reduce off-target effects compared with WT SpCas9, some sgRNAs targeting non-essential genes showed significant dropout effects in our viability screens when HiFi and LZ3 were used, corresponding to high off-target tolerance (Figure 1 C, D). To systema tically investiga te the sequence-specific of f-target tolerance of the Cas9 variants, we evaluated the off-target effects of 328 sgRNAs using a synthetic paired sgRNAtarget system that allows high-throughput measurement of on-target and off-target indel rates in parallel ( Figure 3 A; Supplementary Table S9) ( 22 ). For each sgRNA, we designed se v en off-target sequences, comprising three targets with one mismatch (1-MM), three with two mismatches (2-MM) and one with three mismatches . Consistent with the viability screens (Figure 1 B; Supplementary Figure  S6A, B), the on-target indel rates of HiFi and LZ3 showed a high corr elation, wher eas the corr elations between the variants and WT SpCas9 were much lower ( Supplementary Figure S9). In general, the off-target effects, as measured by of f-on ra tios, w ere significantly low er for the two Cas9 variants compared with WT SpCas9 (Figure 3 B; Supplementary Figure S10; Supplementary Table S10), confirming the overall reduced off-target tolerance of Cas9 variants. The of f-target ra tes of HiFi and LZ3 were also highly correlated, in comparison with the r educed corr elation between the variants and WT SpCas9 (Figure 3 C).
Upon further examination, we found that 317 sgRNAtarget pairs were associated with different degrees of reduction of off-target effects when the variants were used (Fig-ure 3 D). To quantify the reduction of of f-target ef fects by the variants, we computed the ratio of off-target effects between the variants and WT SpCas9 (VT / WT ratio) for 317 1-MM sgRNA-target pairs that are associated with strong of f-target ef fects with WT SpCas9 (of f-on ra tio > 0.5). We then asked if the position of the mismatch accounts for the variation of the VT / WT ratio. Interestingly, we observed significantly lower VT / WT ratios when a mismatch occurs at positions 15-18 of the RN A / DN A duplex that interacts with the REC3 domain of Cas9 (Figure 3 E). This observation suggests that the variant-specific mutations of REC3 and the mismatches at positions 15-18 can be synergistic in blocking Cas9-mediated DNA cleavage. In addition to the mismatch positions, we also found that the nucleotide contexts of RN A-DN A mismatches are associated with different degrees of VT / WT ratios (Figure 3 F). In previous studies, we and others have reported that an RN A-DN A mismatch leads to an energy barrier during R-loop progression, where the quantity of the energy barrier is predictable from the context of the mismatch and its adjacent nucleotides ( 22 , 36 , 47 , 48 ). Ther efor e, we computed the expected energy barrier based on the sequences of the 317 sgRNA-target pairs and found that those pairs with greater energy barriers are associated with stronger reduction of off-target effects by the variants (Figure 3 G). Among the 12 single mismatch types, rUdC (i.e. uracil in RNA and cytosine in DNA), rCdC, rAdG, rAdC and rAdA are associated with greater energy barriers and smaller VT / WT ratios ( Figure  3 H), suggesting that the variants are more effecti v e in ov ercoming the off-target effects at the genomic off-target sites harboring these mismatches. Taken together, these lines of evidence indicate that the degree of off-target reduction by the variants is dependent on the position and nucleotide context of the mismatches.

GuideVar facilitates optimal sgRNA selection for Cas9 variants
We next sought to predict the sequence-specific on-target efficiency and off-target effects for HiFi and LZ3 using the da tasets genera ted in this study. Over the past years, machine learning models have been developed and trained to predict on-target efficiency and off-target effects of WT Sp-Cas9 and a few Cas9 variants based on a large volume of datasets ( 17 , 22 , 35 , 49 ). To take advantage of these pretrained models, we de v eloped a transfer learning framework, named GuideVar, by which previous models on WT SpCas9 and HF1 are 'transferred' to new models to enhance the predicti v e power on HiFi and LZ3.
GuideVar consists of two modules: GuideVar-on for the prediction of on-target efficiency, and GuideVar-off for the prediction of an off-target effect. In GuideVar-on, we transferred two long short-term memory (LSTM) models in DeepHF that were trained on > 50000 sgRNAs ( 35 ). As shown in Figure 4 A, the flattened layers of the LSTM models were conca tena ted, together with the mono-and dinucleotide sequence features (see the Methods; Supplementary Figure S11), as the inputs of a multilayer neural network that was trained using the HiFi and LZ3 data in this study. Based on 10-fold cross-validation, the transfer learning model significantly improved the predictive power of on-target efficiency compared with the two source models in DeepHF and other machine learning methods (Figure  4 B). We further evaluated the performance of GuideVar-on using an independent TTISS dataset in which the on-target indel rates were measured for 59 sgRNAs with the expression of HiFi or LZ3 ( 18 ). As a result, GuideVar-on outperformed two recent deep learning-based models for the prediction of sgRNA efficiency (Figure 4 C; Supplementary Table S11), suggesting its robustness for applications with HiFi and LZ3.
Recently, we de v eloped MOFF, an off-target predicti v e model for WT SpCas9 ( 22 ). MOFF integrates multiple sequence-dependent factors, including IME parameterized with a matrix and a GMT modeled using a CNN. We note that the degree of off-target reduction by the variants is dependent on the mismatch position (MP) and mismatch type (MT) (Figure 3 E, F). Therefore, GuideVar-off incorporates MP and MT, together with the quantitati v e measures of IME and GMT from MOFF, as the inputs for model learning (Figure 4 D). Based on cross-validation, we tested a panel of linear or non-linear machine learning models, including LR, RF, SVM and GBT. Among them, GBT achie v ed the best performance ( Supplementary Figure S12), thus it was implemented in GuideVar-off. We observ ed progressi v e improv ement in predicti v e power when the four types of sequence features (GMT, IME, MT and MP) were successi v ely added to the model (Supplementary Figure S13), suggesting that all the features are associated with off-target effects of the variants. To predict off-target effects when multiple mismatches are present, GuideVar-off adopted the ␦ coefficients in MOFF to compute the combina torial ef fect of multiple mismatches ( 22 ). We compared the performance of GuideVar-off with other state-of-the-art methods based on the TTISS dataset ( 18 ), where GuideVaroff consistently outperformed other models for both HiFi and LZ3 (Figure 4 E; Supplementary Table S12).
Finally, we built a unified computational frame wor k that integrates the GuideVar-on and GuideVar-off for the selection of optimal sgRNAs in applications with HiFi and LZ3. To test the frame wor k, we applied GuideVar to all the sgRNAs in the EpiC library (Figure 4 F), resulting in 7553 (61.8%) sgRNAs that wer e pr edicted to be of high efficiency and lo w off-tar get effects, as well as 681 (5.6%) highly off-targeting sgRNAs and 3980 (32.6%) inefficient sgRNAs. For each category of sgRNAs, we compared the dropout effects of the sgRNAs that target essential genes and non-essential genes and calculated the SSMD, a measure of the effect size of sgRNA libraries ( 40 , 50 , 51 ). As expected, the sgRNAs in the high-efficiency lo w-off-tar get group showed the highest effect size (SSMD = 1.73 and 1.64 for HiFi and LZ3, respecti v ely) (Figure 4 G; Supplementary Figure S14). The high-fidelity Cas9 variants are rarely used in high-throughput CRISPR screens due to the compromise of efficiency. Consistently, we observed lower SSMD scores for HiFi and LZ3 when the entire EpiC library was used for screening. Howe v er , with the aid of GuideVar , the selected subset of sgRNAs showed an almost equivalent effect size among WT SpCas9, HiFi and LZ3 (Figure 4 H). Ther efor e, GuideVar will potentially facilitate the usage of high-fidelity Cas9 variants for high-throughput CRISPR screens when an off-target effect is a critical concern. For the convenience of users, we computed GuideVar scores for all the sgRNAs targeting the exomes of human, mouse and monkey genomes. This r esour ce will facilitate broader application of high-fidelity Cas9 variants aided by sgRNA prioritization.

DISCUSSION
High-fidelity Cas9 variants have been de v eloped to reduce the of f-target ef fect of the CRISPR / Cas9 system, but the variants could also compromise the on-target efficiency, which limits their applications. In this study, we found that the efficiency loss of HiFi and LZ3 is guide specific and dependent on the sequence of the spacer. Importantly, the sequence context at positions [15][16][17][18] relati v e to the PAM is highly associated with the efficiency loss. Because nucleotides at positions 15-18 of the RN A / DN A heteroduplex interact with the REC3 domain of Cas9 where mutations are introduced into the variants, our results suggest a variant-specific sequence rule for on-target efficiency. Of note, most Cas9 variants de v eloped to date harbor mutations in the REC3 domain ( 12 , 13 , 43 ). We observed similar sequence rules on HiFi and LZ3 to those of HF1, another REC3 mutant variant, using a da taset genera ted from an orthogonal study (Figure 2 A-C). Therefore, the sequence rules deri v ed from this study are likel y to be a pplicable to other Cas9 variants. Despite the elaborate engineering of Cas9 variants, we also observed various degrees of offtarget reduction by the variants over a large panel of sgR-NAs and targets. The guide-specific off-target reduction is dependent on the position and nucleotide context of the mismatch. In particular, a mismatch at positions 15-18 is associa ted with ef fecti v e off-target reduction by the variants, suggesting synergistic effects between REC3 mutations and those RN A-DN A misma tches a t positions 15-18. In future study, it would be interesting to explore the structural conformations with REC3 mutations and the RN A-DN A mismatches for in-depth understanding of the mechanism underlying the synergistic effects. Together, these findings not only elucidate the guide-specific loss of efficiency and offtarget reduction, but also suggest the feasibility of optimizing sgRNA design by the implementation of sequence rules for Cas9 variants.
Gi v en the observations of guide-specific loss of efficiency and off-target reduction by the variants, it is expected that the high-fidelity variants could effecti v ely reduce off-target effects with little sacrifice of efficiency for a subset of sgR-NAs. We de v eloped GuideVar to facilitate guide prioritization in applications with HiFi and LZ3. In GuideVar, we introduced a transfer learning frame wor k by which previous machine learning models ar e transferr ed to the new model to enhance the predicti v e power. We demonstrated the utilization of GuideVar via a high-throughput screen, in which a computationally selected subset of sgRNAs exhibited improved performance with the Cas9 variants, where the effect size is equivalent to or e v en better than WT SpCas9 (Figur e 4 H). Ther efor e, we anticipa te tha t GuideVar holds potential to broaden the application of the high-fidelity Cas9 variants in scientific and clinical settings. Moreover, the success of transfer learning in our study suggests its future utilization for sgRNA optimization, considering the ra pidl y growing body of CRISPR datasets and computational models.

DA T A A V AILABILITY
The raw sequencing data (fastq files) generated in this study are deposited at the Sequence Read Archi v e (SRA) under the BioProject accession number PRJNA977711. The GuideVar scores of all the sgRNAs targeting the exomes within genomes of human, mouse and monkey are deposited at FigShare and are freely accessible at https:// figshare.com/s/721551365ff2c4b92976 .