Quantum biological insights into CRISPR-Cas9 sgRNA efficiency from explainable-AI driven feature engineering

Abstract CRISPR-Cas9 tools have transformed genetic manipulation capabilities in the laboratory. Empirical rules-of-thumb have been developed for only a narrow range of model organisms, and mechanistic underpinnings for sgRNA efficiency remain poorly understood. This work establishes a novel feature set and new public resource, produced with quantum chemical tensors, for interpreting and predicting sgRNA efficiency. Feature engineering for sgRNA efficiency is performed using an explainable-artificial intelligence model: iterative Random Forest (iRF). By encoding quantitative attributes of position-specific sequences for Escherichia coli sgRNAs, we identify important traits for sgRNA design in bacterial species. Additionally, we show that expanding positional encoding to quantum descriptors of base-pair, dimer, trimer, and tetramer sequences captures intricate interactions in local and neighboring nucleotides of the target DNA. These features highlight variation in CRISPR-Cas9 sgRNA dynamics between E. coli and H. sapiens genomes. These novel encodings of sgRNAs enhance our understanding of the elaborate quantum biological processes involved in CRISPR-Cas9 machinery.


INTRODUCTION
CRISPR-Cas9 is revolutionizing genome-editing using a single guide RN A (sgRN A) to direct precise cleavage at endogenous locations in the genome ( 1 , 2 ).The first step to engineer or modify a specific region using CRISPR-Cas9 is to computationally predict cutting efficiencies of potential sgRNAs.The CRISPR-Cas9 system depends on this designed sgRNA to target the protein complex to a region flanked by a 3 NGG protospacer adjacent motif (PAM).The CRISPR-Cas9 system is successful only if both specificity and efficiency occur at the target loci ( 3 ).To inform sgRNA sequence choices, genomic feature analyses have associa ted sgRNA a ttributes with cutting ef ficiency for CRISPR-Cas9 systems (4)(5)(6)(7)(8).
The efficiency of CRISPR-Cas9 systems is defined as the percentage of transgenic samples in which mutations are introduced at the intended target.Due to the time and effort r equir ed to transform many species, efficiency is critical.The calculated efficiency, while correlated, does not directl y ca pture the specificity (on-target versus off-target cuts) that occur.Predicting sgRNA efficiency r equir es car eful consideration of relationships among the sgRNA sequence, genomic features of the target region, and activity within the CRISPR-Cas9 system.Some of these relationships have been extensively investigated ( 9 ).Among them, nucleotide composition of the target sequence is the most thoroughly studied contributor to sgRNA efficiency ( 3 , 10-12 ).Specific nucleotide patterns have been associated with sgRNA efficiency; including the presence of guanine and absence of thymine near the PAM sequence, pr efer ence for cytosine near the cut site, and overall GC content ( 3 , 11 , 13 ).The seed region -defined as the fiv e to ten bases of the target sequence nearest the PAM -is of centr al consider ation for these patterns in sgRNA sequence composition ( 10 , 12 , 14 ).
While nucleotide sequence patterns are observed across species, their influence on CRISPR-Cas9 sgRNA association and target cleav age v aries ( 1 , 15-18 ).The added complexities of chromatin structure have started to be considered, enhancing understandings of CRISPR-Cas9 dynamics.For example, human models were expanded with information about the insertion point within the gene sequence ( 19 ) and secondary structure of the target sequence ( 20 , 21 ).Target regions with low nucleosome occupancy and high chromatin accessibility have also been investigated (22)(23)(24)(25)(26).These structural nuances underscore e v en greater variation in CRISPR-Cas9 system mechanisms across organisms.
DNA is less protected in prokaryotic cells than in eukaryotic cells because of a simpler chromatin structure; and target r egions ar e often more accessible ( 3 ).In contrast, mammalian cells have highly active non-homologous end-joining (NHEJ) systems, which induce repair mechanisms for the DNA double strand break during CRISPR-Cas9 integration.In prokaryotes, sgRNA activity is correlated with cellular survival because double stranded breaks are lethal to the cell in the absence of NHEJ ( 27 ).These pronounced differences in structure and function illuminate, in part, why models trained for mammalian species have failed to provide sgRNAs that consistently integrate with target sequences across other kingdoms.This insufficiency spurred de v elopment of organism-tailored models, including those for plants ( 1 ), yeast ( 18 ) and bacteria ( 28 ).Expanding the breadth and chemical specificity of model feature sets provides useful avenues for extending stateof-the-art sgRNA efficiency prediction to other organisms and non-model species.To achie v e this ne xt le v el of model prediction power, quantum chemical properties warrant consideration.
Bridging chemistry and physics, quantum chemical properties capture the ways in which electron density impact the reactivities and energetics of molecules.Some properties, such as the HOMO-LUMO gap (highest occupied molecular orbital-lowest unoccupied molecular orbital energy ga p; H-L ga p), describe how electron density is distributed among atoms.Meanwhile, other properties, like hydrogen-bonding energy or -stacking energy, describe how a system's total energy changes as molecules interact.Such properties depend on how the molecular electron densities shift in response to one another.Incorporating quantum chemical detail when characterizing or predicting bio-logical processes has been transformati v e for biology; providing ne w frame wor ks for vie wing processes, identifying novel features, and enhancing mechanistic understandings ( 29 , 30 ).This work spotlights quantum properties including HOMO-LUMO ga ps, hydro gen bonding, and stacking interactions to investigate the complex molecular interactions of the DNA double helix and the DNA / CRISPR-Cas9 sgRNA hybrid.
Machine learning models excel at identifying patterns in data to inform outcomes; but advances in predicti v e power are bottlenecked by the depth and breadth of training data.Current methods of fea ture evalua tion for CRISPR-Cas9 efficiency are trained on experimental sgRNA cutting efficiency data from a narrow range of eukaryotic species, including human, mouse, and zebrafish ( 4 ).While these models' species-by-species rules for sgRNA pr ediction ar e informati v e, their insights are rarely generalizable.Therefore, to de v elop advanced predicti v e models, the training data must be sufficiently detailed to capture the complexities of genomic structure and content that influence efficiencies of CRISPR-Cas9 integration and target cleavage.
Here we use machine learning approaches to unravel these species-dependent rules of sgRNA efficiency.Many current AI model generation approaches use techniques such as neural networks that can obscure associations behind a 'black box' of decision schemes.We sought to understand feature contributions to cutting efficiency for Esc heric hia coli through an e xplainab le-artificial intelligence (XAI) approach.We used iterati v e Random Forest (iRF), an XAI method designed for model transparency and feature evaluation, to assess CRISPR-Cas9 efficiency and improve our understanding of the system's underlying biological mechanisms.When trained on detailed feature sets, XAI models provide a shared basis for predicting sgRNA efficiency across organisms.This wor k e xtends sgRNA efficiency modeling to assess both E. coli and Homo sapiens da tasets.Additionally, our model integra tes a novel and interdisciplinary feature set that includes quantum chemical properties.

E. coli.
A publicly accessible E. coli dataset published by ( 28 ) was utilized.Briefly, this dataset contains 55670 unique sgRNAs that are profiled by co-expressing a genome-scale library with a pooled screening strategy.The data was compiled from three different Cas9 variations including Cas9 ( Str eptococcus pyog enes ), eSpCas9, and Cas9 ( recA).The eSpCas9 is a Cas9 that has been r eengineer ed for improved specificity and the Cas9 ( recA) was de v eloped by knockout of recA blocking DSB repair.The dataset contains both sgRNA sequence and empirical CRISPR-Cas9 efficiency scores for each of the respecti v e guides.The cutting efficiency scor es wer e calculated by taking the binary logarithm (log x ) of the selected read count to the control read count.We focused on the Cas9 dataset for analyses within this manuscript.

H. sapiens.
A publicly accessible H. sapiens dataset published by ( 10 ) was utilized.This dataset contains 1278 unique sgRNAs based on a human malignant melanoma cell-line (A375) viability analysis.The cutting efficiency was determined in the same manner as described above with the log 2 fold change calcula ted rela ti v e to the change in abundance during a two-week growth period.Additionally, a larger dataset curated by ( 33 ) which contains four pub licly-accessib le human experimental sgRNA efficiency datasets ( 11 , 19 , 39 ) including multiple cell lines (HCT116, HEK293T, HELA and HL60) was considered.Cutting efficiency values were defined as the log 2 fold change in the measured knockout efficacy.

Multi-species.
The multi-species model included sgRNA ef ficiency da ta from all previously described da tasets.When model training occurs on datasets spanning multiple species, all data is min-max normalized on a scale of 0 to 1 and composed into a matrix of sgRNA and cutting efficiency scores for model input.To eliminate species bias due to sample size, a consistent subsampling of 15 000 sgRNAs was utilized from E. coli and H. sapiens ; and the species was encoded as a binary feature.

Feature matrix
Quantum chemical pr oper ties ( 29 ).Quantum chemical pr operties pr ovide unique insights into the factors influencing sgRNA efficiency in CRISPR-Cas9 systems.Canonical of DN A-DN A and DN A-RN A duplex es wer e modeled to assess these factors.The analysis included quantum chemical properties of individual nucleotide bases; base-pairs; and base-pair dimers , trimers , and tetramers.In this way, a fourteen-Å ngstrom (4 bp) cut-off distance was invoked for long-r angeinter actions in the sgRN A. Additionall y, a new sliding-window approach for the nucleotide base positions was de v eloped for sgRNA interactions with the target DNA.Base-pair interactions were encoded into blocks, which subdivided the twenty-nucleotide sgRNA.In this approach, fiv e ranges of interactions were assessed, from intramolecular to intermolecular.HOMO-LUMO gap has been described as a signpost for a molecule's kinetic stability ( 40 ).It describes the energetics of allowed electron transitions, and the likelihood of processes involving electron mobility.Structurally, the H-L gap reflects a molecule's landscape of phase dependence for electronic wave function interactions --both constructi v e and destructi v e --tha t origina te covalent molecular interactions ( 41 ).Hydro gen bonding, meanw hile, is a contextual property.It reflects an energetic pr efer ence for arrangements of molecules in relation to one another.Hydrogen bonding directs non-covalent interactions between molecules, playing roles in thermodynamic stability and the energetics of protein folding, as two examples ( 42 ).Stacking interactions are similarly contextual interactions and occur between aromatic rings.Stacking inter actions r ange frominter actions within the rings --of the overlapping p -orbital electron density --to steric repulsions from e xocy clic groups, which are implicated in DNA twisting ( 43 ).Hydrogen bonding and stacking interactions differ in the chemical species that participate (Supplementary Figur e S2).Wher eas hydrogen bonding interactions occur between hydrogen and a hydrogen bond acceptor, stacking interactions occur between aro-matic moieties.Quantifying the energetics of these interactions complements a detailed feature set for machine learning models and sgRNA efficiency prediction.
The density-functional-based tight binding method (DFTB) is a powerful approach for large-scale atomistic simula tions and calcula ting quantum chemical properties.This work uses the DFTB3 / 3ob parameter set (third order parametrization for biological and organic systems).Calculations with the DFTB3 / 3ob parameter set yield excellent molecular geometries (Wang and Berkelbach, 2020), which compare favorably with more resource-intensi v e methods.For example, DFTB3-3OB structures exhibit maximum absolute deviations of 0.045 Å ngstr oms fr om MP2 / 6-31G(d) methods (second order Møller-Plesset perturbation theory with six-primiti v e split valence polarized Pople basis; ( 44 )).
Initial coordinates .Nucleotide coordina tes were collected from PubChem ( 45 ).B-DNA base-pairs were extracted from crystal structure data ( 46 ) (PDBID: 167D; ( 46 )).RNA hybrids and DN A-RN A hybrids wer e pr epar ed by stericsdri v en structure ov erlay in Bio via Disco very Studio software (Dassault Syst èmes , S .E.).The nucleotide base and base-pair geometries were optimized through a gradient descent method with the simulation procedure described below.By optimizing the building blocks (the bases and basepairs) and then systematically constructing kmers, we maintain an internal consistency of coordinates across the full ( n = 904) set of systems.This reduction of the full conformational freedom of the kmers was motivated by the work of Šponer et al. ( 59 ), and serves primarily to prevent assertion of an optimal geometry where the context is not considered; and secondarily to avoid specialization to a context which may not be extensible or generalizable across the di v ersity of availab le conforma tional sta tes and speciesdependent DNA winding or remodeling.After optimization, each base-pair was aligned with the xy-plane in Open-Pymol software (Schr ödinger, Inc.).In analogy to Gil et al. ( 47 ), each base-pair was then translated such that its centroid was the origin of coordinates.To complete the unambiguous set of transformations, the pyrimidine carboxyl gr oups pr ovided a final constraint.For this, the thymine carbonyl bond and cytosine carbonyl carbon were rotated to be normal to one another.K-mer construction.For all constructs, base-pairs were stacked at a distance of 3.5 Å along the z-axis, and rotated 36 degrees about their centroids (the origin).Structur es wer e pr epar ed with scripts ex ecuted in Open-Pymol.All non-chimeric single-strand combinations were assessed.In total, fiv e nucleotide bases, size base-pairs (bp), 32 bp dimers, 156 bp trimers and 716 bp tetramers were evaluated.Compiled k-mers were assessed by single-point energy calculations using the simulation procedure described below.
Simulation.Calculations were carried out at the DFTB3-D3(BJ) / 3ob le v el of theory with Grimme's D3(BJ) dispersion correction ( 48 , 49 ).Dispersion corrections were included to capture non-covalent interactions, resolving van der Waals and London dispersion forces in detail.Grimme's dispersion correction was selected to describe medium and short-range dispersion effects ( 50 ).
Additionally, a 'COSMO' model was used with water as a solvent ( co nductor-like s creening mo del).This model approximated solvent interactions and contextualized the geometries and energy calculations to a water environment.Total system energy, HOMO-LUMO gap, and other quantum tensors were compiled for assessment in an Iterati v e Random Forest model.Methods justification.To our knowledge, there is no consensus method (or pr eferr ed method) for simulating oligonucleotides beyond base-pairs.For simulations with individual nucleotide bases and base-pairs, LC-DFT is indicated in pr efer ence to high-le v el ab initio methods for recovering frontier orbital energies ( 55 ).Despite its excellent performance in this context, LC-DFT typically scales as N 4-6 , with N the number of atoms ( 56 , 57 ).Furthermore, the total number of systems to sample for all canonical gRNA sequence fragments increases combinatorically with the degree of base-pair polymerization (trimers, n = 156; tetramers, n = 716).These are competing imperati v es in this work which rapidly escalate the computational demand.Ther efor e, a low-scaling method is essential to describe e v en modest windows of the gRNA structure with quantum chemical detail.We select the DFTB method because it provides reliable geometries ( 58 ) with scaling (N3 ( 57 )) that is amenable to the large number of constructs considered in this work ( n = 904).We compare the DFTB3 / 3ob model output with LC-BLYP / pVTZ (longr ange corrected, r ange separ ated, Becke-Lee-Yang-Parr functional with triple z eta polariz ed valence basis) calculations up to base-pair dimers.We find that the top features (base-pair H-L ga p, hydro gen bonding energy, basepair stacking energy) and the regions of special interest (3 and 5 termini) are reproduced across models and methods (Supplementary Figur e S4).Mor eov er, the le v el of model prediction is consistent ( > 81%) across models, with the top fifty features emphasizing quantum chemical properties (Supplementary Figure S4).We include frontier orbital and ground-state energies at DFTB3 / 3ob, LC-DFT / pVTZ and HF / 6-31G** le v els of theory (Hartree-Fock with sixprimiti v e split valence Pople basis and d-/ p-type polarization functions) to compare the ordering of features across methods (Supplemental Table 5).
Pub lic r esour ce .All calculated quantum chemical properties for nucleotides and k-mers have been compiled in Supplementary Table S1.
Positional encoding.Matrix generation involved extraction of se v eral isola ted fea ture sets.Position-independent and position-dependent encoding of the 20 bp sgRNA were performed as described by Doench et al., 2014.Briefly, position-independent featur es wer e determined by the count of nucleotides within the 20 bp sequence both as a single base (A / C / T / G) and as paired bases (AA / AC / AG / etc.).Position-dependent featur es wer e r epresented using binary variables (1 r epr esents pr esence at that position) to encode the position of nucleotide bases up to base-pair oligomers.To describe all possible combinations, each position was described by four bits, with a binary value for each of the four possible bases (A / C / T / G).P air ed bases are further encoded with a binary value for each of the 16 possible base pair combinations.Additionally, we encoded the PAM (NGG) sequence by incorporating position-independent encoding of the N nucleotide.The combination of positional encoding approaches resulted in 384 features for each sgRNA assessed.
Further positional encoding was conducted with k-mers to extend sequence fragment descriptions to the full guide RNA.This encoding considers combinations of fragments, capturing how the fragment's context in the full transcript influences sgRNA binding and efficiency.Here, a k -mer is simply a sequence of characters in a string.We utilized k -mers to capture nucleotide neighborhoods, considering multiple base pairs with defined dependent positions.This was done through a stepwise integration of additional nucleotides as described above in a position-dependent manner including nucleotides in groups of two to fiv e.The binary matrix includes the positional encoding using a sliding window so that each position from 1 to 20 (less the k -mer length) is encoded.
'Raw' featur es .Se v er al r aw value featur es wer e determined including GC content (ratio from 0-1 r epr esenting the proportion of the sgRNA sequence that is composed of GC), temperature of melting of the DNA duplex (calculated by the Watson-Crick formula of Tm( • C) = 64.9+ 41 * (nG + nC-16.4)/ (nA + nT + nG + nC)), minimum free energy as a function of RNA structure (calculated with Vi-ennaRNA; ( 51 )), distance of the target sequence to the closest downstream PAM (utilizing the known genome assembly this was determined by the number of bases between position 20 of the sgRNA and the nearest NGG), and location relati v e to the target gene (r epr esented by TSS, TTS and quartiles of gene sequence (Q1-Q4)).These calculated values resulted in an additional 5 features for each sgRNA assessed.
Normalization and correlation assessment.The predicti v e measur e, cutting efficiency scor e, was min-max normalized to ensure tr ansfer ability across models , control, methods , datasets and species.The distributions of high or low cutting efficiency scor es differ ed between species, and these skews were maintained during normalization as to not bias the technical efficiency of one species against the other.
Additionally, values were assessed for high le v els of correla tion between fea tures.W hen any two features resulted in a correlation higher than 0.9 one of the features was removed to eliminate the chance for split weights in feature importance during model training.The feature set utilized continuous and discrete variables with varied distributions.

Iter ative r andom f or est model
Random forest (RF) is a non-linear r egr ession model which incorporates an ensemble of decision trees that trace the algorithm's decision process.Iterati v e Random Forest (iRF) expands on the Random Forest method and is described in ( 52 , 53 ).Briefly, iRF is an ad vanced form of RF that implements a boosting and feature culling process based on the feature importance values from the previous iteration's random forest.These processes further iterate and amplify the fea tures tha t repea tedly indica te high predicti v e capacity.The boosting process produces a similar effect to Lasso in a linear model frame wor k.In iRF, a Random Forest is genera ted where fea tures are unweighted and randomly sampled, at any gi v en node in the decision trees, with equal probability.This process establishes feature importance scores that are used to weight features in the next forest.This iterati v e method provides an amplifica tion ef fect, increasing the chance that important featur es ar e evaluated at any gi v en node ( 52 , 53 ).
For this study, the process of weighting and generating a new Random Forest is repeated 10 times with 1000 trees and incorporates a 5-fold cross validation.For each run, the data is separated into an 80 / 20 training / test split where 80% of the data is used for training the model and the remaining data (not utilized in model training) is used for testing.Each feature is ranked by its importance in the tree building step and the direction of impact is determined based on the signed correlation of feature value and cutting efficiency value.Specifically, the feature matrix described above (incorporating positional encoding of the sgRNA nucleotide composition through a one-hot binary method, and quantum chemical properties) is utilized to predict the sgRNA cutting efficiency.Multiple iRF models were produced with different feature sets to assess top importancescoring, highly-predicti v e features; details on these models are in Table 1 .
Model assessment across quantum calculations.The iRF model was run using LC-DFT and DFTB3 quantum calculations for nucleotide base, base pair, and base-pair dimer kmers of the sgRNA to assess the variation in model output when featur es wer e calcula ted a t dif ferent le v els of quantum theory (Supplementary Figure S5).Both the predictability of the iRF models (Supplementary Figure S5A) and the top importance features (Supplementary Figure S5B) were identified.The HL-ga p, hydro gen bonding, and stacking energies are maintained as key features across the models.These properties' locations within the 3 region of the sgRNA are also conserved (Supplementary Figure S5B).It is observed that 94 out of the top 100 features are consistent between the two models.

Validation
Oligo design.As part of another project, 120000 unique synthetic guide RNAs (sgRNAs) were synthesized by Agilent (Santa Clara, CA) as 90mers consisting of the 35-bp J23119 Anderson promoter, the 20-bp spacer, and the 35bp 5 end of the sgRNA.The pool of single-stranded DNA molecules recei v ed from Agilent were dissolv ed in 200 l Elution Buffer (EB; 10 mM Tris, pH 8.0) and heated at 42 • C until all visible solids had dissolved.
The sgRNA sequences were picked to minimize potential specificity issues.Namel y, onl y sequences with unique seeds and no matches to sequences adjacent to non-canonical PAMs were chosen.Additionally, Cas9 was expressed at a moder ate r ather than a high le v el, which should further reduce the effects of decreased specificity.
Oligo processing and donor library production.Once in solution, second strand synthesis proceeded by mixing 20 l oligo solution, 2.5 l 10 M oWGA139, and 25 l NEB 2 × Q5 Hot Start Master Mix into each of two 0.2 ml PCR tubes.This reaction was placed in a thermal cycler and the following program was run: 98 • C 1 min, 68 • C 10 s, 72 • C 5 min, 4 • C hold.Once the hold was reached, 2.5 l 10 M oWGA140 was added to each tube, and the following program was run: 98 • C 1 min; 25 cycles of 98 • C 5 s, 68 • C for 10 s, 72 • C for 15 s; 72 • C for 5 min, 4 • C hold.The PCR product was purified by running on a 3% agarose gel in TAE buffer until the band w as halfw ay down the gel, removing a gel slice containing the band, and purifying the DNA with the NEB Monarch DNA Gel Extraction Kit.To clone this purified double-stranded oligo, a PCR of the pSS9-gRNA vector ( 62 ) was performed using the NEB Q5 Hot Start 2X Master Mix as manufacturer's instructions with the primers oWGA137 and oWGA138.5 l DpnI (NEB) was added directly to the PCR product and incubated at 37 • C overnight.This reaction was cleaned by adding 45 l water and 40 l MagBio HighPrep PCR Cleanup magnetic beads and following manufacturer's instructions.Quality of this backbone vector fragment was empirically tested by transforming 100 ng into Lucigen E. cloni Supreme DUO electrocompetent cells per manufacturer's instructions; as ∼40 colonies were produced in this reaction, the backbone was considered suitable for Gibson assembly.100 ng of vector backbone and 17.56 ng oligo library PCR product were added to a 20 l NEBuilder HiFi Assembly reaction per manufacturer's instructions (a 5:1 insert:vector ratio).Afterwards, the reaction was cleaned via drop dialysis with a 0.02 micr on nitr ocellulose filter floated on 250 ml of 18.2 M water. 2 l of this dialysis product was added to each of 6 25-ul Lucigen E. cloni Supreme DUO comp cell aliquots, and electroporation was performed using a BioRad Mi-croPulser set to the E. coli 1 program.950 l Lucigen Recovery Buffer was added to each transformation, these six solutions were combined, and the totality was incubated at 37 • C for 1 h.Afterwards, this recovery culture was added to 100 ml LB + 100 g / ml carbenicillin and incubated at 37 • C ov ernight.The ne xt morning, the culture was harvested, the supernatant was discarded, and the cell pellets were suspended in 200 ml + 200 g / ml carbenicillin and incubated at 37 • C for four h.This fresh culture was harvested via centrifugation, then plasmid DNA was extracted using a Zymo Resear ch Midipr ep Kit.This library, r eferr ed to her eafter as the Donor Library, was sent for sequencing by Illumina.
Host competent cell pr epar ation.E. coli K12 MG1655 containing pX2-Cas9 was struck from a freezer stock to an LB + 50 g / ml kanamycin agar plate, and a single colony was picked to 5 ml of LB + kan in a culture tube.This culture was incubated at 37 • C overnight, then it was used to inoculate 110 ml of LB + kan + 0.2% arabinose to an OD600 to 0.1.Growth was monitored over the course of 3 h, and 100 ml of the cells were harvested once an OD 600 of ∼0.6 was measured.The cell pellet was washed 3x with 10% glycerol, then suspended to a final volume of 1000 ul.
Host library production.To pr epar e the library for transformation, 10 g of library vector was mixed with 100 pg each of pWGA128 and pWGA130 to provide internal controls for escape and cutting, respecti v ely.52.5 l of con- trolled library solution was added to 525 l of electrocompetent cells, and 55 l of this mixture was electroporated in a 1 mm gap cuvette by a BioRad Micropulser set to the E. coli 1 program 10 times in total, adding 950 l of SOC to each electroporation immediately after the shock.These 10 ml of cells and media were combined into a culture tube, then 1 l of 1 ng / l pWGA129 plasmid was added to the culture to act as a control for washing.The recovery culture was incuba ted a t 37 • C for 1 h, then centrifuged and the superna tant aspira ted.The cell pellet was suspended in then entir ely transferr ed to 100 ml of LB + kan + 100 g / ml carbenicillin and cultured at 37 • C for 6 h.50 ml of culture had cells harvested, washed once with 50 ml DNAse I Wash Buffer (10 mM Tris, 2.5 mM MgCl 2 , 0.5 mM CaCl 2 ), then the pellet was suspended in 1.5 ml of DNAse I Wash Buffer and transferred to a 1.5 ml microcentrifuge tube.This pellet was washed two more times with 1.5 ml DNAse I Wash Buffer, then suspended in 990 l of the same.10 l of NEB DNAse I was added, and the reaction was incubated for 15 minutes a t 37 • C .The cell pellet was harvested, and plasmid DNA was extracted using the NEB Monar ch Minipr ep Kit.
This library, r eferr ed to her eafter as the Host Library, was sent for sequencing by Illumina.
Illumina library preparation.For both Donor and Host Library plasmid pools, PCR using 'phased primers' amplified the gRNA spacers to be sequenced.Briefly, fiv e forwar d and fiv e re v erse primers were made, each that bound to the same site of the library plasmids but possessing zero to four additional random bases at the 5 end; the purpose of these extra bases is to alleviate issues inherent to sequencing amplicons on Illumina platforms, as the very low complexity of amplicon molecules pre v ents high-quality base calls from being made by the Illumina system.These forward and re v erse primer mixes were used together with 30 ng of library vector in a Q5 Hot Start PCR as per manufacturer's instructions.
The PCR product was purified on 3% gel and isolated using the NEB Monarch DNA Gel Extraction Kit, quantified using the Promega Quantus fluorometer and the QuantiFluor ONE dsDNA System, and then used in the NEBNext Ultra II DNA Library Prep Kit for Illumina.During the amplification step, one forward and one re v erse primer per library from the NEBNext Multiplex Oligos for Illumina (Dual Index Primers Set 1) kit was used to complete the protocol.
Illumina library sequencing.Sequencing libraries were validated on an Agilent Bioanalyzer using a DNA7500 chip, and the final libr ary concentr ation was determined by Qubit (Thermo Fisher Waltham, MA) with the broad range double stranded DNA assay.A paired end sequencing run (2 × 75 bp) was completed on an Illumina MiSeq instrument (Illumina, San Diego, CA) with 20% PhiX spike in using v3 chemistry following the manufactures denaturing and loading recommendations.The Donor Library was sequenced using three separate sequencing runs, while the Host Library was sequenced using two separate runs.
Data processing.P air ed end r eads wer e merged using the USEARCH -f astq mer g epair s command with thefastq maxdiffs option set to 10 ( 63 , 64 ).The merged pairs were then matched to the original design file in FASTA format using USEARCH ( usear ch -usear ch local path / to / r eads .fastq -db path / to / designs .fasta -str and both -threads 30 -id 0.9 -top hit only -tar g et cov 1 -user out path / to / output.tsv-userfields query + tar g et + id + qr o wdots + tr o wdots + qr o w + tr o w ).The output .tsvfile was parsed in Python to count the instances of perfect reads for each design in the design file and to calculate the abundances of perfect reads for each design (calculated as number of perfect reads for a design divided by the number of merged pairs used in the initial USEARCH matching, and these data for all runs were combined into a single spreadsheet.In that spreadsheet, new columns to calculate values for the average donor library reads, average donor library ab undance, avera ge host library abundance, host wash control correction, delta, log 2 delta, and P-value for each design.Briefly: average donor library reads, donor abundances, and host abundances were calculated with the AVERAGE function in Excel; the host wash control corrected abundance was calculated as the average host library abundance minus the product of the ratio of the average host abundance of the pWGA129 wash control sequence, the ratio of the total mass of DNA used (1 g) to the ratio of wash control plasmid added to the recovery (1 ng), and the donor abundance of the cassette; delta was calculated as the host wash control corrected abundance divided by average donor library abundance; log 2 delta was calculated by log2 delta; and the P -value was calculated by the Student's t -test function in Excel for a two-tailed heteroscedastic test.Designs possessing either less than 10 average donor library reads or a Pvalue equal to or greater than 0.05 were moved to 'graveyard' tabs in the spreadsheet and excluded from any further analysis.The resulting validation data are contained in Supplementary Table 6.
The iRF model was run using the same criteria as described in the 'Iterati v e Random Forest Model' section above.A classification iRF model using five-fold crossvalidation with 1000 decision trees per random forest and ten iterations was trained with the Guo et al.Cas9 sgRNA dataset.Classes were defined by sgRNAs with bad (0) or good ( 1 ) cutting efficiency scores as determined by the 2500 guides at each tail of the Log2 normalized score distribution.The cross-validation test resulted in an ROC AUC of 0.8606.This model was then used to predict the class for a novel 120k sgRNA E. coli dataset as described above.Specifically, the 2500 guides at either tail of the Log 2 normalized score distribution were assessed and found to have an accuracy of AUC = 0.7633.See supplemental for classification model test results (Supplementary Figure S5).

Adv anced featur e engineering metrics
In combination with iterati v e Random Forest, in-house scripts for advanced interpretation of machine learning output were utilized: Random intersection trees (RIT) uses binary predictor variables to identify interactions among features in a model.In short, RIT starts by collecting the full set of matrix variables in a 'high level interaction'.Classes of observations are assigned to each variable and then, through iteratively making a random observation and identifying which variable wer e pr edicti v e of that observation, the interaction set is narrowed to a subset of patterns in the prediction variables ( 53 , 54 ).This process is repeated for each observation class.The algorithm works by assessing the node-split forest paths from iRF to find features that occur consecuti v ely along the path (more often than would be expected by chance).The result of RIT is a set of interactions that are retained with high probability, potentially in a non-linear manner, and are ther efor e consider ed informati v e to the model as a whole when the features are in combination.
For these analyses, we use internal R and Python scripts designed for iterati v e and e xtensi v e tree-based random forest decision processes.These scripts extend conventional RIT methods for enhanced feature engineering and model comparison.They provide metrics for feature interpretation, including normalized importance scores (for comparison across models), and fea ture ef fect scores (the magnitude and direction of that feature's influence on the model).We also identify and characterize interacting features.For this, the number of samples captur ed, RIT (pr evalence of a feature set in the model), adjusted RIT (the difference between the model and e xpected pre valence of a feature set), and set importance, are key indicators.

Feature importance with iterative random forest
We assess iRF-based predicti v e models that le v erage quantum descriptors for multiple degrees of base-pair polymeriza tion.This da ta ca ptures interactions within the sgRN A nucleotide sequence, along with properties of the individual bases.This approach combines the increased interpretability of XAI methods for feature interpretation with the novel incorporation of quantum chemical properties to further mechanistic understandings of CRISPR-Cas9.Models were generated for sgRNA efficiency in E. coli and H. sapiens .Varia tions in fea tur e importance across kingdoms wer e also assessed.
A pub licly availab le E. coli dataset was used to generate a detailed feature matrix to determine highly influential properties for predicting CRISPR-Cas9 cutting efficiencies (Figure 1 ).In this model, the dependent variable (Yvector) is the experimental cutting efficiency score for each sgRNA.We used quantum chemical features of nucleotides to capture the intricacies of the multi-step CRISPR-Cas9 mechanism.In addition, base-pair oligomers ( k -mers) up to tetramers were incorporated, describing nucleotide position in the sgRNA structure with binary encoding (onehot k-mers) and quantum chemical properties (quantum k -mers).These k-mers assess the subsequence structure of the 20bp sgRNA by segmenting the sequence into fragments of one to four bases.These features provide a basis for interpreting the influences of independent and dependent nucleotides on sgRNA efficiency, and progress structurally informed understandings of sgRNA efficiency.Featur es pr eviously determined to influence CRISPR-Cas9 efficiency in mammalian species were also encoded, including GC content, melting temperature, and one-hot encoding of single and paired nucleotides in the target sequence.We also included metrics such as the distance between the target sequence and the nearest PAM (NGG sequence) and location of the target sequence relati v e to the nearest gene.To sharpen insight into the contributions of each feature, predicti v e capabilities were assessed by Pearson correlations and accuracy ( R 2 ) metrics (Supplemental Table 1).The iRF model using the complete feature matrix results in a predicti v e accuracy of 0.51, which is competiti v e with the most predicti v e models currently available.Quantum k-mers and one-hot k-mers contribute the largest feature set contributions, isolating essential features for sgRNA engineering.

Feature engineering highlights the role of quantum mechanics
A feature engineering approach clarifies the factors influencing sgRNA efficiency by identifying the model's most important variables.A total of 6232 features were used in an iRF model for E. coli (the full E. coli feature set; Supplementary Table S2).This model was trained on 32 374 sgRNA and tested on 8094 sgRNA sequences.This complete fea ture ma trix cast as an iRF frame wor k resulted in significant correlations between predicted and experimental sgRNA cutting efficiency values (Figure 2 A).Furthermore, high prediction le v els were found in the iRF model using only the quantum chemical properties feature set; and accuracy incr eased incr ementally as additional featur es and kmers were incorporated (Figure 2A; Supplementary Figure S1; Supplemental results).Below, we focus on a subset of fea tures tha t were the most predicti v e of sgRNA efficiency (Figure 2 B, C).
Based on feature importance values produced by iRF (Supplementary Table S4), top features emphasized positionally-encoded k-mers of quantum chemical properties and one-hot encoding of the target sequence (Figure 2 B).These top features were maintained as highly important across two quantum chemical methods (Supplementary Figure S5).Top features were localized to positions 18 through 20 of the target sequence.This region is proximal to the sgRNA tailpin structure, the target DNA PAM sequence, and cut site for the Cas9 nuclease.The most important feature was the HOMO-LUMO energy gap for the base pair at position 20 of the target sequence (Figure 2 B).This feature alone accounted for more than 6% of the variance in empirical sgRNA efficiency.The next most important feature was the base-pair dimer stacking energy at bases 19 and 20; accounting for ∼3% of the variance.Hydrogen bonding energy of the base pair at position 20 shows a similar contribution.Following these features in importance, we observ e se v eral position-dependent base pair dimer , trimer , and tetramer quantum chemical properties.Each of these features accounted for 1-2% of the dataset variance.Se v eral one-hot encoding sequences were also important features, including cytosine positions 15 and 16 (CC pos15) and a CCA beginning at position 19.Additionally, se v eral features with high feature importance scor es ar e consistent with trends in the literature, including GC content and melting temperature ( 31 ).
Each decision tree within an iRF model selects features based on their contribution to the predicti v e ability against the experimental dataset.Because of complex relationships between features and sgRNA, individual features may not be influential for each sgRNA in the model.Ther efor e, each feature's average number of affected sgRNAs was also calculated for all decision trees within the iRF model.This average was compared with the total number of training data sgRNAs to determine the relati v e proportion of sgR-NAs that each feature influenced (Figure 2 C).The twenty highest-magnitude fea tures af fected between 20% and 85% of sgRNA samples.The three most-commonly influential features included the base pair 20 HOMO-LUMO gap energy, the base-pair 19-20 dimer hydrogen bond energy, and the base-pair 1-4 tetramer hydrogen bond energy (Figure 2 C).These top features span both positi v e and negati v e associations with predicted cutting efficiency scores.

Assessing feature association with sgRNA efficiency
Each feature was assigned a direction (positi v e or negati v e) and ef fect size, calcula ted with a random intersection tree (RIT)-based approach ( 32 ) (Figure 3 A, Supplementary Figure S3).These components describe a feature's relationship with the cutting efficiency score, allowing for greater interpreta tion of tha t fea ture's role in the CRISPR-Cas9 mechanism.For example, it has been shown that higher melting temperatures and greater GC content decrease guide ef ficiency ( 31 ).This anti-correla ted rela tionship is demonstrated in our model by a negati v e feature effect value (Figure 3A, C).Among the important features, both positi v e and negati v e correlations with the predicted cutting efficiency score were observed (Figure 3 ).
The top positional encoding features also showed varied directions of correlation with sgRNA cutting efficiency.Two essential features are the HOMO-LUMO gap and the hydrogen bond strength at position 20 between the sgRNA and target sequence (Figure 3A, C, Supplementary Figure S2).The HOMO-LUMO gap is positi v ely correlated with sgRNA cutting efficiency, while the hydrogen bond strength at the same position is anti-correlated.Further, the direction of the hydrogen bond str ength featur e effect v alue v aries with position and encoding length --whether base-pair monomers , dimers , trimers , or tetramers are consider ed.Hydrogen bond str ength at positions 18-20 hav e negati v e effects, w hile hydro gen bond strength at position 1 has a positi v e effect.This contrast indicates varied pr efer ences for hydrogen bonding energy across regions of the target sequence.One-hot encoding indicates position 15 CC as anti-correlated, while position 19 GC is positi v ely correlated with sgRNA cutting efficiency.Additionally, our model indicates that increased distance to PAM is anti-correlated with sgRNA cutting efficiency.

Quantum chemical insights into kingdom-specific dynamics
Current species-trained models are inadequate for prediction across organisms.To assess the organism specificity of our iRF model, we tested the efficacy of the full E. coli -trained model on se v eral pub licly accessib le H. sapiens datasets ( 10 , 33 ).The r esulting pr edictions wer e extr emely poor, with a Pearson correlation of 0.016.This supports that key features identified by models trained on experimental data from a single species are not predicti v e across species, particularl y w here varied CRISPR-Cas9 interactions and complex DNA structures contribute.
E. coli and H. sapiens r epr esent differ ent kingdoms, Eubacteria and Animalia.These classifications span singlecelled to multi-celled or ganisms; varied or ganellar makeup; and di v ersity in genomic and epigenomic structures and compositions.To compare the predicti v e capability of the newly integrated feature set across kingdoms, we generated a model trained on H. sapiens -specific data ( 10 , 33 ).The full fea ture ma trix was generated as described for the E. coli model, using the specified sgRNA sequences in the Doench et S3).The iRF model was pr epar ed with the same fiv e-fold cross validation scheme as for the E. coli models.The resulting model had a Pearson correlation of 0.50 (Table 1 ).This is competiti v e with se v eral of the top predicti v e cutting efficiency models currently available for human genome editing ( 4 ).
To explore this result, we cross-examined the twenty highest-scoring features for each species-trained model (Figure 3 ).Similarly to those identified in the E. coli model, quantum chemical tensors in the target sequence's seed region (sgRNA position 10-20, closest to the PAM sequence) appear to dri v e the H. sapiens model prediction (Figure 3 B).While quantum chemical properties as a feature set are important for sgRNA efficiency in both E. coli and H. sapiens , the most predicti v e features vary considerably between models.In E. coli , the occupancy of frontier orbitals for PAM-adjacent nucleotides was a driving factor in CRISPR-Cas9 cutting (Figure 3 C).In the H. sapiens model, howe v er, properties in central regions of the target sequence were highlighted (positions 5-15; Figure 3 D).Key features for this model emphasize hydrogen bonding energy and stacking interactions, along with electron occupancy (Figure 3B / D).These features signpost novel mechanistic interpretations focused on central regions of the target sequence that can be explored in future biological studies.

DISCUSSION
Current sgRNA efficiency prediction models are limited by a narrow range of species data for training.To enhance the predicti v e accuracy, many models use deep learning techniques that can obscure interpretability of their feature effects.This work expands understanding of the factors influencing sgRNA efficiency for a bacterial dataset with an e xplainab le-AI method.Furthermore, we incorporate quantum chemical properties to provide novel insights into sgRNA ef ficiency, d ynamics, and propose interpretations for the CRISPR-Cas9 mechanism.
A panel of E.coli sgRNA sequences were encoded into a ma trix incorpora ting PAM sequence distance, sgRNA melting temperature, GC content, one-hot binary encoding, and quantum chemical properties.This detailed feature set was used to train an XAI iRF model against experimentally calcula ted cutting ef ficiencies.The goal of this model was to predict on-target CRISPR-Cas9 activity to better understand sgRNA efficiency in the genome.The predicti v e capacity of the machine learning model was enhanced by advanced k-mer features (binary and quantum) and is competiti v e with currently available models.The XAI methodology permitted investigation of the underlying features by quantifying feature importance scores.
Quantum chemical properties carried the highest importance for prediction of sgRNA efficiency.This feature set is novel in the domain of CRISPR-Cas9 models and enhances the model's biological interpretability.Beyond position-specific sequence information, which is commonly encoded in a binary matrix, quantum chemical properties signpost the varied nucleotide interactions tha t media te the CRISPR-Cas9 mechanism.The sgRNA seed region featured quantum properties with high predicti v e capacities.Descriptors of hydrogen bonding energy, stacking interactions, and HOMO-LUMO gaps enrich the interpretation of why this region plays a vital role in CRISPR-Cas9 ef ficiency.Particularly, we note indica tions of mechanistic competition for pr eferr ed structural features.We focus on three main themes: locality in the 'seed r egion', degr ee of base-pair polymerization, and mechanistic competition (Figure 3 C).
The 'seed' region, the fiv e to ten base pairs on the target sequence's 3 end nearest to the PAM sequence and cleavage site, has been a focus of sgRNA construction across mammalian species.Se v eral of the top E. coli -based features, especially quantum properties for base-pair and dimer structur es, ar e essential in positions 18-20 (Figure 3A; Figure 3 C).In contr ast, tetr amer features in the seed region are highlighted further from the PAM sequence.These differences suggest a structur e-activity r elationship and may indica te separa te mechanistic steps involving these regions.
Looking to the multi-step CRISPR-Cas9 mechanism, we postula te tha t considering both DN A-DN A double helix unwinding and subsequent DN A-RN A binding are essential for interpreting these results.
This distinction can be seen when interpreting a positi v e correlation between the hydrogen bond stacking energy at position 18 of the target sequence (Figure 3 A).Mechanistically, this indicates that position 18 is important for DNA-RNA binding.Once helix melting has been initia ted a t the target sequence's 3 end, the remaining sequence composition is potentially less important for unwinding.Ther efor e, we suggest that while lower hydrogen bond strength at position 20 is energetically preferable for DNA double helix melting, higher hydrogen bond strength at position 18 is important for strong DN A-RN A binding (Figure 3 A).
In another view, a positive correlation between the HOMO-LUMO gap energy and cutting efficiency is observed at position 20.The HOMO-LUMO gap may capture conformational changes that are occurring during the initial integration of the CRISPR-Cas9 complex.We note recent work identifying the 'phosphate lock loop' in this interpreta tion.W hen the PAM sequence is identified and bound, the DNA kinks to enable DNA helix unwinding and permit DN A-RN A binding.These structural e v ents are stabilized by a phosphate lock loop proximal to the PAM (34)(35)(36)(37).While a high HOMO-LUMO gap in this region may describe a change in molecular stability, the weaker hydro-gen bonding may relate to the DNA double helix unwinding that follows.
A further discovery was the variation in influential properties for sgRNA efficiency across species.The novel quantum chemical property feature set is tr ansfer able across species because of its construction from simple nucleotide sequences.While the model does not provide a comprehensi v e vie w of nucleotide binding and interactions in complex genomes, it does provide a structural grounding for mechanistic interpreta tions tha t cannot be captured with conventional binary encodings.Species-tailored iRF models generated utilizing E. coli and H. sapiens data exemplify this incr ease in interpr etability.Mor eover, the H. sapiens model provides sufficient predicti v e power while also allowing for feature engineering insight.Such analytical interpretability is not currently available with other top predicti v e models in the field due to their deep learning focus ( 38 ).Furthermore, our model performance points to the beneficial integration of quantum chemical properties, not only for interpretation but also for sgRNA efficiency prediction.
This novel feature set was shown to be of high importance across species; with each species model emphasizing different quantum chemical properties or locations of interest.This highlights uses for understanding complex mechanisms across di v erse species and datasets.For example, recent work, including that by Palermo et al., has made strides in its combination QM / MM description of the S. pyogenes CRISPR-Cas9 system, with particular focus on the scissile phosphodiester bond [1,2].In light of our work, sampling a larger region of the guide and a range of transcript candidates, focusing on the p -20 to p -15 region, could provide powerful insight into the S. pyogenes CRISPR-Cas9 mechanism and its relationship to guide sequence composition ( 60 , 61 ).
Varia tion in fea ture importance across models supports critiques that organism-tailored models are not transferable across species.This was further emphasized by the very low performance of the E. coli trained model as a predictor of H. sapiens sgRNA efficiency.Not only is model exploration across species needed; new datasets with standardized scoring methodologies ar e r equir ed for such studies to be conducted.Current publicly accessible cutting libraries vary across factors that can greatly influence the calculated cutting efficiency scores; including sgRNA sequence length, type of assay, and the use of a dead Cas9 (dCas9), Cas9 with a non-complementary sequence as the guide, no Cas9 or no guide as a control.Each of these variations influence the reported efficiency values and complicate synthesis of multiple datasets within a single model or comparison across datasets.Ther e ar e further challenges in assessing cutting specificity in bacteria due to the general lack of NHEJ machinery.This causes many double stranded breaks to result in cell lethality.Future work will generate genome-wide cutting efficiency libraries to expand on currently available data as well as standardize libraries across species for comparison and model integration.This work established a novel feature set, with quantum chemical tensors, that advances the mechanistic interpretation and predicti v e accuracy of the model and will become a r esour ce for continued work in the field.Initial insights into essential variables for understanding the CRISPR-Cas9 mechanism have been identified through feature engineering techniques.These advances provide avenues for improving CRISPR-Cas9 sgRNA generation and identify points of interest for experimental follow-up in the CRISPR-Cas9 mechanism.Furthermore, these advances provide insights into species variation of CRISPR-Cas9, and indicate methods for predicti v e model enhancement.

DA T A A V AILABILITY
No new data were generated or analysed in support of this r esear ch, except for guide RNA sequences and depletion scores, which are included in supplemental material.Source code for data processing can be found on GitHub under an MIT license.Code for generating the fea ture ma trix can be found at https://github.com/nosha003/sgRNA iRF .Code for iterati v e Random Forest (iRF) can be found at https://github.com/jailGroup/RangerBasediRF .

NOTES ADDED IN PROOF
This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy.The United States Government retains and the publisher, by accepting the article for publica tion, acknowledges tha t the United Sta tes Government retains a non-e xclusi v e, paid-up, irre vocab le, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.The Department of Energy will provide public access to these results of federally sponsored r esear ch in accordance with the DOE Public Access Plan ( http://ener gy.gov/do wnloads/doe-public-access-plan ).

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Online.

Figure 1 .
Figure1.Explainable-AI method for analysis of feature importance on prediction of sgRNA ef ficiency.Fea tures are forma tted to generate a wide matrix with rows r epr esenting each sgRNA, corresponding experimental cutting efficiency scores and columns for all feature values.This information matrix is analyzed with an iterati v e Random Forest (iRF) method.

Figure 2 .
Figure 2. Identifying model variation based on feature input and assessing feature importance in E. coli .( A ) Violin plot of R 2 values based on iRF model genera tion with isola ted fea ture input (fea ture ca tegories described in Table 1 ).( B ) The top 50 features from the full fea ture ma trix iRF model ranked by normalized feature importance score and color-coded by feature category.( C ) Dot plot of features from full feature matrix iRF model showing the number of samples (sgRNAs) that were influenced by that feature (y-axis) versus the normalized importance of the feature (x-axis).Color temperature increases with the feature effect score (red, negati v e; b lue, positi v e) and dot size is scaled by the normalized importance score.( D ) Violin plot of R 2 values for the top 5, 10, 20, 50, 100, 200, 500 and 1000 features, based on full feature iRF model output.There is a plateau of information gained from including features with low importance scores.

Figure 3 .
Figure 3. Explainable-AI interpretation through iRF output metrics and featur es' dir ectional influence on cutting efficiency.(A, B) The top 20 features from the full fea ture ma trices ranked by normalized importance score and color-coded by the direction of the effect.Positi v e correlations with the cutting efficiency score are blue while anti-correlations with cutting efficiency score are pink for E. coli ( A ) and H. sapiens ( B ). (C, D) sgRN A-DN A interaction highlighting quantum chemical features of top importance, their locations, and correlated associations with cutting efficiency scores in E. coli ( C ) and H. sapiens ( D ).DNA strand r epr esented in gray (target sequence) and blue (target complementary sequence), sgRNA shown in yellow, and PAM sequence displayed with NGG stars.The feature effect direction is indicated with arrows, up (blue arrow) indicates a positi v ely correlated relationship between the feature value and the cutting ef ficiency value.Fea ture bars indica te quantum properties (HL gap, purple; Stacking interactions, green; H-bonding, blue) and the length of the bar indicates the k -mer size.Multi-colored bars indicate the same k-mer at the same position has multiple features assessed as highly important.The E. coli (C) model shows e xtensi v e localization of important features, primarily bp, trimer and tetramers at positions 11-20.Hydrogen bonding has outlier importance at position 1-5.Hydrogen bonding and stacking energy features are observed in both correlated and anti-correlated relationships with cutting efficiency (depending on their k -mer and position) while HL-gap is consistently a positi v e relationship nearest the PAM sequence.The H. sapiens (D) model has lesser feature localization, with many features overlapping in positions 5-15.For features of high importance (hydrogen bonding, stacking energy, and HL-gap), the feature-specific directional effects span both positi v e and negati v e relationships with cutting efficiency, dependent on the feature length and position.Similarly to the E. coli model, bp, trimers and tetramers are the most predicti v e.The number of electrons H. sapiens a top feature for the H. sapiens model that is not among the top feature in E. coli .