Structural basis for substrate recognition and cleavage by the dimerization-dependent CRISPR–Cas12f nuclease

Abstract Cas12f, also known as Cas14, is an exceptionally small type V-F CRISPR–Cas nuclease that is roughly half the size of comparable nucleases of this type. To reveal the mechanisms underlying substrate recognition and cleavage, we determined the cryo-EM structures of the Cas12f-sgRNA-target DNA and Cas12f-sgRNA complexes at 3.1 and 3.9 Å, respectively. An asymmetric Cas12f dimer is bound to one sgRNA for recognition and cleavage of dsDNA substrate with a T-rich PAM sequence. Despite its dimerization, Cas12f adopts a conserved activation mechanism among the type V nucleases which requires coordinated conformational changes induced by the formation of the crRNA-target DNA heteroduplex, including the close-to-open transition in the lid motif of the RuvC domain. Only one RuvC domain in the Cas12f dimer is activated by substrate recognition, and the substrate bound to the activated RuvC domain is captured in the structure. Structure-assisted truncated sgRNA, which is less than half the length of the original sgRNA, is still active for target DNA cleavage. Our results expand our understanding of the diverse type V CRISPR–Cas nucleases and facilitate potential genome editing applications using the miniature Cas12f.


INTRODUCTION
Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated proteins (Cas) systems are the adaptive immune systems in bacteria and archaea against infection from mobile genetic elements (MGEs) (1)(2)(3)(4). In Class 2 CRISPR-Cas systems, a single effector nuclease incorporates with guide RNAs (gRNAs) to recognize target DNA with complementary sequences. Class 2 systems are further divided into three types: type-II exemplified by Cas9 nucleases, type-V featured by Cas12 nucleases, and type VI epitomized by Cas13 nucleases (5). Type-V systems have the most subtypes discovered to date, including Cas12a-k (5,6). Most Cas12 nucleases target doublestranded DNA for cleavage, with the exception of Cas12g which targets RNA substrates (6,7) and Cas12k which is inactive for substrate cleavage (8).
Cas12f, also known as Cas14, is the smallest class 2 CRISPR-Cas effector reported to date with a length between ∼400-700 amino acids (9). Cas12f proteins were identified almost exclusively within a superphylum of symbiotic archaea, DPANN (9). Initially found to be specific for ssDNA, Cas12f was recently reported to also recognize ds-DNA with 5 T-rich protospacer adjacent motifs (PAMs) (10). Cas12f associates with a crRNA and a tracrRNA, which can be fused into a single guide RNA (sgRNA), to target substrate DNA. Cas12f is a Mg 2+ -dependent endonuclease that functions best in low salt concentrations and at ∼46 • C (10). Similar to other Cas12 nucleases, Cas12f is capable of cleaving non-specific ssDNA in trans after binding complementary target DNA, thus enabling its development for nucleic acid detection (9).
The plasmid encoding full-length Cas12f (UnCas12f1) was purchased from Addgene #112500 with an N-terminal 10xHis-MBP-tag. The plasmid was transformed into Escherichia coli BL21(DE3) cells and grown to OD 600 = 0.5 in Terrific Broth (TB). Protein overexpression was induced by adding 0.5 mM IPTG followed by incubation at 18 • C overnight. The cells were collected and then resuspended in buffer A containing 25 mM Tris-HCl (pH 7.6), 1 M NaCl, 5% glycerol, 1 mM PMSF and 5 mM ␤-mercaptoethanol, and disrupted by sonication. Cell lysate was clarified by centrifugation. The supernatant was loaded onto Ni-NTA resin, washed with buffer B containing 25 mM Tris-HCl (pH 7.6), 1 M NaCl, 30 mM imidazole, and 5 mM ␤mercaptoethanol, and the Cas12f protein was eluted by buffer B supplemented with 250 mM imidazole. The His-MBP-tag was removed by overnight incubation with TEV protease at 4 • C. The target protein was exchanged into buffer C containing 25 mM Tris-HCl (pH 7.6), 500 mM NaCl, 2 mM DTT and 5% glycerol, loaded onto a HiTrap SP HP column (GE Healthcare), and eluted with a linear NaCl gradient (0.1-2 M) followed by size exclusion chromatography over a Superdex 200 (GE Healthcare) in buffer D containing 25 mM Tris-HCl (pH 7.6), 150 mM NaCl, 2 mM DTT and 1 mM MgCl 2 . Fractions were concentrated and stored at -80 • C.
To assemble the Cas12f-sgRNA binary complex, Cas12f proteins were incubated with sgRNA (Supplementary Table S1) at a ratio of 1:1.2 at 37 • C for 30 min in buffer D. To reconstitute the Cas12f-sgRNA-target DNA complex, Cas12f D510A mutant proteins were incubated with guide RNA at 37 • C for 30 min followed by adding the target DNA (Supplementary Table S1) synthesized from IDT at a ratio of 1:1.2:1.3. After 30 min, the reaction mixture was subjected to SEC over a Superdex 200 column (GE Healthcare) equilibrated with buffer D for further purification.
sgRNA preparation sgRNAs were produced by in vitro transcription using the HiScribe T7 High Yield RNA synthesis kit (NEB) with PCR amplified gBlocks (IDT) as templates. sgRNAs were purified over a Resource-Q column (GE Healthcare) and eluted with a linear NaCl gradient (50 mM-1000 mM) in 25 mM Tris-HCl (pH 8.0). The eluted sgRNAs were concentrated and stored at -80 • C

Mutagenesis
Single amino acid mutations were introduced by the QuikChange site-directed mutagenesis method. Mutations with multiple amino acids were introduced by ligating inverse PCR-amplified backbone with mutations bearing DNA oligonucleotides via the In-Fusion Cloning Kit (ClonTech). All mutants were confirmed by Sanger sequencing.

In vitro DNA cleavage assay
Target DNA containing the 5 -TTTA-3 PAM was ordered from IDT and cloned into a pET28-MHL vector using the In-Fusion Cloning Kit (ClonTech). Plasmids were linearized before usage. Cas12f proteins (200 nM) were mixed with guide RNA at a ratio of 1:1.1 at 37 • C for 30 min in cleavage buffer containing 2.5 mM Tris-HCl (pH 7.6), 50 mM NaCl, 10 mM MgCl 2 , and 0.5 mM DTT, and then linearized plasmids (5 nM) were added. The reactions were quenched by adding EDTA and proteinase K (Thermo Fisher Scientific) after 45 min. The cleavage products were resolved on 0.7% agarose gels and visualized by ethidium bromide staining.

Electron microscopy
Aliquots of 4 l Cas12f-sgRNA binary complex (1 mg/ml) and Cas12f-sgRNA-dsDNA ternary complex (1 mg/ml) were applied to glow-discharged UltrAuFoil holey gold grids (R1.2/1.3, 300 mesh). The grids were blotted for 2 s and plunged into liquid ethane using a Vitrobot Mark IV. Cryo-EM data were collected with a Titan Krios microscope (FEI) operated at 300 kV and images were collected using Leginon (27) at a nominal magnification of 81 000× (resulting in a calibrated physical pixel size of 1.05 A/pixel) with a defocus range of -0.8 to -2.0 m. The images were recorded on a K3 electron direct detector in superresolution mode at the end of a GIF-Quantum energy filter operated with a slit width of 20 eV. A dose rate of 20 electrons per pixel per second and an exposure time of 3.12 s were used, generating 40 movie frames with a total dose of ∼54 electrons/Å 2 . Statistics for cryo-EM data are listed in Table 1.

Image processing
The movie frames were imported to RELION-3 (28). Movie frames were aligned using MotionCor2 (29) with a binning factor of 2. Contrast transfer function (CTF) parameters were estimated using Gctf (30). A few thousand particles were auto-picked without template to generate 2D averages for subsequent template-based auto-picking. The auto-picked and extracted particle datasets were split into batches for 2D classifications, which were used to exclude false and bad particles that fell into 2D averages with poor features. Particles from different views were used to generate an initial model in cryoSPARC (31). 3D classification was further performed to distinguished different compositional/conformational heterogeneity. The homogeneous dataset was used for final 3D refinement with C1 symmetry.
For the Cas12f-sgRNA binary complex dataset, 1 846 279 particles were auto-picked and extracted from 1391 dose weighted micrographs. 448 190 particles were selected from 2D classification and used for 3D classification. 154 190 particles were selected from 3D classification and used for final 3D refinement.
For the Cas12f-sgRNA-dsDNA ternary complex dataset, 3 284 618 particles were auto-picked and extracted from 2450 dose weighted micrographs. 992 872 particles were selected from 2D classification and used for 3D classification. 384 132 particles were selected from 3D classification and used for final 3D refinement. Focused refinement around the Nuc domain was further performed to improve the local map quality. Cryo-EM image processing is summarized in Table 1.

Model building, refinement, and validation
De novo model building of the Cas12f-sgRNA-target DNA structure was performed manually in COOT (32) guided by secondary structure predictions from PSIPRED (33). Refinement of the structure models against corresponding maps were performed using the phenix.real space refine tool in Phenix (34). For the Cas12f-sgRNA complex, the structure model of the Cas12f-sgRNA-target-DNA complex was fitted into the cryo-EM map, and each domain was manually adjusted in COOT. The resultant model was refined against the corresponding cryo-EM map using the phenix.real space refine tool in Phenix. 3D FSC analysis for the presented maps were performed using the Remote 3DFSC Processing Server (https://3dfsc.salk.edu/upload/) (35).

Structural visualization
Figures were generated using PyMOL and UCSF Chimera (36).

Overall structure of Cas12f-sgRNA-target DNA
We assembled a Cas12f-sgRNA-target DNA ternary complex by incubating an inactive Un1Cas12f1 D510A (529 amino acids or a.a., 61.5 kDa)(10), a sgRNA (222 nucleotides), and a target dsDNA with a TTTA PAM sequence (60 bp) (Supplementary Figure S1A). Using cryo-EM, we determined the structure of this complex at 3.1Å resolution ( Figure 1A and Supplementary Figure S1B-G, and Table 1). The resultant map allowed us to build the atomic model of the whole complex (Supplementary Figure  S2), except three residues at the N-terminus, four residues at the C-terminus, and flexible regions in the sgRNA and target DNA to be discussed below. The most astonishing feature of the structure is the presence of two copies of Cas12f in the complex (named Cas12f.1 and Cas12f.2) (Figure 1A, B), in contrast to all previous determined structures of other class 2 effectors. The overall structure of the Cas12f-sgRNA-target DNA ternary complex is consistent with a recent study (37) that was published during the preparation of this paper. Despite its small size, Cas12f contains all the conventional domains of Cas12 proteins, compared with other known Cas12 nuclease structures (Supplementary Figure  S3). Cas12f monomers consist of REC1 and WED domains in the N-terminal half and the RuvC, REC2 (included as part of RuvC in (37)), and Nuc [the target nucleic acidbinding or TNB domain in (37)] domains in the C-terminal half ( Figure 1C). The closest match to Cas12f is Cas12g with 767 amino acids with both Cas12f and Cas12g being classified into branch 3 of type V nucleases based on phylogenetic analysis (6,7). The biggest difference is the REC1 domain, which can be further divided into two subdomains: REC1 N (referred to as a zinc finger or ZF domain in (37)) and REC1 C . REC1 N contains two anti-parallel helices connected by a CCCH zinc finger motif with a zinc ion chelated by four cysteines (C475, C478, C500 and C503) while REC1 C is composed of a three anti-parallel helical bundle, which is the primary dimerization interface of Cas12f ( Figure 1C).

Structure of sgRNA
The sgRNA of Cas12f contains a 140-nt tracrRNA at the 5 end and a 37-nt crRNA at the 3 end (17-nt repeat-derived and 20-nt spacer-derived sequences), connected by a linker (Figure 2A,B and Supplementary Figure S4). Four stemloop structures (Stems 1-4) are present in the tracrRNA ( Figures 1A and 2A,B). Stem 1 (1-21) contains seven base pairs and is solvent exposed but lacks direct interactions with the Cas12f subunits. Deletion of the Stem 1 ( Stem 1) shows comparable activity to the full-length tracrRNA in substrate cleavage assays ( Figure 2C). Stem 2 (22-69) is a long duplex primarily interacting with Cas12f.2 that connects the N-terminal and C-terminal halves of Cas12f.2 ( Figure 1A). The 10-bp duplex (23-33 and 59-69) bound to the C-terminal half of Cas12f.2 is structurally ordered while the rest (34-58) is curved and flexible due to disturbance of the Watson-Crick base pairing in the duplex (Figures 1A and 2B). Partial deletion of Stem 2 ( Stem 2  or both Stems 1 and 2 ( Stems 1&2) results in reduced activity in substrate cleavage assays ( Figure 2C), indicating that Stem 2 is required for optimal activity. The 5-bp Stem 3 (72-88) is located in the center of the sgRNA structure and contributes a loop (78-83) that forms the anti-repeat:repeat duplex 1 (AR:R 1) with the repeat-derived region of cr-RNA, critical for correct positioning of the spacer-derived guide (Figure 2A,B). Following Stem 3 is a long duplex Stem 4 (94-127) that lies between the two copies of Cas12f and establishes extensive interactions with both of them. Consequently, replacement of Stem 4 with a UUUU linker ( Stem 4) significantly reduces substrate cleavage activity ( Figure 2C). The 3 end of the tracrRNA (132-140) establishes the second duplex with the repeat-derived region of crRNA, AR:R 2 duplex. Deletion of AR:R 2 duplex ( AR:R 2) shows moderate reduction in substrate cleavage activity. All together with the exception of Stem 1, the stem-loop structures in tracrRNA play a role in Cas12f activity. However, none of the deletion mutations completely abolish the complex's activity. Notably, deletion of Stems 1 and 2, and AR:R 2 ( Stems 1&2 & AR:R 2), reducing the sgRNA from 222-nt to 90-nt, still shows considerable substrate cleavage activity ( Figure 2C). These results lay the foundation for designing smaller and simpler guide RNAs for potential application of Cas12f in genome editing. Figures 1B  and 3A). Specifically, five hydrophobic amino acids (I118, Y121, Y122, Y126 and L182) from each monomer establish a hydrophobic patch that associates the two monomers. Mutation of any of those residues to glycine reduces the cleavage activity of Cas12f, with Y121G, I126G and L182G exhibiting significant effects ( Figure 3B). Furthermore, mutations of two residues (Y121 and Y122) or four residues (I118, Y121, Y122 and Y126) to either glycine or glutamic acid completely abolish the cleavage activity (Figure 3B). These results suggest dimerization is essential for substrate cleavage by Cas12f. In addition to REC1 C , the REC2 domains form a second contact between Cas12f.1 and Cas12f.2 through electrostatic interactions and their contacts with the Stem 4 of tracrRNA ( Figure 3C). Except the dimerization interfaces, both Cas12f molecules establish extensive interactions with one copy of sgRNA, suggesting that sgRNA plays an important role for coordinating the two Cas12f molecules within the complex (37).

PAM recognition
The PAM sequence is recognized at the interface of REC1 C and the WED domain. The hydroxyl group of S142 and the guanidino group of R163 form two hydrogen bonds with base A (-1) of the TTTA PAM sequence in the nontarget strand ( Figure 3D). The amide group of Q197 forms a pair of hydrogen bonds with A (-3) of the target strand while Y202 forms a hydrogen bond with A (-4) of the target strand ( Figure 3D). Alanine substitution of any of the residues reduces substrate cleavage activity with S142A, R163A, and Q197A almost completely abolishing activity ( Figure 3E). In addition to the sequence-specific interactions, S286, Y146, and K196 also establish non-sequencespecific interactions with the PAM duplex ( Figure 3D). Y146A and K196A mutations also severely reduce the complex's ability to degrade substrate DNA ( Figure 3E). PAM recognition is also critical for subsequent strand separation of target DNA to facilitate hybridization between guide RNA and target DNA (38,39). A helix from REC1 C (a.a. 134-152) is inserted between the two strands of target DNA at +1 position, with H139 packing against the adenine base of A(+1) and therefore maintaining target DNA after the PAM in an unwound conformation ( Figure 3D).
Notably, those residues are exposed to solvent in the other subunit, Cas12f.2, of the ternary complex; therefore, alanine substitutions of any of them should not impact substrate recognition ( Figure 3F). Interestingly, Stem 4 of the sgRNA is located near the PAM recognition site in Cas12f.2, likely preventing substrate binding at this site ( Figure 3F).

crRNA-DNA heteroduplex recognition
The 19-20 bp crRNA-DNA heteroduplex is located in the central channel formed by Cas12f, similar to other Cas12 proteins (Supplementary Figure S3). The heteroduplex is recognized by positively charged residues from both Cas12f.1 and Cas12f.2 while the non-target strand is held predominantly by the N-terminal half of Cas12f.2 ( Figure  1A and Supplementary Figure S5A, B). The PAM proximal end of the heteroduplex is primarily recognized by Cas12f.1, whereas the PAM distal end is bound to Cas12f.2 (Supplementary Figure S5A (Supplementary Figure S5A,B). Single alanine substitutions for the residues involved in the recognition of the crRNA-DNA heteroduplex mostly result in modest reductions in the cleavage activity of Cas12f ( Supplementary Figure S5C). However, alanine substitution of R396 severely reduces the substrate cleavage activity ( Supplementary Figure S5C). R396 engages the phosphate group of position +8 of the target DNA strand, a position shown to be a critical checkpoint for Cas12a (20,40).

Nuclease site of Cas12f
The conserved triplet of acidic residues (D326, E422 and D510) from the RuvC domain is located in the interface between the RuvC and Nuc domains ( Figure 4A). Located in the active site is also R490 from the Nuc domain, and alanine substitution of this residue results in loss of cleavage activity (9). Lying on top of the acidic residues is the lid motif, which plays a vital role in regulating the RuvC active site (26). Interestingly, the lid motif in Cas12f.1 is in an open conformation, in correspondence to the crRNAtarget DNA heteroduplex formation ( Figure 4A). However, the lid motif in Cas12f.2 is in a closed conformation, although this active site is closer to the 5 end of the target strand ( Figure 4B). Two purine bases from the Stem 2 of tracrRNA, G(24) and A(62), insert into the inactive RuvC catalytic pocket, likely further inhibiting substrate access ( Figure 4B and Supplementary Figure S5D). This observation indicates that only the RuvC domain in Cas12f.1 is activated upon target DNA binding. Interestingly, we observed an extended density assigned as the substrate DNA trapped in the RuvC active site of Cas12f.1, likely from excess DNA oligos used in complex assembly. The resolution does not allow for unambiguous assignment of bases but was clear enough for us to build a 5-nt poly-C model ( Figure 4C). The backbone of the substrate is located in proximity to the triplet of acidic residues with R490 from the Nuc domain sitting on the other side of the backbone ( Figure 4A). The stacking of bases in the substrate is broken between C(3) and C(4) due to the side chains of M427 and W433 occupying the space of base C(4). Consequently, base C(4) rotates by ∼90 • and packs against the side chain of F487 ( Figure 4C). The rotation of C(4) and the close proximity to the triplet of acidic residues indicate the phosphate group connecting C(3) and C(4) is the scissile phosphate targeted for cleavage. This configuration of the substrate DNA in the RuvC active site is consistent with previous observations in Cas9 (41), Cas12b (22) and Cas12i (26). Alanine substitution of W443 significantly reduces cleavage activity of Cas12f while F487 shows minor effect, suggesting that W443 plays a dominant role in positioning the substrate DNA for cleavage ( Figure 4D).
The lid motif bridges the substrate and the crRNA-target DNA heteroduplex. Replacement of the lid motif with alanine or a GSGSGS linker deactivates Cas12f ( Figure 4E). These results add to our mechanistic understanding of substrate configuration and cleavage within the RuvC nuclease domain.

Activation mechanism of Cas12f
To understand the mechanism of Cas12f activation by target DNA, we determined the cryo-EM structure of Cas12f-sgRNA binary complex at 3.9Å (Supplementary Figure  S6, and Table 1). This structure reveals a 5-nt pre-ordered seed sequence in the crRNA adjacent to the PAM duplex (Supplementary Figure S6J). The 5 end seed sequence was also observed in other Cas12 nucleases, including Cas12a (17), Cas12b (22,23) and Cas12i (26). The binary complex structure also allows us to reveal the conformational changes in Cas12f upon target DNA recognition ( Figure  5B, C, and Movie S1). The most significant conformational changes happen in the C-terminal half of Cas12f.  (26). Although both copies of the Cas12f effector protein are necessary for the complex's functionality, this evidence suggests that Cas12f still adopts a conserved mechanism for activation of the RuvC nuclease site like other type V nucleases.

DISCUSSION
In this paper, we show that two copies of Cas12f bind to one sgRNA for target recognition and cleavage. Dimerization of Cas12f is likely to compensate for the small size of Cas12f, allowing for recognition of the ∼20-bp crRNA-target DNA duplex, which is a conserved length for substrate recognition in most class 2 CRISPR-Cas systems (Supplementary Figure S3). The most notable differences between Cas12f and other type V effectors are the lengths of the REC1 and REC2 domains. The REC1 domain of Cas12f is composed of ∼170 a.a., in comparison to ∼300 a.a. in other Cas12 proteins (315 aa in Cas12a (19), ∼377 a.a. in Cas12b (22), ∼276 a.a. in Cas12e (25) and ∼353 a.a. in Cas12i (26)). Additionally, the REC2 domain in Cas12f is composed of ∼68 a.a., in comparison to ∼200 a.a. in other Cas12 proteins (∼252 a.a. in Cas12a (19), ∼200 a.a. in Cas12b (22), ∼177 a.a. in Cas12e (25) and ∼203 a.a. in Cas12i (26)). Both the REC1 and REC2 domains are involved in the recognition and stabilization of the 20-bp crRNA-target DNA duplex, formation of which induces conformational changes required for activation of the RuvC domain. Minimal lengths of the REC1 and REC2 domains are thought to be indispensable for their proper function. Despite the miniature size of its REC1 and REC2 domains, dimerization renders Cas12f an effective RNA-guided nuclease similar to other Cas12 proteins. In detail, Cas12f.1 functions as a conventional Cas12 effector which contributes the canonical RuvC, Nuc, and WED domains. The combination of the REC1 domain of Cas12f.1 and the REC1 and WED domains of Cas12f.2 is  Figure S5A,B), likely regulating the length of the crRNA-target DNA duplex similar to W382 in the REC2 domain of the Acidaminococcus sp. Cas12a (42).
Although two Cas12f molecules are required for target recognition and cleavage, Cas12f adopts the conserved activation mechanism of the type V nucleases that requires coordinated conformational changes induced by the formation of the crRNA-target DNA heteroduplex. In summary, our results unravel the mechanism of Cas12f and add to our understanding of mechanisms behind the diverse type V CRISPR-Cas effectors.

DATA AVAILABILITY
Cryo-EM reconstructions of Cas12f-sgRNA-target DNA and Cas12f-sgRNA complexes have been deposited in the Electron Microscopy Data Bank under the accession num-bers EMD-23158 and EMD-23157, respectively. Coordinates for atomic models of Cas12f-sgRNA-target DNA and Cas12f-sgRNA complexes have been deposited in the Protein Data Bank under the accession numbers 7L49 and 7L48, respectively.