CasPEDIA Database: a functional classification system for class 2 CRISPR-Cas enzymes

Abstract CRISPR-Cas enzymes enable RNA-guided bacterial immunity and are widely used for biotechnological applications including genome editing. In particular, the Class 2 CRISPR-associated enzymes (Cas9, Cas12 and Cas13 families), have been deployed for numerous research, clinical and agricultural applications. However, the immense genetic and biochemical diversity of these proteins in the public domain poses a barrier for researchers seeking to leverage their activities. We present CasPEDIA (http://caspedia.org), the Cas Protein Effector Database of Information and Assessment, a curated encyclopedia that integrates enzymatic classification for hundreds of different Cas enzymes across 27 phylogenetic groups spanning the Cas9, Cas12 and Cas13 families, as well as evolutionarily related IscB and TnpB proteins. All enzymes in CasPEDIA were annotated with a standard workflow based on their primary nuclease activity, target requirements and guide-RNA design constraints. Our functional classification scheme, CasID, is described alongside current phylogenetic classification, allowing users to search related orthologs by enzymatic function and sequence similarity. CasPEDIA is a comprehensive data portal that summarizes and contextualizes enzymatic properties of widely used Cas enzymes, equipping users with valuable resources to foster biotechnological development. CasPEDIA complements phylogenetic Cas nomenclature and enables researchers to leverage the multi-faceted nucleic-acid targeting rules of diverse Class 2 Cas enzymes.


Introduction
CRISPR-Cas ( c lustered r egularly i nterspaced s hort p alindromic r epeats, C RISPR-as sociated) systems provide adaptive immunity in bacteria and archaea through RNA-guided recognition and Cas-mediated destruction of foreign nucleic acids ( 1 ,2 ).These immune systems are exceptionally diverse, occurring as 6 types and 33 subtypes in line with recent classification ( 3 ).Beginning with the discovery of RNA-guided endonuclease activity conferred by Cas9, insights into the enzymatic activities of CRISPR Cas enzymes have precipitated a veritable wave of biotechnological innovation (4)(5)(6).
In particular, the Class 2 Cas enzymes have been a driver of biotechnological development owing to their single-protein nature.Class 2 Cas enzymes can be separated into 3 families: Cas9, Cas12 and Cas13 from Type II, V and VI CRISPR systems respectively ( 3 ).Because these proteins all employ a processed CRISPR RNA (crRNA) to guide protein activity towards a sequence of interest, these proteins can all be easily 'programmed' to target unique sequences of interest by simple design of a spacer (i.e.targeting) sequence.However, as researchers have explored the genetic diversity of these systems, it has become clear that (i) the RNAguided biochemical activity, (ii) constraints on targeting context and (iii) ways crRNAs are processed differ dramatically across -and within -families.While these differences reflect opportunities for biotechnological development, there does not yet exist a centralized resource for comparing biochemical activity to complement existing genetic classification efforts ( 3 ).It remains difficult for these enzymes to be functionally compared and contrasted across and within subfamilies.
Here, we present CasPEDIA, http://caspedia.org, providing users with summary information about the capabilities and limitations of Class 2 Cas technologies to facilitate tool selection and to highlight opportunities for future biotechnological development.We introduce CasID, a Cas enzyme classification scheme, to facilitate functional comparison between RNA-guided Class 2 Cas enzymes.The optimal selection of a CRISPR enzyme depends heavily on the intended application and CasPEDIA allows for efficient comparison between enzymes by both their biochemical properties and their previously established uses.As a flexible database, CasPEDIA can be updated to accommodate the emergence of novel CRISPR-Cas enzymes and their applications.

RNA -Unknown
The website's Tool Finder ( http:// caspedia.org/tool _ finder.html ) may be used to explore and tabulate enzymes that possess each feature below.
marized in Table 1 .This classification, termed CasID, is directly inspired by the ENZYME Classification (E.C.) system, but is tailored to the unique properties of these RNA-guided enzymes ( 7 ).Each enzyme in CasPEDIA receives a 3-decimal number reflecting its biochemical activities as RNA-guided enzymes.Briefly, CasPEDIA's classification schema can be seen in Table 1 and is summarized here.2 , each wiki contains 7 sections: Quick Review, Summary, Applications, Experimental Considerations, Nucleotide Sequence, and Protein Structure.The Quick Review section, located at the top of the page, enables rapid access to essential information including: enzyme classification (a description of the CasID and phylogenetic classification), core properties (e.g.protospacer length, PAM / PFS, length of the nucleotide coding-sequence, etc.) and external resources (e.g.RefSeq identifiers for the gene and protein, UniProtKB ID, Conserved Domains Database IDs, etc.) ( 9-11 ).Next, is a high-level Summary section, detailing the nuclease's origins, novel properties, and common uses.The Applications section then provides a literature review for the enzyme with subheaders for Gene Editing examples in model organisms, Tools and Diagnostics utilizing the enzyme and Engineered Variants with expanded properties.This is followed by the Experimental Considerations section, a brief introduction to performing experiments with the Cas enzyme.It includes details on construct design, appropriate delivery modalities and a list of algorithms for gRNA design.Nucleotide Sequence is also discussed, complete with downloads and a genome browser, created with igv.js ( 12 ), demonstrating the nuclease's sequence  and the architecture of its CRISPR array.The subsequent section covers Protein Structure, which includes a summary of the protein's domains from UniProt ( 11 ), Pfam annotations ( 13 ) and structures from the PDB ( 14), or predicted with Al-phaFold2 ( 15 ), visualized using 3Dmol.js( 16 ).Citations are provided for all wiki content and indexed at the bottom of the webpage.To assist users in locating relevant wiki entries, CasPEDIA includes extensive search features and navigational pages, discussed below.

Na vig ating CasPEDIA
CasPEDIA provides multiple search tools to connect users with pertinent CRISPR-Cas enzymes and wiki content (Figure 2 ).Scientists can use the search bar, located on the homepage, to search for Cas nucleases by name (e.g.AsCas12a, Spy-Cas9a), RefSeq protein ID, or function using CasID nomenclature.Each search returns a table containing matching protein entries and displays for each entry: Enzyme Name, CasID, Protein Accession (RefSeq protein ID, when available), Nuclease Activity, Targeting Requirement, gRNA Design and Multiplexability , and PAM.Similarly , the search bar can also be used to query the database for a protein sequence using DELTA-BLAST ( 17) with default parameters.This approach allows for remote homology detection with the support of NCBI's C onserved D omain D atabase (CDD) ( 10 ) for domainenhanced sequence searches across the CasPEDIA database.The resulting table is sortable by all fields, including E-value, to assist users in finding a nuclease of interest.
A separate page, entitled Tool Finder, directs users through a series of drop-down menus (fields include: Cis-Activity Substrate, Trans-Activity Substrate, Targeting Requirements and gRNA Design and Multiplexability), which generates a table of all Class 2 systems within CasPedia that demonstrate or conservatively predicted to demonstrate the selected properties.
CasPEDIA also supports phylogenetic navigation, complementing evolutionary classifications from previous studies ( 3 ,8 ).The Phylogeny page of the website provides summaries of T ype II, T ype V and T ype VI systems which make up Class 2. We provide dedicated pages for each system type, containing subtype descriptions and an interactive tree whose leaves redirect to wiki entries.
While CasPEDIA wiki entries are organized by protein type (i.e.nuclease name and corresponding species) and CasID, users may also locate information for examples of engineered variants and gene-editing tools.Term searches for engineered variants are unsupported at this time, but variant details can be identified by searching the parental enzyme by name, and scanning the "Engineered Variants" section of the parental wiki entry.Furthermore, a designated page for fusion proteins is available (i.e.Tool Glossary), organizing the expanding list of base editors and prime editors by function, as well as proteins used for CRISPR interference (CRISPRi), CRISPR activation (CRISPRa) and other tools.

CasPEDIA data curation
CasPEDIA is a community project, curated from the literature by a panel of CRISPR researchers.Wiki content was managed through a series of forms, which were distributed amongst curators and editors for completion.To ensure accuracy and objectivity, citations from peer-reviewed publications and databases were required.Citations are provided at the base of each page.Structural and sequence information were taken from literature or databases like PDB, NCBI, UniProt and Pfam.The CasPEDIA Consortium and Scientific Communications Team at the Innovative Genomics Institute reviewed all entries prior to initial release.

F uture dev elopments
Currently, CasPEDIA only contains entries for the enzymatic activities of Cas effectors in Class 2 CRISPR-Cas systems, as there is limited distinction between the enzymatic activity of the protein and the mature CRISPR-Cas complex.The current CasPEDIA entries include representatives from all 27 phylogenetic subtypes encoded within the Cas9, Cas12 and Cas13 families.We also provide entries including related proteins IscB (HEARO) and TnpB, important variants used in biotechnological applications, and enzymatic subtypes (ex.Cas12c1 versus Cas12c2).Class 2 CRISPR system derived enzymes represent only a fraction of the overall Cas protein diversity ( 3 ).Class 1 CRISPR-Cas systems and CRISPR adaptation, comprise the most abundant CRISPR systems and enzymes across bacterial and archaeal genomes ( 3 ).Owing to their multiprotein nature, Class 1 CRISPR-Cas interference complexes coordinate multiple enzymatic activities in target nucleic acid recognition and their adoption for biotechnology has thus been difficult ( 2 ,25-27 ).Adaptations of CasID for these enzyme complexes would facilitate greater adoption and subsequent innovation by the biotechnology community and is a clear priority for future iterations.Additionally, new Class 2 CRISPR-Cas systems are emerging at a rapid pace.During the preparation of the CasPEDIA database alone, seven new systems were reported ( 20 ,28-34 ).We anticipate that many new systems will emerge by the next update of CasPEDIA.
CasPEDIA is an actively evolving database, which will grow through community engagement and sustained content management.CRISPR scientists are encouraged to contact the CasPEDIA Consortium to suggest new wiki entries and features, as well as update current wikis with emergent discoveries.These efforts will maintain the relevancy of the database as a useful resource for future scientists.Prospective volunteers can follow detailed directions on the Contact page of the website to contribute.

Data availability
CasPEDIA is freely accessible at http://caspedia.org , and data is licensed under Creative Commons Attribution 4.0 International License (CC BY 4.0).The website is compatible with all devices, including tablets and mobile phones.A complete inventory of enzymes in CasPEDIA, along with CasID numbers, can be downloaded on the Tool Finder page.Text content for the wikis is available upon request, with more information provided on the Contact page of the website.Illustrations from CasPEDIA are available for non-commercial use under a Creative Commons Attribution-NonCommercial-

3 3 5 5 5
No constraints.3 positioning means the 3 CRISPR repeat is used.4 Protospacer-adjacent motif (PAM).This is a required sequence encoded in the non-targeted strand.5 positioning also means the 5 CRISPR repeat is used.5 Protospacer-flanking sequence (PFS) .This is a prohibited sequence encoded in the targeted strand also referred to as an anti-tag.5 positioning also means the 5 CRISPR repeat is used.6 No constraints.5 positioning means the 5 CRISPR repeat is used.+ non-CRISPR-associated endogenous factors in the native host required for CRISPR processing 2 crRNA + tracrRNA required for CRISPR processing 3 crRNA required for CRISPR processing 4 ω

Figure 2 .
Figure 2. Ov ervie w of CasPEDIA Entry f or Sp yCas9a (1 .1 .1)from the database.( A ) CasID diagram and functional description.( B ) R esources f or accessing native sequences and gRNA design for the Cas enzyme.( C ) Functional and phylogenetic classification of SpyCas9 (CasID 1 .1 .1). ( D ) Biological properties of this Cas enzyme, including protein, gene and gRNA properties.( E ) Ov ervie w of the Cas enzyme including a summary of the enzyme, applications, experimental considerations, protein str uct ure and gene browser (below the visualized portion).( F ) Link to homepage containing CasID Definitions and search bar, accommodating queries for Cas enzymes by CasID, protein name or protein family.( G ) Icon for Tool Finder, where users can search CasPEDIA for enzymes with specific properties.( H ) Redirects to Cas Phylogeny page for browsing the website by protein family.( I ) Tool Glossary of common CRISPR-Cas systems.( J ) Contact Page.( K ) FAQ and general information.
CasPEDIA introduces a systematic, enzymatic nomenclature for the functional classification of Class 2 Cas proteins, sum-

org for diagrams Dimension Value Description Primary Nuclease Activity 1 Targets dsDNA + no trans -activity. Clea v age products are predominantly blunt. However
, additional trimming of DNA cleavage products may occur on a timescale much slower than that of the initial cuts.RNA-guided RuvC domains are also capable of targeted, PAM-independent ssDNA cleavage.
2 3 Protospacer-flanking sequence (PFS).This is a prohibited sequence encoded in the targeted strand also referred to as an anti-tag.3 positioning also means the 3 CRISPR repeat is used.
CasPEDIA is organized in wiki format, with dedicated web pages for an initial set of 33 nucleases.Shown in Figure Familiar to most CRISPR biotechnologists is 'Nuclease Activity', describing which nucleic acids are predominantly cut in cis (i.e.guide RNAtargeted) or in trans (i.e.non-guide RNA targeted).anRNA-guidedenzyme with targeted double-stranded DNA (dsDNA) activity with blunt cleavage and no trans -activity, employs a 3 PAM positioning, and requires multiple synthetic gRNAs for multiplexable design.The enzymatic classification of Class 2 CRISPR proteins is intended to complement evolutionary classification efforts( 3 ,8 ).In tandem with phylogenetic classification, we hope that the consolidation of enzy-matic and sequence information fosters the further development of CRISPR-based biotechnologies.