TransCRISPR–sgRNA design tool for CRISPR/Cas9 experiments targeting specific sequence motifs

Abstract Eukaryotic genomes contain several types of recurrent sequence motifs, e.g. transcription factor motifs, miRNA binding sites, repetitive elements. CRISPR/Cas9 can facilitate identification and study of crucial motifs. We present transCRISPR, the first online tool dedicated to search for sequence motifs in the user-provided genomic regions and design optimal sgRNAs targeting them. Users can obtain sgRNAs for chosen motifs, for up to tens of thousands of target regions in 30 genomes, either for the Cas9 or dCas9 system. TransCRISPR provides user-friendly tables and visualizations, summarizing features of identified motifs and designed sgRNAs such as genomic localization, quality scores, closest transcription start sites and others. Experimental validation of sgRNAs for MYC binding sites designed with transCRISPR confirmed efficient disruption of the targeted motifs and effect on expression of MYC-regulated genes. TransCRISPR is available from https://transcrispr.igcz.poznan.pl/transcrispr/.


INTRODUCTION
Se v eral types of r ecurr ent sequence motifs exist in eukaryotic genomes and are important components of complex regulatory networks ( 1 ). For example, transcription factors (TFs) recognize specific sequence motifs in DNA and regulate target gene expression ( 2 ). MicroRNAs bind to their target transcripts via short complementary seed sequences ( 3 ). Splicing also depends on the recognition of specific sequences by the spliceosome ( 4 ). The role and significance of these motifs might be examined by their blocking or disruption, e.g. with use of CRISPR / Cas9 system (5)(6)(7)(8)(9)(10)(11). Clustered Regularly Interspaced Short Palindromic Repeats / Cas9 (CRISPR / Cas9) has become one of the most powerful tools for genome editing and revolutionized genome engineering. Howe v er, to perform reliab le and informati v e CRISPR / Cas9 e xperiments, high specificity and efficiency of the approach are required. In answer to this need, numerous online tools have been designed. Se v eral online CRISPR / Cas9 tools are available which allow designing the most optimal single-guide RNAs (sgR-NAs) targeting specific sequences and predict their off-and on-target scores to increase their specificity and efficiency (12)(13)(14)(15)(16). Although they offer a wide range of possibilities, none of them allows searching for a specific motif in a gi v en sequence and designing sgRNAs targeting this motif.
Her e, we pr esent a highly versatile online tool, tran-sCRISPR, created to identify specific motifs in the sequence of interest and to design sgRNAs with optimal off-and ontarget scores. It can be applied both for single sequences as well as for large lists of genome coor dinates, enab ling design of sgRNA libraries for genome-wide CRISPR / Cas9 screens.

Data
Genomic coordinates for coding exons, non-coding exons, introns and transcription start sites (TSS) were downloaded from UCSC using a database interface. Full do wnload command: m ysql { genome name } -h genomemysql.soe.ucsc.edu -u genome -A -e 'select * from ncbiRef-SeqCurated' -NB > { genome file } . These data were further automatically processed to create specialized .bed files with genes and localization elements for each of available genomes. Whole genome sequences were downloaded as FASTA files.

Implementation
TransCRISPR is created using Django (with Python programming language). MariaDB is used as a database, Celery with Redis as a query system, Daphne for w e bsocket communication, Nginx as a w e b server, and Bootstrapbased Gentelella for a la y out with Highcharts as a data visualization library. Docker with Docker Compose is used for management purposes. Off-targets are identified using the Cas-OFFinder ( 17 ) with a maximum of 4 (standard option) or 3 (rapid option) mismatches and later the CFD score is calculated for each off-target, as well as a cumulati v e CFD score ( 17 , 18 ). For on-target value calculation a dockerized version of Azimuth is used ( 18 ). Two queue systems are available: for short calculations and for larger queries, so that short tasks can be proceeded quickly. Software is freely available online: https://transcrispr.igcz.poznan.pl .
Search of the motif positions is performed either with exact search in case of motifs defined as sequences with no IUPAC code or with regular expressions in case of IUPAC code in the sequence. In case of motif matrices, they are converted to IUPAC sequence using a selected rule set and then searched with regular e xpressions. Re v erse complementary sequences are also generated and used for search on the re v erse strand. For each of the found motif positions, potential guides are generated using rules for either Cas9 or dCas9.
In case of target sequences defined as coordinates, respecti v e sequences are selected from downloaded genome files and extended by 30 nucleotides before and 30 nucleotides after defined coordinates. This enables design of guides for motifs at the border of the sequence. This approach is not possible in case of sequences defined as raw sequences or in FASTA format.
Localization of motifs in relation to genes is determined as follows. Firstly, data downloaded from UCSC are sorted and saved to special .bed files containing data from a single chromosome, sorted by gi v en sequence start. For each chromosome respecti v e .bed file is being sequentially searched for previously sorted motif positions. In case a motif is on a boundary (e.g. intron-e xon, e xon-intergenic etc.) or is localized in a position where different transcript variants differ with respect to intron / ex on, the follo wing hierarchy is applied: coding exon-non-coding ex on-intron-inter genic.
For determination of the closest up-and downstream TSS, data from UCSC are similarly downloaded, sorted and saved to .bed files containing gene localization. During the analysis, the TSSs for the closest upstream and downstream gene for each of the sorted motifs are selected and saved.
For each motif and guide, a name is generated. If a target sequence name is gi v en in the input (header in FASTA format or last column in .bed file) this name is used as prefix, in other cases a generic prefix is created.

RESULTS
TransCRISPR is an online tool dedicated to designing sgRNAs targeting various sequence motifs. This software performs se v eral steps to calculate and display results for gi v en input: (i) processing sequences; (ii) processing motifs and finding motifs in sequences; (iii) calculating off-targets and on-targets; (iv) finding localization and calculating statistics.

User interface
To run a query, the user provides input data and selects available options ( Figure 1 ). In Step 1, the r efer ence genome is selected. Currently, thirty genome assemblies are available, including e.g. human, mouse, rat, fruit fly, zebrafish, C. elegans and others.
TransCRISPR offers se v eral ways of submitting queries. In Step 2, sequence motifs might be entered directly in the window or uploaded as FASTA or comma separated file, or as a motif matrix in various formats, including output files of programs analyzing transcription factors (e.g. JASPAR, TRANSFAC). Motifs provided as a sequence can contain A, C, G, T nucleotides or IUPAC codes. Next, in Step 3 the target r egions wher e motifs will be sear ched for ar e provided. Target sequence may be pasted directly in the window or uploaded as a text (FASTA or coma separa ted forma t) or as genomic coordinates.
In Step 4, se v er al par ameters ar e defined. If motifs wer e entered as a motif matrix, the user can choose criteria according to which motifs will be generated from the matrix (see Manual for detailed information). TransCRISPR currently supports the canonical S. pyogenes PAM NGG as well as variants: NGA and NGCG ( 19 ). Next, one can choose between the Cas9, dCas9 and custom variants which define how the sgRNAs are searched with respect to the motifs. In the Cas9 variant, only guides that lead to the cut within the motifs (taking into account that the cut occurs 3 nt upstream of PAM) are designed. In the dCas9 variant, any guides that overlap with at least one nucleotide of the motifs are included. The third option is 'Custom' where the user defines the maximum distance of PAM from the motif. This allows to search for PAM within a gi v en Nucleic Acids Research, 2023, Vol. 51, Web Server issue W579  range from the motif and design relevant sgRNAs. As a standard, f or all f ound sgRNAs off-targets up to four mismatches are analyzed. To reduce the analysis time, the 'rapid' option can be chosen which includes only off-targets with up to three mismatches. The user can optionally enter their e-mail address to be informed when calculations are done. During the analysis, the user is informed about its progr ess and curr ent step, as well as about the task queue. For convenience, the details of the query might be checked later ( Figure 2 ). Results are available on the w e bsite for se v en days.

Analysis results and data visualization
The results page provides information about the number of found motifs and sgRNAs, an average number of guides per motif, and average on-and off-target scores, which are additionally presented on histograms. Distribution of guides per motif and their genomic localization are presented on downloadable pie charts. Information about found motifs includes: motif sequence, their position in the uploaded genomic sequence and localization in relevance to genes (coding or non-coding exon, intron or intergenic), and the genomic positions of transcription start sites (TSSs) for the closest up-and downstream gene. Information about the genomic localization of motifs and closest TSSs is avail-able only if genomic coordinates for target regions were provided. Next, sgRNAs found for the motifs are presented in the table, where their sequence, relati v e position in the target region and DNA strand are shown, together with the calculated on-and off-target scores. It is possible to view the details of the most significant off-targets.
Results may be filtered by se v er al par ameters, which are described in detail in the manual. In some instances, identified motifs may overlap partially and in such a situation some sgRNAs may be duplicated in the motif view. To obtain the list of nonredundant sgRNAs, the user should switch to the 'Unique guides' mode.
The results can be downloaded in se v eral formats (xslx, csv , tsv , bed). In the Excel file, separate sheets provide results per motifs or per unique guides. It is also possible to download the track to visualize motifs and guides together with their on-and off-target scores coded by colors in the UCSC Genome Browser. Moreover, by clicking Display in Genome Browser the user is directly taken to the Genome Browser with this track loaded.
A detailed explanation of preparing the query and analyzing results is provided in the manual available on the transCRISPR w e bpage, also as a downloadable pdf. To get familiar with transCRISPR and available options, it is advised to run one of the preloaded examples.
Summary of the features available in transCRISPR and comparison with other available tools for sgRNA design is presented in Table 1 .

Case 1: example of library design
We used transCRISPR to find motifs and sgRNAs within the ChIP peaks for the MYB transcription factor in human GM12878 cells (query details in Figure 2 , 3748 MYB ChIP peaks retrie v ed from UCSC Table Browser). First, we chose NGG PAM, Cas9 variant and standard off-target mode (Figure 3 ). After the calculations are finished, the upper panel shows the summary of motifs and guides (Figure  3 A). 59.3% of the identified 975 motifs were targeted by 947 guides (mostly 1 or 2 sgRNAs per motif; for some up to six sgRNAs were designed). The average on-and off-target   scor es wer e abo ve 50 which indicated an o verall good quality of designed sgRNAs. Detailed information provided on histograms showed that the majority of sgRNAs had offtarget scores above 70, only a few below 50. The predicted cutting efficiency was medium as the majority of sgRNAs had the on-target score around 50. The identified motifs were mainly localized in the introns or non-coding exons, much less in intergenic regions and only a few in coding exons. Next, we filtered the results to exclude motifs present in coding regions and guides with off-target scores < 30. As a result, we obtained 956 motifs and 909 guides with the average off-target score increased to 83 (Figure 3 B). Figure 4 shows various modes of presenting results by tran-sCRISPR.
Changing the Cas9 variant in the query to dCas9 increased the number of motifs targeted by guides to > 92% (Figure 5 A). This is expected, as the rules for sgRNA design are broader in this option. In line with this, the number of guides per motif was more di v ersified (up to 8), with the prevalence of 1-4 sgRNAs per motif. Average on-and off-target values as well as their distribution on histograms looked similar to the Cas9 mode. When the filters recommended for dCas9 mode were applied, i.e. motifs localized between −200 nt and + 100 nt relati v e to TSS were excluded, the number of found motifs went down to 873 and the number of designed guides decreased to 2815 (Figure 5 B).

Case 2: experimental validation of transCRISPR design
To confirm that transCRISPR is able to design sgRNAs efficiently targeting sequence motifs, we took as an example genes involved in the purine biosynthesis pathway which are known to be regulated by MYC: PPAT , GART , PFAS, PAICS and ATIC . MYC-ChIP peaks proximal to these genes wer e r etrie v ed from ENCODE data for K562 cells and used as the target sequence in transCRISPR, while the MYC motif matrix was obtained from Jaspar. For each MYC peak ( PPAT and PAICS are localized in a head-tohead orientation with the common MYC peak within their promoter) transCRISPR identified 1-3 MYC binding mo-tifs and designed sgRNAs targeting them ( Figure 6 A-D). sgRNAs showed good specificity (score 75-99) and moderate predicted efficiency (score 52-75). Designed sgRNAs were cloned into the lentiCRISPRv2 puro vector and used to transduce K562 cells (Supplementary Methods, Supplementary Table S1-3). Based on TIDE analysis ( 25 ), all sgR-NAs resulted in efficient DNA editing (100% for all sgR-NAs, except for ATIC E-box 2, which was 78%) and disruption of the E-box motifs ( Figure 6 A-D, Supplementary Figure S1). Importantly, CRISPR editing of E-box sequences decr eased expr ession of the studied genes. The most pronounced effect was observed for PFAS. Disruption of the inter genic E-bo x localized between PP AT and P AICS did not affect either gene, while targeting the two intronic Ebox es within PAICS r educed expr ession of PAICS but not PPAT (Figure 6 E).
This experiment demonstra ted tha t transCRISPR enables design of efficient sgRNAs for targeted sequence motif disruption, which allows designing and performing experiments to answer biolo gicall y relevant questions.

DISCUSSION
We de v eloped transCRISPR to facilita te the stud y of specific sequence motifs. This is a unique tool that enables design of sgRNAs targeting a particular sequence in the r egion of inter est and pro vides an important no vel functionality for sgRNA design algorithms. The ability to disrupt or block sequence motifs can significantly facilitate research and understanding of their function. Importantly, tr ansCRISPR offers wide r ange of functionalities and options to tailor the results to the user's needs and is applicable for small queries as well as design of genome-wide sgRNA libraries.
The biggest improvement introduced by transCRISPR is a versa tile, ef ficient and well tested pipeline, which includes both custom code and available tools and algorithms. This pipeline is packed to be used in the queuing system, and the r esults ar e instantly displa yed f or the user in the form of comprehensi v e tab les and diagrams. For calcula ting of f-targets we use Cas-OFFinder software, which is proven to work much more efficiently on GPUs. Unfortunately, we did not manage to obtain a dedicated w e bserver with pow erful graphic cards, therefore we decided to increase the number of CPU cores available for calculations. If our server is overloaded with work, we will further increase the number of available cores.
We designed a consistent API for our w e bserver, but after multiple complex tests that took up to se v eral days to complete, we decided to remove the possibility to submit query through API, to pre v ent ov eruse of our server. At the moment only check status and download results (in JSON format) options are available.

Future plans
We plan to further expand the list of available r efer ence genomes. We will also include more Cas9 variants, recognizing various PAM sequences. This will enable more comprehensi v e design of sgRNAs targeting specific motifs. We welcome all suggestions for improvement and de v elopment