CRISPR-DIPOFF: an interpretable deep learning approach for CRISPR Cas-9 off-target prediction

Abstract CRISPR Cas-9 is a groundbreaking genome-editing tool that harnesses bacterial defense systems to alter DNA sequences accurately. This innovative technology holds vast promise in multiple domains like biotechnology, agriculture and medicine. However, such power does not come without its own peril, and one such issue is the potential for unintended modifications (Off-Target), which highlights the need for accurate prediction and mitigation strategies. Though previous studies have demonstrated improvement in Off-Target prediction capability with the application of deep learning, they often struggle with the precision-recall trade-off, limiting their effectiveness and do not provide proper interpretation of the complex decision-making process of their models. To address these limitations, we have thoroughly explored deep learning networks, particularly the recurrent neural network based models, leveraging their established success in handling sequence data. Furthermore, we have employed genetic algorithm for hyperparameter tuning to optimize these models’ performance. The results from our experiments demonstrate significant performance improvement compared with the current state-of-the-art in Off-Target prediction, highlighting the efficacy of our approach. Furthermore, leveraging the power of the integrated gradient method, we make an effort to interpret our models resulting in a detailed analysis and understanding of the underlying factors that contribute to Off-Target predictions, in particular the presence of two sub-regions in the seed region of single guide RNA which extends the established biological hypothesis of Off-Target effects. To the best of our knowledge, our model can be considered as the first model combining high efficacy, interpretability and a desirable balance between precision and recall.


B. Transformer-Based Approach
In recent years, the development of transformer-based models, such as BERT (Bidirectional Encoder Representations from Transformers) and ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), has revolutionized natural language processing tasks.One key aspect that has contributed to the success of these models is pretraining which involves training the models on large-scale datasets, allowing them to learn general language representations.We designed an ELECTRAbased pipeline to utilize large pretrained models for Off-Target prediction.ELECTRA is considered an improvement over BERT due to its adversarial training approach, which leads to better model generalization.It is computationally more efficient, requiring fewer parameters and less pretraining time.ELECTRA also exhibits improved robustness and stability, making it a preferred choice for natural language processing tasks.

Data Preprocessing
We utilized the GRCh38 (Genome Reference Consortium Human Build 38) as the dataset for pretraining which is a comprehensive assembly of the human genome [38] containing autosomes, sex chromosomes, and mitochondrial chromosomes.We collected this dataset from the Ensembl [49] project's repository2 .We followed a data processing approach similar to that used in DeepCRISPR's pretraining.But we used GRCh38 instead of GRCh37 as the data source for the human genome sequence.We used the Cas-OFFinder tool to find sample pairs that allowed up to six nucleotide mismatches where the first sequence (sgRNA) ends with 'NGG'.This process resulted in a pretraining dataset comprising approximately 317 million samples.In our study, we explored two approaches to construct the vocabulary: overlapping tri-mer and byte pair encoding (BPE).Both approaches have five special tokens: [UNK], [CLS], [SEP], [MASK], and [PAD] representing unknown, classification, separator, mask, and padding tokens respectively.Overlapping tri-mer involves dividing the sequences into overlapping three-nucleotide segments.The other approach, unigram Byte Pair Encoding (BPE), is a subword tokenization technique that involves iteratively merging the most frequent character pairs in a corpus to create a vocabulary of subword units.
After forming the vocabulary we preprocessed the pretraining samples.Each sequence pairs were first tokenized.The tokens of two sequences were separated by a [SEP] token and a [CLS] token was inserted before the first token of sgRNA.Another [SEP] token was added at the end of the target DNA tokens.The maximum sequence length was set to 48 and the remaining positions after the input tokens were filled with [PAD] tokens.An input mask was generated to identify different sequences (0 for sgRNA and 1 for potential target DNA).Attention mask differentiated the actual input tokens from the [PAD] tokens.

Pretraining the ELECTRA model
In the process of pretraining ELECTRA, a subset of approximately 15% of the samples was selected for replacement with a [MASK] token.This masked portion of the data was then passed through a generator, which substituted the masked tokens with generated tokens.The discriminator's task was to determine whether the tokens had been replaced by the generator or not.To accommodate limited resources, we conducted pretraining using two relatively smaller versions of the ELECTRA model, namely, the Tiny and the Small models.The training process consisted of 100,000 steps with a batch size of 128.Towards the end of the training, the loss function stabilized over the last few thousand steps.The two smaller versions of the ELECTRA model employed the parameter configurations shown in Table 5.We kept all other parameters consistent with the official ELECTRA code repository.The tiny model underwent training for approximately 7 days, while the small model was trained for 10 days.Both models were trained separately using tri-mer and byte pair encoded tokens.Finetuning ELECTRA Model for Off-Target Prediction Following pretraining, the ELECTRA model was further finetuned on a specific downstream task, where it was trained on the labeled DeepCRISPR dataset.During finetuning, the model's parameters were adjusted to optimize its performance on the task's objective using supervised learning.We conducted experiments with different numbers of layers (ranging from 1 to 3) attached to the output of the [CLS] token of the ELECTRA model and finetuned them.Models pretrained on both tri-mer and byte pair encoded tokens were finetuned separately.Initially, during the finetuning process, the model faced difficulties in learning.It exhibited a tendency to predict all samples as either 1 or 0. To overcome this, we employed a training technique known as gradual unfreezing [16].This involved initially freezing the weights of the ELECTRA model and updating only the output layer weights for a few iterations.Subsequently, we gradually unfroze the weights of each ELECTRA layer, starting from the layer closest to the output layer in a bottom-up manner.While this approach led to gradual improvement in the performance of the Tiny model, the Small model still exhibited a tendency to predict all samples as either 0 or 1.As a result, we have focused on comparing the results obtained using the Tiny model.During the finetuning phase of the ELECTRA models, we employed a strategic approach to enhance its performance.This involved incorporating additional layers, ranging from 1 to 3, including the output layer, specifically designed for classification purposes.By introducing these extra layers, we aimed to capture more intricate patterns and improve the model's ability to make accurate predictions.To finetune the ELECTRA models, we utilized the gradual unfreezing strategy.This technique allows for a controlled update of the model's weights, starting with the output and additional layers and progressively unfreezing the encoder layers closest to it.In our experiments, we conducted separate finetuning procedures on the Tiny models that were pretrained using both Tri-Mer Encoded (TME) and Byte Pair Encoded (BPE) input tokens.The results of these experiments, as shown in Table 6 and Table 7 for the TME and BPE models, respectively, shed light on the impact of these encoding approaches on the model's performance.Notably, we observed a significant performance advantage for the TME models compared to the BPE models.Among the TME models, the one that yielded the best results incorporated two additional layers, an additional hidden layer and an output layer, following the ELECTRA encoder layers.

Results of ELECTRA model
Though the overall result does not outperform the baseline models and the RNN-based models, we observe that gradual unfreezing improved the performance of the models.Gradual unfreezing provides better results because it helps to mitigate the problem of catastrophic forgetting.Catastrophic forgetting is a phenomenon that occurs when a machine learning model is finetuned on a new task, and it loses its ability to perform well on the original task.While finetuning the ELECTRA model, we are essentially updating the weights of the model to better fit the new task.However, if we update all of the weights at once, it is possible that the model will forget how to perform the original task which is capturing the general context of the sequences.Gradual unfreezing helps to mitigate this problem by unfreezing the layers of the model from the last layer to the first layer.The last layer of the model contains the least general knowledge, so it is the least likely to be affected by finetuning on a new task.
Figure 9 shows the effect of gradual unfreezing on performance metrics for the best-performing ELECTRA model.It shows that most of the metrics have improved gradually except recall.The initial model had the tendency to predict a lot of Off-Targets  which affected the precision of the model.As the model gradually stroke a balance between precision and recall, precision gradually increased and recall gradually decreased.

Observations
Instead of training the model with the complete human genome, which would have required significant time and resources, we followed the pretraining approach of DeepCRISPR by selecting samples generated by Cas-OFFinder [3].It resulted in a much smaller pretraining dataset which was manageable within our limited computational resources.However, this limited pretraining dataset may have hindered the generalizability of the model.The overlapping nature of tri-mers during training could also have led the model to easily predict a masked token based on the previous or next token, potentially resulting in earlier convergence and limited generalization.To address these limitations, a more appropriate approach would involve training the ELECTRA model with the entire human genome sequence.This would not only enhance its application for Off-Target prediction but also make it valuable for other tasks related to human genome analysis.We note however that, despite its below-par performance with respect to some of the baselines and other models in our CRISPR-DIPOFF suite, we find it worth discussing for the following reasons.Firstly, Large Language Models (LLMs) have the potential to propel the advancement of bioinformatics, similar to their impact on NLP.Secondly, despite pretraining ELECTRA in our resourceconstrained setting, it showed an indication of possible performance improvement.This suggests that properly pretrained ELECTA or similar LLMs could be Swiss army knife for any biological sequence-related prediction tasks.Finally, the interpretations of the finetuned ELECTRA model acts as the second independent (computational) validation for our interesting observation obtained from the LSTM mode's interpretation as discussed in the following section.

Interpretation of the ELECTRA Model
The ELECTRA model, finetuned for Off-Target prediction, did not perform as expected.Among the different ELECTRA models, the pretrained model on tri-mer encoded tokens and finetuned with two additional layers performed better than others.Despite not meeting our initial performance expectations, we have attempted to interpret the model in a limited capacity.Given the complexity and size of the model, which comprises approximately 3.2 million parameters, interpreting its inner workings becomes an arduous task.To overcome this challenge, we focused our interpretation efforts on the embedding layers of the finetuned ELECTRA model.The ELECTRA model consists of three specific embedding layers for token, token type, and position embedding.We employed the integrated gradients method to calculate attribution scores for these layers.The input tokens used in our experiments are tri-mer encoded sgRNA and DNA sequences, separated by [SEP] tokens.There is also a [CLS] token at the start and a [SEP] token at the end.In order to align with the maximum sequence length, the last three empty positions were filled with [PAD] tokens.Integrated gradients require a baseline sample to compute the gradient along the path.Generally, an input vector containing all zeros is used for that purpose.In the case of ELECTRA, the baseline sample has been prepared with [PAD] tokens.
We computed the average attribution scores for each token across positive, negative, and overall predictions on the finetuned model.To observe the change in the embedding layer we also computed attribution scores for overall predictions on pretrained model (i.e., the model before finetuning).The token importances are illustrated in Figure 10 which clearly demonstrates that the tokens associated with the DNA sequence exhibit significantly higher attribution scores compared to the sgRNA tokens.This observation is well-aligned with the understanding that the sgRNA sequence is engineered, while the target DNA sequence is responsible for introducing mismatches.In practical scenarios, a single sgRNA can potentially interact with numerous off-target DNA sites, and the final determination of off-target effects depends on the mismatches introduced by the DNA sequences.It appears that the embedding layers of the finetuned ELECTRA model have successfully captured this relationship from the training data used during finetuning.The comparison of attribution before and after the finetuning process illustrates how the embedding layer has changed after the finetuning process and has been aware of the biological significance of the specific task of Off-Target prediction.
Similar to the interpretation of our LSTM model, we observe that the attribution scores for negative and overall prediction are almost the same.The attribution scores for overall predictions show some sort of four distinguishable regions in the DNA tokens.The first region is related to PAM and it affects the final prediction negatively.There are two regions with contiguous positive scores.One of them is in the seed region (position 15 to 20) and the other one is in PAM distal region (position 1 to 6).The region in the middle of these two regions contains contiguous negative attributions.This is consistent with our findings in the LSTM model's interpretation, though the regions do not match exactly nucleotide by nucleotide.This further strengthens our observation that there are two sub-regions in the seed regions and one of them might be tolerable for mismatches.Further investigation is required to validate this observation.We anticipate that improving the performance with large-scale pretraining of the ELECTRA model could unlock more complex biological relationships and a deeper understanding of Off-Target effects.

C. Comparison of Performance
A comparison of our models with previous studies is shown in Table 8.

Table 2 .Table 3 .Table 4 .
Parameters and Results of Best RNN, LSTM, and GRU Models Obtained from Genetic Algorithm with 5-Channel Encoded Input.AUPRC score has been calculated on the Validation set.Model Type Iteration Hidden Size LSTM Layers Bi LSTM Hidden Layers Dropout Probability Batch Size Epochs Learning Rate AUPRC RNN Parameters and Results of Best RNN, LSTM, and GRU Models Obtained from Elitist Genetic Algorithm with 4-Channel Encoded Input.AUPRC score has been calculated on the Validation set.Model Type Iteration Hidden Size LSTM Layers Bi LSTM Hidden Layers Dropout Probability Batch Size Epochs Learning Rate AUPRC RNN Parameters and Results of Best RNN, LSTM, and GRU Models Obtained from Elitist Genetic Algorithm with 5-Channel Encoded Input.AUPRC score has been calculated on the Validation set.Model Type Iteration Hidden Size LSTM Layers Bi LSTM Hidden Layers Dropout Probability Batch Size Epochs Learning Rate AUPRC RNN

Fig. 9 .
Fig.9.Effect of gradual unfreezing on different performance metrics.All the performance metrics increased gradually when the parameters of the ELECTRA layers were unfrozen for finetuning one by one.The only exception was recall as it decreased with the increase of precision.

Fig. 10 .
Fig. 10.Attribution scores at embedding layers for all the input tokens of finetuned ELECTRA model for (a) positive, (b) negative and, (c) overall prediction before finetune and d) overall prediction after finetune with respect to positive class.

Table 1 .
Parameters and Results of Best RNN, LSTM, and GRU Models Obtained from Genetic Algorithm with 4-Channel Encoded Input.AUPRC score has been calculated on the Validation set.Model Type Iteration Hidden Size LSTM Layers Bi LSTM Hidden Layers Dropout Probability Batch Size Epochs Learning Rate AUPRC RNN

Table 5 .
Parameters for ELECTRA Tiny and Small Model.

Table 6 .
Results of finetuned ELECTRA models pretrained with Tri-Mer Encoded (TME) input tokens.The result shows how the performance improved over the gradual unfreezing of ELECTRA layers.

Table 7 .
Results of finetuned ELECTRA models pretrained with Byte Pair Encoded (BPE) input tokens.The result shows how the performance improved over the gradual unfreezing of ELECTA layers.

Table 8 .
Comparison of results of our study with baseline studies.The models of our study have the prefix "CRISPR-DIPOFF".Our LSTM model has outperformed the models from previous studies by a fair margin.