SeedMatchR: identify off-target effects mediated by siRNA seed regions in RNA-seq experiments

Abstract Motivation On-target gene knockdown, using siRNA, ideally results from binding fully complementary regions in mRNA transcripts to induce direct cleavage. Off-target siRNA gene knockdown can occur through several modes, one being a seed-mediated mechanism mimicking miRNA gene regulation. Seed-mediated off-target effects occur when the ∼8 nucleotides at the 5’ end of the guide strand, called a seed region, bind the 3’ untranslated regions of mRNA, causing reduced translation. Experiments using siRNA knockdown paired with RNA-seq can be used to detect siRNA sequences with off-target effects driven by the seed region. However, there are limited computational tools designed specifically for detecting siRNA off-target effects mediated by the seed region in differential gene expression experiments. Results SeedMatchR is an R package developed to provide users a single, unified resource for detecting and visualizing seed-mediated off-target effects of siRNA using RNA-seq experiments. SeedMatchR is designed to extend current differential expression analysis tools, such as DESeq2, by annotating results with predicted seed matches. Using publicly available data, we demonstrate the ability of SeedMatchR to detect cumulative changes in differential gene expression attributed to siRNA seed region activity. Availability SeedMatchR is available on CRAN. Documentation and example workflows are available through the SeedMatchR GitHub page at https://github.com/tacazares/SeedMatchR.


Introduction
RNA interference approaches utilizing small interfering RNA (siRNA) to downregulate gene expression have shown efficacy for the treatment of genomic-related diseases (Hu et al. 2020, Alshaer et al. 2021, de Brito et al. 2022, Schlegel et al. 2022).Argonaut (AGO) proteins complex with siRNA to form the machinery that targets transcripts with full sequence complementarity to the siRNA antisense sequence (guide) for degradation (Gorski et al. 2017, Alshaer et al. 2021, Kobayashi et al. 2021, Kobayashi et al. 2022).Knockdown of mRNA due to full sequence complementarity is considered an on-target hit (Alshaer et al. 2021, Kobayashi et al. 2021, 2022, Schlegel et al. 2022).MicroRNA (miRNA) are small non-coding RNA that also control gene expression through the AGO system by binding with varying degrees of sequence complementarity to the 3' UTR regions of protein coding transcripts (Bartel 2009, Gorski et al. 2017, McGeary et al. 2019).The 5' end of the guide sequence of both miRNA and siRNA, called a seed region, is responsible for mediating site recognition (Bartel 2009, Kobayashi et al. 2022).The seed region is roughly defined as the first eight nucleotides of the 5' end of the guide sequence (Bartel 2009).Binding of the seed region to the 3' UTR region of transcripts can result in repression of protein translation, as opposed to the cleavagemediated degradation of transcripts by siRNA complementarity (McGeary et al. 2019, Alshaer et al. 2021, Kobayashi et al. 2021, 2022).Due to overlapping mechanisms of action between siRNA and miRNA guide sequences, siRNA can also take on a miRNA like mechanisms of gene repression that are considered off-target hits (Janas et al. 2018, Kobayashi et al. 2021, 2022, Schlegel et al. 2021, 2022).
Current research has shown that siRNA can bind the 3' UTR of transcripts through a seed-mediated binding mechanism at suprapharmacological doses (Janas et al. 2018, Schlegel et al. 2021, 2022).Off-target binding of siRNA seed regions to the 3' UTR of transcripts has been shown to lead to hepatotoxicity in rodent models and cumulative changes in the gene expression profiles that can be detected with RNA-seq (Janas et al. 2018, Schlegel et al. 2021, 2022).Despite the growing interest around developing therapeutic siRNA with limited off-target effects, there is still a lack of software specifically designed to bring together all aspects of siRNA off-target detection.For example, many algorithms have been developed to predict the binding of miRNAs to a target transcript using state-of-the-art approaches, but do not provide additional functions for visualizing or evaluating the statistical significance of those predictions in the context of a differential expression analysis (Agarwal et al. 2018, McGeary et al. 2019, Soutschek et al. 2022).Many popular tools, such as TargetScan, provide tools for miRNA target prediction and assessment, but tools are written in several languages that require an experienced bioinformatician to run (Agarwal et al. 2018).SeedMatchR is an R package developed to fill the need for a single source of accessible tools to help diagnose seed-mediated off-target effects of siRNA using RNA-seq experiments.The goal of SeedMatchR is to provide an easy-to-use framework for assessing siRNA off-target effects that can easily fit into new or established differential expression workflows (Supplementary Fig. S1A).SeedMatchR has functions to help with common bioinformatics tasks such as preparing transcriptome annotations, defining seed regions for a given siRNA, statistical testing for cumulative changes in gene expression profiles, and generating publication quality figures.

Visualize siRNA and seed definitions
The functions available through SeedMatchR were designed to make it easier to explore your siRNA experiments.The plot_seeds() function will plot the siRNA sequence, in addition to the default seed definitions available for that siRNA (Fig. 1A).The get_seed() function returns an object containing the Biostrings sequences for a given seed definition (Fig. 1B).Alternatively, users can also define a custom seed definition by providing the start and stop position of interest.These functions are useful for reproducibly defining the target DNA sequence that is part of the string search and visualizing how it was derived (Fig. 1B).

Prepare required annotations
There are four required inputs for SeedMatchR: a data.frame of differential expression results, a species-specific GTF GRanges object, a species-specific DNAStringSet for genomic DNA, and a siRNA guide sequence !8 nt (Fig. 1C, Lawrence et al. 2013, Pages et al. 2023).The SeedMatchR package provides functions to help quickly and reproducibly generate the required annotations, taking advantage of the Bioconductor environment.Genomic features of interest (3' UTR, 5' UTR, exons, introns, or CDS) are derived from a Txdb object generated using the species-specific GTF file (Lawrence et al. 2013).Built-in SeedMatchR annotation databases use ENSEMBL gene IDs, but SeedMatchR can be used with custom annotations as well.The SeedMatchR documentation contains tutorials on how to generate required annotations as well as how to adjust parameters for custom gene sets.

Search annotations for seed matches
Bioconductor contains the packages Biostrings and GenomicFeatures which are tools that can be used for performing string searches in common genomic annotations (Lawrence et al. 2013, Pages et al. 2023).The function SeedMatchR() is designed to extend differential expression analysis workflows by annotating each gene with the number of matches found to an siRNA seed using the function vcountpattern() (Fig. 1A, Pages et al. 2023).The inputs to the primary SeedMatchR() function are a differential expression results data.frame, a GTF GRanges object, a DNAStringSet object of features to search, and the guide sequence for the siRNA.SeedMatchR was developed and tested on differential expression data from DESeq2, but any differential expression method can be used if the appropriate columns are represented.

Statistical analysis
The functions de_fc_ecdf() and ecdf_stat_test() are provided to streamline the comparison of empirical cumulative distribution functions (ECDF) for genes with or without a seed match (Fig. 1D).Comparison of the ECDFs for genes with a seed match to a background set has been applied to assess the effects of seed-mediated binding on gene expression profiles in miRNA and siRNA experiments (Grimson et al. 2007, Janas et al. 2018, McGeary et al. 2019, Schlegel et al. 2021, 2022).The ECDF analysis is used to gauge whether the distribution of log 2 (fold changes) observed between two sets of genes is different using the Kolmogorov-Smirnov (KS) test.Statistical testing with the KS test is performed by comparing the D statistic (Dstat), or difference, between the two ECDFs.In the case of seed-mediated effects, we expect the distribution of log 2 (fold changes) for transcripts with seed matches to be shifted to the left (downregulated) compared to a background set.We use a one-sided KS test to determine if the ECDF of transcripts with a seed match is less than the background distribution.The statistical functions provided can be run independently on any differential expression results looking to compare the ECDFs of different gene categories.For example, de_fc_ecdf() can be used to plot ECDFs for any set of genes, such as those derived from alternative binding prediction algorithms like scanMiR (Soutschek et al. 2022).

Seed-mediated off-target analysis
SeedMatchR comes with functions to download and process an example dataset from siRNA experiments targeting the Ttr transcript in female rat liver at suprapharmacological doses (Schlegel et al. 2022).The purpose of this study was to demonstrate the mitigation of hepatotoxicity caused by off-target seed-binding through chemical modifications of the siRNA seed region (Schlegel et al. 2022).The siRNA sequence with seed driven off-target effects is named Ttr D1.For this analysis, DESeq2 was used to perform differential expression analysis of rats treated with siRNA targeting the Ttr gene with different modifications compared to a vehicle control group (Schlegel et al. 2022).Each treatment condition has four biological replicates that are combined for the differential expression analysis using DESeq2.Using SeedMatchR, we can replicate the published results showing a significant shift in the ECDF for genes with a seed match compared to those without any seed matches in rats treated with 30 mg/kg of siRNA (Fig. 1D).Concordant with published results, cumulative down-regulation of predicted siRNA target genes expression was mitigated by the addition of a glycol nucleic acid at position g7 of the seed region in the siRNA called Ttr D4 (Fig. 1E, Schlegel et al. 2022).

Predicting effects of siRNA treatment on miRNA target gene expression
siRNA and miRNA share the same AGO machinery used for binding and mediating mRNA knock-down (Gorski et al. 2017, Alshaer et al. 2021).Early studies have shown that introduction of exogenous small RNA can perturb the activity of endogenous miRNAs by co-opting the AGO2 machinery (Khan et al. 2009, Saito andSaetrom 2012).SeedMatchR can be used to assess potential off-target effects on miRNA pathways using ECDF analysis of predicted miRNA targets compared to background sets of genes (Fig. 1F).Analysis of potential perturbation of miRNA pathways requires knowledge of miRNA expression in the experimental model (Supplementary Fig. S1B).This example uses the previous dataset for the Ttr D1 sequence as before, but we focus on the SeedMatchR can be used to detect siRNA seed-mediated off-target effects in RNA-seq experiments by comparing the ECDF of log 2 (fold changes) between genes with and without a seed match.A KS test is used to calculate the Dstat where the difference between the two groups is the greatest (Dstat: 0.138674, P-value: 7.74 Â 10 À8 ).(E) ECDF curve of log 2 (fold changes) for a siRNA glycol nucleic acid modification that helps reduce off-target effects compared to siRNA without the modification (Dstat: 0.049007, P-value: 8.239 Â 10 À2 ).(F) SeedMatchR can be used to detect cumulative changes in the expression profiles of miRNA target genes.The ECDF for miRNA targets shows a shift to the right compared to background genes (Dstat: 0.197063, P-value: 2.332 Â 10 À7 ).(G) SeedMatchR can be used to detect small activating RNA activity using promoter sequences as input.This analysis detected a positive increase in the ECDF of genes that have promoters with seed matches (Dstat: 0.048989, P-value: 2.295 Â 10 À3 ).
10 mg/kg dose.We collected public data for miRNA expression in 6-week-old female rat liver available from a tissue survey of miRNA expression in rats (Yao et al. 2022).The read counts per miRNA were normalized by library size and averaged across replicates for the control samples in the experiment.The top 10 expressed miRNAs were considered for downstream analysis.The miRDB database was used to predict the targets of the top 10 expressed miRNAs with a target score of !90 (Chen and Wang 2020).Using ECDF analysis of miRNA target genes shows a positive increase in the expression of predicted miRNA targets compared to background (Fig. 1F).These finding suggest that miRNA pathways could potentially be perturbed by siRNA treatment.

Using SeedMatchR with any genomic feature
SeedMatchR functions can be used with a DNAStringSet for any type of feature.Since SeedMatchR is built on GenomicFeatures and Biostrings functions, users can use any of the built-in functions from these packages to generate custom sets of features.For example, some small RNA molecules are designed to activate gene expression through stabilizing the promoter regions of some gene bodies in the nucleus (Kwok et al. 2019).The main difference between these small activating RNA from siRNA is the sequence they are designed to target (Kwok et al. 2019).To assess the potential for gene activation, transcript features can be swapped out for genomic promoter sequences (Supplementary Fig. 1C).Users can derive promoter sequences using the function getPromoterSeq() from the GenomicFeatures package.Using the same data from previous examples for Ttr D1 siRNA treatment at 30 mg/kg, there is a slight increase in the positive fold changes of some genes with a canonical seed match (Fig. 1G).These results suggest potential gene expression activation upon treatment of siRNA at high doses.

Conclusion
The SeedMatchR R package is intended to improve the reproducibility and access to commonly performed tasks in siRNA and miRNA experiments paired with RNA-seq data.The example datasets included in this publication illustrate the flexibility of SeedMatchR for detecting different types of off-target effects described in the literature that can attributed to different aspects siRNA-based gene knockdown.

Figure 1 .
Figure 1.Overview of SeedMatchR functions and example applications.(A) The function plot_seeds() is used for exploring siRNA sequences and potential seed definitions.The purple box is highlighting the default seed (mer7m8) used by SeedMatchR.(B) Example outputs from get_seed() showing the seed sequence and target sequence that will be searched for the mer7m8 (left) and mer6 (right) seeds.(C) SeedMatchR inputs and outputs.There are 4 required inputs: a siRNA guide sequence, a DESeq2 results data.frame, a species-specific GTF, and a feature specific DNAStringSet.The output of SeedMatchR is an additional column containing the seed match counts.(D)SeedMatchR can be used to detect siRNA seed-mediated off-target effects in RNA-seq experiments by comparing the ECDF of log 2 (fold changes) between genes with and without a seed match.A KS test is used to calculate the Dstat where the difference between the two groups is the greatest (Dstat: 0.138674, P-value: 7.74 Â 10 À8 ).(E) ECDF curve of log 2 (fold changes) for a siRNA glycol nucleic acid modification that helps reduce off-target effects compared to siRNA without the modification (Dstat: 0.049007, P-value: 8.239 Â 10 À2 ).(F) SeedMatchR can be used to detect cumulative changes in the expression profiles of miRNA target genes.The ECDF for miRNA targets shows a shift to the right compared to background genes (Dstat: 0.197063, P-value: 2.332 Â 10 À7 ).(G) SeedMatchR can be used to detect small activating RNA activity using promoter sequences as input.This analysis detected a positive increase in the ECDF of genes that have promoters with seed matches (Dstat: 0.048989, P-value: 2.295 Â 10 À3 ).