eSkip-Finder: a machine learning-based web application and database to identify the optimal sequences of antisense oligonucleotides for exon skipping

Abstract Exon skipping using antisense oligonucleotides (ASOs) has recently proven to be a powerful tool for mRNA splicing modulation. Several exon-skipping ASOs have been approved to treat genetic diseases worldwide. However, a significant challenge is the difficulty in selecting an optimal sequence for exon skipping. The efficacy of ASOs is often unpredictable, because of the numerous factors involved in exon skipping. To address this gap, we have developed a computational method using machine-learning algorithms that factors in many parameters as well as experimental data to design highly effective ASOs for exon skipping. eSkip-Finder (https://eskip-finder.org) is the first web-based resource for helping researchers identify effective exon skipping ASOs. eSkip-Finder features two sections: (i) a predictor of the exon skipping efficacy of novel ASOs and (ii) a database of exon skipping ASOs. The predictor facilitates rapid analysis of a given set of exon/intron sequences and ASO lengths to identify effective ASOs for exon skipping based on a machine learning model trained by experimental data. We confirmed that predictions correlated well with in vitro skipping efficacy of sequences that were not included in the training data. The database enables users to search for ASOs using queries such as gene name, species, and exon number.


INTRODUCTION
Exon skipping is a strategy that uses antisense oligonucleotides (ASOs) to exclude specific exons from the mature mRNA transcript of a given gene. ASOs are short nucleic acid analogs of diverse chemistry that recognize tar- get mRNA sequences by base pairing. Once hybridized to their targets, ASOs act as steric blockers that prevent splicing factors and other critical proteins from accessing these sequences (1). It is through this mechanism that ASOs could be designed to modulate splicing, for example, by targeting exonic splice enhancer sequences. Given its simplicity and versatility, exon skipping has evolved to become a promising treatment for various genetic disorders, particularly muscular dystrophies (2,3).
Exon skipping is showing promise as a therapy to treat Duchenne muscular dystrophy (DMD) and other genetic diseases (1). Most cases of DMD are caused by large, outof-frame deletions in the DMD gene, leading to an absence of the sarcolemma-stabilizing dystrophin protein in muscle cells (4)(5)(6). Exon skipping was adapted to make out-offrame DMD mutations in-frame by removing incompatible exons from the final transcript. In this manner, exon skipping facilitates the production of shorter but partially functional dystrophin protein in muscle, ameliorating DMD pathology. Recent years have seen the approval of four exonskipping ASOs for DMD therapy by the U.S. Food and Drug Administration (FDA): eteplirsen (2016, Sarepta), golodirsen (2019, Sarepta), viltolarsen (2020, NS and NS Pharma), and casimersen (2021, Sarepta) (7)(8)(9). In addition, the FDA approved the first n-of-1 clinical trial with an exonskipping ASO named milasen to treat a single patient with Batten's disease in 2018 (10).
While these support the outlook of exon skipping as a viable therapeutic strategy for genetic diseases, there is much to improve especially regarding efficacy. For instance, eteplirsen could only restore up to about 1% dystrophin of healthy levels after 180 weeks of treatment in DMD patients (7). Previous studies from our group demonstrate the utility of in silico methods to design more effective ASOs (11)(12)(13)(14). In one study, we developed an ASO with 12-fold higher in vitro exon skipping efficacy than eteplirsen using an in silico predictive tool based on statistical modelling (12). Such work and others have since uncovered numer-ous factors that could influence the exon skipping efficacy of an ASO including length, proximity to splice sites, target mRNA secondary structure, chemistry, and binding energy, among others (13,(15)(16)(17)(18)(19)--all of which would be useful considerations in ASO design. However, previously developed online tools lack the capacity to simultaneously integrate many parameters critical to ASO design.
To address this gap, we previously developed a computational method using a mathematical model based on 60 descriptor candidates as well as experimental data to design highly effective ASOs for exon skipping (13). Here, we improved this framework further using machine-learning algorithms and have developed eSkip-Finder, a web server to aid the design of effective ASOs for exon skipping. The overview of the webserver is presented in Figure 1. One part of eSkip-Finder is a first-of-its-kind comprehensive database of exon skipping ASOs for DMD and other genes. This database was populated using published scientific literature and patents as sources, and contains information such as ASO chemistry, ASO sequence, and experimentally obtained skipping efficacies. The second part is a firstof-its-kind machine learning-based application to predict highly effective ASO sequences for exon skipping, based on a training set of 566 skipping values from 209 unique ASOs extracted from the database above. Here, we describe the features of eSkip-Finder in-depth and outline the ways by which it can be used for the design of exon skipping ASOs.

Construction of database
A database of exon-skipping ASOs and their skipping efficacy was built by manually collecting and curating research papers and patents written in English. The database compiles data on exon-skipping ASOs for various genes, including their sequence, target exon, chemistry, literature information, and experimental information such as the ASO  concentration, the cell type used for testing, and the target species. The database statistics as of 15 April 2021, are shown in Supplementary Table S1. The complete dataset extracted for each ASO in the database is provided in the web server.

Predictive model of exon-skipping efficacy
We extracted skipping data that met the following criteria from the database to prepare our training and test datasets: (i) an absolute skipping efficacy was given by a numerical value; (ii) ASO concentration used in the experiment was given; (iii) rhabdomyosarcoma (RD) cells were used in the experiment to normalize experimental conditions; (iv) the skipping efficacy was not given as an EC 50 value; (v) an ASO sequence that was sequential (not dualtargeting) in the pre-mRNA of dystrophin was used.   (15), which is not included in the training dataset. The correlation between predicted and experimental skipping efficacies R 2 was 0.7 as shown in Supplementary Figure S3.
a similar distribution of skipping efficacy, and they did not share identical sequences, as shown in Supplementary  Figure S1.
We built a predictive model for the relative skipping efficacy of a target exon of dystrophin mRNA using the support vector regressor (SVR) implemented in scikit-learn version 0.23.2 (20). First, 32 features, tabulated in Table 1  and Supplementary Table S3, were prepared by feature engineering of ASO and/or its target exon sequences such as predicted binding score between the ASO and its target exon (21), predicted local RNA structure at the target site (22), and GC contents of the ASO and target exon. We also included the ASO concentration used in experimental studies as a feature. More details on the features used are provided elsewhere (13). Each feature was standardized before fitting the model. To select fewer important features, we built all possible combinations of the SVR model that used fewer than seven features, where the experimental ASO concentration was always included as a selected feature. The upper limit number of features, six, was chosen according to the available computational resources. For each model, the hyper-parameter optimization by a grid search for C, gamma, and epsilon was conducted with 100-time repeated splitting of the training data into 80% used to build a model and 20% used to validate the built model under the condition that they did not share identical sequences. Finally, we selected the SVR model that yielded the highest average R 2 of the validation sets as shown in Supplementary Figure S2, the features of which are given in Table 1. The selected models for PMO and 2OMe were applied to the test set, yielding R 2 values of 0.6 and 0.7, as shown in Figure 2. The correla-tion between experimental and predicted skipping efficacy was confirmed for various concentrations. The contributions of each feature to predictive performance (feature importance) were estimated by permutation importance (23). The importance of each feature was defined by decrease of the R 2 value when the feature in the test set was permutated randomly. The feature importance calculation was repeated 100 times and the averaged values are shown in Table 1. The current model is focused on the prediction of the relative skipping efficacy of ASOs. However, other parameters should be also considered when designing ASOs, one of which is the off-target effect. Other bioinformatics tools such as SKIP-E (https://skip-e.geneticsandbioinformatics. eu/) could complement it.

Implementation
The selected predictive models ( Figure 2 and Table 1) are implemented on the web server with scikit-learn (20). Features of local accessibility scores of target exon sequences and binding scores between ASOs and their target exons were calculated with the ViennaRNA Package (22) and RNAstructure (21). The dictionary of NI scores was retrieved from Ref. (24). The concentrations of ASOs were set to typical values, that is, 3 M for PMO and 0.1 M for 2OMe. The database was built using PostgreSQL.

Case study
Database search. The web server provides an intuitive search interface of relevant information on exon skipping