Alternative splicing impacts microRNA regulation within coding regions

Abstract MicroRNAs (miRNAs) are small non-coding RNA molecules that bind to target sites in different gene regions and regulate post-transcriptional gene expression. Approximately 95% of human multi-exon genes can be spliced alternatively, which enables the production of functionally diverse transcripts and proteins from a single gene. Through alternative splicing, transcripts might lose the exon with the miRNA target site and become unresponsive to miRNA regulation. To check this hypothesis, we studied the role of miRNA target sites in both coding and non-coding regions using six cancer data sets from The Cancer Genome Atlas (TCGA) and Parkinson’s disease data from PPMI. First, we predicted miRNA target sites on mRNAs from their sequence using TarPmiR. To check whether alternative splicing interferes with this regulation, we trained linear regression models to predict miRNA expression from transcript expression. Using nested models, we compared the predictive power of transcripts with miRNA target sites in the coding regions to that of transcripts without target sites. Models containing transcripts with target sites perform significantly better. We conclude that alternative splicing does interfere with miRNA regulation by skipping exons with miRNA target sites within the coding region.


INTRODUCTION
MicroRN As (miRN As) are short (16-27 nucleotides ( 1 )) non-coding RNAs tha t regula te post-transcriptional gene expression.They usually repress the target gene by destabilizing its transcript and / or by r epr essing its translation ( 2 ).Through a complementary target site (position 2-8 from the 5 end, commonly r eferr ed to as seed sequence) they bind to their target mRNA and guide the RNA-induced silencing complex (RISC) to degrade it ( 3 ).In mammals, miR-NAs regulate > 60% of all protein-coding genes ( 4 ).They play an important role in health and disease.For example, tissue-specific miRNAs control cell dif ferentia tion ( 5 ) and miRNA downregulation is associated with tumorigenesis, e.g., the downregulation of li v er-specific miRNA miR-122 in hepatocellular carcinoma (HCC) ( 6 , 7 ).By analyzing the expression of 11 major human cancers from the Cancer Genome Atlas (TCGA), Li et al. ( 8 ) showed that the correlation between miRNA and target gene expression is reduced in tumors compared to normal tissue.Since individual miRNAs are able to sim ultaneousl y downregulate several target genes and thereby affect whole pathways, they ar e inter esting therapeutical targets ( 9 ).
MiRNAs are known to bind to the 3 untranslated region (3 -UTR) of their targets ( 10 ).Howe v er, Lytle et al. ( 11 ) moved a target site of let-7a miRNA from the 3 -UTR to the 5 -UTR in human HeLa cells and demonstrated that both 5 -UTR and 3 -UTR can be targeted.Lee et al. ( 12 ) found that not only the 5 -end of miRNAs can interact with the 3 -UTR of mRNAs but also vice versa .They identified many mRN As that sim ultaneousl y contain 5 -end and 3 -end target sites enabling combinatorial interactions between a single miRNA and both UTRs of an mRNA.While previous experiments found the reduction of protein le v els by around 40%-60% when using only 3 -UTR ( 5 ), the authors observ ed an e v en gr eater r eduction of protein abundance by also including miRNA target sites in the 5 -UTR.They validated their findings experimentally using hsa-miR-34a binding to both 3 -UTR and 5 -UTR of AXIN2.
A gene's coding region can also contain potential miRNA target sites.Forman et al. ( 13 ) analyzed publicly available proteomics datasets and demonstrated that miRNA target sites in coding r egions ar e functional but less conserved and effecti v e in r epr ession or inhibition of target genes than 3 -UTR sites.Hausser et al. ( 14 ) analyzed putati v e miRNA target sites in coding regions that were predicted computationally or inferred based on expression changes upon miRNA transfection.Target sites in the coding regions were found to have a smaller impact on mRNA stability but to be more effecti v e in inhibiting translation, while 3 -UTR sites trigger mRNA degradation more efficiently.The authors concluded that a combination of both enables fine-tuning of the miRNA regulatory effects.
The miRNA-media ted regula tion through interaction with target sites in the coding regions might be affected by alternati v e splicing (AS).A pproximatel y 95% of human multi-exon genes can be spliced alternativel y ( 15 ), w hich leads to different combinations of exons in resulting transcripts.If the exon with the miRNA target site is spliced out, the transcript might evade miRNA regulation.Howe v er, the impact of alternati v e splicing on miRNA-mediated mRNA regulation has been previously addressed only in one study.Han et al. ( 16 ) studied the impact of alternati v e splicing and alternati v e polyadenylation of 3 -UTRs on miRNA-mediated r epr ession efficiency in bladder cancer.They demonstrated that miRNA might fail to regulate alternati v ely spliced transcripts missing 3 -UTR exons.
The following open questions remain: Does alternati v e splicing of exons in the coding region affect miRNAmedia ted regula tion?How does this ef fect generalize to other tissues and conditions?
We address these questions and evaluate the impact of alternati v e splicing of coding regions on miRNA regulation on a transcriptome-wide scale for se v eral types of cancer.To that end, we computationally predict miRNA target sites using TarPmiR ( 17 ) and filter out miRNA-gene pairs based on their expression level and correlation between gene and miRNA expression.We then construct nested linear r egr ession models to predict miRNA expression based on the expression of transcripts with and without miRNA target sites and conclude that alternati v e splicing does indeed interfere with miRNA-mediated mRNA regulation.

Workflow
Figure 1 illustrates the workflow we used to investigate the impact of alternati v e splicing on miRNA-mediated gene e xpr ession r egulation.Human miRNA and mRNA sequences were input to TarPmiR for miRNA target site prediction on the mRNAs (Figure 1  • all transcripts (ALLT): the full models contain all transcripts; the reduced models contain non-binding transcripts.This setting allows us to investigate the role of miRNA target sites independent of their location in the gene.
• transcripts not binding in non-coding regions (TNBN): the full models contain non-binding transcripts and transcripts only binding in coding regions; the reduced models contain non-binding transcripts.This setting allows us to investigate the impact of miRNA target sites in coding regions in the absence of miRNA target sites in noncoding regions.• transcripts binding in non-coding regions (TBN): the full models contain transcripts binding in non-coding regions and transcripts with binding in both coding and noncoding regions; the reduced models contain transcripts binding in non-coding regions.This setting allows us to investigate the impact of miRNA target sites in coding regions in the presence of miRNA target sites in the noncoding regions.
After filtering (Figure 1 .3),we constructed nested linear r egr ession models for each remaining miRNA-gene pair to pr edict miRNA expr ession from transcript expr ession (Figure 1 .4)and filtered the models by the root mean squared error (RMSE) (Supplementary Figure S1).This procedure was repeated on both random subsets (Figure 1 .5)and random subsets with additionall y perm uted labels (Figure 1 .6).Finally, between each pair of full and reduced models we performed a likelihood ratio test (Figure 1 .7).

Prediction of miRNA target sites
Sequence data.We downloaded 2656 human mature miRNA sequences from miRBase ( 18 ) (release 22.1).The Figur e 1. Anal ysis pipeline using miRN A and mRN A expression and sequence data.(1) TarPmiR target site prediction, (2) ca tegoriza tion in non-binding vs. binding transcripts, (3) filtering ( A ) for expression and variance above the chosen thresholds (see Materials and Methods), ( B ) alternati v ely spliced genes, ( C ) negati v e correlation of miRNA and gene expression, (4) per miRNA-gene pair nested linear r egr ession: non-binding transcript r egr ession and all transcript r egr ession, (5) subsampling of nested models, (6) subsampling and label randomization of nested models, (7) likelihood ratio test between nested model pairs.The pipeline was run for the three settings ALLT (all transcripts), TNBN (transcripts not binding in non-coding region) and TBN (transcripts binding in non-coding region) separately from step 2) on.
Ensembl database ( 19 ) release 100 (GRCh38.p13assembly) was used as a source for the 249 750 mRNA sequences (228 116 primary assembly sequences + 21 634 alternati v e assembly sequences), as well as coding region annotation.We converted all uracil bases to thymine bases as cDNA format is necessary for miRNA target site prediction.miRNA tar g et site prediction with TarPmiR.We executed TarPmiR on the Ensembl mRNA sequences and miRBase mature miRNA sequences using the default parameters.The final output file contains miRNA target site candidates with binding probabilities > 50% (Supplementary Figure S2).As computational prediction of target sites can lead to false-positi v e predictions, we kept only transcripts that contain target sites with binding probabilities > 80% ('binding transcripts') and transcripts that contain no target sites with binding probabilities > 50% ('non-binding' transcripts) (Figure 3 A).The TarPmiR default is 50% but we raised the threshold to 80% to focus on high-confidence predictions (Supplementary Figure S3).Transcripts that contain only target sites with a binding probability between 50% and 80% were filtered out to avoid noise.
The predicted target sites were then mapped back onto the exons and categorized into target sites in the coding region and target sites in the non-coding region (5 -UTR and 3 -UTR).Target sites overlapping both coding and noncoding regions were assigned to both categories.Expression filter.To further reduce the number of potential false-positi v e predictions, we applied an e xpression filter.MiRN A and mRN A expression data were col-lected from The Cancer Genome Atlas (TCGA).We used the Xena platform ( 20 ) to download the batch effect corrected, TPM normalized, and log-transformed gene expression data (version 2016-09-01), transcript expression data (version 2019-02-25) and miRNA mature strand expression data (version 2016-12-29) from the T CGA P an-Cancer (P ANCAN) cohort.W e investigated the following cancer types: Brain lower grade glioma (LGG), Kidney chromophobe carcinoma (KICH), Li v er hepatocellular carcinoma (LIHC), Kidney renal cell carcinoma (KIRC), and Breast Invasi v e Carcinoma: Invasi v e Lobular Carcinoma (ILC) and Invasi v e Ductal Carcinoma (IDC).We selected tissues with a high proportion of alternati v ely spliced genes such as Brain tissue (LGG), Li v er tissue (LIHC) and Kidney tissue (KICH, KIRC).Breast tissue (IDC, ILC) was chosen for comparison due to the high number of samples available in TCGA ( 21 ).We filtered out miRNAs with expression variance smaller than 0.2 between samples within a cancer type dataset to reduce noise and to pre v ent ov erfitting of the model.We filtered out genes and transcripts that are not expressed in 25% or more samples within a dataset.
We also ran the pipeline on Parkinson's disease data from P arkinson's Progr ession Markers Initiative (PPMI) ( 22 ).We used the transcripts per million (TPM) abundance estimates from Salmon after basic QC from long RNA transcriptome sequencing of PPMI samples (B38, release IR3) for transcripts and genes and reads per million (RPM) normalized read counts for all miRNAs from miRBase v22 ( 18 ) that passed basic QC from small RNA transcriptome sequencing of PPMI samples (Project ID 133).Transcript and gene  counts were provided as log 2 ( tpm + 0.001), whereas miRNA counts were transformed from log 2 ( rpm + 1) to log 2 ( rpm + 0.001).Next we also filtered genes and transcripts that are not expressed in 25% or more samples and selected 100 most highly expressed miRNAs for further analysis.
Alternative splicing filter.To account for alternative splicing, we kept only miRNA-gene pairs where a gene has at least one transcript containing a miRNA target site with TarPmiR probability > 80% in the investigated region and at least one other transcript containing no miRNA target site with TarPmiR probability > 50% in the same investigated region.
Correlation filter.miRNAs most fr equently r epr ess the expression of their targets ( 23 , 24 ).Therefore, we expect a negati v e correlation between miRNA expression and target gene expression.To focus on the down-regulating effect that most miRNAs have on target gene expression, the Pearson standard correla tion coef ficient between miRNA expression and gene expression was calculated for all miRNAgene pairs.We kept only pairs with a negati v e Pearson correla tion coef ficient for further analysis.

Nested linear r egr ession on expression
Full and reduced models.The samples for all miRNAgene pairs were divided into training (80%) and test sets (20%).We constructed nested linear r egr ession models to pr edict miRNA expr ession from transcript expr ession.Per miRNA-gene pair a full model was trained on all NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3 5 transcripts of the gene using the ordinary least squares method.
, where m is the miRNA expression, t is transcript expression, k is the number of all transcripts of the gene without target sites in the investigated region, l is the number of all transcripts of the gene with target sites in the investigated region and ␣, ␤, ␥ are the regression coef ficients associa ted with each independent variable.Accordingly a reduced model was trained on the transcripts without target sites in the investigated region (see Workflow Description).
, where m is the miRNA expression, t is transcript expression, k is the number of all transcripts of the gene without target sites in the investigated region and ␣, ␤ are the regression coef ficients associa ted with each independent variable.
To investigate the possible impact of the most correlated miRNA with a target site in the non-coding region, we constructed full and reduced models as following.For a miRNA-gene pair (M1, G1) the full model was defined as: , where m 1 is the expression of miRNA M1, m 2 is the expression of miRNA M2 that is the most correlated to m 1 , t is transcript expression, k is the number of all transcripts of G1 without any target sites for M1, l is the number of all transcripts of G1 with M1 target sites only in the coding region and ␣, ␤, ␥ , ␦ are the r egr ession coef ficients associa ted with each independent variable.The reduced model was defined as: , where m 1 is the expression of miRNA M1, m 2 is the expression of miRNA M2 that is most correlated to m 1 , t is transcript expression, k is the number of all transcripts of G1 without any target sites for M1 and ␣, ␤, ␦ are the regression coefficients associated with each independent variable.
For the resulting models, the RMSE was calculated on the test sets.All nested models with an error smaller than 0.7 for the reduced and corresponding full models were kept.The threshold was determined through visual inspection of the remaining models after filtering (Supplementary Figure S1) aiming to select those tha t demonstra ted improvements beyond the average performance while preserving a reasonable number of candidates for further analysis.The likelihood ratio test was used between the reduced and full model, a ppl ying the Benjamini Hochberg multiple testing correction with a family-wise error rate of ␣ = 0.05.We then calcula ted the ra tio of nested models with sta tistically significant adjusted P -values ( P < 0.05), meaning nested models, where the full model outperformed the reduced model.
To estimate the distribution of ratios for the data set, we randomly sampled 10 000 miRNA-gene pairs 1000 times and calculated the ratio of nested models, where the full model outperformed the reduced model.
To quantify the effect size of improvement from the reduced model to the full model, Cohen's f 2 was calculated using the following formula: W hile the coef ficient of determina tion ( R -squared) measures how much of the independent variab le (miRNA e xpression) is explained by changes in our dependent variables (transcript expressions), the adjusted R -squared was used to penalize the model for adding irrelevant independent variables.A small effect size is defined as a Cohen's f 2 value of 0.02, a medium effect size as a Cohen's f 2 value of 0.15, and a large effect size as a value of 0.35 ( 25 ).
Randomization.To show that the obtained ratio of nested models with significant adjusted P -value ( P < 0.05) cannot be reached by chance, we repeated the subsampling procedure described above with 10 000 miRNA-gene pairs 1000 times, but shuffled the transcript binding labels within a gene while preserving the absolute number of binding transcripts and the absolute number of non-binding transcripts per gene.Per iteration, the reduced models were retrained and filtered by RMSE, as described above.For the remaining nested models, the likelihood ratio test statistic was calculated.We obtained the ratio of nested models with significant P -values after Benjamini Hochberg multiple testing correction as for the non-randomized models above.

Computational prediction of miRNA-gene pairs
Starting from 2656 miRNA and 249 750 mRNA sequences TarPmiR predicted 983 499 270 target sites with probability > 50% (Supplementary Figure S2), among those 424 345 096 target sites with probability > 80%.In total this amounted to 288 056 794 miRNA-transcript pairs with target sites.This high number might indicate a high number of false-positi v e pr edictions (Supplementary Figur e S4).In particular, target site predictions are sequence-specific and hence do not account for a lack of expression of either a miRNA or its target in a specific condition, cell type or tissue.Thus, we used miRN A and mRN A expression data for the six cancer types from TCGA.
The expression matrices contain 178,927 transcripts of 57 471 genes before the expression filter.Table 1 shows the number of samples per cancer type dataset with both miRNA and gene expression.The number of miRNA samples after filtering by expression variance slightly differs between datasets.Furthermore Table 1  miRNAs per gene and vice versa after filtering.We filtered for miRNA-gene pairs with expression above the chosen thresholds, alternati v e splicing, and negati v e Pearson correlation between miRNA and gene expression (see Methods).

Impact of alternative splicing on miRNA-mediated regulation in cancer
Previous studies focused on the role of miRNA target sites in non-coding regions (10)(11)(12).First, using the ALLT setting, we investigated the impact of miRNA binding on transcript expression independent of the location of target sites.This setting might reflect the mixed effect, where target sites in the non-coding region have more impact compared to target sites in the coding regions.To check this hypothesis, we de v eloped two other settings described below.
Target sites in the coding regions might be spliced out due to alternati v e splicing, and if those target sites are important for miRNA-mediated regulation, the resulting transcript could evade miRNA regulation, i.e. its expression will be independent of miRNA regula tion.To investiga te this possibility, we used the TNBN setting.
Transcripts often contain a mixture of miRNA target sites -both in coding and non-coding regions.Target sites in the non-coding regions might fully overpower the effect of target sites in coding regions.To investigate this possibility, we used the TBN setting.
We calculated the likelihood ratio test between the nested models for each miRNA-gene pair and compared the ratio of models, where the full model statistically significantly outperformed the reduced model for ALLT, TNBN and TBN, respecti v ely.Ne xt, we compared this ratio with the ratio of such models after r andomizing tr anscript category labels (Figure 4 for LGG and KICH, Supplementary Figure S6 for LIHC, KIRC, ILC, IDC).
The full models were found to consistently outperform the reduced models, whereas the magnitude of performance gain varies between cancer types.For Invasi v e Lobular Carcinoma (ILC) the effect is the weakest and the distribution of the ratios overlaps between real and randomized experiments.In the setting TNBN we see that also target sites in the coding region have an effect, supporting the notion that acti v e miRNA target sites are not only found in non-coding regions.
For all analyzed cancer types and settings (ALLT, TNBN, TBN), this difference is significant (Mann-Whitney U test P -value < 0.05).This observation supports the hypothesis that on transcriptome-wide scale, alternati v e splicing impacts miRNA regulation by splicing out miRNA target sites in the coding regions.To calculate the effect size we used Cohen's f 2 (see Materials and Methods).KICH clearly showed the best improvements between reduced and full model with Cohen's f 2 scores higher than one.For all diseases and settings we provide a list of all miRNA-gene pairs with the highest effect size (Cohen's f 2 ≥ 0.35) (see Data Availability).

Gene Set Enrichment analysis
We performed a Gene Set Enrichment analysis of the top 500 genes that predict miRNA expression with most significant p-values using the Molecular Signatures Database ( 26 , 27 ) for all six cancer types and settings.We over lay ed the top 500 genes with all non-computational collections (C1, C2, C3, C5, C6, C7 ( 28 ), C8, H ( 29)).For all the cancer types and settings, we found cancer-related functions within the top ten most significantly enriched genesets of the noncomputational collections for all diseases and settings besides for LGG setting TNBN (Supplementary Tables S2-S19).Of those, we found an overlap with a geneset related to the specific cancer type for cancer type IDC in all settings, for ILC settings TBN and ALLT, for KIRC setting ALLT, and for LIHC setting TBN.
While overlaying the top 500 genes of all the cancer types and settings with only oncogenic genesets (C6), we found significant overlaps for all cancer types and settings (Supplementary Tables S20-S37).
We investigated the miRNA regulatory interactions by extracting the top 500 genes with corrected P -value < 0.1 for the top 10 miRNAs with most significantly corrected likelihood test P -values.For each disease and setting we performed Gene Set Enrichment analysis as described above.For 30% of miRNAs a significant overlap of genes with any oncogenic genesets was observed in ILC for all settings, and in KICH setting ALLT for 50% of miRNAs.For the rest of the diseases we found significant overlaps with all noncomputational collections for the genes for a minimum of 70% miRNAs for all settings.
miRNAs are able to regulate se v eral target genes at the same time and thereby affect whole pathways.To examine this we used miRPathDB ( 30 ) and checked the miR-NAs with the top 100 most significant corrected p-values against the KEGG database (Supplementary Figures S7-S12).We found miRNAs significantly enriched for cancer for all investigated cancer types and settings, among that LGG settings TBN and ALLT were significantly enriched for glioma.

Impact of alternative splicing on miRNA-mediated regulation in Parkinson's disease
To prove the observed effect generalizes also to other diseases besides cancer, the pipeline was executed as described above on Parkinson's disease data from Parkinson's Progression Markers Initiative (PPMI) ( 22 ).
Due to the huge number of samples available (1.613 samples for 1.957 miRNA) compared to the biggest analysed The ratio of models with statistically significant ( < 0.05) corrected p-values of the likelihood ratio test statistic calculated between nested r egr ession models is shown as the dashed green line.To estimate the distribution, the ratio was calculated 1000 times for random subsets of miRNA-gene pairs (green histogram) and to estimate the impact of alternati v e splicing, the ratio was calculated 1000 times for random subsets of miRNA-gene pairs while r andomizing the tr anscript binding labels within a gene (y ellow histogr am).Dist describes the difference between the average ratio of models based on subsampled real miRNA-gene pairs and after randomizing the transcript category labels.This is shown separately for diseases Brain lower grade glioma (LGG) and Kidney chromophobe carcinoma (KICH) for settings ALLT, TNBN and TBN.cancer type in TCGA (IDC: 179 samples for 544 miRNA), we decided to focus on the top 100 most expressed miRNAs, while keeping all samples and genes.
Supplementary Table S38 shows the number of samples with both miRNA and gene expression, the number of miRNA samples after filtering by expression variance and the number of miRNA-gene pairs after filtering.Supplementary Table S39 shows the intermediate numbers of miRNA-gene pairs after the single filtering steps.We filtered for expressed, alternate spliced miRNA-gene pairs with negati v e Pearson correlation between miRNA and gene expression (see Methods).
For all three settings the likelihood ratio test between the nested models for each miRNA-gene pair and the ratio of models, where the full model statistically significantly outperformed the reduced model, was calculated.Then this ratio was compared to the ratio of outperforming models with randomized labels (Figure 5 ).
The full models were found to consistently outperform the reduced models, but the magnitude of performance gain was seen largest for setting ALLT.For all settings this difference was found significant (Mann-Whitney U test P -value < 0.05).This observation provides additional evidence supporting the notion that the influence of AS on miRNA regulation in the coding regions generalizes from cancer to disease such as Parkinson.

Influence of multiple miRNAs
In our pipeline we model the relationship between one gene and one miRNA at a time.In setting TNBN we focus on miRNA -gene pairs with target sites for this miRN A onl y in the coding region.Howe v er, other miRNAs might target the non-coding region of the same transcript.To investigate the possible effect of other miRNAs, for each gene-miRNA pair we identified the miRNA which expression has the highest Pearson correlation to the miRNA of interest and which is predicted to bind to the non-coding region of the gene with a binding probability > 50%.We included its expression as a covariate in the model (see Methods) and retrained all nested linear r egr ession models for disease KICH setting TNBN: The ratio of models with statistically significant ( < 0.05) corrected p-values of the likelihood ratio test statistic based on subsampled real miRNA-gene pairs is shifted towards the left compared to the distribution of the ratio after randomizing the transcript category labels (Figure 6 ).While we could observe a significant difference, we see a less strong signal (just ≈ 2% of models with significant likelihood ratio) than in the KICH TNBN, where we did not take into account multiple miRNA interactions (around 11% of models).
Overall, our findings demonstrate the effect of miRNA target sites in coding regions and re v eal that alternati v e splicing plays a significant role in influencing these interactions.

DISCUSSION
We investigated the impact of miRNA target sites in the coding regions based on miRNA and mRNA expression data using a nested linear r egr ession approach.On a transcriptome-wide scale, we demonstra ted tha t alterna ti v e splicing impacts miRNA regula tion, puta ti v ely by splicing out miRNA target sites in the coding regions.We observe The ratio of models with statistically significant ( < 0.05) corrected p-values of the likelihood ratio test statistic calculated between nested r egr ession models is shown as the dashed green line.To estimate the distribution, the ratio was calculated 1000 times for random subsets of miRNA-gene pairs (green histogram) and to estimate the impact of alternati v e splicing, the ratio was calculated 1000 times for random subsets of miRNA-gene pairs while r andomizing the tr anscript binding labels within a gene (y ellow histogr am).Dist describes the difference between the average ratio of models based on subsampled real miRNA-gene pairs and after randomizing the transcript category labels.This is shown for Parkinson disease for settings ALLT, TNBN and TBN.

Figure 6.
The ratio of models with statistically significant ( < 0.05) corrected P -values of the likelihood ratio test statistic calculated between nested r egr ession models is shown as the dashed green line.To estimate the distribution, the ratio was calculated 1000 times for random subsets of miRNA-gene pairs (green histogram) and to estimate the impact of alternati v e splicing, the ratio was calculated 1000 times for random subsets of miRN A-gene pairs w hile randomizing the transcript binding labels within a gene (y ellow histogr am).Dist describes the difference between the average ratio of models based on subsampled real miRNA-gene pairs and after r andomizing the tr anscript category labels.This is shown for Kidney chromophobe carcinoma (KICH) for setting TNBN with the addition of the most correlated miRNA as a covariate.
the same effect in six cancer types and in Parkinson's disease.We specifically investigated miRNA target sites in the coding region, as the effecti v eness of those target sites is not yet well understood.While now it is widely accepted that miRNAs target not only 3 -UTR but also 5 -UTR ( 11 , 12 ), r esear ch also indicates functional miRNA regulation by binding to the protein-coding regions.This is supported by both statistical analysis of the conservation of target sites ( 31 ) and se v eral e xperimentally validated e xample interactions (32)(33)(34), though the miRNA target sites in the coding region were found to have a smaller gene regulatory effect ( 35 ).McGeary et al. estimated binding affinities to predict miRNA r epr ession efficacy using a biochemical model of miRNA-mediated r epr ession and a convolutional neural network ( 36 ).The fitted penalty for binding to coding region arri v ed a t a 5.5-fold reduced af finity compared to non-coding regions.In their underlying analysis canonical target sites were more effecti v e due to the perfect seed pairing facilita ting ef fecti v e targeting of the silencing comple x.In contrast, non-canonical sites were less pr evalent, r equiring greater length to achie v e similar affinity values.Although non-canonical sites contributed measurably to r epr ession in the cell, their affinity was lower than the top four canonical sites' affinity.We demonstrated on a transcriptome-wide scale, that miRNA can bind to target sites in the coding exons, ther efor e r epr esenting the possible mechanism which disturbance might lead to complex disease development: the aberrant splicing might allow a gene to escape miRNA regulation or, vice versa, the miRNA dysregulation might lead to the disruption of transcript expression and / or ratio.As miRNAs are important players in complex disease de v elopment, e.g. in cancer progression ( 37 ) and a promising tool for cancer therapy, our study highlights the need for further r esear ch to unravel the underlying mechanisms and its implications in health and disease.
The interplay between miRNA regulation and alternati v e splicing in cancer de v elopment has rarely been addressed ( 16 ).Howe v er, se v eral miRNA-gene pairs, for which we showed a significant impact of alternati v e splicing, hav e been demonstrated to be important for cancer subtypes.
It is known that microRNA-200 family miRNAs target genes ZEB1 and ZEB2, which both are involved in EMT and tumour metastasis ( 38 ).In our analysis, we found that miRNA miR-141-5p regulates ZEB2 ( P ≈ 2.39 e −4 in LGG setting TBN and P ≈ 9.00 e −5 in LGG setting ALLT).miR-141-5p inhibits glioma cell growth and migration by repressing ZEB1 expression ( 39 ).In pancreatic cancer, howe v er, tr eatment of MiaP aCa-2 cells with gemcitabine caused an upregulation of the ZEB1 protein through alternati v e polyadenylation of the transcript ( 40 ).Thereby the ZEB1 3 -UTR was shortened and miRNA target sites in the last exon deleted.We were able to observe gene ZEB1 evading regulation by miR-141-5p through alternati v e splicing ( P ≈ 3.32 e −5 in LGG using setting TBN).
Tumor suppressor miR-30c is known to inhibit prostate cancer by targeting the 3 -UTR of the SRSF1 splicing factor oncoprotein to downregulate its expression in prostate cancer ( 41 ).This expression is correlated with the pathological NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3 9 stage of prostate cancer and biochemical r ecurr ence.SRSF1 is also known to be ov er-e xpressed in kidney tumor ( 42 ).We found the miRNA miR-30c-1-3p and gene SRSF1 interaction significant in KICH setting TNBN ( P ≈ 2.39 e −2 ) and in KIRC setting ALLT ( P ≈ 3.03 e −2 ).In renal cancer 3 -UTR variants of SRSF1 wer e discover ed with differing miRNA target sites ( 43 ), a differential regulation mechanism potentially existing for miR-30c as well.We found SRSF1 also interacts with miR-7 in lower grade glioma --more specifically with miR-7-2-3p ( P ≈ 9.22 e −8 for setting ALLT).The splicing factor SRSF1 transcript, besides being r epr essed by miR-7, is also targeting the miRNA through binding, thereby generating a negati v e feedback loop ( 44 ).
In KIRC setting ALLT we found miR-18a-3p regulating K-Ras expression ( P ≈ 1.41 e −3 ), an interaction which was pre viously shown e xperimentally ( 45 ).MiR-18a* acts as a tumor suppressor by targeting oncogene K-R as.K-R as is known to be alternati v ely spliced into two isoforms K-Ras 4B, which is anti-apoptotic and ubiquitously expressed, and K-Ras 4A, which is pro-apoptotic and expressed in only a subset of tissues such as kidney, lung and colon ( 46 ).In renal cell carcinoma oncogene K-Ras 4A was observed as upregulated and the isoform's influence on cell survival and proliferation shown ( 47 ).
W hile cancer-rela ted terms were frequently observed among the top ten gene sets in the Gene Set Enrichment analysis, se v eral other enriched terms lacked a direct association with cancer.These likely indicate tissue-specific effects of miRN A regulation, w hich suggests that the interplay of miRN A regulation and AS probabl y also plays a role in tissue dif ferentia tion ( 48 ).
Our study provides the first steps toward investigating the role of alternati v e splicing in miRNA regulation and has se v eral limitations.First of all, the current availability of miRN A and mRN A expression data from the same tissue and condition limits the study.Howe v er, the effect is clearly seen in different cancer types and Parkinson's disease, which suggests that this effect might be commonly observed.W hile we demonstra ted the ef fect on two dif ferent diseases, it is important to recognize that this does not automaticall y impl y an identical effect in healthy individuals.Moreover, cancer samples, with their eleva ted muta tional burden, are likely to exhibit deviating transcript expression pa tterns.Additional investiga tions involving healthy populations are necessary to ascertain the generalizability and context-specific nature of these findings.The other limita tion is tha t machine learning methods for miRNA target site prediction continue to exhibit a high false positi v e rate ( 49 ).Despite its high false negati v e rate, Riffo-Campos et al. recommended TargetScan as the most complete and widely used sequence-based miRNA target site prediction tool ( 50 ).We opted for the state-of-the-art tool TarPmiR ( 17 ) because of its superior recall and pr ecision compar ed to other commonly used methods such as TargetScan ( 51 ), miRanda ( 52 ), and miRmap ( 53 ).Moreover, in addition to conventional fea tures, TarPmiR integra tes seven fea tures, including the length and position of the longest consecuti v e pairs, contributing to its improved performance, particularly in identifying 'non-seed-matching' target sites.We tried to further mitigate the false positi v e rate by introducing an expression correlation filter but still miRNA target pr ediction r emains an open challenge.Additionally, pr edicting targets for non-conserved or ne wly discov ered miR-NAs remains challenging due to limited homology and experimental data, compounded by the absence of a gold standard ( 54 ).
In our analysis, we consider miRNA-gene interactions as binary, while genes acting as competing endogenous RN As actuall y form a complex gene-regulatory network based on miRNA competition ( 55 , 56 ).We chose a simple binary model over a more complex network model with nto-n interactions as the r esults ar e easier to interpret and clearly support our findings.We also simplified the model of miRNA binding to a target site and did not take into account the more di v erse nature of binding.For example, binding within the ORF can involve only partial complementarity in a nucleation bulge interaction or 'seed-like' motifs ( 57 ), which makes it necessary to look further than direct seed matches to predict these non-conserved target sites ( 58 ).This stands opposed to canonical miRNA binding using seed pairing, where the miRNA binds on its complementary positions 2-8 (the seed region) to the 3 -UTR of a target mRNA.Further studies are needed to pr edict differ ent types of miRNA target sites and investigate if they have a different impact on miRNA regulation.Furthermore, many transcripts show target sites for se v eral miRNAs (Supplementary Figure S13), and multiple miRNAs might e v en target the same exon.miRNAs are also known to act cooperati v ely to regulate their targets and this miRN A-miRN A cooperation was shown to differ across cancer types with cancer-specific hubs ( 59 ).The m ulti-miRN A effect should be taken into account in the futur e to r efine the miRNA-transcript r elationship by e.g.adding more miRNAs as dependent variables to multiple linear r egr ession.In our study, we accounted for the most correlated miRNA with the target site within the same transcript, howe v er, the regulation via other miRNAs might be much more complex ( 60 ).For example, we can not rule out completely that the effect we observe could be explained by other correlated miRNAs that do not bind to coding exons but to the 3 or 5 UTR of the transcripts we investigate.
The most promising miRNA-gene pairs might be feasib le for e xperimental validation.Howe v er, the demonstrated effect has a transcriptome-wide meaning that might not be clearly observed for single miRNA-gene pairs.Also we found that only a small percentage, around 10-15%, of the miRNA-gene pairs e xhibiting alternati v e splicing leading to the exclusion of miRNA target sites were suitable for our modeling due to the need to mitigate the impact of numerous other confounding effects.Although miRNAs and genes had to be removed from the analysis, it does not negate the possibility of alternati v e splicing-mediated miRNA regula tion af fecting these miRNAs and genes.To be able to conclusively demonstrate the existence of such an effect rigorous filtering was necessary but further r esear ch is needed to ascertain the prevalence and significance of alternati v e splicing-influenced miRNA regulation in different biological conte xts.Ne v ertheless, we pub lished the list of significant cancer-specific miRNA-gene interactions affected by alternati v e splicing (see Data Availability), and provide a basis for further experimental investigation of specific interactions and the influence of alternati v e splicing.In the 10 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3 futur e, we ar e planning to provide our findings as a userfriendly database where the miRNA-gene pairs can be investigated visually.
Another interesting aspect worth considering in the future is the question of the effect of miRNA regulation on the alternati v e splicing machinery, as miRNAs can bind to splicing factors and alter splicing activity ( 61 ).This work focuses on the more common miRNA-mediated downregula tion ra ther than upregula tion, as miRNA-media ted upr egulation is rar e and curr entl y not sufficientl y understood ( 23 , 62 , 63 ).
Finally, tools are needed to study the functional consequences of d ysregula ted transcripts.As current pa thway enrichment analyses do not take into account AS, this also presents avenues for future research.Therefore tools such as mirPathDB could benefit from operating on the transcript le v el similar to NEASE, which connects splicing and pathway analysis by functionally enriching AS e v ents ( 64 ).

CONCLUSION
We hav e de v eloped a ne w computational method to assess the influence of miRNA binding in coding regions on the whole transcriptome.We studied the impact of alternati v e splicing on miRNA regulation on the whole transcriptome for se v eral cancer types and Par kinson's disease while focusing on the coding region.Using sequence data, miRNA target sites were predicted on human mRNA and the difference in correlation between miRNA expression and binding vs. non-binding transcript expression was investigated using nested linear r egr ession models.We wer e able to show that miRNAs binding in coding r egions ar e effecti v e at reducing transcript expression and that transcripts that splice out the miRNA target site are less affected by miRNA-mediated downregulation.Beyond the influence of alternati v e splicing, we show evidence that the coding region plays a role in miRNA regulation.Our findings suggest that further clinical studies can be directed at studying miRNA target sites in the coding region.

DA T A A V AILABILITY
All data presented are deri v ed from previously published data sets as indicated.The used input data and output data including miRNA-gene pairs with p-values, miRNA-gene pairs with Cohen's f 2 ≥ 0.35 and the results of the Gene Set Enrichment analysis for all six investigated cancer types and the Python code used for this study are available at https: // doi.org / 10.6084 / m9.figshare.21821181.
Figure 1 illustrates the workflow we used to investigate the impact of alternati v e splicing on miRNA-mediated gene e xpr ession r egulation.Human miRNA and mRNA sequences were input to TarPmiR for miRNA target site prediction on the mRNAs (Figure 1 .1,for details see Materials and Methods).Based on the binding probability and the location of a target site for each miRNA-transcript pair, we categorized transcripts into four types (Figures 1 .2, 2 A): (a) nonbinding transcripts; (b) transcripts with binding in the coding region; (c) transcripts with binding in non-coding regions; (d) transcripts with binding in both coding and noncoding regions.The number of miRNA-transcript pairs in each category is shown in Figure 3 B. To investigate the role of miRNA target sites in coding regions opposite to noncoding regions, from here on the same steps were performed separately for each of the three settings (Figure 2 B):

Figure 2 .
Figure 2. ( A ) Transcripts are divided into four different transcript types: ( A ) non-binding transcripts, ( B ) transcripts with target sites only in the coding region, ( C ) transcripts with target sites only in the non-coding region (3 -UTR or 5 -UTR), ( D ) transcripts with target sites in both the coding and noncoding region B Structure of the nested miRNA-gene-le v el linear regression models.The full model is trained on: all transcripts (ALLT), only transcripts without target sites in non-coding region (TNBN), and only transcripts with target sites in non-coding region (TBN).Accordingly, the reduced model is only trained on a subset of the transcripts without target sites in the investigated region.

Figure 3 .
Figure 3. Ca tegoriza tion of pr edicted TarPmiR target sites befor e any filtering ( A ) in target sites with binding probability between 50% and 80% and above 80% and ( B ) in miRNA-transcript pairs based on target region (coding / non-coding region).

Figure 4 .
Figure 4.The ratio of models with statistically significant ( < 0.05) corrected p-values of the likelihood ratio test statistic calculated between nested r egr ession models is shown as the dashed green line.To estimate the distribution, the ratio was calculated 1000 times for random subsets of miRNA-gene pairs (green histogram) and to estimate the impact of alternati v e splicing, the ratio was calculated 1000 times for random subsets of miRNA-gene pairs while r andomizing the tr anscript binding labels within a gene (y ellow histogr am).Dist describes the difference between the average ratio of models based on subsampled real miRNA-gene pairs and after randomizing the transcript category labels.This is shown separately for diseases Brain lower grade glioma (LGG) and Kidney chromophobe carcinoma (KICH) for settings ALLT, TNBN and TBN.

Figure 5 .
Figure 5.The ratio of models with statistically significant ( < 0.05) corrected p-values of the likelihood ratio test statistic calculated between nested r egr ession models is shown as the dashed green line.To estimate the distribution, the ratio was calculated 1000 times for random subsets of miRNA-gene pairs (green histogram) and to estimate the impact of alternati v e splicing, the ratio was calculated 1000 times for random subsets of miRNA-gene pairs while r andomizing the tr anscript binding labels within a gene (y ellow histogr am).Dist describes the difference between the average ratio of models based on subsampled real miRNA-gene pairs and after randomizing the transcript category labels.This is shown for Parkinson disease for settings ALLT, TNBN and TBN.

Table 1 .
illustrates the number of miRNA-gene pairs after filtering, while Supplementary Table S1 shows the intermediate numbers of miRNA-gene pairs after the single filtering steps and Supplementary Figure S5 additionally shows the distribution of the number Number of samples, miRNAs and miRNA-gene pairs after all filtering steps shown for settings ALLT, TNBN, TBN and for the investi-