Exploring miRNA–target gene pair detection in disease with coRmiT

Abstract A wide range of approaches can be used to detect micro RNA (miRNA)–target gene pairs (mTPs) from expression data, differing in the ways the gene and miRNA expression profiles are calculated, combined and correlated. However, there is no clear consensus on which is the best approach across all datasets. Here, we have implemented multiple strategies and applied them to three distinct rare disease datasets that comprise smallRNA-Seq and RNA-Seq data obtained from the same samples, obtaining mTPs related to the disease pathology. All datasets were preprocessed using a standardized, freely available computational workflow, DEG_workflow. This workflow includes coRmiT, a method to compare multiple strategies for mTP detection. We used it to investigate the overlap of the detected mTPs with predicted and validated mTPs from 11 different databases. Results show that there is no clear best strategy for mTP detection applicable to all situations. We therefore propose the integration of the results of the different strategies by selecting the one with the highest odds ratio for each miRNA, as the optimal way to integrate the results. We applied this selection-integration method to the datasets and showed it to be robust to changes in the predicted and validated mTP databases. Our findings have important implications for miRNA analysis. coRmiT is implemented as part of the ExpHunterSuite Bioconductor package available from https://bioconductor.org/packages/ExpHunterSuite.

Preprocessed reads were mapped to the reference genome using Bowtie (v1.0) [2], using the options -seedmms 0maqerr 80 -seedlen 10 -all -m 7 -best -strata.Reads from the PMM2-CDG and DCM datasets were mapped to the human genome GRCh38; reads from the LD dataset were mapped to the mouse genome GRCm38.Similar mappings were combined following the miRDeep2 criteria using an in-house script collapse bwt.rb, which is included in the workflow (https: //github.com/seoanezonjic/DEG_workflow).Then, preprocessed reads and combined mapping files were analyzed with miR-Deep2 [3] using known miRNAs for human or mouse, previously downloaded from the miRBase database, to predict existing and novel miRNAs for each sample, disabling the arguments for maximum number of precursors to analyze and the minimum read stack height thresholds using -g -1 -a 1.We selected all predicted miRNAs with a miRDeep2 score greater than 0. These options are the most sensitive possible: we used them to ensure that all potential miRNAs would be detected.We later implemented an additional step for miRNA quantification, using these predicted miRNAs.As such, miRNAs for which there was little evidence of expression were removed at this stage.We obtained the sequences of the precursors of the predicted miRNAs from the genome using a custom Ruby script.These sequences were merged to produce a fasta file.Redundancy was removed by merging similar sequences using CD-HIT-EST (v4.8.1) [4] with the option -c 1.

miRNA quantification and expression analysis
Preprocessed reads were mapped to the non redundant predicted miRNA precursors using Bowtie2 (v2.2.9) [5] with default options for single reads and were quantified using sam2counts [6](Supplementary Figure 1B).This resulted in a table of counts for all miRNAs across samples.This was analyzed by degenes hunter.R [7] to obtain DEMs between case and control samples as indicated in the experimental design.The count matrices were normalized using the counts per million (CPM) method implemented by the edgeR [8] package and miRNAs with fewer than 2 CPM were filtered out.Four differential expression detection packages were executed: DESeq2 [9], edgeR [8], limma [10] and NOISeq [11].Packages that did not find any miRNAs with absolute base 2 logarithmic fold change (log2FC) > 1 and False Discovery Rate (FDR) < 0.05, as calculated by each method, were removed.DEMs for downstream analysis where those detected by all remaining packages.The CPM table was used as input to the weighted gene correlation network analysis method (WGCNA) [12] to group miRNAs into co-expression modules.
RNA-Seq analysis RNA-Seq samples were preprocessed with SeqtrimBB using the default template for transcriptomic data.Preprocessed reads were mapped to the genome and quantified using STAR (v2.5.3) [13] using the -no-discordant -no-mixed options for paired reads.
This resulted in tables of counts of genes per sample which were then analyzed using degenes hunter.R [7] to detect differentially expressed genes (DEGs) between case and control samples.The table of counts was normalized using the CPM.DESeq2, edgeR, limma and NOISeq were executed; packages that did not find any expressed genes with absolute log2FC > 1 and FDR < 0.05 were removed from the analysis.The DEGs were those genes detected by any package (in contrast to the DEMs which had to be detected by all packages).The CPM table was used to find co-expression modules using WGCNA as described above for the miRNA analysis.

Processing input
We calculated the Pearson correlation coefficient between genes and miRNAs based on the following criteria: For genes, they must be DEG, or be included in a coexpression module that contains at least one DEG with a KME > 0.7, and the gene itself has a KME > 0.7 for that module.The KME measure, also known as module membership, is the Pearson correlation value between the gene expression profile and the module eigengene, defined as the first right-singular vector of the standardized module expression data [12,14,15].
For miRNAs, only DEMs were used.For each miRNA-/RNA module, we used either the eigengene or the hub gene as representative of the module.We defined the hub of a module as the gene/miRNA with the highest KME.The input of coRmiT can be filtered through the option -tag filter and module membership threshold can be set by the option -module membership cutoff Using single expression profiles and/or module representative profiles as input (eigengene profiles or hub gene expression profiles) we performed 7 correlation strategies using coRmiT that could be categorized into one three groups as showed in Table 1.
Table 1.Summary of the correlation strategies implemented in coR-miT.Strategies are classified according to the input data into three groups that combines CPM and co-expresison modules.The names of the strategies (column "Strategy name") are composed from the gene and miRNA profiles (columns "Gene profiles" and "miRNA profiles") that are used for correlation analysis.CPM means counts per million matrix, eigengene is the first right-singular vector of the standardized expression data of a module and HUB is the expression profile of the gene/miRNA with highest correlation with the module eigengene.

Strategy group
Gene Supplementary Figure 1.General overview of the Expression analysis workflow.A miRNA detection using smallRNA-Seq samples; B miRNA quantification and expression analysis pipeline; C Gene expression analysis using RNA-Seq samples; D mTPs analysis using data obtained from the previous steps.DEGs and DEMs represents the differentially expressed genes and miRNAs respectively.CPM matrix is the expression matrix normalized to show counts per million mapped reads for each gene.Eigen and hub are the representative profiles of each co-expression module, i.e. the eigengene and the hub gene profiles.mTPs are the miRNA-target gene pairs

Comparing the performance of correlation strategies
Each of the strategies resulted in a list of predicted mTPs (strategy mTPs), which consisted of all miRNAs and target gene pairs with a correlation value lower than a given threshold.We then evaluated the results of all strategies obtained using a range of correlation thresholds, from -0.9 to -0.5 in intervals of 0.05.This was achieved by comparing the mTP lists for each strategy/threshold combination with experimentally validated and computationally predicted mTPs obtained from the multiMiR [16] meta-database, which integrates mTP data from eight predictive tools and three validation databases.The user can also load their own benchmark file of mTPs, such as mTPs predicted using deep learning methods like DMISO [17], TargetNet [18], miRBind [19] or ncRNAInter [20] using the option -add databases.
We only used the mTPs included in multiMiR that existed in at least one validation database (multiMiR mTPs) and/or were predicted by at least two prediction tools to avoid false positives.Then, we compared strategy mTPs to the multiMiR mTPs to calculate the overall odds ratio for each strategy and threshold using the following equation: Where O represents the strategy mTPs that were found in the multiMiR mTPs and S s represents those that were not, S d represents possible permutations between expressed miRNA and genes that were found in the multiMiR mTPs but were not found by the strategy and R the possible permutations between expressed miRNA and genes that did not appear in either strategy or multiMiR mTPs.
Selecting the optimal mTPs for each miRNA to integrate the results of the different strategies We have designed a selection-integration method that, for each miRNA, suggests potential mTPs from the data by using the results of the best peforming strategy and threshold.
For this, we computed a miRNA specific odds ratio for each DEM, comparing the mTPs found for each strategy and correlation threshold with the multiMiR mTPs.The miRNA specific odds ratio was calculated with the same formula as the overall odds ratio mentioned previously.The one-tailed Fisher's exact test was used as a significace measure, along with the odds ratio.The significant strategy/correlation threshold combinations were those that led to a p-value < 0.05 and odds ratio higher than 1.
For the selection-integration method, for each miRNA, we ranked the significant strategies and correlation threshold combinations by their odds ratios.When multiple strategies had the same odds ratio, we ranked more highly the strategy with the more restrictive correlation threshold the higher number of pairs that were multiMiR mTPs.The mTPs from the top ranked strategy and correlation threshold combination for each miRNA were selected.

Validation of differential expression of targets in Lafora Disease
The expression levels of the putative LD targets were analyzed by RT-qPCR to confirm differential expression.cDNA was prepared from total RNA extracts using the High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems, Life Technologies, Foster City, CA, USA) following the manufacturer's recommendations.We adapted the calculations for introducing 200 ng/µL of RNA in a 20 µL reaction volume.For a single sample, the master mix preparation comprised 10 µL of RNA, 2 µL of pool RT primers (5x), 0.8 µL of dNTPs (100 mM), 1 µL of MultiScribe reverse transcriptase (50 U/µL), 2 µL of 10X reverse transcription buffer, and 4.2 µL of nuclease-free water.A negative control of nuclease-free water was included to ensure no contamination with genomic DNA.The parameters for thermal cycling were 10 min at 25 • C, 120 min at 37 • C, and 5 min at 85 • C. For the real-time quantitative PCR, three technical replicates of each cDNA sample were generated for each gene to be tested, all in a 384-well plate using a QuantStudio 5 Real-Time PCR System thermocycler (Thermo Fisher Scientific).For each 10 µL of master mix, 1 µL of cDNA and 3.5 µL of nuclease-free water was mixed with 0.5 µL TaqMan ® Gene Expression Assay (20x) (specific for each gene: Mm00436931 m1 for Tert, Mm00498375 m1 for Tgm1, Mm04209424 g1 for Trem2, Mm00490624 m1 for Smc1a, Mm00433489 m1 for Gabrg2, Mm00433473 m1 for Gabrb3, Mm01253033 m1 for Gfap, Mm00449152 m1 Tyrobp, Mm00475988 m1 for Arg1, Mm00440207 m1 for Psmb8 ) and 5 µL of TaqMan Gene Expression Master Mix (Applied Biosystems).The reactions were carried out using the following parameters: 50 • C for 2 min, 95 • C for 10 min, 40 cycles of 95 • C for 15 s, and 60 • C for 1 min.Finally, all RT-qPCR data were captured using the QuantStudio™ Design and Analysis Software (version 1.5.1,Thermo Fisher).The relative expression of each sample was calculated using the 2-∆∆CT method (Livak & Schmittgen, 2001), using the CT (cycle threshold) value of Gapdh (Mm99999915 g1) as an internal control for the normalization of the CT of each sample.Graphical representations and statistical analyzes were conducted using GraphPad Prism v.8.3.0 for Windows (GraphPad Software, San Diego, CA, USA, www.graphpad.com).The selection of the targets was performed manually according their potential key role regulating the LD behaviour from the Online Repository Table 11.