PAREameters: computational inference of plant microRNA-mRNA targeting rules using RNA sequencing data

MicroRNAs (miRNAs) are short, non-coding RNAs that influence the translation-rate of mRNAs by directing the RNA-induced silencing complex to sequence-specific targets. In plants, this typically results in cleavage and subsequent degradation of the mRNA. This can be captured on a high-throughput scale using degradome sequencing, which supports miRNA target prediction by aligning degradation fragments to reference mRNAs enabling the identification of causal miRNA(s). The current criteria used for target prediction were inferred on experimentally validated A. thaliana interactions, which were adapted to fit that specific subset of miRNA interactions. In addition, the miRNA pathway in other organisms may have acquired specific changes, e.g. lineage-specific miRNAs or new miRNA-mRNA interactions, thus previous criteria may not be optimal. We present a new tool, PAREameters, for inferring targeting criteria from RNA sequencing datasets; the stability of inferred criteria under subsampling and the effect of input-size are discussed. We first evaluate its performance using experimentally validated miRNA-mRNA interactions in multiple A. thaliana datasets, including conserved and species-specific miRNAs. We then perform comprehensive analyses on the differences in flower miRNA-mRNA interactions in several non-model organisms and quantify the observed variations. PAREameters highlights an increase in sensitivity on most tested datasets when data-inferred criteria are used.


INTRODUCTION
Small RNAs (sRNAs) are short, non-coding RNAs with roles in transcriptional and post-transcriptional gene regulation in eukaryotes [1]. In plants, the latter mode of action is achieved via a class of sRNAs, the microRNAs (miRNAs), which reduce the amount of messenger RNA (mRNA) available for translation by directing the RNA-induced silencing complex (RISC) to their sequence-specific mRNA target(s) and inducing cleavage and subsequent degradation of the mRNA [2]. The Parallel Analysis of RNA ends (PARE) protocol [3], also known as degradome sequencing, captures the 5' ends of downstream cleaved mRNAs, which are used to quantitatively predict miRNA-mRNA interactions.
Improvements to next-generation sequencing technologies have resulted in larger and more diverse experiments, including multi-omics ones [4]. It has also led to the sequencing and annotation of many different organisms' genomes and facilitated functional studies outside of the context of model organisms [5]. However, a vast proportion of our understanding of specific biological mechanisms is based on the study of model organisms, mostly due to their lower regulatory complexity and availability of extensive, varied public sequencing datasets. Many computational methods designed for extracting information/features from sequencing data, (e.g. sRNA classification and target prediction) often summarize the data-mining results into rule-based models, derived from experimental observations. corresponds to the smallest overall value of the absolute ratio between Se and PPV since it corresponds to the minimum increase of sensitivity with respect to the loss in PPV (ST3).
The significance of the distribution of properties with respect to the miRNA was calculated using offset χ 2 tests and the contribution of each property was assessed using individual Fisher exact tests [19]. To better understand the contribution of specific nucleotide base-pairs or motifs, the distributions of match pairs (A/U, G/C and G/U), mismatch pairs (A/m, C/m, G/m, U/m) and gap pairs (A/g, C/g, G/g, U/g) were also determined, with the positional significance evaluated using the χ 2 test and the individual Fisher exact tests. Lastly, the distribution of miRNA-mRNA duplex MFE ratios [9, 13] were analysed using Kolmogorov-Smirnov tests.
In addition to the A. thaliana datasets, we exemplify the usage of PAREameters on sRNA and corresponding PARE datasets from A. trichopoda leaf (D4A) and opened female flower (D4B) (GSE41811); G. max leaf (D5) (GSE76636) [23]; O. sativa inflorescence (D6) (GSE18251) [24] and T. aestivum 2.2mm spikes (D7) (GSE36867) [25]. The transcriptome and genome sequences for organisms other than A. thaliana were obtained from EnsemblPlants Release 43 [26], namely, A. Summaries about each dataset, such as the number of raw and unique reads and genome matched reads can be found in ST4. In addition, for the sRNA data, we report the number of known miRNAs present (based on current miRBase [15] annotation) and for the PARE data, we also include the number of transcriptome matching reads.

Evaluation the inferred targeting rules in A. thaliana
First we compare the increase in accuracy of the computationally inferred targeting rules versus the manually inferred Allen et al. rules. Next, we assess the stability of the inferred rules using random subsampling of the input data. The effect of additional parameters, such as size of the input data and the user-defined retain rate, are also discussed.
When comparing the sensitivity of the manually inferred criteria and the Allen et al. rules on the set of A. thaliana validated interactions, the former outperformed the latter by ~15%. However, this increase in performance may be due to the over-fitting of the targeting criteria on the currently known interactions.
In addition, due to the scarcity of validated interactions (either as number of valid interactions or localization of specific modes of action in different cell types [27]), these criteria may not be portable between various organisms or tissues. Therefore, we used the PAREameters tool to infer targeting criteria from the A. thaliana D1, D2 and D3 datasets. The inferred criteria were then utilized by PAREsnip2 for target prediction and the results evaluated and compared to the Allen et al. rules. The resulting parameters reported by the PAREameters pipeline can be found in ST5. The evaluation method used is identical to that of the manually inferred criteria. Specifically, for each dataset, the class of positive (P) data included experimentally validated miRNA-mRNA interactions with HC transcript peaks and corresponding miRNA sequence with abundance ≥5.
The results, presented in We also evaluated the time and memory performance of PAREameters on each dataset. The runtime of the pipeline depends on the size of the input data (sequencing depth of the sRNA and PARE samples and the size of the reference genome). On A. thaliana D1, D2 and D3 datasets, the runtime range was 16 minutes and 52 seconds to 1 hour 4 minutes (this excludes the time taken to build the bowtie index as this is only done once per species) and the memory usage varied between 5GB and 8GB (ST6). The inference component of PAREameters is linear on the size of the sRNA and PARE input data.
Based on the properties of HC miRNA-mRNA duplexes with cleavage signal confirmation in the PARE data, PAREameters extracted targeting criteria that increased the sensitivity and retained precision versus existing fixed criteria when tested against a golden standardthe set of experimentally validated interactions in A. thaliana. To avoid the overfitting of targeting criteria based on characteristics of the input data, we tested the stability of the inferred properties using a cross-validation technique and the set of experimentally validated A. thaliana miRNA-mRNA interactions on the D1, D2 and D3 datasets.
Specifically, we used the HC interactions with corresponding miRNA sequences in each dataset as a starting point. We then randomly split the HC validated interactions in each dataset to form two groups: the training group, containing 75% of the data, and the testing group, which contained the remaining 25%. PAREameters was used to infer parameters on the training set and these were employed by PAREsnip2 for target prediction on the test set. We then calculated the sensitivity and precision of the inferred parameters on the training set and on the test set. The random cross-validation was repeated 50x and the results, displayed in ST7, show that PAREameters is able to infer targeting parameters with a median sensitivity of 77%, [range: 66-81%] and precision 83% [range: 75-100%] when evaluated on the unobserved testing data.
The decrease in sensitivity from our previous analysis likely originates from the fact we are inferring criteria from one set of miRNA-mRNA interactions and testing on a different set of miRNA-mRNA interactions. Whereas previously, we were inferring criteria from the whole set of PAREameters predicted HC miRNA-mRNA interactions. This further supports our hypothesis that miRNAs may have different modes of action or target complementarity requirements and demonstrates that using just one set of fixed criteria is not sufficient when performing miRNA target prediction.
To investigate how increasing the amount of training data may lead to a more accurate representation of inferred targeting criteria, we evaluated the computationally inferred criteria produced by PAREameters on different sized subsets of the experimentally validated interactions contained within the D1 datasets. Starting with 10% of the validated data, followed by increments of 10% until the final value of 90%, we used PAREameters to infer criteria on the data subset and then evaluated those criteria on the remaining unseen data. Analysis on each subset was performed 50 times and the results shown in ST8. On each dataset, increasing the amount of training data resulted in an overall increase in sensitivity. Intriguingly, the increase in training data resulted in a decrease in precision. However, this should not be seen as a negative result, as we've previously stated, the class FP is the set of predicted interaction for which, currently, there is no experimental validation. Indeed, the current class of positive data is almost certainly incomplete, therefore further experimental validation can only increase the sensitivity and precision values for the inferred criteria.
To evaluate how changes to the PAREameters retain rate parameter impacts Se and PPV, we evaluated the computational inferred targeting criteria produced by PAREameters on the D1 dataset with increasing retain rate values. The results of this analysis are shown in ST9 and SF1. Starting with an initial value of 0.5 and with increments of 0.05 thereafter, we recorded the number of validated and non-validated interactions being captured at each value. Next, we computed the ratio between the Se and the PPV; the data-inferred threshold for the retain rate parameter is the value that minimizes this ratio i.e. it is the value for which the increase in Se is minimal with respect to the loss in PPV (ST3). In

Consistency of attribute distributions and inferred criteria across miRNA subsets in A. thaliana
To evaluate the portability of targeting criteria (and distribution of properties) across miRNA subsets we  Figure 2A and 2B, respectively. In Figure 2A thaliana are significant, we performed χ 2 tests of significance using the conserved properties as the expected distribution and the species-specific properties as the observed distribution. Additionally, we use the Fisher's exact test to determine the specific property at each position responsible for the significance of the differences. The results of the significance analysis for the position-specific property distributions are presented in Table 2 significant differences in the proportion of G:U pairs. We also analysed the differences in MFE ratio distributions between conserved and species-specific miRNAs, shown in Figure 2B, and the significance of the differences were evaluated using the Kolmogorov-Smirnov test, which reported a pvalue of 8.57×10 -10 . These results may suggest a higher complementarity requirement between conserved miRNAs and their targets than that of species-specific miRNAs.
To investigate the portability between criteria inferred exclusively on conserved or species-specific miRNA interactions, we evaluated the inferred rules of each set of interactions (all four pairwise combinations: conserved rules on conserved interactions, conserved rules on species-specific interactions and the similar pairs on the species-specific rules), using PAREsnip2. The results, presented in Table 3, show a consistent decrease in sensitivity for both the conserved and speciesspecific miRNAs when inferring criteria on the other subset of miRNA-mRNA interactions. Specifically, a decrease from 82.08% to 65.67% and 76.09% to 55.98% for the conserved and species-specific miRNA-mRNA interactions, respectively. Further investigation into these differences support our previous observation regarding the differences in MFE ratio of conserved and species-specific miRNA interactions, with the inferred values being 0.75 and 0.68, respectively, further supporting our previous observation regarding an increased complementarity requirement for conserved miRNAs. Another intriguing difference between the inferred criteria is an allowed mismatch or G:U pair at position 10 of the species-specific miRNAs.
The differences between the properties of conserved and species-specific miRNA-mRNA interactions highlight the need for customization in the set of criteria used for describing and capturing miRNA-mRNA interactions when conserved or species-specific miRNAs are involved.

Evaluation of miRNA targeting criteria in non-model organisms
Current miRNA targeting rules, inferred on interactions, mostly consisting of conserved miRNAs from A. thaliana [9], have been applied to other species for target prediction [28][29][30][31]. However, to the best of our knowledge, no comprehensive investigation into the suitability of these fixed targeting criteria has been performed in non-model organisms. The characterization of miRNA-mRNA interactions has been facilitated by both the increased complexity of experiments involving non-model plant species and through the analysis of RNA degradation profiles (PARE [3] sequencing and more recently NanoPARE [32]), which despite technical limitations (e.g. sequencing bias) can provide reliable high-throughput pseudo-validation of microRNA-mediated cleavage sites.
To investigate the suitability and portability of the fixed Allen et al. rules on non-model organisms and evaluate the scope for customised, organism-specific rules, we conducted an exploratory analysis using as input the HC degradome-supported miRNA-mRNA interactions reported by PAREameters. We compared the inferred rules for flower and leaf tissues in several organisms to produce a quantitative summary of the variation ranges of thresholds for the selected rules. Table 4 shows these summaries of inferred criteria per organism; Figure 3A illustrates the position-specific distributions of G:U pairs, mismatches and gaps, and Figure 3B  Next to the position-specific properties, the MFE ratio (part of the targeting criteria) was also investigated as a discriminative feature ( Figure 3B); the Kolmogorov-Smirnov test was used to evaluate differences between distributions of different species. The distribution of MFE ratios and results of the statistical test, presented in Table 6, illustrate the differences between monocots and dicots, with significant differences only reported when comparing different groups.
To evaluate the differences in number and identity of predicted miRNA targets when using the Allen et al. and PAREameters inferred criteria on the non-model organisms, we performed target prediction Lastly, we investigated the overlap between the miRNAs and their interactions for each set of criteria, presented in Table 7

DISCUSSION
Larger, more diverse experiments, outside of the context of model organisms rely on computational methods, that extract features from sequencing data, for achieving a comparable sensitivity with wellstudied organisms such as A. thaliana. The recent changes implemented for miRNA classification criteria [6][7][8] have signalled the need for a revised approach for defining their mode of action too.
In this paper, we describe PAREameters, a novel approach and pipeline that enables data-driven The comparison of validated miRNA-mRNA interaction properties between conserved and speciesspecific miRNAs highlighted interesting results. When investigating the features of conserved miRNA interactions, we observed similar patterns to that of Allen et al. [9] regarding complementarity in the core region of the miRNA (2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13) and also at the canonical position 10. This observation is further supported by a recent study of highly conserved miRNAs in N. benthamiana [33], where it was shown that a single mismatch at the 5' end of miR160 significantly diminished target site efficacy, and two or more consecutive mismatches at the 5′ end fully abolished it. Furthermore, the authors highlighted that a single nucleotide mismatch at positions 9 and 10, in addition to combinations of mismatches at positions 9, 10, and 11 led to the complete elimination of the responsiveness of miR164. However, the species-specific miRNAs tended to tolerate more flexibility at these positions.
Through our study, we highlight that targeting criteria inferred on subsets of interactions are less compatible with one another and often lead to a considerable drop in sensitivity. Given the current understanding of the miRNA-mRNA interactions in various species, it is difficult to propose a biological interpretation of these variations, however we can conclude that a customised selection of thresholds may result in a more detailed overview of regulatory interactions and facilitate a more in-depth assessment of the underlying regulatory networks. Furthermore, the differences observed in the flower tissue between monocots and dicots emphasise the usefulness of data-inferred, species/tissue specific thresholds.
The tool is optimized both in runtime and computational resource usage; the analysis of a typical A. thaliana and T. aestivum sample completes in ~30 minutes and 1 day 10 hours, with 6GB and 10GB memory (RAM) requirements, respectively. PAREameters was also implemented as a user-friendly, cross-platform (Windows, Linux and MacOS) application that enables users to analyse sequencing datasets without the need of specialized support or dedicated hardware. These features recommend it for a wide variety of experimental designs and organisms that will enable further understanding of the subtle variations in miRNA-mRNA interactions in different species, tissues and treatments. In addition, the novel data-driven approach may enable new discoveries within the RNA silencing pathways.

AVAILABILITY
PAREameters is available as part of the UEA sRNA Workbench [16]; it can be downloaded from http://srna-workbench.cmp.uea.ac.uk/. The source code has been released on GitHub and is accessible at https://github.com/sRNAworkbenchuea/UEA_sRNA_Workbench/.  distribution. The significance of the differences in the localization of gaps, G:U pairs and mismatches was assessed using offset χ 2 tests and the contribution of individual categories was evaluated using Fisher exact tests. The first position and the 8-10 range, important for the cleavage ability of the miRNA, showed significant/marginal significant differences; in addition, positions 14 and 16 illustrated the divergence in properties between these subsets. The similarities in the distributions of the MFE ratios were evaluated using the Kolmogorov-Smirnov test, which reported a p-value of 8.57×10 -10 , the distributions of MFE ratios were different both in location of the mode and the shape of the distirbutions.
Dataset # miRNAs # V Allen V Inferred V Allen NV Inferred NV Allen Se Inferred Se Allen PPV Inferred PPV Se gain PPV difference