BrumiR: A toolkit for de novo discovery of microRNAs from sRNA-seq data

Abstract MicroRNAs (miRNAs) are small noncoding RNAs that are key players in the regulation of gene expression. In the past decade, with the increasing accessibility of high-throughput sequencing technologies, different methods have been developed to identify miRNAs, most of which rely on preexisting reference genomes. However, when a reference genome is absent or is not of high quality, such identification becomes more difficult. In this context, we developed BrumiR, an algorithm that is able to discover miRNAs directly and exclusively from small RNA (sRNA) sequencing (sRNA-seq) data. We benchmarked BrumiR with datasets encompassing animal and plant species using real and simulated sRNA-seq experiments. The results demonstrate that BrumiR reaches the highest recall for miRNA discovery, while at the same time being much faster and more efficient than the state-of-the-art tools evaluated. The latter allows BrumiR to analyze a large number of sRNA-seq experiments, from plants or animal species. Moreover, BrumiR detects additional information regarding other expressed sequences (sRNAs, isomiRs, etc.), thus maximizing the biological insight gained from sRNA-seq experiments. Additionally, when a reference genome is available, BrumiR provides a new mapping tool (BrumiR2reference) that performs an a posteriori exhaustive search to identify the precursor sequences. Finally, we also provide a machine learning classifier based on a random forest model that evaluates the sequence-derived features to further refine the prediction obtained from the BrumiR-core. The code of BrumiR and all the algorithms that compose the BrumiR toolkit are freely available at https://github.com/camoragaq/BrumiR.

Background: MicroRNAs (miRNAs) are small non-coding RNAs that are key players in the regulation of gene expression. In the last decade, with the increasing accessibility of high-throughput sequencing technologies, different methods have been developed to identify miRNAs, most of which rely on pre-existing reference genomes. However, when a reference genome is absent or is not of high quality, such identification becomes more difficult. Results: In this context, we developed BrumiR, an algorithm that is able to discover miRNAs directly and exclusively from sRNA-seq data. We benchmarked BrumiR with datasets encompassing animal and plant species using real and simulated sRNA-seq experiments. The results demonstrate that BrumiR reaches the highest recall for miRNA discovery, while at the same time being much faster and more efficient than the state-of-the-art tools evaluated. The latter allows BrumiR to analyze a large number of sRNA-seq experiments, from plants or animals species. Moreover, BrumiR detects additional information regarding other expressed sequences (sRNAs, isomiRs, etc.), thus maximizing the biological insight gained from sRNA-seq experiments. Finally, when a reference genome is available, BrumiR provides a new mapping tool (BrumiR2reference) that performs an a posteriori exhaustive search to identify the precursor sequences. Conclusions: In summary, we present a new and versatile method that implements novel algorithmic ideas for the study of miRNAs that complements and extends the currently existing approaches. The code of BrumiR is freely available at https://github.com/camoragaq/BrumiR .

Response to Reviewers:
5th November 2021 Dear Dr Nicole Nogoy, Thank you very much for considering our manuscript for publication in GigaScience. We would like to thank the referees and you for the careful assessment of our manuscript. We have attempted to address all points raised by the referees and hope that the responses are satisfactory. With these revisions, we believe that our manuscript has been substantially improved and hope that it is now suitable for publication in GigaScience. Please find below our point-by-point replies to the reviewers' comments. All changes in the main manuscript and the supplement have been marked in blue font, including all the Supplementary material. We refer to the changes we have made in the manuscript using consecutive line numbering. Yours sincerely, Carol Moraga Reviewer #1: Summary: The authors developed a de novo assembly method, BrumiR, for small RNA sequencing data based on de Bruijin graph algorithm. This tool displayed a relatively high sensitivity in finding miRNAs and helped the authors discover a novel miRNA in A. thaliana roots.
Major comments: 1. Have the authors compared the performance with different seed lengths? Even if the minimal miR length is 18nt in MiRBase 21, seed=18 might not necessarily lead to the best AUC or F score (This might also be related to Comment 4).
We followed the reviewer recommendations and we now compare the performance of BrumiR using different seed lengths of 14, 16, 18, 20, and 22; in both animal and plant datasets. The benchmark results are reported in Table S1. The benchmark shows that the optimal seed size for BrumiR is indeed 14 and not 18 as was previously set by default. The main reason for this improved performance is that shorter seed lengths permit a better handling of sequencing errors (shorter k-mers are less likely to be sequencing errors), and enable a more sensitive clustering of identical miRNAs.
Overall, we observe that the benchmark metrics (Supplementary Table S1, Figure S2) improves in comparison to the observed for the previous seed length (k=18) by 1.67X , 0.97X, 1.46X for precision, recall, and F-score in animal species, and 4.09X, 1.13X, 3X for precision, recall, and F-score in plant species, respectively. In light of the results, we have now set the default seed length to k=14, and we thank the reviewer again for this key suggestion that allows us to further improve the performance of BrumiR. Finally, the code has been updated accordingly and a new release has been published in the Github repository.
The updated text is in page 8, lines 13; page 25, lines 18-22. 2. The authors need to benchmark BrumiR with more existing tools (e.g. those MLbased methods), and to include more genome-free methods (e.g. MiRNAgFree).
In our current benchmark, we included miRDeep2, miR-PREFeR and mirnovo, and a total of 10 animal and plant datasets. miRDeep2 is the most used tool in the field with more than 1530 citations and is considered the reference miRNA prediction tool for animal species. In addition, we included miR-PREFeR that, like miRDeep2, is one of the few tools designed specifically for plant species when a reference genome is available. Complementing these popular reference-based methods, we included mirnovo, which is a genome-free discovery tool based on machine learning. The mirnovo manuscript presents a benchmark comparison with miReader and miRplex which are de novo tools and mirnovo showed better performance than the aforementioned tools. This is one of our main reasons for considering mirnovo as the best tool for performing miRNA discovery without a reference genome. Therefore, we selected the tools that perform best for miRNA discovery, including reference and de novo based.
Nevertheless, we tried to follow the reviewer's suggestion and attempted to add more tools to our benchmark. However, we had difficulties in running the existing tools because most of them are not under active development, do not provide update binaries (standalone versions), or their web-server does not work. For instance, miRplex does not provide recent binaries (the current one, v.01, is from 2012), leading to problems with some old libraries and we could not run it. In another example, the miReader software mirPlex is not in active development (updated for the last time in 2014). We were able to run the standalone version, but after letting it run for more than one month on a single sample, we decided to stop it.
Besides these issues, we succeeded in running the miRanalyzer (genome-reference based), and miRNAgFree (de novo based) tools. The benchmark results are provided in Supplementary Table S5 using a reduced version of the real dataset. The benchmark results show that BrumiR obtained the highest F-score rates, as do miRDeep2 and mirnovo (Supplementary FigureS13). We also observed the poor performance on plant species of most of the tools, which shows the potential of BrumiR, as well as being a tool that can be used for both animal and plant species.
Currently, there is no standard for benchmarking miRNA prediction methods, which means that each new method tries to make the comparison in its own way. For this reason, we tried to make an effort to create a benchmark standard and we developed miRsim to generate benchmarks based on a ground truth known a priori and using the same conditions for all the tools.
Overall, we wanted to benchmark BrumiR in a simple but exhaustive way selecting the top performers for genome-based and genome-free methods. Finally, several miRNA tools do not provide stand-alone versions and therefore prevent extensive benchmarking when only the server is available. Being able to generate a good benchmark remains a challenge in the field of miRNAs where there are no standards.
The updated text is in page 13, lines 17; page 14, lines 1-2; page 36, lines 11-16. 3. It is also interesting to know whether de novo method for mRNA assembly would be useful on the miRNA side. It would be great if the authors were able to compare the performance of BrumiR2reference (without filtering for RFAM) with Trinity in genomeguided mode, by tweaking its seed length to be the same as BrumiR.
We followed the reviewer's recommendation and included two de Bruijn graph de novo transcriptome assemblers, namely Trinity (inchworm), and Velvet (Supplementary Table S7). The genome guide mode of Trinity first aligns the reads to the genome and then a de novo assembly is performed using the inchworm tool for each partition. The genome guide pipeline does not provide access to the seed length of inchworm (default k=25) whose default value exceeds the length of the miRNA sequences. Therefore, we ran it independently and without mapping the reads to the genome, because we thought that in this way the comparison between BrumiR and de novo transcriptome assemblers is more appropriate and allows us to focus on how the candidates are extracted from the de Bruijn graph. To perform the comparison, we used 4 real datasets; including human and Arabidopsis. The seed length and minimum contig length for the transcriptome assemblers were set to k=15 (edge-centric de Bruijn graph, overlap equals 14 nt), and l=18, for both inchworm and Velvet. Then, the contigs longer than 24 nt were filtered out for BrumiR (without filtering for RFAM) and the transcriptome de novo assemblers. Finally, the candidates were matched against miRBase (blast search) and the results of the comparison are presented in Supplementary Table S7. We can observe that the de novo transcriptome assemblers generated on average 40X and 4X more candidates than BrumiR, for Trinity and Velvet, respectively.
In general, the huge number of contigs generated by the transcriptome assemblers, even after filtering them by length, were poorly matched to miRBase entries (1,2% and 36%, indeed). On the other hand, BrumiR matched miRBase entries at a rate of 1 every 2 candidates (52% precision average). As expected, we can conclude that most of the contigs generated by a pure de Bruijn graph transcriptome assembler are poorly related to miRNA sequences. This was expected because they are developed for mRNAseq analysis and do not consider the complexities of the sRNA seq data like BrumiR.
In summary, this experiment showed that BrumiR and all the downstream steps it performs after the De Bruijn graph construction are essential for miRNA discovery.
The updated text is in page 15, lines 9-22; page 16, lines 1-3; page 33, lines 14-22; page 34, lines 1-5. 4. The tool's sensitivity is promising across animal and plant datasets. However, the average precision is quite low, an average precision of 0.3 means a false discovery rate of 0.7. This is not an accepted value for a tool designed to discover novel miRNA. Is there any parameter the author could tweak towards a better performance? For example, is the seed length of 18nt too short to start with? Are there any other sequence features the authors should take into account to boost the performance? Or maybe some post-assembly filtering approaches might be sufficient and helpful.
We followed the reviewer's recommendations and we were able to improve the precision of BrumiR while maintaining a similar sensitivity. The major gain in the precision of BrumiR comes from a reduced seed length, which allows it to better deal with sequencing errors, and from an improved clustering of the reads coming from the same miRNA candidates, plus some code improvements. Overall, our precision for animals and plants increased by 1.67X and 4.09X on average respectively, which led to a combined F-score increased by 1.5X and 3X for animals and plants respectively, significantly reducing the rate of false-positives of our previous version. This can be further reduced when a reference genome is available and BrumiR2Reference is used. Additionally, we also provided a new script to annotate the candidates using the miRBase entries, providing a subset of high confidence BrumiR predictions. The novel candidates can be further refined using additional criteria (such as coverage, presence on several miRNA-seq experiments, etc.) that we describe in Figure 4 and Supplementary FigureS10, in the discovery of novel miRNAs in the Arabidopsis genome using BrumiR. Moreover, we intend to improve our method by implementing a random forest classifier that will enable a better characterization of the novel candidates not present in miRbase. The random forest model learns the sequence features of the miRNA annotated in miRbase and can be used to assign a probability of being a miRNA or not for novel BrumiR candidates for species with or without a reference genome. This feature should be included in a future version of our method. However, at this point we believe that in the absence of a reference genome, BrumiR is able to provide a list of potential miRNA candidates which is not usually possible with the other genome-based prediction tools that have a similar prediction power. We compared BrumiR against the best genome based miRNA discovery tool and we had comparable or better results, even without using a reference genome.
The updated text is in page 8, lines 13 5. Wet-lab validation (e.g. Luciferase assay) for the identified novel miRs will leverage the real-life usefulness of BrumiR. This is extremely important, as the tool showed a high false discovery rate. We agree that functional validation of miRNAs predicted by BrumiR is an important point in order to highlight the usefulness of our tool; and we intend to continue our studies with Arabidopsis' novel miRNA candidates in the near future to provide additional evidence towards the in vivo functions of such candidates in a more specialized manuscript. Unfortunately, this is not within the scope of our study and more specifically the current global pandemic has not helped us finish such functional studies, because this part of the work has been done in Chile where our partner wetlab was closed most of the part of the pandemic and currently is resuming is normal operation.
Even so, we would like to point out that current criteria to validate and annotate miRNAs in plants are based on experimental evidence coming directly from the sequencing libraries and that confirmation by blot of the expression of the miRNA or miRNA* is disallowed, as proposed by Axtell 2018*. What we think could be done in order to improve the sRNA-seq dataset instead would be to enrich the sequencing libraries with functional sRNAs coupled to proteins (as proposed in Grentzinger et al., 2014**). At this point, we believe that we currently fulfill several state-of-the-art criteria for miRNA annotation in plants. We have revised all the criteria exposed in Axtell, 2018* for novel miRNA candidates predicted by BrumiR on the Arabidopsis miRNAseq data. We concentrated our effort on 22 novel miRNA candidates that were expressed in at least 3 replicates, and we found that 4 of them fulfilled at least 7 of 8 criteria, with 2 of them fulfilling all the criteria exposed in Axtell 2018. We have updated this part of the manuscript ( Figure 4, Figure S10, and Table S8, S9, S10) accordingly to reflect these changes and we thank the reviewer for pointing out the need for more comprehensive criteria to reflect the accuracy of BrumiR when predicting new candidates.
(*) Axtell We use the edit distance library to allow a maximum of 2 nt differences in the cluster step of BrumiR. Then, for each cluster, we select the most expressed candidates but a reference is kept for all the cluster members. It is possible to use these clusters to identify miRNA candidates that may be RNA edited but the final output of BrumiR contains only one representative member of each cluster. We have included a flag in the output of BrumiR that reports the size of the cluster for each miRNA candidate. This allows us to explore if a candidate might be RNA edited or not. This is provided in a list as one of the BrumiR output files (*.edit_clusters.ED2.txt). Overall, BrumiR permits a mismatch maximum of two nucleotides when clustering candidates and therefore allows for the identification of candidates with potential RNA edition.
Reviewer #2: The authors here present BrumiR, a de Bruijn-based method to discover miRNAs independently of a reference genome. Today most miRNA discovery and annotation is done by mapping sequenced RNAs to readily available reference genomes and analyzing the mapping profiles. However, there are some uses cases where the genome-free approach is needed (particularly for species that have no reference genome or where the genomes have missing parts); therefore BrumiR could potentially be useful for the community. However, the comparison to existing tools needs to be done in a more careful way.
Major comments: 1-RFAM filtering is not really part of the prediction step, this is rather a filtering step. Therefore, to make a fair comparison with mirnovo (the other genome-free tool), BrumiR should additionally be run without RFAM filtering, and mirnovo should additionally be run using the exact same RFAM filtering.
We used the RFAM database to filter out other kinds of RNAs present in the miRNAseq experiment. Mirnovo calculates a consensus sequence for each cluster (candidate) and then an alignment is performed against the RFAM database to identify candidates matching rRNAs and tRNAs, and these matching candidates are filtered out by mirnovo. The latter step is performed by mirnovo by default and cannot be turned off (there is not a parameter allowing this). In BrumiR, we used the RFAM database to filter RNAs not associated with miRNAs. We built a k-mer database excluding all the RFAM families annotated as miRNAs. All the BrumiR candidates matching this k-mer database are excluded. This step is similar for both tools with the same goal (excluding other RNAs) but using different algorithms (alignment vs k-mer matches). We therefore considered that the comparison that we did is appropriate and fair for both tools. On average, 1% of the BrumiR candidates are filtered-out on this step.
The updated text is in page 9, lines 19-21.
2-it appears that 16-mers from miRBase miRNAs were specifically excluded from the RFAM catalog used for the filtering, which is reasonable. However, the miRNAs from the exact benchmarked species should not be included in the used miRBase 16-mer catalog, to avoid circular reasoning.
Yes, we agree with the reviewer that including miRNAs in this step may generate circular reasoning. This represents a final step of BrumiR where the aim is to remove sequences related to rRNAs and tRNAs, but we do not annotate or guide BrumiR towards previously annotated miRNA sequences present in the RFAM database. To achieve this, the 16-mer database was built from the RFAM database excluding all the families in all the taxa related to miRNA genes (529 families excluded), conserving the most abundant k-mers (more abundant than 5). Moreover, all 16-mers present in the miRBase database were also excluded. In summary, we excluded any reference to known miRNA sequences present in RFAM from the 16-mer database.
The updated text is in page 9, lines 19-21.
3-miRDeep2 software should ideally not be run with default options -this is particularly important since the miRDeep2 performance in this manuscript appears lower than what is reported in other studies (e.g. Friedlander et al. 2012). First, reference mature miRNAs from a related and well-annotated species should be included to support the prediction. Second, a score cut-off should be used that gives a decent signal-to-noise ratio according to the miRDeep2 output overview table (for instance 5:1). Third, all read pre-processing and genome mapping should be performed with the mapper.pl script which is part of the miRDeep2 package.
We agree with the reviewer that the ideal case is to use miRDeep2 according to the recommendations suggested by the developers (Friedländer et al, 2012). For this reason, we first used the script proposed by the developers to map the reads to the reference genome (mapper.pl). Then, we used the miRDeep2 pipeline in a "de novo" mode because we wanted to make a prediction without using previous knowledge, as we did with mirnovo and BrumiR. Finally, we selected all the predictions with a minimum miRDeep2 score of 5.1 (signal-to-noise ratio). Also, the benchmark presented in the miRDeep2 manuscript was performed in a different way, because they used a pool of several samples, and the metrics were computed using all predictions, and for this reason, they are not comparable to our benchmark. As we mentioned above, there is no standard for benchmarking the miRNA prediction methods, which means that each new method tries to make the comparison in its own way and the metrics used could be quite different depending on the design of the benchmark. miRDeep2 presented a good performance in most of the datasets, having the highest precision rate. However, we observed that it is a very conservative method because it does not predict many new candidates, even using the "de novo" mode.
4-it appears that only miRNA-derived sequences were included in the simulated data. In fact, real small RNA-seq data typically contains fragments from other known types of RNA and also sequences from unannotated parts from the genome. Therefore, the authors should use simulated data that also includes samples from RFAM and randomly sampled sequences from the reference genome (for instance 10% of each). Overall, the use of simulated sequence data could be put a bit in the background in this study, since real small RNA-seq data is in fact readily available these days and typically has a structure that is not easy to simulate. Further, there is little reason not to use real data, since the miRNAs in miRBase tend to be reasonably well curated for most species and therefore can function well as a gold standard for benchmarking.
We followed the reviewer's suggestion and we added to our simulated data sequences from the RFAM and the genomes for each of the species included in the benchmark. Now the simulated data include 10% of sequences from RFAM and 10% of random genomic sequences. We have repeated the synthetic benchmark and the new results are reported in the corresponding section. We agree that it is hard to mimic the structure of real miRNA-seq experiments, but the simulated data provide a good overview of the performance of the tools as well as a more controlled way to compare such performance since all the expressed miRNA sequences are known, which may not be the case for real data. Finally, we also included an extensive benchmark for the same species using real miRNA-seq data (Supplementary TableS4) and used the miRbase entries as the gold standard for computing the benchmark metrics as proposed by the reviewer. 6-the authors should either benchmark BrumiR against the genome-free methods miReader and MirPlex, or explain why this comparison is not relevant.
We used mirnovo as a de novo predicted tool. As in the mirnovo paper, the authors compare mirnovo against these tools, miReader and MirPlex, and mirnovo has the best performance, we chose to compare BrumiR only against the best performance tools only in a genome-based and de novo approach, meaning against respectively miRDeep2 and mirnovo, as we explained above we compared BrumiR with 4 different miRNA discovery tools using a reduced version of the real dataset. The results provided in Supplementary TableS5 showed that the highest F-score rates are for BrumiR, miRDeep2, miR-PREFeR and mirnovo (Supplementary FigureS9). We decided to benchmark BrumiR in a simple but exhaustive way selecting the top performers for genome-based and genome-free methods.
Minor comments: -the brief introduction to miRNA biology should be carefully edited by an expert in the field. Currently, very old reviews are being cited (e.g. Bartel 2004), and some of the other references appear to be a bit spurious (e.g. why focus on plant host-pathogen interactions out of the hundreds of established functions of miRNAs?). The excellent review of Dave Bartel from 2018 contains references to numerous milestone studies that the introduction could build on. We included miRNA biology background and we verified our references. Thanks to the reviewers for having pointed this out.
The updated text is in page 3, lines 6-20.
-the authors write on page 2 that genome-based methods struggle with a high rate of false positive prediction, citing [9]. However, this is a mis-reference, since the reference [9] states that methods that rely on _only_ the genome and do not leverage on small RNA-seq data have high false positive rates.
We changed this sentence.
The updated text is in page 4, lines 20-21. The updated text is in page 40, lines 4-6.
2-Since the tool is able to identify novel miRNAs and look also at known ones, they could provide an output file including the read count per miRNA. In addition, since the tool is expected to be ultra-fast (not checked … see above), the differential gene expression analysis could also be implemented.
BrumiR outputs an abundance estimation based on the k-mer counts for each predicted miRNA candidate. Currently, BrumiR quantify each candidate using the total number of k-mers (KC), and the average number of k-mers per base (KM 3-I suggest also implementing a graphical output. A sort of summary in a decorated html page. We followed the reviewer's suggestion and we now provide an R notebook to create an interactive and graphical summary of the output of BrumiR. We do plan to extend these capabilities in the future to further enhance the graphical report (i.e. by providing functions to quickly determine differentially expressed miRNAs). A full example of the R Notebook is provided with the demo dataset.
The updated text is in page 40, lines 5-6.
4-By using BrumiR, authors analyze miRNAs in Arabidopsis during the development, discovering three novel miRNAs. Although bioinformatics evidences indicate that they could be real miRNAs, an experimental validation is required. Indeed, these miRNAs have been detected by BrumiR only. I think that this validation could be easily done because authors directly performed sRNAseq data. In my opinion, this experiment could really improve the manuscript and assess the high performance of BrumiR.
We agree with the reviewer that experimental validation of the candidate miRNAs is an important step and a valuable suggestion for our work. Following the reviewer's suggestion, we have revised all the updated criteria on plant miRNA annotation based on Axtell, 2018* for each novel candidate miRNA predicted by BrumiR using the Arabidopsis thaliana datasets. These criteria are based directly on precursor and miRNA features that can be determined directly from the sRNA-Seq reads.
Considering these criteria, we found that 2 of our miRNA candidates fulfill all of them ( Figure 4, Figure S8, S9, S10). As such, these results highlight the usefulness of BrumiR for predicting new candidate miRNAs even in model species where the miRNA catalogs are highly complete. Although, this last results section of the BrumiR manuscript highlight the power of BrumiR to discover new miRNAs even in highly curated genomes, we believe that the biological insight and subsequent validations may be the focus of a more specialized manuscript rather than the current one where the focus is the new algorithmic ideas that we are presenting.

Findings Background
MicroRNAs (henceforth denoted by miRNAs) are small RNA molecules usually shorter than 25 nucleotides (nt), which have been identi ed as crucial regulators of gene expression mostly at the post-transcriptional level [1]. miRNAs are involved in a wide range of biological processes including cell cycle, di erentiation, apoptosis and disease [2]. They have been the target molecules for a large number of important applications, more particularly in cancer where miRNAs have been shown to play important roles in driving or suppressing tumor spread [3,4]. In plant species, unravelling host-pathogen interactions mediated by miRNAs may shed light on plant development and its relation with the environment, both essential knowledge that can lead to the discovery of new biotechnological products for the agricultural industry [5,6].
Since the rst classi cation and annotation of miRNAs [7,8], accurately identifying them as well as the regulatory networks in which they are involved has proven di cult [9,10]. Accurate prediction of known and novel miRNAs along with their targets is however essential for increasing our understanding of the miRNA biology [3]. Nowadays, a common experimental practice is to identify miRNAs and their expression patterns using next generation sequencing technologies (NGS) [11]. Commonly, NGS experiments are able to generate more than 20 million sRNA-seq reads, thus promoting the development of algorithms to transform and process such data into biological information [12].
Currently, there are two computational strategies for the discovery of miRNAs: 1) genome-based approaches that rely on the mapping of the sRNA-seq reads to a reference genome and subsequent evaluation of the sequences generating the characteristic hairpin structure of miRNA precursors [9]; 2) machine-learning approaches which rely on the biogenesis features extracted from the knowledge on miRNA sequences available in databases such as miRBase [13] and on the analysis of the duplex structure of miRNAs [14]. Genome-based methods, that have been updated at the pace of the evolving NGS technologies, are the most widely used tools in this eld, and their results have populated the public miRNA repositories [12]. Such methods are the natural choice for the study of model species with high quality reference genomes available. However, it has been shown that most of the genomebased tools struggle with a high rate of false positive predictions [9]. Additionally, a critical step of such tools is the use of genome aligners [15,16] to map the sRNA-seq reads to the reference genome. Mapping short (<30 nt) and very similar sequences to a large, complex, and repetitive reference genome is however a di cult and error-prone task [17]. Genome-based methods are thus highly sensitive to the aligner selected as well as to the parameters employed and the thresholds chosen (e.g. number of mismatches allowed) in order to discard mapping artefacts generated from sequencing errors [18]. Furthermore, despite all the advancements in the sequencing technologies and de novo assembly methods, few complete genomes are available today, which is a recurring problem that researchers working on non-model species face [19]. The lack of a high quality reference genome thus reduces the possibilities for discovering novel miRNAs [14]. Genome-based methods such as miRDeep [20], miRDeep2 [21], and miR-PREFeR [22] are included in this group.
On the other hand, new methods such as miReader [23], MirPlex [24], and mirnovo [14], in particular using machine-learning approaches, were speci cally developed as an alternative to discover miRNAs in species without a reference genome. In the case of mirnovo, the initial step involves the clustering of the sRNA-seq reads performing an all-vs-all read comparison that is followed by a subsequent classi cation of the clusters into putative miRNAs using pre-trained models. The performance obtained by such methods on well-annotated species is comparable to those achieved by genome-based methods [9]. However, relying exclusively on annotated miRNAs for training machine learning models may introduce a bias towards the identi cation of well-characterized miRNAs over species-speci c ones [12]. Nonetheless, machine learning methods have demonstrated that it is possible to discover miRNAs using only the sequence information present in the sRNA-seq experiment [14].
There remains however a need to go further in the development of algorithms for nding novel miRNAs in non-model species using only the sequence information. With this purpose in mind, the adoption of a special type of graphs called de Bruijn graphs may be considered. This is a widely used approach for the de novo reconstruction of genome or transcriptome sequences [25]. It therefore appears to be a plausible option for organizing, clustering and assembling the sequence information present in sRNA-seq experiments. However, accommodating the de Bruijn graph approach for the discovery of miRNAs involves the development of new methods to address the speci c characteristics of sRNA-seq data. Indeed, mature miRNA sequences are short (18-24 nt), thus limiting the overlap length for building a de Bruijn graph which in turn impacts the global topology by inducing tangled graph structures. Moreover, miRNAs captured in a sRNA-seq experiment have variable expression, from low (few reads) to highly expressed (thousands of reads), which may induce spurious graph connections that should be removed in order to isolate and detect both types of miRNAs. Finally, the sequencing errors present in sRNA-seq data further induce spurious connections and are harder to detect as compared to genomic data due to the variable expression and the shorter lengths of the miRNAs. Overall, using a de Bruijn graph to analyze sRNA-seq data and extract information from such data seems thus counterintuitive as mature miRNAs are captured full-length by the current NGS technologies. However, a de Bruijn graph has several interesting properties for the discovery of miRNAs, mainly due to the fact that it encodes all the sRNA-seq sequence information at once in a compact and connected representation (graph), without the need to perform an all-vs-all read comparison or mapping to a reference.
In this paper, we present BrumiR, a de novo algorithm based on a de Bruijn graph approach that is able to identify miRNAs directly and exclusively from sRNA-seq data. Unlike other state-of-the-art algorithms, BrumiR does not rely on a reference genome, on the availability of close phylogenetic species, or on conserved sequence information. Instead, BrumiR starts from a de Bruijn graph encoding all the reads and is able to directly identify putative miRNAs on the generated graph. BrumiR also removes sequencing errors and navigates inside the graph detecting putative miRNAs by considering several miRNA biogenesis properties (such as expression, length, topology in the graph). Along with miRNA discovery, BrumiR can also assemble and identify other types of small and long non-coding RNAs expressed within the sequencing data. Finally, when a reference genome is available, BrumiR provides a new mapping tool (BrumiR2reference) that performs an exhaustive search to identify and validate the precursor sequences. We extensively benchmarked BrumiR on animal and plant species using simulated and real datasets. The benchmark results demonstrate that BrumiR is very sensitive, besides being the fastest tool, and its predictions were supported by the characteristic hairpin structure of miRNAs. Finally, we also applied BrumiR to the discovery of miRNAs of Arabidopsis thaliana and identi ed three novel high-con dence miRNAs involved in root development. These putative miRNAs were not discovered before by any other software, thereby showing the potential of using di erent approaches even in the case where high quality genomes are available. The code of BrumiR is freely available at https://github.com/camoragaq/BrumiR.

Building a de Bruijn graph for sRNA-seq data.
BrumiR starts by building a compact de Bruijn graph from the sRNA-seq reads given as input. De Bruijn graphs are a widely used approach in the genome assembly problem [25]. BrumiR uses this graph to organize, detect, and exploit the sequence information of sRNA-seq experiments. BrumiR takes as input sequencing les in FASTA or FASTQ formats. The sequencing data can be cleaned, using a fastq pre-processor [26] (i.e. fastp), to remove adapter sequences and trim low quality bases. BrumiR employs the Bcalm [27] tool to build a de Bruijn graph from the sRNA-seq reads. Bcalm uses a node-centric bi-directed de Bruijn graph where the nodes are k-mers, that is words of length k, and an arc between two nodes if the k -1 su x of one node is equal to the k -1 pre x of the subsequent node, representing an exact overlap of k -1 bases [27]. A critical parameter of any de Bruijn graph approach is the k-mer size [28]. We observed that the length of all mature miRNA sequences stored in the miRBase database (v21) [13] have a minimum value of 18 nt (Supplementary Figure S1). We thus empirically set the k-mer size equal to 18. Bcalm compacts the nodes of the de Bruijn graph into maximal unipaths by gluing all the nodes of the graph with an in-degree and an out-degree equal to one, thus generating the so-called unipath graph [27]. The unipath graph is the starting point of BrumiR ( Figure S2).

Removing sequencing errors from the unipath sRNA-seq graph.
BrumiR deletes from the unipath graph all the nodes that have only one connection (degree equal to 1), known as dead-end paths or tips [29]. Usually, these nodes have a low abundance value associated to them (KM less than or equal to 5, the default parameter).
Moreover, BrumiR deletes isolated nodes (degree equal to 0) having a low abundance; isolated nodes highly expressed are however conserved for further analysis. All these nodes are likely artifacts generated from sequencing errors because they are not deeply expressed in the sRNAs-seq reads [30]. BrumiR iterates this step 3 times in order to prune and clean the unipath graph (polishing).
This operation, called 'tip removal', edits the original unipath graph, and therefore a new unipath graph with a new structure is generated ( Figure 1.2).

An expressed mature miRNA has uniform coverage.
The unipath graph of a set of miRNAs from a sRNA-seq experiment has non-uniform coverage as di erent miRNAs and other elements may be connected in a single big component (Figure 1.1). BrumiR evaluates each connection of the unipath graph to identify those that link two nodes with a large expression di erence. According to the miRNA biogenesis, after a stable miRNA precursor is cleaved by Dicer, among its three products, the miRNA mature sequence is the most abundant and when it is sequenced, it has a uniform expression along its sequence [20]. Thus due to miRNA biogenesis, it is possible to capture the complete miRNA mature sequence having a homogeneous expression [9] directly from the sRNA-seq experiments. BrumiR expects a similar KM value for k-mers originating from the same mature miRNA gene. Accordingly, if we observe two connected nodes that show a big di erence in their abundance values, this connection is deleted and we keep the nodes unconnected. In particular, two unipaths U = [a, b] connected in the graph have a KM value associated to them that represents their coverage from the reads information.
BrumiR scans all the neighbor connections and if the di erence between their KMs is larger than three-fold, the connection is deleted (Ui km /Uj km > 3). In this way, BrumiR de nes a relative threshold that will depend on each unipath neighborhood in the graph. Finally, BrumiR repeats the tips removal step to eliminate new low frequency isolated nodes (Figure 1.4).

miRNAs and other sequences are captured in single connected components.
After the previous steps of BrumiR, a new unipath graph emerges, with a new structure. It is thus necessary to identify and classify the new connected elements within the graph ( Figure 1). A connected component (CC) of a graph is a maximal strongly connected subgraph [31]. BrumiR computes the CCs of the unipath graph, and then each CC is processed independently to identify miRNA candidates as well as to discard other sequences present in the unipath graph.

BrumiR classi es low abundance non-linear topologies as sequencing artefacts.
BrumiR detects topologies that are potentially related to sequencing errors and thus unlikely to be miRNA candidates. The shapes of these topologies were identi ed by visual inspection of several unipath graphs and are described in detail in Supplementary Figure S3). Moreover, we observed that the sequences contained in these topologies were usually redundant and contained in other linear and more expressed CCs. In this way, we are not discarding relevant sequence information. BrumiR removes about 10% of the CCs in this step.

Re-assembling unipaths within each CC.
BrumiR re-assembles all unipaths present in the linear CCs by bundling the nodes with in and out degree equal to 1 into a new  Figure S4).

Re-clustering potential miRNAs.
After grouping unipaths by CCs, BrumiR builds an overlap graph to rescue the missing connections between potential miRNA candidates sharing an overlap with another candidate. First, BrumiR adds all the candidates as nodes of the overlap graph, then an all-vs-all k-mer comparison is performed using exact overlaps of length k=15. Candidates sharing an exact overlap are connected in the overlap graph. Then, the connected components are computed to identify clusters of miRNA candidates, and the most expressed candidate within each component is selected as the representative candidate of the cluster. The representative candidates are compared all-vs-all in a second overlap step that allows a maximum edit distance of 2, which is implemented using the edlib library [32]. BrumiR then builds a second overlap graph, computes again the connected components, and selects the most expressed candidate as the representative of each cluster. The other members of each connected component are classi ed as putative isomiRs and saved in a le for later analysis.

Identifying other expressed RNA sequences.
In sRNA-seq experiments, di erent types of RNAs are expressed, some of which, such as small non-coding RNA elements, may have similar length with miRNAs [33]. The RFAM database [34] is a collection of curated RNA families including three functional classes of RNAs (non-coding, cis-regulatory elements, and self-splicing RNAs), which are classi ed into families according to their secondary structure and sequence information (Covariance Models) [34]. We downloaded 3,017 RNA families present in RFAM (v14.1) and excluded 529 miRNA families [35]. The sequences of 2,488 RFAM families were concatenated (a total of 2,736,549 sequences) and used to build a 16-mer database with the KMC3 k-mer counter tool [36] ("-fm -n100 -k16 -ci5"). All distinct 16-mers with a frequency lower than 5 were excluded, leading to a total of 6,204,556 distinct 16-mers related to RNA elements.

Identifying precursor sequences for BrumiR candidates (BrumiR2reference).
Unlike current state-of-the-art tools that perform miRNA discovery by mapping the sRNA-seq reads to a reference genome, BrumiR generates candidates by operating directly on the sRNA-seq reads. The reduced list of potential BrumiR miRNA candidates permits the computation of a more exhaustive alignment than when mapping directly the sRNA-seq reads to the reference genome.
BrumiR aligns each candidate to the reference genome using an exact alignment method that computes the edit distance [37] between two strings and thus support mismatches, insertions and deletions. The BrumiR2reference tool divides the reference genome in non-overlapping windows of 200bp (adjustable parameter), then the window is indexed using 12-mers and each miRNA candidate is matched in both strands (split at 12-mers). When a 12-mer match is found, an exhaustive alignment is computed between the window and the matching miRNA candidate. The alignment is performed using a fast implementation of Myers' bitvector algorithm [32]. A miRNA candidate is stored as hit if the alignment in the current genomic window has an edit distance less than or equal to 2. After scanning all the genomic windows, the vector of hits is sorted by miRNA-candidate; edit distance (0-2), and alignment sequence coverage. For a single miRNA-candidate, a maximum of 100 genomic locations (best hits) are selected.
BrumiR2reference then builds a potential precursor sequence for each selected hit using a strategy similar to the ones employed by miRDeep2 [21] and Mirinho [38]. BrumiR excises the potential precursor hairpin sequence from the anking genomic coordinates of the reported miRNA candidate hits (mature sequence) in both strands. Potential precursor hairpin sequences of length 110 bp are built for animal species from both strands, while for plant species hairpin sequences of lengths 110, 150, 200, 250 and 300 bp are built from both strands [37]. Secondary structure prediction for all the potential precursor sequences is performed using RNAfold

Data description
Benchmarking BrumiR using simulated sRNA-seq reads.
We simulated synthetic reads from animal and plant species, and compared the results of BrumiR to those obtained with the miRD-eep2 [21] and miR-PREFeR [22] tools. The sRNA-seq reads were simulated using miRsim (https://github.com/camoragaq/miRsim), a tool that we developed speci cally for simulating sRNA-seq reads from a list of known miRNA mature sequences. miRsim is based on wgsim (https://github.com/lh3/wgsim), which is a widely used tool for simulating short Illumina genomic reads. miRsim includes functionalities speci c of sRNA-seq reads such as variable depth/coverage and shorter read lengths. miRNA mature se-quences were obtained from miRBase [13] for animal (High Con dence) and plant species. The animal species that we considered were: Homo sapiens, Mus musculus, Drosophila melanogaster, Danio rerio, and Caenorhabditis elegans, while the following plant species were included: A. thaliana, Oryza sativa, Physcomitrella patens, Zea mays, and Solanum lycopersicum. Supplementary Table S1 provides further details (i.e. number of reads, number of mature miRNAs etc.) for each simulated dataset. miRDeep2 was run on the animal datasets with the default parameters providing the respective reference genome. Similarly, miR-PREFeR was run with the default parameters on the plant datasets. BrumiR was run with the default parameters on both the animal and plant datasets. The list of simulated miRNAs was considered as the ground truth, and precision, recall and F-Score quality metrics were computed to assess the performance of each discovery tool. The benchmark metrics were de ned as follows: where: TP = true positive elements predicted as miRNAs present in the miRBase input list; FP = false positive elements predicted as miRNAs but not present in the miRBase input list; FN = false negative elements not predicted as miRNAs, but that were present in the miRBase input list.
We downloaded publicly available sRNA-seq data for the plant and animal species listed in the synthetic benchmark, and two datasets for each species were included (Supplementary Table S2). Additionally, we included mirnovo [14], a tool that can discover miRNAs without a reference genome. The predictions of BrumiR were benchmarked along with miRDeep2 (v2.0.1.2) [21] and mirnovo for the animal datasets. Similarly, miR-PREFeR [22] replaced miRDeep2 for the plant datasets (Supplementary Table S2).
The stand-alone packages of BrumiR, miRDeep2 and miR-PREFeR were used to discover miRNAs in all datasets. The software mirnovo was run using its web version because the stand-alone package was not available and the developer recommends the use of the web version instead. The miRNA discovery was performed for each sample independently using default parameters for miRDeep2, miR-PREFeR and mirnovo. In particular, we used the scripts provided by miRDeep2 and miR-PREFeR to map the reads to the reference genome, and the predictions for these tools were performed on the resulting alignment les. The mirnovo predictions were done using the animal and plant universal panel respectively, as recommended when the reference genome is not available. BrumiR was run using the command line and parameters provided in the Supplementary Material. Moreover, the predictions of BrumiR were re ned using the BrumiR2reference tool on the available reference genome of the selected species (Supplementary Table S2). Benchmark metrics (precision, recall, and F-Score) were computed as before but considering all the annotated mature sequences present in miRBase (v22.1) [13] as the ground-truth. 3' and 5' adaptors were sequentially ligated to 1 µg of total RNA prior to reverse transcription and library ampli cation by PCR.
Size selection of the sRNA libraries was performed on 6% Novex TBE PAGE Gels (Thermo Fisher Scienti c, cat. EC6265BOX) and puri ed by ethanol precipitation. Both the library size assessment and library quanti cation were carried out in a Fragment Analyzer TM . Finally, the libraries were pooled and sequenced on an Illumina NextSeq 500 platform.
All samples were analyzed with BrumiR separately with default parameters to identify the candidate miRNAs. We further validated the candidates having a putative precursor with a hairpin structure analysis using the BrumiR2reference tool with the reference genome for A. thaliana (GCF_000001735.4_TAIR10.1_genomic.fna). All validated candidate miRNAs were compared with known miRNAs described for A. thaliana (437) present in miRBase (v21). The putative novel miRNAs were curated manually, specifically, we checked the hairpin features, mature sequence alignment position, star sequence in the precursor sequence, mismatches in the seed region, and exact overlap in the antisense miRNA sequence [9]. Then a target analysis was performed using the Araport 11 cDNA library with the plant-speci c psRNATarget algorithm (based on a best expectation score) [42].

Results
BrumiR discovers mature miRNAs directly from the sRNA-seq reads.
The main idea behind BrumiR is that mature miRNAs can be discovered directly from the information contained in the sequenced sRNA-seq reads. To achieve this, BrumiR starts by building a de Bruijn graph (using k-mers of size 18) from the sRNA-seq reads, then compacting all the simple nodes thus leading to the unipath graph [27] ( One feature of the miRNA biogenesis is that after Dicer cleavage, the mature miRNA is the most abundant of the three byproducts and when it is sequenced, it has a uniform expression along its sequence [20]. Therefore, BrumiR expects that the neighbor elements within a particular putative miRNA will have similar expression. BrumiR checks all neighbor connections (arcs), and deletes any connection with a relative expression di erence larger than 3 fold (

BrumiR achieves the highest accuracy on simulated data.
To evaluate the performance of BrumiR, we applied it to discover mature miRNAs on simulated sRNA-seq reads from 10 animal and 10 plant species (Figure 2A). We compared BrumiR to the state-of-the-art genome-based miRNA discovery tools miRDeep2 [21] and miR-PREFeR [22], which were developed speci cally for animal (miRDeep2) and plant (miR-PREFeR) species. For each tested species, we generated two synthetic datasets with di erent error-rates (0.01 and 0.02) using the miRsim tool implemented and provided by the BrumiR toolkit (https://github.com/camoragaq/miRsim), and the high-con dence miRNAs annotated in the miRBase database (see the Methods section). A total of 20 datasets with an average of 11.5 million reads were simulated. The list of simulated miRNAs was considered as the ground truth, and benchmark metrics ( Figure 2C) were computed to assess the performance of BrumiR and of the other software (see the Methods section and Supplementary Table S1).
BrumiR recovered more mature miRNAs than the others, on average 92% (opposed to 41% and 64% for miRDeep2 and miR-PREFeR, respectively), and presented the highest average recall across all the simulated datasets ( Figure 2B). BrumiR recovered more than 90% of the simulated mature miRNAs in 17 of the 20 simulated datasets ( Figure 2B). In particular in the H. sapiens, M. musculus and D. melanogaster datasets, BrumiR recovered three times more candidates than miRDeep2 ( Figure 2B). As concerns precision, BrumiR tended to generate more putative candidates than MiR-PREFeR (median 316 vs 264) and miRDeep2 (median 445 vs 308). The slightly higher number of BrumiR candidates resulted in lower average precision than miRDeep2 (0.59 vs 0.69) and MiR-PREFeR (0.63 vs 0.87). This is due to the fact that BrumiR does not use the hairpin structure lter employed by the other software. If we consider both precision and recall (F-Score), BrumiR was the top performer in 16 of the 20 datasets evaluated ( Figure 2C). With animal species, BrumiR always reached a higher F-score than miRDeep2. With plant species, BrumiR was better or comparable to miR-PREFeR on most datasets, except for Z. mays and P. pattens where miR-PREFeR reached a higher F-Score ( Figure 2C). In terms of computational time, BrumiR was the fastest method. In particular, BrumiR core was on average 30X faster than miRDeep2 and 10X times faster than MiR-PREFeR (Supplementary Table S3). The speed of BrumiR relies on e cient alignment-free and graph-based approaches.
Overall, we demonstrated with simulated data that BrumiR discovers putative mature miRNAs without a reference genome across di erent eukaryotic species achieving the highest accuracy and computational e ciency.

The hairpin structure of mature miRNAs is found in most of the BrumiR candidates.
In order to assess the performance of BrumiR on real data, we collected public datasets for the same plant and animal species evaluated in the synthetic benchmark ( Figure 2A). On average, 15.4 and 18.2 raw million reads were used for the animal and plant datasets (Supplementary Table S2), respectively. The predictions of BrumiR were compared against those of the state-ofthe-art tools encompassing reference and de novo based methods [14,21,22]. In particular, we included mirnovo that similarly to BrumiR can discover mature miRNAs directly from the reads. Before running the tools, low-quality reads were removed using fastp [26] (∼10%, see the Methods section). All the predicted miRNAs for each tool were annotated using the miRBase database to identify known and novel predictions. On average, BrumiR predicted ∼1,000 putative mature miRNAs for the animal species, which was ∼2.7X higher than the miRDeep2 candidates and 1.7X lower than the candidates predicted by mirnovo ( Figure 3A1). For plant species, BrumiR predicted on average ∼1,900 putative mature miRNAs, which was lower than the candidates predicted by mirR-PREFeR [3], and higher than the predictions of mirnovo (301 on average) ( Figure 3A1). A comparison using the miRBase [13] annotated miRNAs revealed that BrumiR shared more candidates with miRDeep2 and miR-PREFeR than with mirnovo ( Figure 3A2).
However, an important fraction (on average more than 70%) of the miRBase-annotated candidates were exclusive to each tool ( Figure 3A2), which summarizes the complexity of miRNA discovery.
Considering miRBase-annotated candidates as the ground-truth, we computed precision, recall and F-Score for all the evaluated tools ( Figure 3B, Methods section). BrumiR achieved an accuracy (F-Score) better (animals) or comparable (plants) to the one obtained by the other software ( Figure 3B3). Moreover, BrumiR consistently reached the highest recall for most of the datasets evaluated ( Figure 3B2). The precision values of BrumiR were slightly lower for some datasets ( Figure 3B1), however, none of the methods performed well on this metric ( Figure 3B3). In particular, miRDeep2 reached the highest precision (∼0.7) on animal species and all methods performed poorly on the evaluated plant species (average precision <0.3). The low precision with plant species may be the product of a low number of entries annotated in miRBase for plants (10.414 vs 38.471 animals) as well as of a higher complexity of plant miRNAs [37]. The BrumiR toolkit also provides a tool to determine the hairpin loop of miRNA precursor sequences, which is the main structural feature of miRNAs [40]. BrumiR2reference maps the mature miRNAs predicted by BrumiR to the reference genome using an exhaustive alignment (see the Methods section), generates precursor sequences, computes their secondary structure, and checks the hairpin structure using a variety of criteria inferred from analyzing more than 30,000 miRBase precursor sequences from animal and plant species (see the Methods section). We used BrumiR2reference as a double validation for all the predicted mature miRNAs generated by BrumiR for the animal and plant datasets ( Figure 3C). On average, BrumiR2reference identi ed a valid precursor sequence having the characteristic hairpin structure for over 70% of the BrumiR candidates ( Figure 3C). In terms of speed, BrumiR core was the fastest tool. BrumiR was on average 120X and 220X times faster than miRDeep2 and miR-PREFeR, respectively (Supplementary Table S4).
Overall, we demonstrated that BrumiR is a competitive tool for discovering mature miRNAs without a reference genome. We showed that it was the most sensitive on most of the datasets tested. The performance of our method was not only faster, but also better or comparable to the state-of-the-art tools. Moreover, we also provide a new mapper approach to be used when a reference genome is available, to further verify if a precursor sequence of the predicted mature miRNA is present in the genome. BrumiR therefore represents a reliable alternative for the discovery of mature miRNAs in model and non-model species with or without a reference genome.

Discovering novel miRNAs from sRNA-seq data of A. thaliana roots using BrumiR.
A. thaliana is one of the best characterized model organisms, and the rst plant species in which miRNAs were cloned and sequenced [44]. To date, 436 mature miRNA sequences are included in the miRBase database. Most of these miRNAs have been identi ed by studies addressing the sRNAome of di erent plant organs [45], cell types [46], or responses to biotic or abiotic stress using sRNA-seq [47,48].
We sequenced sRNA-seq libraries from the roots of A. thaliana after di erent time points during vegetative development (see the Methods section) (Supplementary Figure S8) to demonstrate the potential of BrumiR to discover novel mature miRNAs in a known biological context. BrumiR was run independently for each condition and replicate. The day 5 samples were excluded because of the low number of reads when compared to the other samples (Supplementary Table S5). BrumiR predicted, on average, 1,120 mature miRNAs per sample, which were further re ned to 678 using the BrumiR2reference tool. To take advantage of our experimental design, we considered as a putative miRNA the ones present in the three replicates (core predictions) [49] ( Figure 4A).
Novel miRNAs were identi ed using the following steps: First, predictions were classi ed as known miRNAs by comparing with miRBase (160 known miRNAs out of a total of 436 miRNAs already described for A. thaliana in miRBase). These known miRNAs were put aside to explore the sensitivity of BrumiR in detecting novel putative miRNAs. We then clustered the remaining putative miRNAs into three stages: early, late, and constitutive ( Figure 4B). The days 9, 13 and 17 represent an early stage of the plant development [50]; days 17, 21 and 25 represent a late stage of the plant development [50], and the putative miRNAs expressed in all conditions represent the constitutive category. A total of 25 putative novel miRNAs were identi ed, and a manual curation was carried out using all the information provided by BrumiR. Three curated novel miRNAs ful lling all the recommended criteria to annotate miRNAs in plants [49] were discovered by BrumiR (Supplementary Figure      conditions were sequenced in three technical replicates. BrumiR was used to analyze all sRNA-seq librairies, and conserved predictions by the three replicates was considered as a core by condition. B) Di erent combinations of root growth per day were analyzed together to identify novel putative miRNAs conserved in all conditions. C) miR-8 is discovered as a novel miRNA, supported by the hairpin analysis, and is conserved in all replicates in all conditions. characteristic hairpin structure of plant miRNAs ( Figure 4C). Interestingly, this miRNA locus has not been previously discovered because its mature sequence maps to multiple chromosomes, and is therefore discarded by genome-based tools [17].
In an exploratory analysis to shed light on the potential targets of these novel miRNAs, we conducted an in silico target transcript prediction using the psRNATarget algorithm [42] (Supplementary Table S7). FSD-1 (AT4G25100) was found to be one of the top genes regulated by this novel miRNA miR-8 (Supplementary Table S6). In A. thaliana, FSD-1 encodes a Fe superoxide dismutase enzyme which regulates reactive oxygen species (ROS) levels of chloroplast and cytosol and participates in salt stress tolerance [51]. Moreover, knockout mutants of FSD-1 exibit a lower number of lateral roots, thereby suggesting an important role in root development [52]. FSD-1 is developmentally regulated, abundantly expressed from the 3rd day to the 13th day but signi cantly decreased in the following days, and its di erential accumulation between root zones is related to emerging patterns of lateral roots and hair formation from trichomes [53]. Another predicted mRNA target of miR-8 is PER24 (AT2G39040), which is a peroxidase gene involved in the detoxi cation of ROS in the extracellular and Na+ homeostasis and which plays an important role in the resistance to salinity stress as does the FSD-1 gene [51]. It is plausible to say that these novel miRNAs may be involved in the ne-tuning of lateral root growth in the early stages of development. These results highlight the value of the BrumiR toolkit for discovering novel miRNA candidates with functional impact on the organisms studied, even in the case where high quality genomes are available.

Discussion
In this paper, we introduced and benchmarked the BrumiR toolkit, which was designed for enabling the identi cation of mature miRNAs in model and non-model species with or without a reference genome, encompassing the plant and animal kingdoms. The BrumiR toolkit implements the following algorithms: 1) a new discovery miRNA tool (BrumiR-core), 2) a speci c genome mapper (BrumiR2reference), and 3) a sRNA-seq read simulator (miRsim). We demonstrated that BrumiR is capable of identifying mature miRNAs based only on the sequence information, and generates results that are better or comparable to the state-of-the-art tools on simulated and real datasets. We further tested the usefulness of the BrumiR toolkit for discovering novel miRNAs potentially involved in the regulation of the root development of the extensively annotated A. thaliana genome.
Unlike the state-of-the-art tools, BrumiR starts by encoding the sRNA-seq reads using a de Bruijn graph. This avoids the read mapping stage and the dependency on previous miRNA annotations. It also enables the identi cation of sequencing artifacts. A critical step of genome-based miRNA discovery tools is to identify the precursor sequence when a reference genome is available.
BrumiR introduces a new mapping approach, BrumiR2reference, which scans every possible hairpin precursor in the genome, when such is available, for all the BrumiR predictions. As the hairpin structure is determined using the predicted mature miRNA instead of the reads, this alignment can support mismatches and indels and handles the case of multi-mapped candidates (due to repetitive regions of the genome). Such features distinguish BrumiR from the current genome-based methods.
Discovering miRNAs in non-model species is one of the limitations of the current methods. One exception is mirnovo, which similarly to BrumiR can predict miRNAs using only the sRNA-seq data, and a speci c training set for animal and plant species.
We thus compared its performance to the one of BrumiR. Our results show that mirnovo is very conservative, generating few predictions but with a high accuracy. BrumiR generates a larger number of candidates than mirnovo, most of which are annotated in miRBase but also others that correspond to new candidates. However, the higher number of predictions of BrumiR results in a lower accuracy in some of the datasets evaluated, representing a potential weakness of our method. To address this, we developed the BrumiR2reference tool to re ne the BrumiR predictions, thus reducing the number of false-positive putative miRNAs. We plan to further reduce the false-positive rate of BrumiR by using a random-forest classi er trained on the high con dence mature sequences available in miRBase. The latter will improve the accuracy of BrumiR even in the case when a reference genome is not available. It is important to observe that the miRNA annotations remain incomplete and although miRBase is the main repository for miRNAs, it cannot be considered as the gold-standard for most species [9]. For this reason, the predictions of BrumiR are not based on miRBase in any step of the algorithm. However, miRBase can be used with all due precaution in a posterior analysis to verify the miRNAs inferred in case of not having any reference genome.
In terms of computational resources and usability, BrumiR is the fastest method and provides a stand-alone package for running locally all the analyses. It further generates an output that is compatible with the Bandage software [43], which can be employed to visualize and explore the results of BrumiR in a user-friendly way. Moreover, BrumiR reports other sequences expressed in the sRNA-seq data among which are putative isomiRs and longer non-coding RNAs, thereby providing additional biological insight.
Finally, we tested the e ectiveness of BrumiR on sequenced sRNA-seq libraries from the roots of A. thaliana, and were able to annotate 3 novel putative miRNAs based on the very conservative criteria proposed in [49], showing the potential of it being used alone or in combination with other methods.
In summary, we present a new and versatile method that implements novel algorithmic ideas for the study of miRNAs that complements and extends the currently existing approaches.

Code Availability
The BrumiR code (v1.0) used in this manuscript is freely available at https://github.com/camoragaq/BrumiR, and is open software under the MIT license. BrumiR: A toolkit for de novo discovery of 1 microRNAs from sRNA-seq data. have been developed to identify miRNAs, most of which rely on pre-existing 24 reference genomes. However, when a reference genome is absent or is not of 25 high quality, such identification becomes more difficult. In this context, we developed BrumiR, an algorithm that is able to discover miRNAs directly and 1 exclusively from sRNA-seq data. We benchmarked BrumiR with datasets alternative to discover miRNAs in species without a reference genome. In the 17 case of mirnovo, the initial step involves the clustering of the sRNA-seq reads 18

Funding
performing an all-vs-all read comparison that is followed by a subsequent 19 classification of the clusters into putative miRNAs using pre-trained models. There remains however a need to go further in the development of algorithms 5 for finding novel miRNAs in non-model species using only the sequence 6 information. With this purpose in mind, the adoption of a special type of graphs 7 called de Bruijn graphs may be considered. This is a widely used approach for 8 the de novo reconstruction of genome or transcriptome sequences (Compeau 9 et al., 2011). It therefore appears to be a plausible option for organizing, harder to detect as compared to genomic data due to the variable expression 22 and the shorter lengths of the miRNAs. Overall, using a de Bruijn graph to 23 analyze sRNA-seq data and extract information from such data seems thus counterintuitive as mature miRNAs are captured full-length by the current NGS 1 technologies. However, a de Bruijn graph has several interesting properties for 2 the discovery of miRNAs, mainly due to the fact that it encodes all the sRNA-3 seq sequence information at once in a compact and connected representation 4 (graph), without the need to perform an all-vs-all read comparison or mapping 5 to a reference. 6 In this paper, we present BrumiR, a de novo algorithm based on a de Bruijn 7 graph approach that is able to identify miRNAs directly and exclusively from 8 sRNA-seq data. Unlike other state-of-the-art algorithms, BrumiR does not rely 9 on a reference genome, on the availability of close phylogenetic species, or on 10 conserved sequence information. Instead, BrumiR starts from a de Bruijn graph 11 encoding all the reads and is able to directly identify putative miRNAs on the 12 generated graph. BrumiR also removes sequencing errors and navigates inside  We extensively benchmarked BrumiR on animal and plant species using 21 simulated and real datasets. The benchmark results demonstrate that BrumiR 22 is very sensitive, besides being the fastest tool, and its predictions were 23 supported by the characteristic hairpin structure of miRNAs. Finally, we also applied BrumiR to the discovery of miRNAs of Arabidopsis thaliana and 1 identified three novel high-confidence miRNAs involved in root development. 2 These putative miRNAs were not discovered before by any other software, 3 thereby showing the potential of using different approaches even in the case 4 where high quality genomes are available. The code of BrumiR is freely 5 available at https://github.com/camoragaq/BrumiR.

8
BrumiR discovers mature miRNAs directly from the 9 sRNA-seq reads. 10 The main idea behind BrumiR is that mature miRNAs can be discovered directly 11 from the information contained in the sequenced sRNA-seq reads. To achieve 12  graph construction, BrumiR cleans the graph by removing tips (dead-end 20 nodes) with low expression/abundance (KM < 5), which are usually generated 21 from sequencing errors (Figure 1.2). One feature of the miRNA biogenesis is that after Dicer cleavage, the mature miRNA is the most abundant of the three 1 by-products and when it is sequenced, it has a uniform expression along its 2 sequence (Friedländer et al., 2008). Therefore, BrumiR expects that the 3 neighbor elements within a particular putative miRNA will have similar 4 expression. BrumiR checks all neighbor connections (arcs), and deletes any 5 connection with a relative expression difference larger than 3 fold (Figure 1 Table S2). 20 BrumiR recovered more mature miRNAs than the others, on average 97% 21 (opposed to 58% and 66% for miRDeep2 and miR-PREFeR, respectively), and 22 presented the highest average recall across all the simulated datasets ( Figure 2B). BrumiR recovered more than 90% of the simulated mature miRNAs in 19 1 of the 20 simulated datasets ( Figure 2B). In particular in the H. sapiens and D. 2 melanogaster datasets, BrumiR recovered 1,5X and 2,5X more candidates than 3 MiRDeep2 ( Figure 2B). As concerns precision, BrumiR tended to generate more 4 putative candidates than MiRDeep2 (median 659 vs 332) and less than MiR- datasets, BrumiR reached a higher F-Score in 9 of the 10 datasets ( Figure 2C). 15 In terms of computational time, BrumiR was the fastest method. In particular, 16 BrumiR core was on average 21X faster than miRDeep2 and 6X times faster 17 than MiR-PREFeR (see Table S3). The speed of BrumiR relies on efficient 18

alignment-free and graph-based approaches. 19
Overall, we demonstrated with simulated data that BrumiR discovers putative 20 mature miRNAs without a reference genome across different eukaryotic 21 species achieving the highest accuracy and computational efficiency. Benchmarking metrics for all datasets tested, the error bar indicates the 5 distance between the 2 replicates. 6 7 8 The hairpin structure of mature miRNAs is found in 9 most of the BrumiR candidates. 10 In order to assess the performance of BrumiR on real data, we collected public 11 datasets for the same plant and animal species evaluated in the synthetic 12 benchmark (Figure 2A miRNA discovery tools, we selected the best performer (Supplementary Table  1 S5, Supplementary Figure S9). In particular, we included mirnovo that similarly 2 to BrumiR can discover mature miRNAs directly from the reads. Before running 3 the tools, low-quality reads were removed using fastp (S. Chen et al., 2018) 4 (~10%, see Methods section). All the predicted miRNAs for each tool were 5 annotated using the miRBase database to identify known and novel 6 predictions. On average, BrumiR predicted ~1,000 putative mature miRNAs for 7 the animal species, which was ~2.8X higher than the miRDeep2 candidates and 8 1.7X lower than the candidates predicted by mirnovo ( Figure 3A1). For plant 9 species, BrumiR predicted on average ~1,800 putative mature miRNAs, which 10 was lower than the candidates predicted by mirR-PREFeR (3,248 on average), 11 and higher than the predictions of mirnovo (301 on average) ( Figure 3A1). A 12 comparison using the miRBase (Kozomara & Griffiths-Jones, 2014) annotated 13 miRNAs revealed that BrumiR shared more candidates with miRDeep2 and 14 miR-PREFeR than with mirnovo ( Figure 3A2). However, an important fraction 15 (on average more than 70%) of the miRBase-annotated candidates were 16 exclusive to each tool ( Figure 3A2), which summarizes the complexity of miRNA 17

discovery. 18
Considering miRBase-annotated candidates as the ground-truth, we 19 computed precision, recall and F-Score for all the evaluated tools ( Figure 3B, 20

Method section). BrumiR achieved an accuracy (F-Score) better for animals and 21
plants to the one obtained by the other software ( Figure 3B3). Moreover, 22 some datasets ( Figure 3B1). However, methods based on the reference 1 genome as miRDeep2 has better precision in some datasets because the 2 predictions are more conservative than the de novo methods ( Figure 3B3). In approach for miRNA discovery. We can observe that the de novo transcriptome 12 assemblers generated on average 40X and 4X more candidates than BrumiR, 13 for Trinity and Velvet respectively (Supplementary Table S7). In general, the 14 huge number of contigs generated by the transcriptome assemblers, even after 15 filtering them by length, were poorly matched to the miRBase entries (1,2% and 16 36%,indeed). On the other hand, BrumiR matched the miRBase entries at a 17 rate of 1 every 2 candidates (52% precision average). As expected, we can 18 conclude that most of the contigs generated by a pure de Bruijn graph 19 transcriptome assembler are poorly related to miRNA sequences. This was 20 expected because they are developed for mRNAseq analysis and do not 21 consider the complexities of the sRNA seq data like BrumiR.
In summary, this experiment showed that BrumiR and all the downstream steps 1 it performs after the de Bruijn graph construction are essential for miRNA 2 discovery. 3 The BrumiR toolkit also provides a tool to determine the hairpin loop of miRNA 4 precursor sequences, which is the main structural feature of miRNAs (Roden et 5 al., 2017). BrumiR2reference maps the BrumiR predicted mature miRNA to the 6 reference genome using an exhaustive alignment (See Methods section), 7 generates precursor sequences, computes its secondary structure, and checks 8 the hairpin structure using a variety of criteria inferred from analyzing more than 9 30,000 miRBase precursor sequences from animal and plant species (see 10 Methods section). We used BrumiR2reference as a double validation for all the 11 predicted mature miRNAs generated by BrumiR for the animal and plant 12 datasets ( Figure 3C). On average, BrumiR2reference identified a valid precursor 13 sequence having the characteristic hairpin structure for over 60% of the BrumiR 14 candidates ( Figure 3C). 15 In terms of speed, BrumiR core was the fastest tool. BrumiR was on average 16 19X and 38X times faster than miRDeep2 and miR-PREFeR, respectively (See 17   Table S6). 18 Overall, we demonstrated that BrumiR is a competitive tool for discovering 19 mature miRNAs without a reference genome. We showed that it was the most 20 sensitive on most of the datasets tested. The performance of our method was 21 not only faster, but also better or comparable to the state-of-the-art tools. 22 reference genome is available, to further verify if a precursor sequence of the 1 predicted mature miRNA is present in the genome. BrumiR therefore represents 2 a reliable alternative for the discovery of mature miRNAs in model and non-3 model species with or without a reference genome.  Discovering novel miRNAs from sRNA-seq data of A. 1 thaliana roots using BrumiR.  Table S8). A total of 21 putative novel miRNAs were identified, and a manual 6 curation was carried out revising all the criteria to validate and annotate 7 miRNAs in plants . We discovered two novel miRNAs 8 candidates that fulfill all the recommended criteria to annotate miRNAs in plants 9 ( Figure S10, Table S8). According to the revised criteria, confirmation by blot 10 of the expression of the miRNA or miRNA* is disallowed, and it is suggested 11 that validation of miRNA expression should be based on sRNA-seq reads only. 12 In this way, these two curated novel miRNA candidates are supported directly 13 from the sRNA-seq libraries and are expressed in all replicates in all conditions 14 . 15 One of the curated novel miRNAs candidates (miR-8) is located in Chromosome 16 5 ( Figure 4C), this miRNA locus has not been previously discovered because 17 its mature sequence maps to multiple chromosomes, and is therefore 18 discarded by genome-based tools . 19 In an exploratory analysis to shed light on the potential targets of these novel 20 miRNAs, we conducted an in silico target transcript prediction using the 1 21 algorithm   (Supplementary Table S10). EXO84b (AT5G49830) 22 was found to be one of the top genes regulated by this novel miRNA miR-8 (Supplementary Table S9). In A. thaliana, it has been demonstrated the 1 importance of EXO84b in the development of trackeary elements or vassel 2 xylem system which is essencial for water and nutrient transport of vascular 3 plants (Vukašinović et al., 2017). EXO84b is expressed over all days but 4 significantly abundantly expressed in the last days, and its differential 5 accumulation between root zones is related to emerging patterns of lateral 6 roots and hair formation from trichomes (Dvořák et al., 2020). 7 We have also explored the known miRNAs identified by BrumiR in where we 8 have found in almost all the samples, with a higly expression, the plant miRNAs 9 that would be playing a key role in root specification and development 10

(Couzigou & Combier, 2016). 11
It is plausible to say that these novel and known miRNAs may be involved in 12 the fine-tuning of lateral root growth in the early stages of development. sRNA-seq librairies, and conserved predictions by the three replicates was considered 5 as a core by condition. B) Different combinations of root growth per day were analyzed 6 together to identify novel putative miRNAs conserved in all conditions. C) We 7 discovered 2 candidates as novel miRNAs that fulfill the current criteria to annotate 8 miRNAs in plants. Moreover, they were supported directly from the sRNA-seq libraries 9 and are conserved in all replicates in all conditions. 10

12
In this paper, we introduced and benchmarked the BrumiR toolkit, which was 1) a new discovery miRNA tool (BrumiR-core), 2) a specific genome mapper 1 (BrumiR2ref), and 3) an sRNA-seq read simulator (miRsim). We demonstrated 2 that BrumiR is capable of identifying mature miRNAs based only on the 3 sequence information and generates results that are better or comparable to 4 the state-of-the-art tools on simulated and real datasets. We further tested the 5 usefulness of the BrumiR toolkit for discovering novel miRNAs potentially 6 involved in the regulation of the root development of the extensively annotated 7 A. thaliana genome. 8 Unlike the state-of-the-art tools, BrumiR starts by encoding the sRNA-seq 9 reads using a de Bruijn graph. This avoids the read mapping stage and the 10 dependency on previous miRNA annotations. It also enables the identification 11 of sequencing artifacts. A critical step of genome-based miRNA discovery tools 12 is to identify the precursor sequence when a reference genome is available. 13 BrumiR introduces a new mapping approach, BrumiR2reference, which scans 14 every possible hairpin precursor in the genome, when such is available, for all 15 the BrumiR predictions. As the hairpin structure is determined using the 16 predicted mature miRNA instead of the reads, this alignment can support 17 mismatches and indels and handles the case of multi-mapped candidates (due 18 to repetitive regions of the genome). Such features distinguish BrumiR from the 19 current genome-based methods. 20 Discovering miRNAs in non-model species is one of the limitations of the 21 current methods. One exception is mirnovo, which similarly to BrumiR can 22 predict miRNAs using only the sRNA-seq data, and a specific training set for 23 animal and plant species. We thus compared its performance to the one of BrumiR. Our results show that mirnovo is very conservative, generating few 1 predictions in comparison to BrumiR, this could be due to the low number of 2 entries of plant miRNAs in miRBase because mirnovo approach is based on 3 miRNAs families present in this database. However, miR-PREFeR generates a 4 larger number of candidates in plant species. The higher number of predictions 5 of miR-PREFeR results in lower precision in most of the evaluated datasets, in 6 which BrumiR obtained the highest performance in plant species. In animal 7 species, BrumiR has a lower precision compared to miRDeep2, and in some 8 cases, it generates a large number of candidates, which represents a potential 9 weakness of our method. We examined possible piRNA sequences present in 10 some of the samples (Mus musculus SRR1734817) to see if this high number 11 of candidates was due to wrong predictions, but no relationship was found 12 (Supplementary Table S11). To address the number of candidates generated, 13 we developed the BrumiR2reference tool to refine the BrumiR predictions, thus 14 reducing the number of false-positive putative miRNAs. We plan to further 15 reduce the false-positive rate of BrumiR by using a random-forest classifier 16 trained on the high confidence mature sequences available in miRBase. The 17 latter will improve the accuracy of BrumiR even in the case when a reference 18 genome is not available. It is important to observe that the miRNA annotations 19 remain incomplete and although miRBase is the main repository for miRNAs, it 20 cannot be considered as the gold-standard for most species (many of the  Figure S1). To determine 18 the optimal k-mer size for BrumiR, we compared the performance of BrumiR 19 using different k-mer sizes (14-16-18-20-22). The benchmark shows that the 20 optimal k-mer size for BrumiR is 14 (Supplementary Table S1 Figure S4). 13 14 Removing sequencing errors from the unipath sRNA-15 seq graph. 16 BrumiR deletes from the unipath graph all the nodes that have only one 17 connection (degree equal to 1), known as dead-end paths or tips (Chikhi & Rizk, 18 2013). Usually, these nodes have a low abundance value associated to them 19 (KM less than or equal to 5, the default parameter). Moreover, BrumiR deletes 20 isolated nodes (degree equal to 0) having a low abundance; isolated nodes 21 highly expressed are however conserved for further analysis. All these nodes are likely artifacts generated from sequencing errors because they are not 1 deeply expressed in the sRNAs-seq reads (Deorowicz et al., 2013). BrumiR 2 iterates this step 3 times in order to prune and clean the unipath graph 3 (polishing). This operation, called 'tip removal', edits the original unipath graph, 4 and therefore a new unipath graph with a new structure is generated ( Figure  5 1B). 6 7 An expressed mature miRNA has uniform coverage. 8 The unipath graph of a set of miRNAs from an sRNA-seq experiment has non-9 uniform coverage as different miRNAs and other elements may be connected 10 in a single big component (Figure 1.1). BrumiR evaluates each connection of 11 the unipath graph to identify those that link two nodes with a large expression 12 difference. According to the miRNA biogenesis, after a stable miRNA precursor 13 is cleaved by Dicer, among its three products, the miRNA mature sequence is 14 the most abundant and when it is sequenced, it has a uniform expression along 15 its sequence (Friedländer et al., 2008). Thus due to miRNA biogenesis, it is connections and if the difference between their KMs is larger than three-fold, 2 the connection is deleted (Uikm/Ujkm > 3). In this way, BrumiR defines a relative 3 threshold that will depend on each unipath neighborhood in the graph. Finally, 4 BrumiR repeats the tips removal step to eliminate new low frequency isolated 5 nodes ( Figure 1C). is a maximal strongly connected subgraph (Lewis & Papadimitriou, 1982). 13 BrumiR computes the CCs of the unipath graph, and then each CC is 14 processed independently to identify miRNA candidates as well as to discard 15 other sequences present in the unipath graph. 16 17 BrumiR classifies low abundance non-linear topologies 18 as sequencing artefacts. 19 BrumiR detects topologies that are potentially related to sequencing errors and 20 thus unlikely to be miRNA candidates. The shapes of these topologies were identified by visual inspection of several unipath graphs and are described in 1 detail in Figure S5. Usually they have low KM and are composed of lowly 2 expressed branching nodes with 3, 4 or 5 connections to the principal 3 structures in the graph ( Figure S5). Moreover, we observed that the sequences 4 contained in these topologies were usually redundant and contained in other 5 linear and more expressed CCs. In this way, we are not discarding relevant 6 sequence information. BrumiR removes about 10% of the CCs in this step.     Unlike current state-of-the-art tools that perform miRNA discovery by mapping 22 the sRNA-seq reads to a reference genome, BrumiR generates candidates by operating directly on the sRNA-seq reads. The reduced list of potential BrumiR 1 miRNA candidates permits the computation of a more exhaustive alignment 2 than when mapping directly the sRNA-seq reads to the reference genome. 3 BrumiR aligns each candidate to the reference genome using an exact 4 alignment method that computes the edit distance (Meyers et al., 2008) 5 between two strings and thus support mismatches, insertions and deletions. 6 The BrumiR2reference tool divides the reference genome in non-overlapping 7 windows of 200bp (adjustable parameter), then the window is indexed using 8 12-mers and each miRNA candidate is matched in both strands (split at 12-9 mers). When a 12-mer match is found, an exhaustive alignment is computed 10 between the window and the matching miRNA candidate. The alignment is 11 performed using a fast implementation of Myers' bit-vector algorithm (Šošić & 12 Šikić, 2017). 13 A miRNA candidate is stored as a hit if the alignment in the current genomic 14 window has an edit distance less than or equal to 2. After scanning all the 15 genomic windows, the vector of hits is sorted by miRNA-candidate; edit 16 distance (0-2), and alignment sequence coverage. For a single miRNA-17 candidate, a maximum of 100 genomic locations (best hits) are selected. 18 BrumiR2reference then builds a potential precursor sequence for each selected 19 hit using a strategy similar to the ones employed by miRDeep2 (Friedländer et 20 al., 2012) and Mirinho (Higashi et al., 2015). BrumiR excises the potential for all tools. Then, the contigs longer than 24 nt were filtered out for Trinity and Velvet. For BrumiR, we eliminated the last step using the RFAM database 1  information to filter out other kinds of sRNA 2 sequences, and we used all the predictions to compare to the transcriptome 3 de novo assemblers. Finally, the BrumiR candidates and contigs generated by 4 Trinity and Velvet were mapped against the miRBase database (Blast search). Benchmarking BrumiR using simulated sRNA-seq 7 reads. 8 We simulated synthetic reads from animal and plant species, and compared 9 the results of BrumiR to those obtained with the miRDeep2 (Friedländer et al., Benchmarking BrumiR using real sRNA-seq reads. 8 We downloaded publicly available sRNA-seq data for the plant and animal 9 species listed in the synthetic benchmark, and two datasets for each species 10 were included (Supplementary Table S4). We wanted to benchmark BrumiR in 11 a simple but exhaustive way by selecting the top performer for genome-based 12 and genome-free methods. We benchmarked some of the most used 13 prediction tools in a reduced version of the real dataset, and the results were 14 conclusive to select the best methods (Supplementary Table S5 replaced MiRDeep2 for the plant datasets (Supplementary Table S4). 20 The stand-alone packages of BrumiR, miRDeep2 and miR-PREFeR were used 21 to discover miRNAs in all datasets. The software mirnovo was run using its web 22 version because the stand-alone package was not available and the developer recommends the use of the web version instead. The miRNA discovery was 1 performed for each sample independently using default parameters for 2 MiRDeep2, miR-PREFeR and mirnovo. In particular, we used the scripts 3 provided by miRDeep2 and miR-PREFeR to map the reads to the reference 4 genome, and the predictions for these tools were performed on the resulting 5 alignment files. The mirnovo predictions were done using the animal and plant 6 universal panel respectively, as recommended when the reference genome is 7 not available. BrumiR was run using the command line and parameters 8 provided in the Supplementary Section 1. Moreover, the predictions of BrumiR 9 were refined using the BrumiR2reference tool on the available reference 10 genome of the selected species (Supplementary Table S4). Benchmark metrics 11 (precision, recall, and F-Score) were computed as before but considering all 12 the annotated mature sequences present in miRBase (v22.1) as the ground-13 truth. 14 15 miRNA discovery from Arabidopsis root samples.  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 Dear Dr Nicole Nogoy, Thank you very much for considering our manuscript for publication in GigaScience. We would like to thank the referees and you for the careful assessment of our manuscript. We have attempted to address all points raised by the referees and hope that the responses are satisfactory. With these revisions, we believe that our manuscript has been substantially improved and hope that it is now suitable for publication in GigaScience.
Please find below our point-by-point replies to the reviewers' comments. All changes in the main manuscript and the supplement have been marked in blue font, including all the Supplementary material. We refer to the changes we have made in the manuscript using consecutive line numbering.

Carol Moraga
Reviewer #1: Summary: The authors developed a de novo assembly method, BrumiR, for small RNA sequencing data based on de Bruijin graph algorithm. This tool displayed a relatively high sensitivity in finding miRNAs and helped the authors discover a novel miRNA in A. thaliana roots. Major comments: 1. Have the authors compared the performance with different seed lengths? Even if the minimal miR length is 18nt in MiRBase 21, seed=18 might not necessarily lead to the best AUC or F score (This might also be related to Comment 4).
We followed the reviewer recommendations and we now compare the performance of BrumiR using different seed lengths of 14, 16, 18, 20, and 22; in both animal and plant datasets. The benchmark results are reported in Table S1. The benchmark shows that the optimal seed size for BrumiR is indeed 14 and not 18 as was previously set by default. The main reason for this improved performance is that shorter seed lengths permit a better handling of sequencing errors (shorter kmers are less likely to be sequencing errors), and enable a more sensitive clustering of identical miRNAs. Overall, we observe that the benchmark metrics (Supplementary Table S1, Figure S2) improves in comparison to the observed for the previous seed length (k=18) by 1.67X , 0.97X, 1.46X for precision, recall, and F-score in animal species, and 4.09X, 1.13X, 3X for precision, recall, and F-score in plant species, respectively. In light of the results, we have now set the default seed length to k=14, and we thank the reviewer again for this key suggestion that allows us to further improve the performance of BrumiR. Finally, the code has been updated accordingly and a new release has been published in the Github repository.
2. The authors need to benchmark BrumiR with more existing tools (e.g. those ML-based methods), and to include more genome-free methods (e.g. MiRNAgFree).
In our current benchmark, we included miRDeep2, miR-PREFeR and mirnovo, and a total of 10 animal and plant datasets. miRDeep2 is the most used tool in the field with more than 1530 citations and is considered the reference miRNA prediction tool for animal species. In addition, we included miR-PREFeR that, like miRDeep2, is one of the few tools designed specifically for plant species when a reference genome is available. Complementing these popular reference-based methods, revised cover letter Click here to access/download;Personal Cover;BrumiR-revletter.pdf we included mirnovo, which is a genome-free discovery tool based on machine learning. The mirnovo manuscript presents a benchmark comparison with miReader and miRplex which are de novo tools and mirnovo showed better performance than the aforementioned tools. This is one of our main reasons for considering mirnovo as the best tool for performing miRNA discovery without a reference genome. Therefore, we selected the tools that perform best for miRNA discovery, including reference and de novo based.
Nevertheless, we tried to follow the reviewer's suggestion and attempted to add more tools to our benchmark. However, we had difficulties in running the existing tools because most of them are not under active development, do not provide update binaries (standalone versions), or their webserver does not work. For instance, miRplex does not provide recent binaries (the current one, v.01, is from 2012), leading to problems with some old libraries and we could not run it. In another example, the miReader software mirPlex is not in active development (updated for the last time in 2014). We were able to run the standalone version, but after letting it run for more than one month on a single sample, we decided to stop it.
Besides these issues, we succeeded in running the miRanalyzer (genome-reference based), and miRNAgFree (de novo based) tools. The benchmark results are provided in Supplementary Table  S5 using a reduced version of the real dataset. The benchmark results show that BrumiR obtained the highest F-score rates, as do miRDeep2 and mirnovo (Supplementary FigureS13). We also observed the poor performance on plant species of most of the tools, which shows the potential of BrumiR, as well as being a tool that can be used for both animal and plant species.
Currently, there is no standard for benchmarking miRNA prediction methods, which means that each new method tries to make the comparison in its own way. For this reason, we tried to make an effort to create a benchmark standard and we developed miRsim to generate benchmarks based on a ground truth known a priori and using the same conditions for all the tools.
Overall, we wanted to benchmark BrumiR in a simple but exhaustive way selecting the top performers for genome-based and genome-free methods. Finally, several miRNA tools do not provide stand-alone versions and therefore prevent extensive benchmarking when only the server is available. Being able to generate a good benchmark remains a challenge in the field of miRNAs where there are no standards.
3. It is also interesting to know whether de novo method for mRNA assembly would be useful on the miRNA side. It would be great if the authors were able to compare the performance of BrumiR2reference (without filtering for RFAM) with Trinity in genome-guided mode, by tweaking its seed length to be the same as BrumiR.
We followed the reviewer's recommendation and included two de Bruijn graph de novo transcriptome assemblers, namely Trinity (inchworm), and Velvet (Supplementary Table S7). The genome guide mode of Trinity first aligns the reads to the genome and then a de novo assembly is performed using the inchworm tool for each partition. The genome guide pipeline does not provide access to the seed length of inchworm (default k=25) whose default value exceeds the length of the miRNA sequences. Therefore, we ran it independently and without mapping the reads to the genome, because we thought that in this way the comparison between BrumiR and de novo transcriptome assemblers is more appropriate and allows us to focus on how the candidates are extracted from the de Bruijn graph. To perform the comparison, we used 4 real datasets; including human and Arabidopsis. The seed length and minimum contig length for the transcriptome assemblers were set to k=15 (edge-centric de Bruijn graph, overlap equals 14 nt), and l=18, for both inchworm and Velvet. Then, the contigs longer than 24 nt were filtered out for BrumiR (without filtering for RFAM) and the transcriptome de novo assemblers. Finally, the candidates were matched against miRBase (blast search) and the results of the comparison are presented in Supplementary Table S7. We can observe that the de novo transcriptome assemblers generated on average 40X and 4X more candidates than BrumiR, for Trinity and Velvet, respectively.
In general, the huge number of contigs generated by the transcriptome assemblers, even after filtering them by length, were poorly matched to miRBase entries (1,2% and 36%, indeed). On the other hand, BrumiR matched miRBase entries at a rate of 1 every 2 candidates (52% precision average). As expected, we can conclude that most of the contigs generated by a pure de Bruijn graph transcriptome assembler are poorly related to miRNA sequences. This was expected because they are developed for mRNAseq analysis and do not consider the complexities of the sRNA seq data like BrumiR. In summary, this experiment showed that BrumiR and all the downstream steps it performs after the De Bruijn graph construction are essential for miRNA discovery.
4. The tool's sensitivity is promising across animal and plant datasets. However, the average precision is quite low, an average precision of 0.3 means a false discovery rate of 0.7. This is not an accepted value for a tool designed to discover novel miRNA. Is there any parameter the author could tweak towards a better performance? For example, is the seed length of 18nt too short to start with? Are there any other sequence features the authors should take into account to boost the performance? Or maybe some post-assembly filtering approaches might be sufficient and helpful.
We followed the reviewer's recommendations and we were able to improve the precision of BrumiR while maintaining a similar sensitivity. The major gain in the precision of BrumiR comes from a reduced seed length, which allows it to better deal with sequencing errors, and from an improved clustering of the reads coming from the same miRNA candidates, plus some code improvements. Overall, our precision for animals and plants increased by 1.67X and 4.09X on average respectively, which led to a combined F-score increased by 1.5X and 3X for animals and plants respectively, significantly reducing the rate of false-positives of our previous version. This can be further reduced when a reference genome is available and BrumiR2Reference is used. Additionally, we also provided a new script to annotate the candidates using the miRBase entries, providing a subset of high confidence BrumiR predictions. The novel candidates can be further refined using additional criteria (such as coverage, presence on several miRNA-seq experiments, etc.) that we describe in Figure 4 and Supplementary FigureS10, in the discovery of novel miRNAs in the Arabidopsis genome using BrumiR. Moreover, we intend to improve our method by implementing a random forest classifier that will enable a better characterization of the novel candidates not present in miRbase. The random forest model learns the sequence features of the miRNA annotated in miRbase and can be used to assign a probability of being a miRNA or not for novel BrumiR candidates for species with or without a reference genome. This feature should be included in a future version of our method. However, at this point we believe that in the absence of a reference genome, BrumiR is able to provide a list of potential miRNA candidates which is not usually possible with the other genomebased prediction tools that have a similar prediction power. We compared BrumiR against the best genome based miRNA discovery tool and we had comparable or better results, even without using a reference genome. 5. Wet-lab validation (e.g. Luciferase assay) for the identified novel miRs will leverage the real-life usefulness of BrumiR. This is extremely important, as the tool showed a high false discovery rate.
We agree that functional validation of miRNAs predicted by BrumiR is an important point in order to highlight the usefulness of our tool; and we intend to continue our studies with Arabidopsis' novel miRNA candidates in the near future to provide additional evidence towards the in vivo functions of such candidates in a more specialized manuscript. Unfortunately, this is not within the scope of our study and more specifically the current global pandemic has not helped us finish such functional studies, because this part of the work has been done in Chile where our partner wet-lab was closed most of the part of the pandemic and currently is resuming is normal operation.
Even so, we would like to point out that current criteria to validate and annotate miRNAs in plants are based on experimental evidence coming directly from the sequencing libraries and that confirmation by blot of the expression of the miRNA or miRNA* is disallowed, as proposed by Axtell 2018*. What we think could be done in order to improve the sRNA-seq dataset instead would be to enrich the sequencing libraries with functional sRNAs coupled to proteins (as proposed in Grentzinger et al., 2014**). At this point, we believe that we currently fulfill several state-of-the-art criteria for miRNA annotation in plants. We have revised all the criteria exposed in Axtell, 2018* for novel miRNA candidates predicted by BrumiR on the Arabidopsis miRNA-seq data. We concentrated our effort on 22 novel miRNA candidates that were expressed in at least 3 replicates, and we found that 4 of them fulfilled at least 7 of 8 criteria, with 2 of them fulfilling all the criteria exposed in Axtell 2018. We have updated this part of the manuscript (Figure 4, Figure S10, and Table S8, S9, S10) accordingly to reflect these changes and we thank the reviewer for pointing out the need for more comprehensive criteria to reflect the accuracy of BrumiR when predicting new candidates.  We use the edit distance library to allow a maximum of 2 nt differences in the cluster step of BrumiR. Then, for each cluster, we select the most expressed candidates but a reference is kept for all the cluster members. It is possible to use these clusters to identify miRNA candidates that may be RNA edited but the final output of BrumiR contains only one representative member of each cluster. We have included a flag in the output of BrumiR that reports the size of the cluster for each miRNA candidate. This allows us to explore if a candidate might be RNA edited or not. This is provided in a list as one of the BrumiR output files (*.edit_clusters.ED2.txt). Overall, BrumiR permits a mismatch maximum of two nucleotides when clustering candidates and therefore allows for the identification of candidates with potential RNA edition.
The authors here present BrumiR, a de Bruijn-based method to discover miRNAs independently of a reference genome. Today most miRNA discovery and annotation is done by mapping sequenced RNAs to readily available reference genomes and analyzing the mapping profiles. However, there are some uses cases where the genome-free approach is needed (particularly for species that have no reference genome or where the genomes have missing parts); therefore BrumiR could potentially be useful for the community. However, the comparison to existing tools needs to be done in a more careful way.

Major comments:
1-RFAM filtering is not really part of the prediction step, this is rather a filtering step. Therefore, to make a fair comparison with mirnovo (the other genome-free tool), BrumiR should additionally be run without RFAM filtering, and mirnovo should additionally be run using the exact same RFAM filtering.
We used the RFAM database to filter out other kinds of RNAs present in the miRNA-seq experiment. Mirnovo calculates a consensus sequence for each cluster (candidate) and then an alignment is performed against the RFAM database to identify candidates matching rRNAs and tRNAs, and these matching candidates are filtered out by mirnovo. The latter step is performed by mirnovo by default and cannot be turned off (there is not a parameter allowing this). In BrumiR, we used the RFAM database to filter RNAs not associated with miRNAs. We built a k-mer database excluding all the RFAM families annotated as miRNAs. All the BrumiR candidates matching this kmer database are excluded. This step is similar for both tools with the same goal (excluding other RNAs) but using different algorithms (alignment vs k-mer matches). We therefore considered that the comparison that we did is appropriate and fair for both tools. On average, 1% of the BrumiR candidates are filtered-out on this step.
The updated text is in page 9, lines 19-21.
2-it appears that 16-mers from miRBase miRNAs were specifically excluded from the RFAM catalog used for the filtering, which is reasonable. However, the miRNAs from the exact benchmarked species should not be included in the used miRBase 16-mer catalog, to avoid circular reasoning.
Yes, we agree with the reviewer that including miRNAs in this step may generate circular reasoning. This represents a final step of BrumiR where the aim is to remove sequences related to rRNAs and tRNAs, but we do not annotate or guide BrumiR towards previously annotated miRNA sequences present in the RFAM database. To achieve this, the 16-mer database was built from the RFAM database excluding all the families in all the taxa related to miRNA genes (529 families excluded), conserving the most abundant k-mers (more abundant than 5). Moreover, all 16-mers present in the miRBase database were also excluded. In summary, we excluded any reference to known miRNA sequences present in RFAM from the 16-mer database.
The updated text is in page 9, lines 19-21.
3-miRDeep2 software should ideally not be run with default options -this is particularly important since the miRDeep2 performance in this manuscript appears lower than what is reported in other studies (e.g. Friedlander et al. 2012). First, reference mature miRNAs from a related and wellannotated species should be included to support the prediction. Second, a score cut-off should be used that gives a decent signal-to-noise ratio according to the miRDeep2 output overview table (for instance 5:1). Third, all read pre-processing and genome mapping should be performed with the mapper.pl script which is part of the miRDeep2 package.
We agree with the reviewer that the ideal case is to use miRDeep2 according to the recommendations suggested by the developers (Friedländer et al, 2012). For this reason, we first used the script proposed by the developers to map the reads to the reference genome (mapper.pl). Then, we used the miRDeep2 pipeline in a "de novo" mode because we wanted to make a prediction without using previous knowledge, as we did with mirnovo and BrumiR. Finally, we selected all the predictions with a minimum miRDeep2 score of 5.1 (signal-to-noise ratio). Also, the benchmark presented in the miRDeep2 manuscript was performed in a different way, because they used a pool of several samples, and the metrics were computed using all predictions, and for this reason, they are not comparable to our benchmark. As we mentioned above, there is no standard for benchmarking the miRNA prediction methods, which means that each new method tries to make the comparison in its own way and the metrics used could be quite different depending on the design of the benchmark. miRDeep2 presented a good performance in most of the datasets, having the highest precision rate. However, we observed that it is a very conservative method because it does not predict many new candidates, even using the "de novo" mode.
4-it appears that only miRNA-derived sequences were included in the simulated data. In fact, real small RNA-seq data typically contains fragments from other known types of RNA and also sequences from unannotated parts from the genome. Therefore, the authors should use simulated data that also includes samples from RFAM and randomly sampled sequences from the reference genome (for instance 10% of each). Overall, the use of simulated sequence data could be put a bit in the background in this study, since real small RNA-seq data is in fact readily available these days and typically has a structure that is not easy to simulate. Further, there is little reason not to use real data, since the miRNAs in miRBase tend to be reasonably well curated for most species and therefore can function well as a gold standard for benchmarking.
We followed the reviewer's suggestion and we added to our simulated data sequences from the RFAM and the genomes for each of the species included in the benchmark. Now the simulated data include 10% of sequences from RFAM and 10% of random genomic sequences. We have repeated the synthetic benchmark and the new results are reported in the corresponding section. We agree that it is hard to mimic the structure of real miRNA-seq experiments, but the simulated data provide a good overview of the performance of the tools as well as a more controlled way to compare such performance since all the expressed miRNA sequences are known, which may not be the case for real data. Finally, we also included an extensive benchmark for the same species using real miRNA-seq data (Supplementary TableS4) and used the miRbase entries as the gold standard for computing the benchmark metrics as proposed by the reviewer.
5-precision of BrumiR is in some cases lower than 0.2, for instance for one mouse dataset. From this dataset ~3000 mouse miRNAs are reported -the majority of which are not in miRBase and can reasonably be presumed to be false positives. The authors should comment on why this particular dataset appears to produce so many false positives for BrumiR -could this have to do with the prevalence of piRNAs that the software cannot easily discern from miRNAs? Also, the authors should reflect on what kind of use cases could tolerate these thousands of false positives. Would this be for generating candidates for downstream high-throughput validation?
We took into account the recommendations of Reviewer #1 which led to a considerable improvement in our predictions and improved the benchmarking metrics in general across all the datasets. After we changed the k-mer size to 14, we improved over all the datasets the precision rates of BrumiR (Supplementary TableS1, TableS4). Moreover, to address the question of Reviewer #2 related to the possibility of having piRNAs predicted as putative miRNAs in this dataset, we compared the predictions of BrumiR against the repository of piRNAs annotated for Mus musculus (https://www.pirnadb.org/). We observed that less than 4% of BrumiR candidates have a match with a piRNAs annotated in this database (Supplementary TableS9), a 3% and 10% of predictions of miRDeep2 and mirnovo, respectively, have a match with piRNAs. We can observe that none of the miRNA discovery tools would be capturing piRNAs instead of miRNAs, according to the data presented here.
6-the authors should either benchmark BrumiR against the genome-free methods miReader and MirPlex, or explain why this comparison is not relevant.
We used mirnovo as a de novo predicted tool. As in the mirnovo paper, the authors compare mirnovo against these tools, miReader and MirPlex, and mirnovo has the best performance, we chose to compare BrumiR only against the best performance tools only in a genome-based and de novo approach, meaning against respectively miRDeep2 and mirnovo, as we explained above we compared BrumiR with 4 different miRNA discovery tools using a reduced version of the real dataset. The results provided in Supplementary TableS5 showed that the highest F-score rates are for BrumiR, miRDeep2, miR-PREFeR and mirnovo (Supplementary FigureS9). We decided to benchmark BrumiR in a simple but exhaustive way selecting the top performers for genome-based and genome-free methods.

Minor comments:
-the brief introduction to miRNA biology should be carefully edited by an expert in the field. Currently, very old reviews are being cited (e.g. Bartel 2004), and some of the other references appear to be a bit spurious (e.g. why focus on plant host-pathogen interactions out of the hundreds of established functions of miRNAs?). The excellent review of Dave Bartel from 2018 contains references to numerous milestone studies that the introduction could build on. We included miRNA biology background and we verified our references. Thanks to the reviewers for having pointed this out.
The updated text is in page 3, lines 6-20.
-the authors write on page 2 that genome-based methods struggle with a high rate of false positive prediction, citing [9]. However, this is a mis-reference, since the reference [9] states that methods that rely on _only_ the genome and do not leverage on small RNA-seq data have high false positive rates. We changed this sentence.
The updated text is in page 4, lines 20-21.

Reviewer #3:
The manuscript by Moraga et al. describes BrumiR, a software devoted to the de novo identification of miRNAs from deep sequencing experiments of the RNA fraction at low molecular weight.
In contrast with existing tools, BrumiR is based on de Bruijn graphs, generated directly from raw fastq reads. The performances on simulated and real sequencing data, in terms of precision, recall and F-Score, are very good. In addition, the tool is ultra-fast, enabling the analysis of huge amount of data. I have tried to use BrumiR but I always got a GLIB error. I have tested the script on different Linux and Mac computers but I was not able to fix the GLIB error. It seems that a very recent version of the GLIB library is required. So, unfortunately, I didn't have the possibility to test the program and look at the outputs. Major concerns: 1-I was not able to run the program and, thus, provide a correct revision. In my opinion, the github page should take into account this by providing the minimal software and hardware architecture to run BrumiR. Authors could also include a copy of the output files (by the way, there is a typo in the description of the second output file).
We do thank the reviewer for raising this point and we created a docker container that provides all the needed software to facilitate the execution of BrumiR. The docker image is available on dockerHub at (https://hub.docker.com/repository/docker/camoragaq/BrumiR) and instructions for downloading the image are provided on the Github repository of BrumiR. Additionally, we have created a demo dataset hosted at https://github.com/camoragaq/BrumiR_demo where we have included all the expected output files for the demo dataset.
The updated text is in page 40, lines 4-6.
2-Since the tool is able to identify novel miRNAs and look also at known ones, they could provide an output file including the read count per miRNA. In addition, since the tool is expected to be ultra-fast (not checked … see above), the differential gene expression analysis could also be implemented.
BrumiR outputs an abundance estimation based on the k-mer counts for each predicted miRNA candidate. Currently, BrumiR quantify each candidate using the total number of k-mers (KC), and the average number of k-mers per base (KM). Both values can be used as proxy to access the expression of each miRNA candidate. However, how has been previously shown for transcriptome mRNAseq data the accurate estimation of gene expression based on k-mers require the 3-I suggest also implementing a graphical output. A sort of summary in a decorated html page.
We followed the reviewer's suggestion and we now provide an R notebook to create an interactive and graphical summary of the output of BrumiR. We do plan to extend these capabilities in the future to further enhance the graphical report (i.e. by providing functions to quickly determine differentially expressed miRNAs). A full example of the R Notebook is provided with the demo dataset.
The updated text is in page 40, lines 5-6.
4-By using BrumiR, authors analyze miRNAs in Arabidopsis during the development, discovering three novel miRNAs. Although bioinformatics evidences indicate that they could be real miRNAs, an experimental validation is required. Indeed, these miRNAs have been detected by BrumiR only. I think that this validation could be easily done because authors directly performed sRNAseq data. In my opinion, this experiment could really improve the manuscript and assess the high performance of BrumiR.
We agree with the reviewer that experimental validation of the candidate miRNAs is an important step and a valuable suggestion for our work. Following the reviewer's suggestion, we have revised all the updated criteria on plant miRNA annotation based on Axtell, 2018* for each novel candidate miRNA predicted by BrumiR using the Arabidopsis thaliana datasets. These criteria are based directly on precursor and miRNA features that can be determined directly from the sRNA-Seq reads. Considering these criteria, we found that 2 of our miRNA candidates fulfill all of them ( Figure  4, Figure S8, S9, S10). As such, these results highlight the usefulness of BrumiR for predicting new candidate miRNAs even in model species where the miRNA catalogs are highly complete. Although, this last results section of the BrumiR manuscript highlight the power of BrumiR to discover new miRNAs even in highly curated genomes, we believe that the biological insight and subsequent validations may be the focus of a more specialized manuscript rather than the current one where the focus is the new algorithmic ideas that we are presenting.  Dear Editor, Please find enclosed our manuscript entitled "BrumiR: A toolkit for de novo discovery of microRNAs from sRNA-seq data" which we would like to submit for publication as an article in GigaScience.
Since the first classification and annotation of miRNAs, accurately identifying them has proven difficult. Accurate prediction of known and novel miRNAs is however essential for increasing our understanding of the miRNA biology. In the last decade, with the increasing accessibility of high-throughput sequencing technologies, different methods have been developed to identify miRNAs, but most of them rely exclusively on preexisting reference genomes. However, despite all the advancements in the sequencing technologies and de novo assembly algorithms, few complete genomes are available today, which is a recurrent problem that researchers working on non-model species face. Therefore, the lack of a high-quality reference genome reduces the possibilities for discovering novel miRNAs.
In this article, we introduce the BrumiR toolkit, which is a package composed of three tools; 1) a new discovery miRNA tool (BrumiR-core), 2) a specific genome mapper (Brumir2Reference), and 3) a sRNA-seq read simulator (miRsim). In particular, BrumiRcore is a de novo algorithm based on a de Bruijn graph approach that is able to identify miRNAs directly and exclusively from sRNA-seq data. Unlike other state-of-the-art tools, BrumiR does not rely on a reference genome, on the availability of close phylogenetic species, or on conserved sequence information. Instead, BrumiR starts from a de Bruijn graph encoding all the reads and is able to directly identify putative mature miRNAs on the generated graph. Along with miRNA discovery, BrumiR assembles and identifies other types of small and long non-coding RNAs expressed within the sequencing data. Additionally, when a reference genome is available, BrumiR provides a new mapping tool (BrumiR2reference) that performs an exhaustive search to identify and validate the precursor sequences.
We extensively benchmarked the BrumiR toolkit on animal and plant species using simulated and real datasets. The benchmark results show that BrumiR is very sensitive, is the fastest tool, and its predictions were supported by the characteristic hairpin structure of miRNAs. Finally, we demonstrate the power of BrumiR for discovering novel miRNAs in the model plant Arabidopsis thaliana. We sequenced a total of 18 sRNA-seq libraries from different stages of root development and used the BrumiR toolkit to analyze our data. We annotated three novel miRNAs involved in root development, showing on a real biological situation how BrumiR catches novel information even in the case of highly annotated genomes. The BrumiR toolkit is a versatile method that represents an important contribution to the discovery of miRNAs in species with or without a reference genome. We therefore believe that it is appropriate to the wide audience of GigaScience.