-
PDF
- Split View
-
Views
-
Cite
Cite
Sebastiano Di Bella, Alessandro La Ferlita, Giovanni Carapezza, Salvatore Alaimo, Antonella Isacchi, Alfredo Ferro, Alfredo Pulvirenti, Roberta Bosotti, A benchmarking of pipelines for detecting ncRNAs from RNA-Seq data, Briefings in Bioinformatics, Volume 21, Issue 6, November 2020, Pages 1987–1998, https://doi.org/10.1093/bib/bbz110
- Share Icon Share
Abstract
Next-Generation Sequencing (NGS) is a high-throughput technology widely applied to genome sequencing and transcriptome profiling. RNA-Seq uses NGS to reveal RNA identities and quantities in a given sample. However, it produces a huge amount of raw data that need to be preprocessed with fast and effective computational methods. RNA-Seq can look at different populations of RNAs, including ncRNAs. Indeed, in the last few years, several ncRNAs pipelines have been developed for ncRNAs analysis from RNA-Seq experiments. In this paper, we analyze eight recent pipelines (iSmaRT, iSRAP, miARma-Seq, Oasis 2, SPORTS1.0, sRNAnalyzer, sRNApipe, sRNA workbench) which allows the analysis not only of single specific classes of ncRNAs but also of more than one ncRNA classes. Our systematic performance evaluation aims at guiding users to select the appropriate pipeline for processing each ncRNA class, focusing on three key points: (i) accuracy in ncRNAs identification, (ii) accuracy in read count estimation and (iii) deployment and ease of use.
Introduction
The advent of massive parallel sequencing technology, known as next-generation sequencing (NGS) [1], has dramatically improved our understanding toward the complexity of the molecular mechanisms orchestrating eukaryotic and prokaryotic cell growth and development.
Next-generation sequencing allows sequencing entire genomes in a few days and at a reasonable cost. It allows for the detection of gene mutations or polymorphisms (e.g., CNV, SNPs, INDEL, STR) potentially associated to disease predisposition and support diagnosis confirmation. Indeed, target panel approaches have been already introduced into the clinical practice, and their usage is rising in the routinely clinical decision-making. Examples include those proposed by Myriad (https://myriad.com/products-services/hereditary-cancers/myrisk-hereditary-cancer/) or by Foundation Medicine (https://www.foundationmedicine.com/). Next-generation sequencing is extensively used also for individual transcriptome profiling, allowing for the identification of differentially expressed genes, new isoforms, splicing variants or gene rearrangements in specific diseases. Unlike microarrays, which requires prior knowledge of the sequence of genes or transcripts, RNA sequencing (RNA-Seq) permits the identification of transcripts by merely detecting the mapped reads on the reference genome [2]. Moreover, this technology can be used for the detection of non-coding RNAs (ncRNAs), namely, RNA molecules, which do not encode for proteins, but represent a considerable amount of the transcriptome [2].
Non-coding RNAs are a heterogeneous class of untranslated RNA molecules. They are involved in many aspects of cell physiology and regulate a broad spectrum of cellular processes, controlling gene expression and contributing to genome organization and stability. Non-codings can be classified according to their size in small ncRNAs (< 200 nucleotides) and long ncRNAs or lncRNAs (≥ 200 nucleotides) [2,3]. Alternatively, they can also be classified according to their function in housekeeping and regulatory ncRNAs [2]. Housekeeping ncRNAs include ribosomal RNA (rRNA), transfer RNA (tRNA), small nuclear RNA (snRNA) and small nucleolar RNA (snoRNA). These ncRNAs are expressed in all cell types and carry out essential functions in eukaryotic cells [2]. On the other hand, regulatory ncRNAs include several classes of small and long molecules, such as microRNA (miRNA), small interfering RNAs (siRNA), Piwi-associated RNAs (piRNA), long non-coding RNAs (lncRNA), circular RNAs (circRNA) and tRNA-derived ncRNAs [2]. This latter group represents a novel class of small regulatory ncRNAs, which derive from pre-tRNA and tRNA processing [4]. They have been reported to have essential roles in different biological processes, such as ribosome biogenesis, retrotransposition, virus infections, apoptosis and cancer pathogenesis [5–13]. Nowadays, ncRNAs are thoroughly studied due to their roles in human health [4,14–19].
Upon the increasing research interest in ncRNAs, the identification of the different classes of ncRNA has emerged as a critical issue. Indeed, RNA-Seq produces a dramatically higher amount of data with respect to traditional approaches, such as real-time PCR or microarray, demanding for fast and effective computational approaches [20]. Several pipelines have been developed for the analysis of RNA-Seq data [21–26] and miRNA, such as DSAP [27], miRanalyzer [28], miRExpress [29], miRNAkey [30], iMir [31], CAP-miRSeq [32], mirTools 2.0 [33], sRNAtoolbox [34], miRDeep 2 [35] and MapMi [36]. Some pipelines can process both RNA-Seq and miRNA-Seq [37–40]. Other tools have been implemented to specifically detect piRNAs, such as piPipes [41], PILFER [42], piRNAPredictor [43] and PIANO [44] or lncRNAs, such as UClncR [45]. Nevertheless, several pipelines [46] present restrictions on the input data size and allow only for static workflows. This last issue is a clear limitation since users cannot start the analysis at different stages of the pipeline. Also, some of them have a complicated setup and are not user-friendly. Hence, a systematic evaluation of workflow performance is needed to guide users in the selection of the appropriate pipeline for processing each different class of ncRNAs among all the available methods.
In this paper, we surveyed eight RNA-Seq pipelines enabling the analysis not of single ncRNA class but of more than one ncRNAs classes. We compared the performances by analyzing their detection accuracy through the usage of both synthetic and real RNA-Seq data sets. In particular, our analysis was devised to evaluate pipeline performances on three key points: (i) accuracy in ncRNAs identification, (ii) accuracy in read count estimation, and (iii) deployment and ease of use.
Materials and methods
Design of synthetic RNA-seq data sets
Two synthetic human RNA-Seq data sets were generated and used for the comparison of the different ncRNAs pipelines. The first data set simulates small non-coding RNA-Seq data and includes miRNAs, piRNAs, snoRNAs, and tRNA-derived ncRNAs, while the second one simulates long RNA-Seq data and covers mRNAs, circRNAs, and lncRNAs.
Synthetic RNA-Seq data were obtained by using Flux Simulator [47]. Briefly, Flux Simulator simulates a transcriptome starting from the genomics sequences of specific species and the corresponding gene structure annotation. Then, the obtained in silico transcriptome undergoes RT/fragmentation according to the chosen experimental technique. Flux Simulator pipeline provides optional steps for modeling the final library preparation, involving in silico ligation of adapter sequences, fragment size selection and PCR amplification. Finally, Flux Simulator produces a FASTQ file as output in which RNA reads associated to each ncRNA are annotated. In this way, it is possible to precisely calculate the species and the number of reads associated to each RNA species present in the synthetic data set [47]. The sequences of simulated RNA reads are retrieved from the reference genome in correspondence to the genomic coordinates from the GTF file, provided in input together with the reference genome. We choose Flux Simulator to generate RNA-Seq experiments since it comprises explicit models for the processes that determine abundance and distribution of reads according to specified experimental protocols [47].
Specifically, to build the two data sets, we first created custom annotation files (GTF format) for small RNAs and long RNAs, respectively. These files were created by using an R script [48], which permits to randomly select molecules from the original genomics coordinates present in the following databases: UCSC [49] for genes, miRNAs and snoRNAs; tRFdb, GtRNAdb and Balatti et al. [10,50,51] for tRNA-derived ncRNAs; piRBase [52] for piRNA; circBase [53] for circRNAs and LNCipedia [54] for lncRNAs.
This process yielded one GTF file for small RNA, containing genomics coordinates for 193 miRNAs, 100 snoRNAs, 500 piRNAs and 200 tRNA-derived ncRNAs (equally split in 5′ leader RNAs, tsRNAs, tRF-5 and tRF-3), and one annotation file for long RNAs including coordinates for 100 genes encoding for proteins (895 exon genomic coordinates), 500 circRNAs and 500 lncRNAs. Concerning circRNAs; however, Flux simulator was not able to correctly simulate their sequences. A possible explanation could be that circRNAs genomics coordinates, which were used for the generation of simulated circRNAs reads, were huge, and the genes transcribed within circRNAs were not annotated for introns and exons in circBase. Therefore, Flux Simulator might have generated RNA reads which straddle between exons and introns of coding-protein genes, which are not components of circRNA molecules. Therefore, circRNAs were not included in the evaluation.
General Transfer Format (GTF) files, together with the human HG19 reference genome, were then submitted to Flux Simulator [47] and synthetic FASTQ files were built. The parameters used for the generation of the two synthetic RNA-seq data sets are reported in Table 1.
Flux Simulator parameters used for the generation of synthetic RNA-Seq data set
Small RNA-seq . | Long RNA-seq . | ||
---|---|---|---|
NB_MOLECULES | 5,000,000 | NB_MOLECULES | 5,000,000 |
READ_NUMBER | 15,000,000 | READ_NUMBER | 5,000,000 |
TSS_MEAN | NaN | TSS_MEAN | NaN |
FRAG_SUBSTRATE | RNA | FRAG_SUBSTRATE | RNA |
POLYA_SCALE | NaN | POLYA_SCALE | NaN |
POLYA_SHAPE | NaN | POLYA_SHAPE | NaN |
PCR_PROBABILITY | 0.5 | PCR_PROBABILITY | 0.5 |
GC_MEAN | NaN | GC_MEAN | NaN |
FRAG_METHOD | UR | FRAG_METHOD | UR |
FRAG_EZ_MOTIF | NlaIII | FRAG_EZ_MOTIF | NlaIII |
READ_LENGTH | 35 | READ_LENGTH | 75 |
PAIRED_END | false | PAIRED_END | false |
FASTA | YES | FASTA | YES |
Small RNA-seq . | Long RNA-seq . | ||
---|---|---|---|
NB_MOLECULES | 5,000,000 | NB_MOLECULES | 5,000,000 |
READ_NUMBER | 15,000,000 | READ_NUMBER | 5,000,000 |
TSS_MEAN | NaN | TSS_MEAN | NaN |
FRAG_SUBSTRATE | RNA | FRAG_SUBSTRATE | RNA |
POLYA_SCALE | NaN | POLYA_SCALE | NaN |
POLYA_SHAPE | NaN | POLYA_SHAPE | NaN |
PCR_PROBABILITY | 0.5 | PCR_PROBABILITY | 0.5 |
GC_MEAN | NaN | GC_MEAN | NaN |
FRAG_METHOD | UR | FRAG_METHOD | UR |
FRAG_EZ_MOTIF | NlaIII | FRAG_EZ_MOTIF | NlaIII |
READ_LENGTH | 35 | READ_LENGTH | 75 |
PAIRED_END | false | PAIRED_END | false |
FASTA | YES | FASTA | YES |
Flux Simulator parameters used for the generation of synthetic RNA-Seq data set
Small RNA-seq . | Long RNA-seq . | ||
---|---|---|---|
NB_MOLECULES | 5,000,000 | NB_MOLECULES | 5,000,000 |
READ_NUMBER | 15,000,000 | READ_NUMBER | 5,000,000 |
TSS_MEAN | NaN | TSS_MEAN | NaN |
FRAG_SUBSTRATE | RNA | FRAG_SUBSTRATE | RNA |
POLYA_SCALE | NaN | POLYA_SCALE | NaN |
POLYA_SHAPE | NaN | POLYA_SHAPE | NaN |
PCR_PROBABILITY | 0.5 | PCR_PROBABILITY | 0.5 |
GC_MEAN | NaN | GC_MEAN | NaN |
FRAG_METHOD | UR | FRAG_METHOD | UR |
FRAG_EZ_MOTIF | NlaIII | FRAG_EZ_MOTIF | NlaIII |
READ_LENGTH | 35 | READ_LENGTH | 75 |
PAIRED_END | false | PAIRED_END | false |
FASTA | YES | FASTA | YES |
Small RNA-seq . | Long RNA-seq . | ||
---|---|---|---|
NB_MOLECULES | 5,000,000 | NB_MOLECULES | 5,000,000 |
READ_NUMBER | 15,000,000 | READ_NUMBER | 5,000,000 |
TSS_MEAN | NaN | TSS_MEAN | NaN |
FRAG_SUBSTRATE | RNA | FRAG_SUBSTRATE | RNA |
POLYA_SCALE | NaN | POLYA_SCALE | NaN |
POLYA_SHAPE | NaN | POLYA_SHAPE | NaN |
PCR_PROBABILITY | 0.5 | PCR_PROBABILITY | 0.5 |
GC_MEAN | NaN | GC_MEAN | NaN |
FRAG_METHOD | UR | FRAG_METHOD | UR |
FRAG_EZ_MOTIF | NlaIII | FRAG_EZ_MOTIF | NlaIII |
READ_LENGTH | 35 | READ_LENGTH | 75 |
PAIRED_END | false | PAIRED_END | false |
FASTA | YES | FASTA | YES |
. | miRNA . | piRNA . | snoRNA . | tRNA-derived . | lncRNA . | circRNA . |
---|---|---|---|---|---|---|
iSRAP | X | X | X | |||
iSmaRT | X | X | ||||
sRNApipe | X | X | X | |||
miARma-seq | X | X | X | |||
sRNAnalyzer | X | X | X | X | ||
SPORTS1.0 | X | X | X | X | ||
Oasis 2 | X | X | X | |||
sRNA workbench | X |
. | miRNA . | piRNA . | snoRNA . | tRNA-derived . | lncRNA . | circRNA . |
---|---|---|---|---|---|---|
iSRAP | X | X | X | |||
iSmaRT | X | X | ||||
sRNApipe | X | X | X | |||
miARma-seq | X | X | X | |||
sRNAnalyzer | X | X | X | X | ||
SPORTS1.0 | X | X | X | X | ||
Oasis 2 | X | X | X | |||
sRNA workbench | X |
. | miRNA . | piRNA . | snoRNA . | tRNA-derived . | lncRNA . | circRNA . |
---|---|---|---|---|---|---|
iSRAP | X | X | X | |||
iSmaRT | X | X | ||||
sRNApipe | X | X | X | |||
miARma-seq | X | X | X | |||
sRNAnalyzer | X | X | X | X | ||
SPORTS1.0 | X | X | X | X | ||
Oasis 2 | X | X | X | |||
sRNA workbench | X |
. | miRNA . | piRNA . | snoRNA . | tRNA-derived . | lncRNA . | circRNA . |
---|---|---|---|---|---|---|
iSRAP | X | X | X | |||
iSmaRT | X | X | ||||
sRNApipe | X | X | X | |||
miARma-seq | X | X | X | |||
sRNAnalyzer | X | X | X | X | ||
SPORTS1.0 | X | X | X | X | ||
Oasis 2 | X | X | X | |||
sRNA workbench | X |
Real RNA-Seq data set
Additionally to the synthetic data, we used a real data set of small RNA-Seq produced using Illumina HiSeq 2500 technology on MDA-MB-231 breast cancer cell line, obtained from the Sequence Read Archive (SRA) (SRR5689212) [55]. This data set contains RNA molecules shorter than 200 nucleotides, suitable for the evaluation of all types of small ncRNAs considered in this review. A second RNA-Seq data set on the same cell line obtained from GDC (https://portal.gdc.cancer.gov/legacy-archive/files/0f5ba7d3-6f43-44af-9bbc-f9b4c09bbfeb) was used for lncRNAs and circRNAs evaluation.
Selected pipelines for the analysis of ncRNAs
The purpose of this review was to evaluate the performances of currently available pipelines, allowing the processing of more than one ncRNA class and not limited to single ncRNA type. Thus, we compared eight ncRNAs pipelines published between 2015 and 2019, enabling the analysis of more than one ncRNA class, specifically: miRNAs, piRNAs, snoRNAs, tRNA-derived ncRNAs and lncRNAs. Tools or pipelines specifically developed for the analysis of a single ncRNA class were not included in this review. Two of the tested pipelines (sRNApipe and miARma-Seq) also allow for assessing the expression profile of protein-coding genes, which were out of the purpose of our study and therefore were not analyzed. In Table 2, the classes of ncRNAs analyzed for each pipeline are reported. A summary of their technical features is provided in Table 3. A brief description of the features of each pipeline is provided below.
Pipeline . | Operative system . | Language . | Software Requirements . |
---|---|---|---|
iSmaRT | Ubuntu 16.04 | Python | |
iSRAP | LINUX/MAC machine | Ruffus (Python) | |
miARma | Fedora 23, Centos 7.2, Ubuntu (from 10.4 to 14.04), Debian jessie | Perl, R | |
Oasis 2 | Java, J2EE, mysql, Python, R, PHP and JavaScript | Web-based pipeline. For the usage JavaScript should be enabled in the browser. | |
SPORTS1.0 | Linux system (Tested system: Ubuntu 12.04 and 16.04 LTS) | Perl, R | Perl 5 (Tested version: v5.14.2, v5.22.1). Bowtie. SRA Toolkit. Cutadapt. R (Tested version: 3.2.3, 3.2.5). Reference database |
sRNAnalyzer | Perl | Make sure you have python 2.6 or later and perl 5 or later installed. Bowtie. Fastx toolkit. Cutadapt | |
sRNApipe | Unix system with a Galaxy server | Perl | Perl higher than 5.1 with packages: ‘perl-statistics’, ‘Parallel::ForkManager’, ‘Statistics::R’, ‘Getopt::Long’, ‘String::Random’, ‘File::Copy::Recursive’, ‘Math::CDF’. R higher than 3.1 with libraries ‘plotrix’, ‘bioconductor-sushi’, ‘RColorBrewer’ and ‘ggplot2’. BWA. BedTools. Samtools |
sRNA workbench | Cross-platform tool | Java | Java FX dependecies |
Pipeline . | Operative system . | Language . | Software Requirements . |
---|---|---|---|
iSmaRT | Ubuntu 16.04 | Python | |
iSRAP | LINUX/MAC machine | Ruffus (Python) | |
miARma | Fedora 23, Centos 7.2, Ubuntu (from 10.4 to 14.04), Debian jessie | Perl, R | |
Oasis 2 | Java, J2EE, mysql, Python, R, PHP and JavaScript | Web-based pipeline. For the usage JavaScript should be enabled in the browser. | |
SPORTS1.0 | Linux system (Tested system: Ubuntu 12.04 and 16.04 LTS) | Perl, R | Perl 5 (Tested version: v5.14.2, v5.22.1). Bowtie. SRA Toolkit. Cutadapt. R (Tested version: 3.2.3, 3.2.5). Reference database |
sRNAnalyzer | Perl | Make sure you have python 2.6 or later and perl 5 or later installed. Bowtie. Fastx toolkit. Cutadapt | |
sRNApipe | Unix system with a Galaxy server | Perl | Perl higher than 5.1 with packages: ‘perl-statistics’, ‘Parallel::ForkManager’, ‘Statistics::R’, ‘Getopt::Long’, ‘String::Random’, ‘File::Copy::Recursive’, ‘Math::CDF’. R higher than 3.1 with libraries ‘plotrix’, ‘bioconductor-sushi’, ‘RColorBrewer’ and ‘ggplot2’. BWA. BedTools. Samtools |
sRNA workbench | Cross-platform tool | Java | Java FX dependecies |
Pipeline . | Operative system . | Language . | Software Requirements . |
---|---|---|---|
iSmaRT | Ubuntu 16.04 | Python | |
iSRAP | LINUX/MAC machine | Ruffus (Python) | |
miARma | Fedora 23, Centos 7.2, Ubuntu (from 10.4 to 14.04), Debian jessie | Perl, R | |
Oasis 2 | Java, J2EE, mysql, Python, R, PHP and JavaScript | Web-based pipeline. For the usage JavaScript should be enabled in the browser. | |
SPORTS1.0 | Linux system (Tested system: Ubuntu 12.04 and 16.04 LTS) | Perl, R | Perl 5 (Tested version: v5.14.2, v5.22.1). Bowtie. SRA Toolkit. Cutadapt. R (Tested version: 3.2.3, 3.2.5). Reference database |
sRNAnalyzer | Perl | Make sure you have python 2.6 or later and perl 5 or later installed. Bowtie. Fastx toolkit. Cutadapt | |
sRNApipe | Unix system with a Galaxy server | Perl | Perl higher than 5.1 with packages: ‘perl-statistics’, ‘Parallel::ForkManager’, ‘Statistics::R’, ‘Getopt::Long’, ‘String::Random’, ‘File::Copy::Recursive’, ‘Math::CDF’. R higher than 3.1 with libraries ‘plotrix’, ‘bioconductor-sushi’, ‘RColorBrewer’ and ‘ggplot2’. BWA. BedTools. Samtools |
sRNA workbench | Cross-platform tool | Java | Java FX dependecies |
Pipeline . | Operative system . | Language . | Software Requirements . |
---|---|---|---|
iSmaRT | Ubuntu 16.04 | Python | |
iSRAP | LINUX/MAC machine | Ruffus (Python) | |
miARma | Fedora 23, Centos 7.2, Ubuntu (from 10.4 to 14.04), Debian jessie | Perl, R | |
Oasis 2 | Java, J2EE, mysql, Python, R, PHP and JavaScript | Web-based pipeline. For the usage JavaScript should be enabled in the browser. | |
SPORTS1.0 | Linux system (Tested system: Ubuntu 12.04 and 16.04 LTS) | Perl, R | Perl 5 (Tested version: v5.14.2, v5.22.1). Bowtie. SRA Toolkit. Cutadapt. R (Tested version: 3.2.3, 3.2.5). Reference database |
sRNAnalyzer | Perl | Make sure you have python 2.6 or later and perl 5 or later installed. Bowtie. Fastx toolkit. Cutadapt | |
sRNApipe | Unix system with a Galaxy server | Perl | Perl higher than 5.1 with packages: ‘perl-statistics’, ‘Parallel::ForkManager’, ‘Statistics::R’, ‘Getopt::Long’, ‘String::Random’, ‘File::Copy::Recursive’, ‘Math::CDF’. R higher than 3.1 with libraries ‘plotrix’, ‘bioconductor-sushi’, ‘RColorBrewer’ and ‘ggplot2’. BWA. BedTools. Samtools |
sRNA workbench | Cross-platform tool | Java | Java FX dependecies |
iSmaRT is a bioinformatics pipeline with a Graphical User Interface (GUI) for the analysis of miRNAs and piRNAs from small RNA-seq data [56]. iSmaRT enables a comprehensive analysis, including quality control, identification of miRNAs and piRNAs expressed in each sample, differential expression analysis, identification of RNA editing events on miRNA, RNA target prediction and Reactome Pathway Analysis (ReactomePA) [56].
iSRAP is a tool provided with a command line interface (CLI) for the profiling of small RNAs (miRNAs, piRNAs and snoRNAs) [57]. A YAML configuration file permits to define options and optimize small RNA profiling in different data sets [57]. Pipeline can be executed starting from either FASTQ or BAM alignment files. Results are reported as PDF file and HTML documents, completed with graphical elements to illustrate the results.
miARma-Seq is a pipeline enabling the identification and differential expression analysis of mRNAs, miRNAs, snoRNAs and circRNAs. It also allows for miRNA target prediction and the analysis of gene ontologies (GO) in any organisms with a sequenced genome available [58]. miARma-Seq comes with several preconfigured parameters and can be executed through a CLI [58]. It takes as input different files: raw FASTQ files, BAM alignment files or raw count txt files. Thus, the pipeline can start at different steps. The tool generates PDF documents as output. Boxplots and density plots of normalized and non-normalized data, multidimensional scaling (MDS) plots, principal component analysis (PCA) plots, heatmaps and clustering plots are also provided [58].
Oasis 2 is a web tool for the analysis of miRNAs, piRNAs, snRNAs, snoRNAs and rRNAs [59]. It takes as input FASTQ or compressed FASTQ files. It performs the identification, quantification and differential expression analysis of all the ncRNAs classes mentioned above. The results are reported by means of text tables and plots. Oasis 2 can perform adaptor removal and can be used with several reference genomes [59].
SPORTS1.0 is a CLI pipeline for the identification and quantification of miRNAs, piRNAs, snoRNAs and tRNA-derived ncRNAs [60]. It also allows for the analysis of rRNA small derived RNAs (rsRNA) [60]. SPORTS1.0 can be used with a wide range of species with an available reference genome. It takes as input the following files: SRA data set, FASTQ and FASTA files. The output is provided as txt and PDF files, with annotation details for each sequence, length distribution along with other statistics and figures [60].
sRNAnalyzer is a CLI pipeline with a text-based configuration file for the identification of miRNAs, piRNAs, snoRNAs and lncRNAs [61]. The pipeline allows for processing the reads (upon adaptors removal), performing quality filters, read mapping and counting. It also filters the exogenous RNAs. For the endogenous sequences, sRNAnalyzer uses ‘map and remove’ structure (i.e., only unmapped reads go to the next steps), with a progressive alignment strategy to sequentially map the reads against various databases [61]. sRNAnalyzer can be used with samples from different species by appropriately modifying the configuration files. Currently, configuration files for human, mouse, rat, horse, macaque and plant are available [61]. sRNAnalyzer takes as input FASTQ files and produces as output txt files describing the matches between mapped reads and the reference genome.
sRNApipe is a web-based pipeline, available on Galaxy, which performs small RNA mapping, counting, normalization and analysis of signatures for ping-pong amplification in the case of piRNAs [62]. The pipeline allows for the identification of mRNAs, transposable elements, miRNAs, piRNAs, snRNAs, snoRNAs, rRNAs and tRNAs. sRNApipe takes as input single-end sequencing data in FASTQ format (Phred +33) with no adaptors and a list of FASTA reference files such as genome, mRNAs, transposable elements, rRNAs, tRNAs, snRNAs/snoRNAs and miRNAs [62]. In the 1.0 version only, the pipeline can be run without rRNAs, tRNAs and snRNAs/snoRNAs reference files. For the analysis of small ncRNAs in the synthetic data set, the maximum read length, which by default is 29 nt, was set to 35 nt.
sRNA workbench is a pipeline reported to allow for the analysis of miRNAs and other small RNAs (sRNAs). It performs identification, quantification, normalization and differential expression analysis of miRNAs. The mapping of miRNAs and sRNA loci on the reference genome is also possible [63].
Pipeline comparison
Moreover, we drew the scatterplots and calculated the R2 on the TP to compare the ncRNAs expression profile identified by each pipeline with the real RNAs expression values included in the RNA-Seq data set.
Next, to compare the read count estimation among the different tools, we computed the Pearson correlation matrix considering per each ncRNA type the subset shared by all pipelines as follows:
|${\rho}_{A,B}=\frac{\sigma_{x_A,{x}_B}}{\sigma_{x_A}{\sigma}_{x_B}}$|
where xA and yB are the vectors of the expression values of ncRNAs identified by pipelines A and B, respectively, |${\sigma}_{x_A}$| is the covariance between the vectors xA and yB and |${\sigma}_{x_A}$| and |${\sigma}_{x_B}$| are the standard deviations of the two vectors.
Statistical analysis has been performed using R (version 3.5.2) [48], scatterplots, Jaccard index and Pearson correlation matrix were generated using the ggplot2 library [64].
Results
The performance of eight ncRNA pipelines enabling the processing of RNA-Seq data, published between 2015 and 2019, were compared. In particular, we evaluated the easiness of installation and usage together with result accuracy.
Installation and use
Installation and usage easiness and flexibility are crucial characteristics of bioinformatics pipelines, these characteristics became even more important for the distribution of the application to non-expert users. Thus, we evaluated a number of parameters which might influence user experience, such as: (i) setup process, (ii) amount and quality of the documentation, (iii) presence of a GUI, (iv) possibility of using different input file formats, (v) possibility of analyzing more than one class of ncRNAs in a single run, (vi) pipeline flexibility, and (vii) output file formats (txt, pdf, image, etc.). In Table 4, we have reported a schematic evaluation of the main features of each pipeline. Please refer to supplementary Table S1 for a detailed description of the criteria used for feature evaluation.
. | iSmaRT . | iSRAP . | miARma-Seq . | Oasis 2 . | SPORTS1.0 . | sRNAnalyzer . | sRNApipe . | sRNA workbench . |
---|---|---|---|---|---|---|---|---|
Installation | ++ | − | ++ | ++ | − | + | + | ++ |
Documentation | ++ | ++ | ++ | ++ | + | + | + | ++ |
GUI | ++ | − | − | ++ | − | − | + | ++ |
Different Input Types | − | + | ++ | − | + | − | − | − |
Report generation | ++ | ++ | ++ | ++ | ++ | + | ++ | − |
Multi-ncRNAs in single analysis | − | ++ | − | ++ | + | ++ | + | − |
Flexibility | − | + | ++ | − | − | − | − | − |
Usability\ configuration | ++ | + | ++ | ++ | − | + | ++ | ++ |
. | iSmaRT . | iSRAP . | miARma-Seq . | Oasis 2 . | SPORTS1.0 . | sRNAnalyzer . | sRNApipe . | sRNA workbench . |
---|---|---|---|---|---|---|---|---|
Installation | ++ | − | ++ | ++ | − | + | + | ++ |
Documentation | ++ | ++ | ++ | ++ | + | + | + | ++ |
GUI | ++ | − | − | ++ | − | − | + | ++ |
Different Input Types | − | + | ++ | − | + | − | − | − |
Report generation | ++ | ++ | ++ | ++ | ++ | + | ++ | − |
Multi-ncRNAs in single analysis | − | ++ | − | ++ | + | ++ | + | − |
Flexibility | − | + | ++ | − | − | − | − | − |
Usability\ configuration | ++ | + | ++ | ++ | − | + | ++ | ++ |
. | iSmaRT . | iSRAP . | miARma-Seq . | Oasis 2 . | SPORTS1.0 . | sRNAnalyzer . | sRNApipe . | sRNA workbench . |
---|---|---|---|---|---|---|---|---|
Installation | ++ | − | ++ | ++ | − | + | + | ++ |
Documentation | ++ | ++ | ++ | ++ | + | + | + | ++ |
GUI | ++ | − | − | ++ | − | − | + | ++ |
Different Input Types | − | + | ++ | − | + | − | − | − |
Report generation | ++ | ++ | ++ | ++ | ++ | + | ++ | − |
Multi-ncRNAs in single analysis | − | ++ | − | ++ | + | ++ | + | − |
Flexibility | − | + | ++ | − | − | − | − | − |
Usability\ configuration | ++ | + | ++ | ++ | − | + | ++ | ++ |
. | iSmaRT . | iSRAP . | miARma-Seq . | Oasis 2 . | SPORTS1.0 . | sRNAnalyzer . | sRNApipe . | sRNA workbench . |
---|---|---|---|---|---|---|---|---|
Installation | ++ | − | ++ | ++ | − | + | + | ++ |
Documentation | ++ | ++ | ++ | ++ | + | + | + | ++ |
GUI | ++ | − | − | ++ | − | − | + | ++ |
Different Input Types | − | + | ++ | − | + | − | − | − |
Report generation | ++ | ++ | ++ | ++ | ++ | + | ++ | − |
Multi-ncRNAs in single analysis | − | ++ | − | ++ | + | ++ | + | − |
Flexibility | − | + | ++ | − | − | − | − | − |
Usability\ configuration | ++ | + | ++ | ++ | − | + | ++ | ++ |
Precision Sensitivity and F-measure values obtained for each ncRNAs pipeline by using our synthetic RNA-seq data set
. | Precision . | Sensitivity . | F-measure . | . |
---|---|---|---|---|
iSmaRT | 0.76 | 0.98 | 0.85 | miRNA |
iSRAP | 0.84 | 1.00 | 0.91 | |
miARma | 0.81 | 0.96 | 0.88 | |
Oasis 2 | 0.36 | 0.92 | 0.52 | |
SPORTS1.0 | 0.90 | 0.97 | 0.93 | |
sRNAnalyzer | 0.18 | 0.98 | 0.30 | |
sRNApipe | 0.86 | 0.92 | 0.89 | |
sRNA workbench | 0.82 | 0.79 | 0.81 | |
iSmaRT | 0.37 | 0.73 | 0.49 | piRNA |
iSRAP | 0.17 | 1.00 | 0.30 | |
Oasis 2 | 0.19 | 0.72 | 0.30 | |
sRNAnalyzer | 0.71 | 0.89 | 0.79 | |
iSRAP | 0.52 | 1.00 | 0.68 | snoRNA |
miARma | 0.58 | 0.99 | 0.73 | |
Oasis 2 | 0.37 | 0.65 | 0.47 | |
sRNAnalyzer | 0.81 | 0.95 | 0.87 | |
sRNAnalyzer | 0.25 | 0.99 | 0.40 | lncRNA |
. | Precision . | Sensitivity . | F-measure . | . |
---|---|---|---|---|
iSmaRT | 0.76 | 0.98 | 0.85 | miRNA |
iSRAP | 0.84 | 1.00 | 0.91 | |
miARma | 0.81 | 0.96 | 0.88 | |
Oasis 2 | 0.36 | 0.92 | 0.52 | |
SPORTS1.0 | 0.90 | 0.97 | 0.93 | |
sRNAnalyzer | 0.18 | 0.98 | 0.30 | |
sRNApipe | 0.86 | 0.92 | 0.89 | |
sRNA workbench | 0.82 | 0.79 | 0.81 | |
iSmaRT | 0.37 | 0.73 | 0.49 | piRNA |
iSRAP | 0.17 | 1.00 | 0.30 | |
Oasis 2 | 0.19 | 0.72 | 0.30 | |
sRNAnalyzer | 0.71 | 0.89 | 0.79 | |
iSRAP | 0.52 | 1.00 | 0.68 | snoRNA |
miARma | 0.58 | 0.99 | 0.73 | |
Oasis 2 | 0.37 | 0.65 | 0.47 | |
sRNAnalyzer | 0.81 | 0.95 | 0.87 | |
sRNAnalyzer | 0.25 | 0.99 | 0.40 | lncRNA |
Precision Sensitivity and F-measure values obtained for each ncRNAs pipeline by using our synthetic RNA-seq data set
. | Precision . | Sensitivity . | F-measure . | . |
---|---|---|---|---|
iSmaRT | 0.76 | 0.98 | 0.85 | miRNA |
iSRAP | 0.84 | 1.00 | 0.91 | |
miARma | 0.81 | 0.96 | 0.88 | |
Oasis 2 | 0.36 | 0.92 | 0.52 | |
SPORTS1.0 | 0.90 | 0.97 | 0.93 | |
sRNAnalyzer | 0.18 | 0.98 | 0.30 | |
sRNApipe | 0.86 | 0.92 | 0.89 | |
sRNA workbench | 0.82 | 0.79 | 0.81 | |
iSmaRT | 0.37 | 0.73 | 0.49 | piRNA |
iSRAP | 0.17 | 1.00 | 0.30 | |
Oasis 2 | 0.19 | 0.72 | 0.30 | |
sRNAnalyzer | 0.71 | 0.89 | 0.79 | |
iSRAP | 0.52 | 1.00 | 0.68 | snoRNA |
miARma | 0.58 | 0.99 | 0.73 | |
Oasis 2 | 0.37 | 0.65 | 0.47 | |
sRNAnalyzer | 0.81 | 0.95 | 0.87 | |
sRNAnalyzer | 0.25 | 0.99 | 0.40 | lncRNA |
. | Precision . | Sensitivity . | F-measure . | . |
---|---|---|---|---|
iSmaRT | 0.76 | 0.98 | 0.85 | miRNA |
iSRAP | 0.84 | 1.00 | 0.91 | |
miARma | 0.81 | 0.96 | 0.88 | |
Oasis 2 | 0.36 | 0.92 | 0.52 | |
SPORTS1.0 | 0.90 | 0.97 | 0.93 | |
sRNAnalyzer | 0.18 | 0.98 | 0.30 | |
sRNApipe | 0.86 | 0.92 | 0.89 | |
sRNA workbench | 0.82 | 0.79 | 0.81 | |
iSmaRT | 0.37 | 0.73 | 0.49 | piRNA |
iSRAP | 0.17 | 1.00 | 0.30 | |
Oasis 2 | 0.19 | 0.72 | 0.30 | |
sRNAnalyzer | 0.71 | 0.89 | 0.79 | |
iSRAP | 0.52 | 1.00 | 0.68 | snoRNA |
miARma | 0.58 | 0.99 | 0.73 | |
Oasis 2 | 0.37 | 0.65 | 0.47 | |
sRNAnalyzer | 0.81 | 0.95 | 0.87 | |
sRNAnalyzer | 0.25 | 0.99 | 0.40 | lncRNA |

Scatterplots reporting R2 calculated between TPs miRNAs expression profiles obtained for each ncRNAs pipeline and the real expression of miRNAs in the synthetic small RNA-Seq data set.

Scatterplots reporting R2 calculated between TPs piRNAs expression profiles obtained for each ncRNAs pipeline and the real expression of piRNAs in the synthetic small RNA-Seq data set.

Scatterplots reporting R2 calculated between TPs snoRNAs expression profiles obtained for each ncRNAs pipeline and the real expression of snoRNAs in the synthetic small RNA-Seq data set.
iSmaRT is provided with a comprehensive documentation, including several examples, which guide the user in each step of the installation process and pipeline use. A unique file is provided for installation and configuration. The tool comes with a user-friendly GUI, developed in Python. iSmaRT takes as input FASTQ files only and does not allow for starting the analysis from alignment files. Different ncRNA classes cannot be analyzed together in a single run. Output is provided as txt file and tiff plots.
iSRAP requires manual installation of dependencies. The documentation is adequate to describe the tool usage. The configuration file is well structured, providing users with the flexibility of selecting at which step initiating the analysis. iSRAP takes as input both FASTQ and BAM files. The output is organized in independent folders, one for each step, containing txt and pdf files. iSRAP is a CLI tool; therefore, no GUI is provided.
miARma-Seq comes with a modular configuration file, permitting users to select which step performs. Indeed, it accepts FASTQ, BAM and txt files with raw counts. miARma-Seq guide is organized as a tutorial, with an exhaustive documentation covering the major features of the pipeline. Although miARma-Seq can identify miRNAs, snoRNAs and circRNAs, it is not able to identify circRNAs together with miRNAs and snoRNAs in a single run. It is not equipped with a GUI.
Oasis 2 is a web-based tool. No installation of the software or of dependencies is required. It is provided with a GUI which makes the tool very user friendly. The setting of few parameters, such as the reference genome and specification of which adaptors have to be removed, is required. Oasis 2 user guide is provided as a video tutorial or PDF file. The output is organized in several folders. Results are reported as txt tables and plots. Oasis 2 can process simultaneously several classes of ncRNAs (miRNAs, piRNAs and snoRNAs) in a single run.
SPORTS1.0 has no setup scripts; therefore, all the parameters must be passed directly in the CLI. The provided documentation lists all the dependencies and the configuration necessary for users' machine setup. Outputs are provided as pdf documents for each ncRNAs class which can be analyzed in the same run. Moreover, a txt file summarizes the result. SPORTS has no GUI.
sRNAnalyzer comes with a YAML configuration file, which permits the users to customize pre-processing and alignment options. A separate configuration file lists the paths for internal database setup. The documentation is modest and only lists the required dependencies for the setup. The output is organized in profiles and features txt files. sRNAnalyzer has no GUI.
sRNApipe requires a Galaxy Server installed on the users' machine. Galaxy handles dependencies' installation. The documentation is exhaustive regarding the pipeline usage; however, it is based on an old release. sRNApipe is very easy-to-use: the users only need to upload FASTQ and reference FASTA files. The output is organized in html format with text and plots.
sRNA workbench can be used with both CLI and GUI. It is provided with a comprehensive documentation (PDF manuals and video tutorials), available through the website (http://srna-workbench.cmp.uea.ac.uk/). sRNA workbench has been developed in Java. It requires the installation of one single dependency (Java FX). The tool can be launched by a jar file. sRNA workbench requires FASTQ or FASTA as input files. It is not possible to start the analysis at different steps of the pipeline. Although, in the sRNA workbench website, it is claimed that it can perform the analysis of sRNA, we only found pipelines for the identification and quantification of miRNAs. The mapping of miRNAs and sRNA loci on the reference genome is also possible.
For user convenience, we have summarized the main advantages and limitations of each pipeline in a schematic table (see supplementary Table S2).
Pipeline accuracy on synthetic data sets
The identification of the correct RNA molecules and their expression quantification is one of the crucial tasks which an RNA-Seq analysis pipeline should accomplish. To test the ability of the different pipelines in recovering ncRNAs, we prepared two synthetic FASTQ files using Flux Simulator [47]. Specifically, we built a FASTQ file simulating small RNA-Seq data (miRNAs, snoRNAs, piRNAs and tRNA-derived ncRNAs) and a FASTQ file for long RNAs (mRNAs and lncRNAs).
For each ncRNA class, we assessed the ability of the pipelines to identify the correct RNA molecules in terms of TP, TN, FP, FN, Sensitivity, Precision, and F-measure. To determine the accuracy in expression profile measure, we computed the R2 coefficient between the real counts and the pipeline-quantified ones. Per each tool, we reported the scatterplots to visualize the relationship between real counts and predicted ones.
miRNAs. Our synthetic small RNA-seq data set comprises 193 miRNA sequences, which were analyzed by all eight pipelines. Using this data set, we calculated Sensitivity, Precision and F-measure values for each pipeline, as summarized in Table 5. The scatterplots and the R2 computed on the TPs between miRNAs expression values identified by each pipeline and the real counts present in the simulated data set are reported in Figure 1. SPORTS1.0 is the pipeline accomplishing the best performance in detecting miRNAs, followed by iSRAP, sRNApipe, miARma-Seq, iSmaRT, sRNA workbench, Oasis 2 and sRNAnalyzer (Table 5), while sRNA workbench (R2 = 0.96), SPORTS1.0 (R2 = 0.96) and iSRAP (R2 = 0.96) are the most accurate tools in read count estimation, followed by sRNApipe (R2 = 0.94), miARma-Seq (R2 = 0.94), iSmaRT (R2 = 0.58), sRNAnalyzer (R2 = 0.52) and Oasis 2 (R2 = 0.38) (Figure 1).
It is noteworthy that all pipelines share a common set of FP miRNAs. Although these false positives are covered by a high number of read counts, there are specific biological reasons which can explain these unexpected results. Indeed, miR-4521 has been recently re-annotated as a tRNA-derived small RNAs (tsRNAs), specifically ts-101 [9]. ts-101 is present in our synthetic small RNA-seq data set, so it was correctly identified by sRNAnalyzer, SPORTS1.0, Oasis 2, iSmaRT, and sRNApipe although as an miRNA. Similarly, miR-3182 and miR-6516 were identified by a significant number of counts by iSmaRT, miARma-Seq, Oasis 2, SPORTS1.0 and sRNAnalyzer because they share sequence similarity with the tRNA-fragment (tRF) tRFdb-5026a (reported in tRFdb database [50]) and the snoRNA ACA 47, respectively, both contained in our simulated data set. In addition, miR-214, miR-522, miR-550b and miR-103b were also identified as FP miRNAs with high counts. This artifact could be explained by performing a blastn analysis (version 2.6.0+ [65]). These miRNAs showed sequence identity (> = 91%) with the following miRNAs present in our data set: miR-3120, miR-519a, miR-550a and miR-103a, respectively.
piRNAs. To determine the accuracy of the pipelines for piRNAs detection, we evaluated their ability to detect the 500 different piRNA sequences contained in our synthetic data set. Six of eight tested pipelines (iSmaRT, iSRAP, Oasis 2, SPORTS1.0, sRNAnalyzer and sRNApipe) are reported to allow for the identification of piRNAs. However, SPORTS1.0 does not annotate piRNA sequences, thus, it just reports the total number of piRNA mapped reads without detailing the results. sRNApipe instead identifies the piRNA sequences mapping on transposable elements (TE) and protein-coding genes only, and it does not report which are the identified molecules in the output. Therefore, we could only consider four of six ncRNAs pipelines for piRNA comparison. The pipeline showing the best performance in terms of Sensitivity, Precision and F-measure in piRNAs detection is sRNAnalyzer, followed by iSmaRT, iSRAP and Oasis 2 (Table 5). In terms of read count accuracy, the best performing one is iSRAP (R2 = 0.83), followed by iSmaRT (R2 = 0.66), Oasis 2 (R2 = 0.28) and sRNAnalyzer (R2 = 0.08) (Figure 2).
snoRNAs. To determine the accuracy in snoRNAs detection, we used 100 different snoRNA sequences present in our synthetic data set. Six of eight tested pipelines (iSRAP, miARma-Seq, Oasis2, SPORTS1.0, sRNAnalyzer, sRNApipe) are claimed to detect snoRNAs. However, also in this case, SPORTS1.0 does not annotate snoRNA molecules, providing a summarization of the total number of snoRNAs mapped reads, which does not allow for a comparison, while sRNApipe could not detect any of the 100 snoRNAs present in the synthetic data set. Thus, we could calculate the statistics for iSRAP, miARma-Seq, Oasis 2 and sRNAnalyzer only. The pipeline with the best performance in terms of Sensitivity, Precision and F-measure is sRNAnalyzer, followed by miARma-Seq, iSRAP and Oasis 2 (Table 5), while for read counts estimation, the best is iSRAP (R2 = 0.81) followed by miARma-Seq (R2 = 0.73), sRNAnalyzer (R2 = 0.50) and Oasis 2 (R2 = 0.04) (Figure 3).
tRNA-derived ncRNAs. SPORTS1.0. is the only pipeline reported to allow for tRNA-derived ncRNAs analysis. However, although SPORTS1.0 can identify reads mapped on tRNA genes, it does not annotate their specific type and just reports the number of mapped reads for each tRNA gene. tRNA-derived ncRNAs are typically classified according to their origin within the tRNA gene and belong to two main classes: (i) tRNA-derived small RNAs (tsRNA), arising from pre-tRNA; (ii) stress-induced tRNA fragments (tiRNAs) and tRNA-derived fragments (tRFs), deriving from mature tRNA [4]. tRF can be further classified in tRF-5 and tRF-3 according to the ribonuclease cleavage site within mature tRNA D-loop or T-loop, respectively [4]. This annotation is commonly used in tRNA-derived ncRNAs databases, such as tRFdb [50] and MINTbase [66]. We, thus, used this annotation for our small RNA-seq simulated data set. Specifically, we selected 50 5′ leader tsRNAs, 50 3′ trailer tsRNAs, 50 tRF-5 and 50 tRF-3. SPORTS1.0 did not annotate the specific types of tRNA-derived ncRNAs, but it only retrieved the number of mapped reads for each tRNA gene. For this reason, it was not possible to establish which tRNA-derived ncRNAs present in our synthetic data set were detected by the pipeline and an accurate performance evaluation could not be performed.
lncRNAs. Among the tested pipelines, sRNAnalyzer is the only one reported to analyze lncRNAs. To evaluate its performance, we selected 500 different lncRNA sequences from our synthetic long RNA-seq data set. sRNAnalyzer identified lncRNAs with high Sensitivity (0.99) and low precision (0.25) (Table 5). sRNAnalyzer can efficiently estimate the lncRNAs expression profile (R2 = 0.96) (Figure 4).

Scatterplot reporting R2 calculated between TPs lncRNAs expression profile obtained by sRNAnalyzer and the real expression of lncRNAs in the synthetic long RNA-Seq data set.
Pipeline similarity and correlation on real data sets
To go beyond the limited complexity of the RNA species present in a synthetic data set, we also performed a comparative analysis using real data. Specifically, we selected a small RNA-Seq data set retrieved from Sequence Read Archive (SRA) (SRR5689212), which belongs to a breast cancer cell line (MDA-MB-231) of the NCI-60 panel [55]. This data set covers RNA molecules shorter than 200 nucleotides, thus including all the small ncRNAs analyzed in this study. In addition, to cover lncRNAs and circRNAs, we used another RNAseq data set from GDC (https://portal.gdc.cancer.gov/legacy-archive/files/0f5ba7d3-6f43-44af-9bbc-f9b4c09bbfeb), covering the same breast cancer cell line.
To evaluate similarities and differences in ncRNAs identification among the different pipelines, we used the Jaccard similarity coefficient between each couple of pipelines for all small ncRNAs classes assessed in this review. Next, we calculated the Pearson correlation matrix on the common small ncRNAs identified by each pipeline to establish their ability in estimating read counts.
miRNAs. Concerning miRNAs identification, we observed a high similarity between SPORTS and sRNApipe (J = 0.94), iSRAP and miARma-Seq (J = 0.9) and, to a lesser extent, SPORTS and sRNAworkbench (J = 0.8), sRNApipe and sRNA workbench (J = 0.78), iSmaRT and sRNApipe (J = 0.77) and iSmaRT and SPORTS (J = 0.77) (see Jaccard similarities in Figure 5). Person correlation matrix calculated among miRNAs identified by all the pipelines also showed high correlations in read count estimation for all tools (Supplementary Figure S3).

Jaccard similarity matrices calculated among the different tools for miRNAs, piRNAs and snoRNAs identification in the real small RNA-Seq data set.
piRNAs. Concerning piRNAs identification, we observed generally low similarities among the tools, possibly due to the high numbers of FP piRNAs identified by each method. iSmaRT and iSRAP are the most similar ones, although at a low level (J = 0.31) (Figure 5). Concerning piRNAs counts estimation, iSRAP, iSmaRT and sRNAnalyzer show high correlations, while Oasis 2 seems to be less consistent with the others (Figure S4).
snoRNAs. Concerning snoRNAs identification and quantification, iSRAP, miARma-Seq and sRNAnalyzer show very high similarities and correlation. On the other hand, Oasis 2 seems to be less consistent with the other tools both in terms of similarity and read count estimation (Figure 5 and Figure S5).
lncRNAs could be analyzed only with sRNAnalyzer, while circRNAs could be evaluated only with miARma-Seq; therefore, comparative statistics could not be calculated. Nevertheless, we executed both pipelines with the real RNA-Seq data set obtained from GDC and identified 11.715 lncRNA and 819 circRNAs using sRNAnalyzer and miARma-Seq, respectively.
Discussion
In this study, we performed a systematic comparative performance analysis of the most recent pipelines currently available for the identification of several types of ncRNAs. Specifically, we selected eight recent ncRNAs pipelines which allow the analysis of more than one single ncRNA class. Our aim was to provide guidelines supporting the choice of the most appropriate analysis workflows to the researchers interested in the analysis of ncRNAs.
We evaluated pipeline performances in terms of: (i) easiness of installation, (ii) usage, (iii) accuracy in the identification of different ncRNA classes (miRNAs, piRNAs, snoRNAs, tRNA-derived ncRNAs, and lncRNAs), and (iv) accuracy in expression quantification. For this purpose, we built two ad-hoc synthetic RNA-Seq data sets, using Flux Simulator [47], enabling pipeline comparison in controlled experiments. The first RNA-Seq data set contains small RNAs, such as miRNAs, piRNAs, snoRNAs and tRNA-derived ncRNAs, while the second one contains long RNAs, covering mRNAs and lncRNAs. Since synthetic data set might not resemble the complexity of real RNA-seq data set in terms of RNA species diversity, we also evaluated pipeline performances on real data sets by using publicly available RNA-seq data performed on a breast cancer cell line.
We considered the usability of each pipeline by evaluating setup simplicity, the type of documentation provided, GUI availability, input and output file formats. Since these aspects are crucial for the widespread distribution of the tools, we highlighted pros and cons of each pipeline. iSmaRT, miARma-Seq and sRNA workbench are easy to install and come with a detailed and exhaustive documentation. iSmaRT, Oasis 2, sRNApipe and sRNA workbench are provided with a GUI, facilitating the usage to non-experienced users, while in SPORTS1.0, all parameters must be passed through command line, thus limiting the usage to more expert ones. miARma-Seq and iSRAP are very flexible because they allow to start the analysis at different steps of the pipeline. sRNAnalyzer, iSRAP, Oasis 2, sRNApipe, miARma-Seq and SPORTS1.0 enable the analysis of more than one ncRNA class simultaneously, while iSmaRT can analyze one ncRNA class at a time. iSRAP and SPORTS1.0 require a complex installation process, and iSmaRT, SPORTS1.0 and sRNApipe are not very flexible, with a rigid workflow.
We used Precision, Sensitivity and F-measure to evaluate tool performances in the synthetic data sets. To ascertain count estimation accuracy, we considered TP identifications and compared the estimated counts with the ones included in the simulated data sets. Although there is not a single pipeline excelling in enabling the detection of all the different ncRNA classes, each of them appears to be more suitable for one or another ncRNA type detection. Overall, iSRAP shows the highest R2 values for several small ncRNAs classes (miRNAs = 0.96, piRNAs = 0.83, snoRNAs = 0.81). SPORTS1.0 is the tool showing the best performances in miRNA identification, while sRNAnalyzer outperforms in piRNA and snoRNAs detection. Concerning lncRNAs, sRNAnalyzer is the only pipeline enabling their analysis.
SPORTS1.0 is reported to detect different small ncRNA classes (miRNAs, piRNAs, snoRNAs, tRNA-derived ncRNAs, rsRNA) [60]. Although the tool shows excellent performances on miRNAs, it was not possible to evaluate its ability in detecting piRNAs and snoRNAs, because for these two classes of ncRNAs, it currently reports the total number of mapped reads only, without retrieving the specific sequences of the identified piRNAs and snoRNAs. However, we believe that this feature could be easily enhanced in future SPORTS releases.
We highlighted a similar concern in SPORTS1.0 for the detection of tRNA-derived ncRNAs, which represents a very heterogeneous class of ncRNA deriving from pre-tRNA and mature tRNAs processing. They are classified according to their original location within tRNA: (i) tRNA-derived small RNAs (tsRNA), deriving from pre-tRNA; (ii) stress-induced tRNA fragments (tiRNAs) and tRNA-derived fragments (tRFs), coming from mature tRNA [4]. Moreover, tRF can be further divided in tRF-5 and tRF-3 according to ribonuclease cleavage site (tRNA D-loop and T-loop, respectively) [4]. We used this annotation, which is extensively used in several databases, such as tRFdb [50] and MINTbase [66], for the implementation of the simulated data set. Despite SPORTS1.0 identified several reads mapping to tRNA genes, it does not annotate the types of tRNA-derived ncRNAs. Therefore, it was not possible to identify which tRNA-derived ncRNAs was detected. Nowadays, several alternative tools have been developed to detect and counts tRNA-derived ncRNAs from RNA-seq data set, such as tRF2Cancer [67] and MINTmap [68]. They are specialized on the analysis of only tRNA-derived ncRNAs and do not analyze other classes of ncRNAs, so they were out of the scope of this review.
sRNApipe is reported to allow for the identification of miRNAs, piRNAs and snoRNAs; however, in our hand, it retrieved only miRNAs, several piRNAs mapped on TEs and coding-protein genes, without reporting which piRNAs were identified, and none of the snoRNAs present in our synthetic small RNA-Seq data set.
All the pipelines were also evaluated on real RNA-seq data. Also in this case, the analysis highlighted how the performances of the pipelines are strictly related to the class of processed ncRNA.
For miRNAs, in both the synthetic and real data sets, all pipelines generally show good performances both in terms of miRNA identification and read count estimation, suggesting that for miRNA analysis, the pipelines have reached good accuracy levels.
On the other hand, for piRNA identification, the pipelines generally showed lower performance. This might be probably due to the high numbers of FPs which were identified by the different pipelines. Similarities in terms of piRNA detection are indeed quite low in all the four pipelines tested. The highest similar ones, iSmaRT and iSRAP, reach a low similarity level (J = 0.31), suggesting that additional refinement are needed for the identification of piRNAs species. However, considering the accuracy in count estimation, iSmart, iSRAP and sRNAnalyzer yield similar performances. Oasis 2 seems to be less consistent compared to the other tools, similarly to what observed in the synthetic data set.
Regarding snoRNAs analysis, iSRAP, miARma-Seq and sRNAnalyzer show similar behavior. Read counts estimation suggests that these tools are accurate in the identification and quantification of snoRNAs, with Oasis 2 being less similar to the other pipelines.
Conclusions
With the present work, we highlighted the main features and the criticisms of the currently available pipelines for ncRNAs data analysis, with the aim of assisting researchers in the choice of the workflow more straightforward for the analyses of different ncRNA classes. From this analysis, it clearly emerged that there is not a single tool exceeding in evaluating all together the ncRNA species, but each of them is more specialized in the detection of a specific class.
Future perspectives
Pipelines for the comprehensive analysis of ncRNAs are still lacking, and they are desirable for enabling a more comprehensive analysis of these heterogeneous and important classes of molecules.
Several pipelines have already been developed to identify ncRNAs from RNA-Seq data set. However, many of them present restrictions on the input data size, static workflows, complicated setup and usage.
Six pipelines were tested in order to evaluate the following three key points: (i) accuracy in ncRNAs identification, (ii) accuracy in read count estimation, and (iii) deployment and ease of use.
From the analysis, it emerged that there is not a single tool exceeding in analyzing all together the different ncRNA species, but each tool is specialized in the detection of a specific class.
This tool comparison can help researchers in the evaluation of pros and cons of each pipeline and guide them to the choice of the best method for the analysis of the classes of ncRNA they are more interested in.
Additional effort should be done to improve the usability of ncRNA pipelines in order to facilitate their usage to the scientific community.
Funding
ALF is supported by the PhD fellowship on Complex Systems for Physical, Socio-economic and Life Sciences funded by the Italian MIUR ‘PON RI FSE-FESR 2014-2020’. SA, AF and AP have been partially supported by the research project ‘Marcatori molecolari e clinico-strumentali precoci, nelle patologie metaboliche e cronico-degenerative’ founded by the Department of Clinical and Experimental Medicine of University of Catania. AP has been also partially supported by the Italian MIUR FFABR grant.
Sebastiano Di Bella is a bioinformatician at Nerviano Medical Sciences. He works on the development of NGS pipelines, in particular on ncRNAs, and computational analysis for biomarkers identification in oncology.
Alessandro La Ferlita is a PhD student in Complex systems for physical socio-economics and life sciences at University of Catania. He works on RNA computational biology, NGS data analysis and development of NGS pipelines.
Giovanni Carapezza is a bioinformatician at Nerviano Medical Sciences. He works on NGS data analysis, pipeline development and on the implementation of methodologies for the identification of biomarkers in oncology.
Salvatore Alaimo is a Researcher in Computer Science at University of Catania. He works on computational pathway analysis and simulation, NGS analysis, methodologies for precision medicine and their translational applications.
Antonella Isacchi is Director of Biotechnology and Kinase Platform Coordinator at Nerviano Medical Sciences. She has more than 20 years’ experience in research and development of novel anticancer drugs and biomarkers.
Alfredo Ferro is a Full Professor in Computer Science at University of Catania. He works on methods and models for computational biology, biomedicine and personalized medicine.
Alfredo Pulvirenti is an Associate Professor of Computer Science at University of Catania. He studies methods for pathway and biological network analysis, drug repositioning, ncRNAs, subgraph matching and motif finding.
Roberta Bosotti is head of Genomics and Bioinformatics group at Nerviano Medical Sciences. Her research interests include application of NGS to translational medicine, biomarker identification and drug repositioning in oncology.