Abstract

Motivation

The ubiquitous abundance of circular RNAs (circRNAs) has been revealed by performing high-throughput sequencing in a variety of eukaryotes. circRNAs are related to some diseases, such as cancer in which they act as oncogenes or tumor-suppressors and, therefore, have the potential to be used as biomarkers or therapeutic targets. Accurate and rapid detection of circRNAs from short reads remains computationally challenging. This is due to the fact that identifying chimeric reads, which is essential for finding back-splice junctions, is a complex process. The sensitivity of discovery methods, to a high degree, relies on the underlying mapper that is used for finding chimeric reads. Furthermore, all the available circRNA discovery pipelines are resource intensive.

Results

We introduce CircMiner, a novel stand-alone circRNA detection method that rapidly identifies and filters out linear RNA sequencing reads and detects back-splice junctions. CircMiner employs a rapid pseudo-alignment technique to identify linear reads that originate from transcripts, genes or the genome. CircMiner further processes the remaining reads to identify the back-splice junctions and detect circRNAs with single-nucleotide resolution. We evaluated the efficacy of CircMiner using simulated datasets generated from known back-splice junctions and showed that CircMiner has superior accuracy and speed compared to the existing circRNA detection tools. Additionally, on two RNase R treated cell line datasets, CircMiner was able to detect most of consistent, high confidence circRNAs compared to untreated samples of the same cell line.

Availability and implementation

CircMiner is implemented in C++ and is available online at https://github.com/vpc-ccg/circminer.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Circular RNA (circRNA) is formed when a downstream donor  3' splice site is covalently joined to an upstream 5 splice site by a 35 phosphodiester bond. This event, termed back-splicing, appears to involve the same splicing signals and machinery as canonical mRNA splicing (Starke et al., 2015). circRNAs are frequently transcribed from the same genes as mRNA, and their formation competes with the synthesis of mature mRNA (Ashwal-Fluss et al., 2014). The existence of circRNAs in eukaryotic cells was first reported in 1979 (Hsu and Coca-Prados, 1979). However, they did not garner much attention until 2012, when they were found to be abundantly expressed across the human and mouse genomes (Jeck et al., 2013; Salzman et al., 2012). It was subsequently determined that circRNA expression extended to essentially all eukaryotes (Wang et al., 2014).

The pervasive expression of circRNAs across the eukaryotes domain indicates that they were either conserved over a billion years of evolution or evolved independently in multiple kingdoms, both of which suggest a functional role (Wang et al., 2014). Extensive efforts have since been made to understand the characteristics, function and possible applications of circRNAs (Lei et al., 2019; Memczak et al., 2013). Exon–intron circRNAs, which retain introns, have been observed to modulate the expression of their parent genes (Li et al., 2015b). circRNAs may also be significant in cancer, as some expressed circRNAs have oncogenic or tumor-suppressor functions (Kristensen et al., 2018). For example, in patients with prostate cancer, the level of circRNA expression may predict tumor progression and prognosis (Chen et al., 2019). Fusion circRNAs, transcribed from exons originating from distinct genes, have been found to promote oncogenesis and confer resistance to therapy (Guarnerio et al., 2016).

Back-splicing is far less efficient than canonical splicing, providing a probable explanation to the low steady-state levels at which most circRNAs are expressed (Guo et al., 2014; Zhang et al., 2016b). Nonetheless, circRNAs are more stable than their linear counterparts and accumulate within cells to allow widespread detection (Zhang et al., 2016b). Moreover, circRNAs are found to be abundant and stable in human blood exosomes (Li et al., 2015a) and circulating blood cells (Maass et al., 2017; Memczak et al., 2015). As a result of the stable structure, they have a longer half-life in cell-free samples (such as blood and urine). This generates an opportunity for circRNAs to be used as cancer biomarkers from non-invasive liquid biopsies (de Fraipont et al., 2019; Li et al., 2015a; Memczak et al., 2015).

Despite their abundant expression, circRNAs were largely overlooked (prior to 2012) due to the selection for poly(A) tails in the preparation of most RNA sequencing (RNA-Seq) libraries, leading to circRNA depletion (Szabo and Salzman, 2016). Nowadays, there is a much greater emphasis on biochemical protocols such as ribosomal RNA (rRNA) depletion and poly(A) depletion in non-coding RNA studies to preserve circRNAs. The discovery of widespread circRNAs also revealed shortcomings in existing methods, given some circRNAs would inevitably survive poly(A) selection (Szabo and Salzman, 2016). This spurred the development of novel computational detection tools aimed specifically at back-splice junction detection including CIRI (Gao et al., 2015)/CIRI2 (Gao et al., 2018), CIRCexplorer (Zhang et al., 2014)/CIRCexplorer2 (Zhang et al., 2016a), KNIFE (Szabo et al., 2015), circRNA_finder (Westholm et al., 2014), find_circ (Memczak et al., 2013), DCC (Cheng et al., 2016), PTESFinder (Izuogu et al., 2016), NCLscan (Chuang et al., 2016), MapSplice (Wang et al., 2010), segemehl (Hoffmann et al., 2014) and CircMarker (Li et al., 2018). However, comparisons show their output diverges dramatically with no tool having a clear advantage (Hansen et al., 2016; Zeng et al., 2017). These tools can be classified into two categories based on the way they determine chimeric reads: (i) candidate-based discovery tools (e.g. CIRCexplorer) require existing gene annotations that contain candidate junctions and (ii) de novo (segment-based) methods (e.g. segemehl) can detect circRNAs without using gene annotation and are able to identify unannotated splice sites (Chen et al., 2015; Jeck and Sharpless, 2014).

All the available circRNA detection tools depend on general-purpose high-throughput sequencing (HTS) mappers such as BWA and STAR with segemehl as the only exception. Methods from both categories require significant computational resources (time and memory). Moreover, the detection of chimeric reads (e.g. back-splice junction reads) from RNA-Seq is challenging and has a dramatic effect on the sensitivity of the circRNA detection methods. A general-purpose mapper, such as STAR, is mainly designed to map linear mRNA reads back to transcriptome because most of the analysis involving RNA-Seq relies on expression estimation. In recent years, a new generation of mappers, such as Kallisto (Bray et al., 2016) and RapMap (Srivastava et al., 2016), have been developed that apply pseudo-alignment and quasi-mapping techniques on the transcriptome, respectively. They have a higher speed than STAR as they only focus on transcriptome compatible reads. However, these tools to this date are not splice-aware, i.e. they only locate reads on the transcriptome and are unable to identify reads on the genome.

We introduce CircMiner that is designed to provide higher sensitivity in the detection of circRNAs while reducing the computational resource requirement, which is not dependent on other HTS mappers. CircMiner is a novel candidate-based circRNA detection method employing splice-aware pseudo-alignment to quickly filter linear reads and report chimeric reads potentially related to aberrations, such as back-splicing. CircMiner’s detection module examines the reported chimeric reads and identifies the reads that can generate back-splice junctions. Finally, it identifies the high confidence back-splice junctions and reports them. We evaluated CircMiner on both simulated datasets and real datasets. On simulated datasets, CircMiner achieves the best time, recall and F1 score while providing a similar precision to the other discovery tools. We further evaluated the performance of these tools on two RNase R treated cell-line datasets. RNase R is used to digest linear mRNAs; thus, the treated sample will have an abundance of circRNAs. In this comparison, CircMiner was able to detect most of the consistent circRNA between treated and untreated samples. This suggests that CircMiner has a lower false positive rate compared to most of the other existing tools. It is worth mentioning that CircMiner can report full alignments at an additional processing time per user request.

2 Materials and methods

CircMiner has two main modules: (i) pseudo-alignment and linear read filtration module and (ii) back-splice junction detection module. CircMiner first rapidly filters those reads which fully map to linear transcripts or the genome using the pseudo-alignment module. The remaining partially mapped reads would be further analyzed using the back-splice junction detection module to extract the potential back-splice junction reads. For each such read, CircMiner maps its unmapped part to the surrounding loci within the same transcript and determines the breakpoints of the back-splice junctions. The back-splice junction reads are then grouped together based on their breakpoints within the same transcripts. Finally, circRNAs are reported along with the breakpoints, transcript information and supporting reads spanning the back-splice junction. These steps are illustrated in Supplementary Figure S1. Furthermore, Figure 1 depicts the overview of the read processing for linear and circular reads.

Back-splicing versus canonical splicing. Linear reads (on the left) are eliminated in pseudo-alignment stage and back-splice junction reads (on the right) are further processed in circRNA detection stage
Fig. 1.

Back-splicing versus canonical splicing. Linear reads (on the left) are eliminated in pseudo-alignment stage and back-splice junction reads (on the right) are further processed in circRNA detection stage

2.1 Rapid splice-aware read filtration

CircMiner starts by rapidly discarding reads that do not provide any signal for aberrations, i.e. reads that originate from known transcripts or the reference genome and therefore, could be mapped using splice-aware mappers. For a given read, CircMiner partitions it into non-overlapping segments, also known as seeds. However, this restriction can easily be loosened to consider overlapping seeds. For each seed, CircMiner queries exact matching positions on the genome from its index. Next, CircMiner tries to find the compatible chains of these genomic locations on the transcriptome and genome. Note that properly mapped reads can be identified if the relative order and distances of the seeds on the read are consistent with the order and distances on the transcriptome or genome. Finally, CircMiner goes through each compatible chain and tries to extend the chain to full alignment. Reads whose chains are successfully extended to full alignments are eliminated from further processing. In the following subsections, we will provide details of read filtration module.

2.1.1 Indexing

CircMiner builds a hybrid index that consists of two parts: (i) genome index and (ii) transcriptome index. For the genome index, CircMiner uses a modified version of mrsFAST-Ultra’s (Hach et al., 2014) index, which is a hash table of k-mer prefixes. Each entry of the table, p, keeps a sorted list of genomic k-mers (and their coordinates) that have p as their prefix. Querying a given k-mer from this index requires  O(log|P|) where P is the set of locations on the genome that have the exact same prefix of size p as the k-mer. This query is done in two steps: (i) an O(1) lookup of prefix p-mer is performed on the hash table, which returns P that is sorted by sequence content lexicographically and then by genomic location; and (ii) a binary search requiring  O(log|P|) time on the sequences will return all the matches. In practice, p is smaller than k but large enough that memory usage for the index is reasonable for contemporary computers. For example, if k is 19 bp, then a prefix of size 14 is reasonable for generating a fast lookup table with reasonable memory usage. More details about genome index are provided in Supplementary Material.

To carry out pseudo-alignment at the transcriptome level, CircMiner uses a secondary index to keep the gene model and transcriptome information from an annotation file (e.g. GTF file). CircMiner builds an interval tree from the exons of the transcripts that are obtained from the gene model. The interval tree is queried to see if a given genomic interval (e.g. k-mer) is on a gene and if so, to obtain all the transcripts that overlap with this genomic location. Each search query on the interval tree requires  O(log|T|) time with |T| being the number of segments in the interval tree T. Note that many k-mers fall in intergenic regions, and thus there is no need to query them. To reduce the number of interval tree lookups, CircMiner augments the interval tree with a simple, compact yet powerful bitwise array that corresponds to all the intergenic locations on the genome. If a location on the genome is intergenic, the value of that location in the array will be 0. In practice, the array reduces the total number of lookups on the interval tree almost 3-fold.

2.1.2 Optimal k-mer chaining

CircMiner partitions a given read r into non-overlapping k-mers with starting positions S=(si)i=1n where n=|r|k, si is the starting position of the i-th k-mer on the read and si<si+1. For the seed starting at si, the set of its exact matches on the genome, Li={li1,li2,,liz}, is extracted from the index where lij is the starting position of its j-th match. A valid chainingC={S,} of r can be therefore defined by: (i) S=(si)i=1m, a sublist of S containing m seeds (mn) and (ii) '={la1,,lam|lajLj}, exact matches for seeds (in S) that preserves their relative order on the same chromosome. CircMiner scores a chain C as follows:
where α and β are reward and penalty constant coefficients in the chaining algorithm and Dist(lai+1,lai) is the distance between starting positions of two consecutive k-mers of S on the transcriptome or genome.

In the Score function, m×k is the total length of matched bases from the read to the reference, which is used as a reward for chaining to encourage building longer chains. On the other hand, the penalty is the aggregated difference between seeds’ distances on the read and their corresponding matching locations distance on the genome or transcriptome. For calculating Dist between two consecutive seeds from S, when both seeds are fully located within the exons of the same transcript, CircMiner would discard introns that fall into [lai,lai+1] to get the accurate distance on the transcriptome. If multiple transcripts contain the seeds, the transcript resulting in the smallest distance will be considered. For intronic and intergenic seed locations, genomic distance is used in Dist function. In both cases, these differences are most likely representative of indels. Since we do not expect too many indels, only a few base pair differences are allowed to account for small indels that happen between two consecutive seeds. More specifically, we have a user-defined parameter to limit the maximum allowed value for Dist. If the difference is greater than this threshold, Dist is set to .

In the chaining process, some seeds are ignored due to having too many seed locations. Seed limit (SL) is a user-defined parameter in CircMiner, indicating the maximum number of locations a seed is allowed to have. Any seed whose number of occurrences on the reference genome is greater than SL will be discarded from the chaining stage. Changing the value of SL can affect the pseudo-alignment module’s power in filtering linear reads as well as CircMiner’s running time. A practical value for this parameter is empirically chosen to be 500 (see Section 3.1 for more details). CircMiner calculates scores for all chains of r using a dynamic programming approach. Top t chains (t=30 by default) with the highest chaining score are stored in Qf for further processing in the next steps. CircMiner repeats the same procedure for the reverse complement of the read and stores in Qr.

2.1.3 Extension of high-scoring chains

In general, popular short read aligners (genomic or transcriptomic) obtain a potential alignment of a read by finding a substring on the genome or transcriptome, which has the smallest edit distance that is less than a user-defined threshold e. In doing so, they sometimes utilize heuristics to clip the ends of the reads (low-quality bases or unmappable in close vicinity) and generate soft clipped reads.

Since k-mer chaining narrows down such potential mapping locations, CircMiner can efficiently check if a chain can be extended into full or clipped alignment by (i) extending the missing terminal k-mers using a prefix or suffix alignment; and (ii) by using local banded alignment for the missing middle k-mers in SS`. Note that CircMiner will extract corresponding transcript segments for the extension step if the chain is located on the transcript.

For any dataset that contains only single-end reads, any read that is not fully extended within a user-defined error threshold will be considered for further processing. If the dataset contains paired-end reads, the extension is modified in order to take advantage of pair information. Consider Q1={Q1fQ1r},Q2={Q2fQ2r} to be the top candidate chains for mate one and mate two, respectively. CircMiner favors transcriptome mappings to genomic mappings. In doing so, it considers all the pairs from Q1 to Q2, where one of the chains is on forward strand (F) and the second one on reverse strand (R). CircMiner then classifies the pairs with respect to the following priorities: (i) transcript compatible: all seeds from both mates are located on the same transcript; (ii) gene-compatible: all seeds are located on different transcripts of the same gene; (iii) fusion: the seeds of the two mates are on transcripts of two different genes; and (iv) genome-compatible: where the seeds are located in intergenic regions, and leftmost seed to the rightmost seed is not further than a user-defined distance D (default value: 20 000 bp). If a pair cannot be placed into any of the above categories, it will be discarded. For the remainder of the pairs, the extension will be performed within the selected region (e.g. transcript compatible pairs will be only extended in transcript boundaries).

2.1.4 Extraction of back-splice signals

A back-splice junction read does not have a (co-)linear chain that covers the full length of the read. Thus, it may have one or two chains in the top scoring chains that represent suffix and/or prefix of read. The goal of this stage is to identify and extract these reads where (i) one mate is fully extended and the other mate has a full prefix or suffix extension on the transcript; or (ii) both mates have a full prefix or suffix extension (Supplementary Figure S2). Specifically, any read that is fully extended on a transcript and has a proper orientation (FR mapping) will be marked as concordant and removed from further processing. A fully extended read on a transcript with incorrect orientation (RF mapping) is a potential non-spanning (not covering the breakpoint) supporting read for back-splice junctions and is retained for circRNA refinement. If a read has a partial extension on the transcript that is a prefix or suffix of the read, then that read has a potential back-splice signal and will be marked for the next stage. Finally, if there is no full or partial extension available for the read on any transcript, but it has a full extension on the categories (ii), (iii) and (iv), then it will be discarded from further processing. Note that, the information for the discarded reads, including their mapping information is recorded and can be extended to full alignment information upon user request and at an additional cost.

Recall and precision of read mapping by CircMiner trying different values (100, 500, 1000, 5000) for SL parameter
Fig. 2.

Recall and precision of read mapping by CircMiner trying different values (100, 500, 1000, 5000) for SL parameter

2.2 CircRNA detection

Due to the low abundance of circRNAs usually, <5% of the reads in an untreated RNA-Seq dataset remain as potential back-splice junction candidates (Salzman et al., 2013; Szabo and Salzman, 2016). In order to detect circRNA breakpoints, for each candidate back-splice junction read, CircMiner examines if the unmapped suffix (or prefix) of the read can be aligned to upstream (or downstream) of the same transcript. CircMiner groups all the candidate back-splice junction reads from the same gene and processes them together as follows. For any gene with at least one back-splice junction read, it creates a hash table with a smaller k-mer size (8 bp) for higher sensitivity. For simplicity, consider the unmapped substring of the read as U and mapped substring as M. CircMiner extracts overlapping seeds (shifted by 3 bp) of size k from U, and chains the seeds. It then performs extension on the top scoring chains (top five chains) as described in the previous section to obtain all the mappings of U. CircMiner keeps the mappings of U with minimum edit distance. If the total combined edit distance of the mapping of U and M is below the error threshold (e), CircMiner evaluates the mappings of U and M. If mappings of U and M form a (co-)linear mapping of the read, it would be discarded. Otherwise, if mappings U and M form a back-splice junction read, the breakpoint will be recorded for the next stage of clustering. Note that if there are multiple mappings of U with the same edit distance, CircMiner selects the formation that is consistent with splice sites of the transcript. Finally, the back-splice junction mappings will be clustered based on their breakpoints. For each circRNA, the genomic location of the detected breakpoints along with the supporting back-splice junction reads and the transcript information is reported.

3 Results

To evaluate the efficacy of the two modules in CircMiner, we designed two sets of experiments. First, we simulated datasets containing only linear mRNAs reads. We used these datasets to calculate the positive predictive value (PPV, precision), and sensitivity (recall) of the pseudo-alignment module against the state-of-the-art splice-aware mapper, STAR v2.6.0a (Dobin et al., 2013). In the next experiment, we used simulated and real datasets to compare circRNA detection module of CircMiner with several popular circRNA detection tools. All experiments were performed on a server running CentOS 7 equipped with 64 core CPU processors (Intel® Xeon® Gold 6130 CPU @ 2.10 GHz) with two threads per core and 720 GB RAM. Note that, in all the experiments, Ensembl version GRCh38.90 is used as human genome reference and gene annotation.

3.1 Robustness of linear read filtration

We first simulated RNA-Seq datasets containing linear mRNA reads to assess PPV and recall of the pseudo-alignment and linear read filtration module. To simulate realistic datasets, we downloaded three RNA-Seq datasets from NCBI Sequence Reads Archive (accession numbers: SRR3146803, SRR3146859 and SRR3146914), and obtained their expression profile using kallisto v0.44.0 (Bray et al., 2016). We excluded low-abundant transcripts from these expression profiles. Finally, we generated five paired-end datasets (2 × 100 bp reads with an insert size of 350 ± 75 bp) per profile using ART Illumina simulator v2.5.8 (Huang et al., 2012) and Illumina HiSeq 2000 error model. Note that the number of the reads simulated per isoform is taken from the expression profile. We included multiple datasets for each gene expression profile in order to increase the diversity of junction reads so that we can test the robustness of our splice-aware pseudo-alignment module.

In these datasets, the origin of all the reads is known; thus, we can evaluate the precision (PPV) and recall of the pseudo-alignment module. As mentioned in Section 2, the SL parameter has an effect on PPV and recall of CircMiner. In the first experiment, we ran CircMiner with four different values of SL including, 100, 500, 1000 and 5000 and calculated (i) time, and (ii) PPV and recall with respect to the ground truth.

Our pseudo-alignment module is able to map 99% of the reads in all three datasets as demonstrated in Supplementary Figure S3. A read is considered to be correctly mapped if the start and end positions of its mapping are within a few base pairs (10 bp) of the actual ground truth provided by ART. Table 1 and Figure 2 show the required time, precision and recall of CircMiner with different values of SL. For all datasets, the average value over five replicates is shown in Table 1 and Figure 2. The standard deviations of precision and recall across five replicates are <0.2% for all SL values, suggesting the robustness of our pseudo-alignment module in detecting coordinates of aligned reads. As SL increases so does the total running time of CircMiner, and although it improves precision, recall and F1 score, the effect is not considerable. In addition, precision drops slightly in some cases when greater SL values are selected. We selected SL=500 as default value in CircMiner since it maintains a good balance between speed and mapping accuracy. This value is used in the rest of experiments explained in Section 3.

Precision and recall of CircMiner and other existing circRNA detection methods on simulated datasets. C1 and DC1 on the left. C2 and DC2 on the right. An arrow is connecting positive control dataset results to diluted dataset results. F1 score and detailed information are presented in Supplementary Table S4
Fig. 3.

Precision and recall of CircMiner and other existing circRNA detection methods on simulated datasets. C1 and DC1 on the left. C2 and DC2 on the right. An arrow is connecting positive control dataset results to diluted dataset results. F1 score and detailed information are presented in Supplementary Table S4

Table 1.

Time comparison of CircMiner with different SL values and STAR

SampleRead countSL=100SL=500SL=1000SL=5000
S123 460 61220:4621:4923:3028:14
S225 944 09023:2924:5326:4433:41
S333 138 47123:5124:5725:2429:01
SampleRead countSL=100SL=500SL=1000SL=5000
S123 460 61220:4621:4923:3028:14
S225 944 09023:2924:5326:4433:41
S333 138 47123:5124:5725:2429:01

Note: Time format: (mm:ss).

Table 1.

Time comparison of CircMiner with different SL values and STAR

SampleRead countSL=100SL=500SL=1000SL=5000
S123 460 61220:4621:4923:3028:14
S225 944 09023:2924:5326:4433:41
S333 138 47123:5124:5725:2429:01
SampleRead countSL=100SL=500SL=1000SL=5000
S123 460 61220:4621:4923:3028:14
S225 944 09023:2924:5326:4433:41
S333 138 47123:5124:5725:2429:01

Note: Time format: (mm:ss).

We then compared CircMiner’s results with the state-of-the-art splice-aware RNA-Seq mapper, STAR. In Table 2, we report the time usage, memory usage, precision, recall and F1 score of STAR versus CircMiner on all three simulated datasets. As shown in Table 2, CircMiner has ˜12 time and ˜25 memory consumption compared to STAR while the accuracy is similar. In two of the three cases, CircMiner has a slightly better F1 score and in one case STAR has a slightly better F1 score. Although CircMiner is not designed to be a full RNA-Seq mapper, the results show that it has comparable results to STAR. As demonstrated in Table 2, in two out of three datasets, CircMiner outperforms STAR while STAR performs better in one of the datasets. Lower accuracy in S1 can be explained by multi-mapping. Since CircMiner does not have all the features of a typical mapper and only reports one mapping location per read, it is possible that for multi-mapping reads, CircMiner report one of the secondary mappings that have identical error and soft-clip values. To test the multi-mapping hypothesis, we compared our pseudo-alignment results with every multi-mapping location reported by STAR. The new recall, precision and F1 score are shown in Supplementary Table S1. It showed that the pseudo-alignment module’s recall and precision in dataset S1 go up >2.5% when we also consider alternate locations of multi-mapped reads as true positives. This indicates that for a mappable read, even though CircMiner might not perfectly locate its true origin, in many cases it already provided the best result other splice-aware mappers can achieve. The feature for reporting mapping locations is disabled in CircMiner by default because it increases running time and it is not needed when the user is only interested in circRNA detection. However, the user can enable this feature.

Table 2.

Performance and time comparison between CircMiner (SL=500) and STAR on linear simulation datasets

DatasetCircMinerSTAR
S1Recalla (%)95.1597.23
Precisionb (%)96.1997.27
F1 scorec0.960.97
Timed (mm:ss)21:0641:01
Memoryd (GB)11.3127.76
S2Recall (%)97.8497.43
Precision (%)98.9697.53
F1 score0.980.97
Time (mm:ss)24:3044:01
Memory (GB)11.3127.76
S3Recall (%)97.4096.23
Precision (%)98.4696.28
F1 score0.980.96
Time (mm:ss)24:5754:52
Memory (GB)11.3127.76
DatasetCircMinerSTAR
S1Recalla (%)95.1597.23
Precisionb (%)96.1997.27
F1 scorec0.960.97
Timed (mm:ss)21:0641:01
Memoryd (GB)11.3127.76
S2Recall (%)97.8497.43
Precision (%)98.9697.53
F1 score0.980.97
Time (mm:ss)24:3044:01
Memory (GB)11.3127.76
S3Recall (%)97.4096.23
Precision (%)98.4696.28
F1 score0.980.96
Time (mm:ss)24:5754:52
Memory (GB)11.3127.76
a

Recall is defined as number of correctly mapped reads/read count.

b

Precision is defined as number of correctly mapped reads/total number of mapped reads.

c

F1 score is calculated as 2.precision×recallprecision+recall.

d

Running time and memory usage are measured using time -v Unix command.

Note: The higher values of Recall, Precision, and F1 score as well as lower time and memory usage are highlighted with bold font specifying the better performing method in each aspect.

CircMiner is around 2 times faster and 2.5 times more memory efficient compared to STAR. It also achieves higher recall, precision and F1 score in two out of three samples.

Table 2.

Performance and time comparison between CircMiner (SL=500) and STAR on linear simulation datasets

DatasetCircMinerSTAR
S1Recalla (%)95.1597.23
Precisionb (%)96.1997.27
F1 scorec0.960.97
Timed (mm:ss)21:0641:01
Memoryd (GB)11.3127.76
S2Recall (%)97.8497.43
Precision (%)98.9697.53
F1 score0.980.97
Time (mm:ss)24:3044:01
Memory (GB)11.3127.76
S3Recall (%)97.4096.23
Precision (%)98.4696.28
F1 score0.980.96
Time (mm:ss)24:5754:52
Memory (GB)11.3127.76
DatasetCircMinerSTAR
S1Recalla (%)95.1597.23
Precisionb (%)96.1997.27
F1 scorec0.960.97
Timed (mm:ss)21:0641:01
Memoryd (GB)11.3127.76
S2Recall (%)97.8497.43
Precision (%)98.9697.53
F1 score0.980.97
Time (mm:ss)24:3044:01
Memory (GB)11.3127.76
S3Recall (%)97.4096.23
Precision (%)98.4696.28
F1 score0.980.96
Time (mm:ss)24:5754:52
Memory (GB)11.3127.76
a

Recall is defined as number of correctly mapped reads/read count.

b

Precision is defined as number of correctly mapped reads/total number of mapped reads.

c

F1 score is calculated as 2.precision×recallprecision+recall.

d

Running time and memory usage are measured using time -v Unix command.

Note: The higher values of Recall, Precision, and F1 score as well as lower time and memory usage are highlighted with bold font specifying the better performing method in each aspect.

CircMiner is around 2 times faster and 2.5 times more memory efficient compared to STAR. It also achieves higher recall, precision and F1 score in two out of three samples.

The most important task of this module is to find and keep the reads that can show signals for a back-splice junction. Note that these datasets do not contain any circRNA reads, and ideally, all simulated reads should be filtered out by the pseudo-alignment module. circRNA detection tools that rely on STAR use the chimeric reads that it reports. Thus, we extracted and calculated the portion of the reads reported by STAR and CircMiner. Both tools report <0.5% of the reads as chimeric for further processing in all datasets. More details are provided in Supplementary Table S2. The proportion of chimeric reads reported by CircMiner is higher. The reason behind this is that CircMiner is more sensitive to candidate back-splice junction read detection (see Supplementary Table S6); thus, it keeps any read with a partial mapping on a single transcript. However, some of them may be result of exons skipping or intron retention events. This is not an issue for STAR because it tries to address such events during the full mapping.

3.2 Simulation of circRNA data

In order to simulate circRNA reads, we used a modified version of CIRI-simulator (Gao et al., 2015) that accepts a set of back-splice junctions, a gene annotation file, and a reference as input and generates a number of RNA-Seq reads that support such a given circRNAs and their junctions. To have realistic circRNA back-splice junctions, we utilized CIRCpedia v2 (Dong et al., 2018), a circRNA annotation database of various tissues and cell types from different species. In order to evaluate the detection module, we pass the following inputs to CIRI-simulator: (i) 110 128 human back-splice junctions that are consistent with Ensembl GRCh38.90 annotation from CIRCpedia and (ii) 1000 randomly selected back-splice junctions from set (i). Then, CIRI-simulator generated two paired-end datasets (C1 and C2) with 4 585 051 and 39 964 reads (2 × 101 bp reads with insert size 350 ± 75) with 1% sequencing error rate. We confirmed that each back-splice junction is supported by at least two reads in both of the datasets. These datasets are denoted positive controls because they only contain reads from circRNA.

Furthermore, to better mimic real RNA-Seq datasets where both linear and circRNA reads exist in one dataset, we diluted each simulated dataset with a background set of simulated linear mRNA reads. The dilution ratios for sample DC1 and DC2 are 1:19 and 1:99, corresponding to 5% and 1% of circRNA reads in the diluted samples (Table 3). The background sets for both samples were generated using ART v2.5.8 based on the gene expression profile of SRR3146803 as 2 × 101 bp reads of insert size 350 ± 75 under Illumina HiSeq 2000 error model. We combined background sets and positive controls to obtain final diluted datasets.

Table 3.

CircRNA and read count in simulation datasets

NameNumber of circRNAsNumber of readsType
C1110 1284 585 051Positive control
DC1110 12892 261 457Diluted of C1 (5%)
C2100039 964Positive control
DC210004 078 974Diluted of C2 (1%)
NameNumber of circRNAsNumber of readsType
C1110 1284 585 051Positive control
DC1110 12892 261 457Diluted of C1 (5%)
C2100039 964Positive control
DC210004 078 974Diluted of C2 (1%)

Note: The number of embedded circRNAs and reads per positive control and diluted sample for each of the simulation sets.

Table 3.

CircRNA and read count in simulation datasets

NameNumber of circRNAsNumber of readsType
C1110 1284 585 051Positive control
DC1110 12892 261 457Diluted of C1 (5%)
C2100039 964Positive control
DC210004 078 974Diluted of C2 (1%)
NameNumber of circRNAsNumber of readsType
C1110 1284 585 051Positive control
DC1110 12892 261 457Diluted of C1 (5%)
C2100039 964Positive control
DC210004 078 974Diluted of C2 (1%)

Note: The number of embedded circRNAs and reads per positive control and diluted sample for each of the simulation sets.

Based on positive controls and diluted datasets, we benchmarked circRNA detection performances of CircMiner, CIRI2 (Gao et al., 2018), KNIFE (Szabo et al., 2015), CIRCexplorer (Zhang et al., 2014), CIRCexplorer2 (Zhang et al., 2016a), circRNA_finder (Westholm et al., 2014), find_circ (Memczak et al., 2013), DCC (Cheng et al., 2016), PTESFinder (Izuogu et al., 2016), NCLscan (Chuang et al., 2016), segemehl (Hoffmann et al., 2014) and CircMarker (Li et al., 2018). CIRCexplorer, CIRI and KNIFE are considered as best-performing circRNA detection tools that work on top of existing mappers (Zeng et al., 2017). segemehl is a multi-split mapper that also supports back-splice junction detection. Finally, recently published CircMarker is selected as one of the first k-mer counting circRNA detection methods.

Figure 3 illustrates the precision and recall of CircMiner and other methods on simulated C1/DC1 and C2/DC2 datasets. For each method, an arrow is connecting the precision and recall of positive control to the diluted dataset. All methods except CircMiner, CIRCexplorer, PTESFinder and NCLscan had a drop in their precision when the diluted dataset was used. This drop was more significant in segemehl. Supplementary Table S3 demonstrates that CircMiner and CIRCexplorer2 achieve zero false positive rate on both negative datasets that only contain simulated linear mRNA reads. CIRCexplorer and PTESFinder also performed well on negative datasets with at most one false positive reported for each dataset. This suggests that these methods are more likely to have lower false positive rate in real datasets since both circRNA and mRNA are present in real samples. Among the robust methods, CircMiner has better recall which is >13% higher than CIRCexplorer and 3% higher than CIRCexplorer2. In general, CircMiner is the best-performing method on the simulated data with the highest recall and F1 score and is able to detect the most number of circRNAs with high precision.

In terms of running time, CircMiner is performing better compared to state-of-the-art methods. Figure 4 demonstrates running time and memory comparison on diluted samples (DC1 and DC2). It illustrates that CircMiner runs faster compared to other circRNA detection tools in both datasets. CircMiner was two times faster than CIRCexplorer, which is the second fastest tool on the simulated data. The high speed of CircMiner is due to both steps of the method. The pseudo-alignment step rapidly filters out the linear reads and the detection step only focuses on marked split reads that are left from the previous step and produces the final circRNA report after removing the false positives. In addition, CircMiner requires 11 GB of memory in all the datasets, regardless of its read depth. In the smaller dataset (DC2), find_circ has the lowest memory consumption. CIRI, PTESFinder and CircMarker also perform well on DC2 in terms of memory usage; however, their memory usage rises on the dataset with higher read depth. CircMiner’s memory consumption is robust since it only loads the original index into memory and it will not get affected by the sequencing input file size. As it is shown in Figure 4, on high read depth dataset (DC1), CircMiner is among the top three memory-efficient tools which shows its capability to run on a personal computer even for large datasets.

Running CPU time and memory usage comparison of CircMiner and other methods on diluted datasets (DC1, DC2). CIRCexplorer2 and NCLscan did not finish within 100-h time limit so they were tested using 10 cores. Other methods were tested using a single core. CircMiner is the fastest circRNA detection method in both datasets. In terms of memory consumption, find_circ has the best performance. CIRI, PTESFinder and CircMarker are also among the memory-efficient tools on smaller dataset (DC2); however, they do not scale well with the size of the dataset as in DC1 they require high memory. So, in the larger dataset (DC1), find_circ, CIRCexplorer2 and CircMiner show the best memory consumption
Fig. 4.

Running CPU time and memory usage comparison of CircMiner and other methods on diluted datasets (DC1, DC2). CIRCexplorer2 and NCLscan did not finish within 100-h time limit so they were tested using 10 cores. Other methods were tested using a single core. CircMiner is the fastest circRNA detection method in both datasets. In terms of memory consumption, find_circ has the best performance. CIRI, PTESFinder and CircMarker are also among the memory-efficient tools on smaller dataset (DC2); however, they do not scale well with the size of the dataset as in DC1 they require high memory. So, in the larger dataset (DC1), find_circ, CIRCexplorer2 and CircMiner show the best memory consumption

3.3 Real data

Our experiment on real data contains two pairs of 2×101 bp Illumina sequencing RNA-Seq data from two cell lines (HeLa and Hs68) of previous studies (Gao et al., 2015; Jeck et al., 2013). For each cell line, one dataset is a typical RNA-Seq data sequenced using rRNA depletion protocol, and the other one is a treated sample where circRNAs are enriched after most linear RNA molecules are digested and removed by RNase R enzyme. We denote the former by HeLa RNase R− and Hs68 RNase R−, and the latter HeLa RNase R+ and He68 RNase R+, respectively. More information regarding the real data is provided in Table 4. Unlike simulation datasets, there is no ground truth of expressed circRNAs for these two cell lines. Therefore, our evaluation for each tool is mainly based on the agreement rate of detected back-splice junctions before and after enrichment for each cell line, considering most circRNAs expressed in a cell line are expected to remain in its treated sample.

Table 4.

Information about 2 pairs of 2 × 101 bp RNA-Seq data from cell lines

RNase R−
RNase R+
SRA accession# readsSRA accession# reads
HeLaSRR163708980 618 760SRR163698536 815 458
SRR1637090SRR1636986
Hs68SRR444975206 362 733SRR445016199 922 486
RNase R−
RNase R+
SRA accession# readsSRA accession# reads
HeLaSRR163708980 618 760SRR163698536 815 458
SRR1637090SRR1636986
Hs68SRR444975206 362 733SRR445016199 922 486

Note: SRA accession number and read count per real sample.

Table 4.

Information about 2 pairs of 2 × 101 bp RNA-Seq data from cell lines

RNase R−
RNase R+
SRA accession# readsSRA accession# reads
HeLaSRR163708980 618 760SRR163698536 815 458
SRR1637090SRR1636986
Hs68SRR444975206 362 733SRR445016199 922 486
RNase R−
RNase R+
SRA accession# readsSRA accession# reads
HeLaSRR163708980 618 760SRR163698536 815 458
SRR1637090SRR1636986
Hs68SRR444975206 362 733SRR445016199 922 486

Note: SRA accession number and read count per real sample.

For each tool, a detected circRNA in the RNase R− dataset is considered as a not-depleted call if its normalized support—the number of back-splice junction reads divided by the total number of reads in a dataset—is not decreased in the treated RNase R+ dataset. Moreover, we consider a circRNA as an enriched call if its normalized support is at least fivefold increased in the treated RNase R+ dataset compared to the RNase R− dataset. Same as the simulation experiment, only the reported circRNAs with at least two supporting back-splice junction reads are considered in the comparison. Table 5 shows the number of calls from both treated and untreated datasets by each tool on Hs68 and HeLa cell lines. The number of not-depleted calls and their percentage  (not-depletedcallsRNaseR-calls) is also provided in this table. In addition, we calculated the number of not-depleted as well as enriched calls from top 10 and top 100 abundant calls of each tool from the untreated RNase R dataset. Even though true circRNAs are not always enriched and detectable in the treated sample, we expect a robust tool to report mostly enriched calls among its top high-confident results of the treated datasets. CircMiner has the highest not-depleted percentage in Hs68 dataset which indicates the lowest false positive rate. In HeLa cell line, CircMiner has the second best not-depleted percentage after CIRCexplorer. In the top 100 calls of Hs68 dataset, CircMiner shows the best results and in the top 10 calls, CircMiner and NCLscan equally have the best performance among all other methods. CIRCexplorer, CircMiner, KNIFE and CIRI have close results in the top 10 and top 100 calls of HeLa dataset. We further investigated the 11 enriched circRNAs among top 100 calls of CIRCexplorer. This is due to the fact that CIRCexplorer generally provides a lower number of support for its reported circRNAs. Thus, there would be a higher chance for them to reach a normalized fivefold increase of support after RNase R treatment. Other methods usually report these 11 circRNAs in their top 100 calls as not-depleted. For example, CIRCexplorer2, its successor method, along with CircMiner and CIRI, all report high support for their calls and contain all 11 events as not-depleted in their top 100 calls. Nonetheless, the normalized ratio of support does not always reach fivefold. CIRCexplorer2 for instance, reports only 4 out of 11 as enriched circRNA and CircMiner reports 6 as enriched. Comparing the number of detected support pre- and post-treatment revealed that CIRCexplorer’s provided support on post-treatment dataset is even smaller than the reported pre-treatment support by CIRCexplorer2, CircMiner and CIRI. More details provided in Supplementary Table S7.

Table 5.

Detailed comparison of circRNA detection tools on Hs68 and HeLa cell lines in terms of detected not-depleted and enriched circRNAs

Cell lineToolRNase R− callsRNase R+ callsNot-depletedNot-depleted (%)Top 10 not depleted (enriched)Top 100 not depleted (enriched)
Hs68CircMiner490726 411345970.498(6)80(62)
CIRCexplorer291921 878199268.247(5)78(56)
CIRCexplorer2543931 913356665.567(6)80(60)
KNIFE412023 014279867.917(5)77(60)
CIRI607433 748383163.076(5)76(61)
segemehl67 650150 78811 06316.350(0)6(1)
CircMarker578924 433315954.574(3)67(44)
circRNA_finder186715 358108858.286(5)65(51)
DCC447626 812261658.455(4)67(54)
find_circ506827 208277354.724(3)59(44)
PTESFinder205712 535129262.817(5)80(62)
NCLscan140411 30990464.398(6)75(61)
HeLaCircMiner61177425299048.888(0)75(7)
CIRCexplorer29524748148950.449(0)76(11)
CIRCexplorer276918163295038.365(0)70(6)
KNIFE50686157234846.338(1)75(8)
CIRI73038949339046.427(0)75(7)
segemehl14 47921 403257917.811(0)31(1)
CircMarker73326996264736.101(0)49(2)
circRNA_finder2471301791236.913(0)41(0)
DCC53456253223341.784(0)69(7)
find_circ70176715236233.666(0)50(3)
PTESFinder2963309799233.486(0)67(5)
NCLscan2235299899244.385(1)70(9)
Cell lineToolRNase R− callsRNase R+ callsNot-depletedNot-depleted (%)Top 10 not depleted (enriched)Top 100 not depleted (enriched)
Hs68CircMiner490726 411345970.498(6)80(62)
CIRCexplorer291921 878199268.247(5)78(56)
CIRCexplorer2543931 913356665.567(6)80(60)
KNIFE412023 014279867.917(5)77(60)
CIRI607433 748383163.076(5)76(61)
segemehl67 650150 78811 06316.350(0)6(1)
CircMarker578924 433315954.574(3)67(44)
circRNA_finder186715 358108858.286(5)65(51)
DCC447626 812261658.455(4)67(54)
find_circ506827 208277354.724(3)59(44)
PTESFinder205712 535129262.817(5)80(62)
NCLscan140411 30990464.398(6)75(61)
HeLaCircMiner61177425299048.888(0)75(7)
CIRCexplorer29524748148950.449(0)76(11)
CIRCexplorer276918163295038.365(0)70(6)
KNIFE50686157234846.338(1)75(8)
CIRI73038949339046.427(0)75(7)
segemehl14 47921 403257917.811(0)31(1)
CircMarker73326996264736.101(0)49(2)
circRNA_finder2471301791236.913(0)41(0)
DCC53456253223341.784(0)69(7)
find_circ70176715236233.666(0)50(3)
PTESFinder2963309799233.486(0)67(5)
NCLscan2235299899244.385(1)70(9)

Note: Only the circRNA calls with at least two supporting back-splice junction reads are considered. The number and ratio of not-depleted circRNAs are calculated as well as number of not-depleted and enriched circRNAs in top 10 and top 100 calls that have highest number of support in RNase R− dataset. Not-depleted circRNA is a circRNA which its normalized support value from RNase R− is not decreased after treatment in RNase R+ dataset. The normalized support value for an enriched circRNA is at least fivefold increased from RNase R− to RNase R+. Since normalized ratios are floating-point numbers, a threshold of 0.05 is considered while counting not-depleted and enriched circRNAs for floating-point error mitigation.

Table 5.

Detailed comparison of circRNA detection tools on Hs68 and HeLa cell lines in terms of detected not-depleted and enriched circRNAs

Cell lineToolRNase R− callsRNase R+ callsNot-depletedNot-depleted (%)Top 10 not depleted (enriched)Top 100 not depleted (enriched)
Hs68CircMiner490726 411345970.498(6)80(62)
CIRCexplorer291921 878199268.247(5)78(56)
CIRCexplorer2543931 913356665.567(6)80(60)
KNIFE412023 014279867.917(5)77(60)
CIRI607433 748383163.076(5)76(61)
segemehl67 650150 78811 06316.350(0)6(1)
CircMarker578924 433315954.574(3)67(44)
circRNA_finder186715 358108858.286(5)65(51)
DCC447626 812261658.455(4)67(54)
find_circ506827 208277354.724(3)59(44)
PTESFinder205712 535129262.817(5)80(62)
NCLscan140411 30990464.398(6)75(61)
HeLaCircMiner61177425299048.888(0)75(7)
CIRCexplorer29524748148950.449(0)76(11)
CIRCexplorer276918163295038.365(0)70(6)
KNIFE50686157234846.338(1)75(8)
CIRI73038949339046.427(0)75(7)
segemehl14 47921 403257917.811(0)31(1)
CircMarker73326996264736.101(0)49(2)
circRNA_finder2471301791236.913(0)41(0)
DCC53456253223341.784(0)69(7)
find_circ70176715236233.666(0)50(3)
PTESFinder2963309799233.486(0)67(5)
NCLscan2235299899244.385(1)70(9)
Cell lineToolRNase R− callsRNase R+ callsNot-depletedNot-depleted (%)Top 10 not depleted (enriched)Top 100 not depleted (enriched)
Hs68CircMiner490726 411345970.498(6)80(62)
CIRCexplorer291921 878199268.247(5)78(56)
CIRCexplorer2543931 913356665.567(6)80(60)
KNIFE412023 014279867.917(5)77(60)
CIRI607433 748383163.076(5)76(61)
segemehl67 650150 78811 06316.350(0)6(1)
CircMarker578924 433315954.574(3)67(44)
circRNA_finder186715 358108858.286(5)65(51)
DCC447626 812261658.455(4)67(54)
find_circ506827 208277354.724(3)59(44)
PTESFinder205712 535129262.817(5)80(62)
NCLscan140411 30990464.398(6)75(61)
HeLaCircMiner61177425299048.888(0)75(7)
CIRCexplorer29524748148950.449(0)76(11)
CIRCexplorer276918163295038.365(0)70(6)
KNIFE50686157234846.338(1)75(8)
CIRI73038949339046.427(0)75(7)
segemehl14 47921 403257917.811(0)31(1)
CircMarker73326996264736.101(0)49(2)
circRNA_finder2471301791236.913(0)41(0)
DCC53456253223341.784(0)69(7)
find_circ70176715236233.666(0)50(3)
PTESFinder2963309799233.486(0)67(5)
NCLscan2235299899244.385(1)70(9)

Note: Only the circRNA calls with at least two supporting back-splice junction reads are considered. The number and ratio of not-depleted circRNAs are calculated as well as number of not-depleted and enriched circRNAs in top 10 and top 100 calls that have highest number of support in RNase R− dataset. Not-depleted circRNA is a circRNA which its normalized support value from RNase R− is not decreased after treatment in RNase R+ dataset. The normalized support value for an enriched circRNA is at least fivefold increased from RNase R− to RNase R+. Since normalized ratios are floating-point numbers, a threshold of 0.05 is considered while counting not-depleted and enriched circRNAs for floating-point error mitigation.

In addition, CircMiner provides significant advantages in terms of total running time and memory consumption over all other tested methods. Total running time includes the mapping and circRNA detection stages for all the tools which is demonstrated in Supplementary Figure S4. In all the real datasets, CircMiner has the lowest running time which is a fraction of time that is used by some other methods. In addition, CircMiner’s memory consumption is within the best three methods while remaining robust. This means that it does not rise as the size of the input grows.

4 Discussion

CircMiner is a highly fast and sensitive tool for circRNA detection from RNA-Seq data. Here, we demonstrated that CircMiner is faster and more memory-efficient compared to the other popular circRNA detection tools. Unlike many circRNA detection tools, CircMiner is not an extension on top of a splice-aware mapper, but was developed as a stand-alone method. It contains a pseudo-alignment module which is fast and it is able to pseudo-align 1 million 2×100 bp RNA-Seq reads to the human genome in a minute on average. The read filtration step of CircMiner reports any read that cannot be mapped to transcriptome in a linear fashion. These chimeric reads include back-splice junction reads and fusion reads which are selected quickly by CircMiner. The candidate fusion reads marked by CircMiner indicate the subset of reads that are potentially related to gene fusion events. These reads can be fed directly into a fusion caller to report any potential gene fusions. The accuracy of CircMiner’s read mapping is comparable with the state-of-the-art splice-aware mappers such as STAR. The detection step is also quick since only a small portion of the reads will remain for further processing of back-splice junctions. This step is usually done in a few minutes for a dataset with tens of millions of reads. CircMiner achieved the best recall and F1 score on simulated samples. It also performed very well in terms of detecting circRNA from RNase R treated real samples in concordance to non-treated sample from the same cell line.

Although CircMiner performs well in back-splice junction detection, it is not designed for full-length circRNA reconstruction. circRNAs usually contain a few exons; thus, they usually have short sequences. So long reads might be useful here to capture full length of circRNAs. Designing a method that uses long reads for full-length construction of circRNAs could be impactful for further analysis of circRNAs, especially in cancer patients. One of the future directions to be considered is to expand CircMiner’s ability to detect fusion circRNAs. The existence of fusion circRNAs has been reported in previous studies (Guarnerio et al., 2016). However, the detection of this class of circRNAs is a challenging problem because of the great potential of a high false positive rate. This is partly due to homologous genes (You and Conrad, 2016); and requires careful design and considerations.

Acknowledgements

We would like to thank Hossein Sharifi-Noghabi, Tunc Morova, Baraa Orabi and Fatih Karaoglanoglu for their valuable suggestions during the preparation of the manuscript.

Funding

This work is funded in part by the Natural Sciences and Engineering Research Council (NSERC) discovery grant (to F.H.) and NSERC CREATE program (to H.A.).

Conflict of Interest: none declared.

References

Ashwal-Fluss
 
R.
 et al. (
2014
)
circRNA biogenesis competes with pre-mRNA splicing
.
Mol. Cell
,
56
,
55
66
.

Bray
 
N.L.
 et al. (
2016
)
Near-optimal probabilistic RNA-seq quantification
.
Nat. Biotechnol
.,
34
,
525
527
.

Chen
 
I.
 et al. (
2015
)
Biogenesis, identification, and function of exonic circular RNAs
.
Wiley Interdiscip. Rev. RNA
,
6
,
563
579
.

Chen
 
S.
 et al. (
2019
)
Widespread and functional RNA circularization in localized prostate cancer
.
Cell
,
176
,
831
843.e22
.

Cheng
 
J.
 et al. (
2016
)
Specific identification and quantification of circular RNAs from sequencing data
.
Bioinformatics
,
32
,
1094
1096
.

Chuang
 
T.-J.
 et al. (
2016
)
NCLscan: accurate identification of non-co-linear transcripts (fusion, trans-splicing and circular RNA) with a good balance between sensitivity and precision
.
Nucleic Acids Res
.,
44
,
e29
.

de Fraipont
 
F.
 et al. (
2019
)
Circular RNAs and RNA splice variants as biomarkers for prognosis and therapeutic response in the liquid biopsies of lung cancer patients
.
Front. Genet
.,
10
,
390
.

Dobin
 
A.
 et al. (
2013
)
STAR: ultrafast universal RNA-seq aligner
.
Bioinformatics
,
29
,
15
21
.

Dong
 
R.
 et al. (
2018
)
CIRCpedia v2: an updated database for comprehensive circular RNA annotation and expression comparison
.
Genomics Proteomics Bioinformatics
,
16
,
226
233
.

Gao
 
Y.
 et al. (
2015
)
CIRI: an efficient and unbiased algorithm for de novo circular RNA identification
.
Genome Biol
.,
16
,
4
.

Gao
 
Y.
 et al. (
2018
)
Circular RNA identification based on multiple seed matching
.
Brief. Bioinform
.,
19
,
803
810
.

Guarnerio
 
J.
 et al. (
2016
)
Oncogenic role of fusion-circRNAs derived from cancer-associated chromosomal translocations
.
Cell
,
165
,
289
302
.

Guo
 
J.U.
 et al. (
2014
)
Expanded identification and characterization of mammalian circular RNAs
.
Genome Biol
.,
15
,
409
.

Hach
 
F.
 et al. (
2014
)
mrsFAST-ultra: a compact, SNP-aware mapper for high performance sequencing applications
.
Nucleic Acids Res
.,
42
,
W494
W500
.

Hansen
 
T.B.
 et al. (
2016
)
Comparison of circular RNA prediction tools
.
Nucleic Acids Res
.,
44
,
e58
.

Hoffmann
 
S.
 et al. (
2014
)
A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection
.
Genome Biol
.,
15
,
R34
.

Hsu
 
M.T.
,
Coca-Prados
M.
(
1979
)
Electron microscopic evidence for the circular form of RNA in the cytoplasm of eukaryotic cells
.
Nature
,
280
,
339
340
.

Huang
 
W.
 et al. (
2012
)
ART: a next-generation sequencing read simulator
.
Bioinformatics
,
28
,
593
594
.

Izuogu
 
O.G.
 et al. (
2016
)
Ptesfinder: a computational method to identify post-transcriptional exon shuffling (PTES) events
.
BMC Bioinformatics
,
17
,
31
.

Jeck
 
W.R.
,
Sharpless
N.E.
(
2014
)
Detecting and characterizing circular RNAs
.
Nat. Biotechnol
.,
32
,
453
461
.

Jeck
 
W.R.
 et al. (
2013
)
Circular RNAs are abundant, conserved, and associated with ALU repeats
.
RNA
,
19
,
141
157
.

Kristensen
 
L.S.
 et al. (
2018
)
Circular RNAs in cancer: opportunities and challenges in the field
.
Oncogene
,
37
,
555
565
.

Lei
 
B.
 et al. (
2019
)
Circular RNA: a novel biomarker and therapeutic target for human cancers
.
Int. J. Med. Sci
.,
16
,
292
301
.

Li
 
Y.
 et al. (
2015
a)
Circular RNA is enriched and stable in exosomes: a promising biomarker for cancer diagnosis
.
Cell Res
.,
25
,
981
984
.

Li
 
Z.
 et al. (
2015
b)
Exon-intron circular RNAs regulate transcription in the nucleus
.
Nat. Struct. Mol. Biol
.,
22
,
256
264
.

Li
 
X.
 et al. (
2018
)
CircMarker: a fast and accurate algorithm for circular RNA detection
.
BMC Genomics
,
19
,
572
.

Maass
 
P.G.
 et al. (
2017
)
A map of human circular RNAs in clinically relevant tissues
.
J. Mol. Med
.,
95
,
1179
1189
.

Memczak
 
S.
 et al. (
2013
)
Circular RNAs are a large class of animal RNAs with regulatory potency
.
Nature
,
495
,
333
338
.

Memczak
 
S.
 et al. (
2015
)
Identification and characterization of circular RNAs as a new class of putative biomarkers in human blood
.
PLoS One
,
10
,
e0141214
.

Salzman
 
J.
 et al. (
2012
)
Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types
.
PLoS One
,
7
,
e30733
.

Salzman
 
J.
 et al. (
2013
)
Cell-type specific features of circular RNA expression
.
PLoS Genet
.,
9
,
e1003777
.

Srivastava
 
A.
 et al. (
2016
)
RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes
.
Bioinformatics
,
32
,
i192
i200
.

Starke
 
S.
 et al. (
2015
)
Exon circularization requires canonical splice signals
.
Cell Rep
.,
10
,
103
111
.

Szabo
 
L.
,
Salzman
J.
(
2016
)
Detecting circular RNAs: bioinformatic and experimental challenges
.
Nat. Rev. Genet
.,
17
,
679
692
.

Szabo
 
L.
 et al. (
2015
)
Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development
.
Genome Biol
.,
16
,
126
.

Wang
 
K.
 et al. (
2010
)
MapSplice: accurate mapping of RNA-seq reads for splice junction discovery
.
Nucleic Acids Res
.,
38
,
e178
.

Wang
 
P.L.
 et al. (
2014
)
Circular RNA is expressed across the eukaryotic tree of life
.
PLoS One
,
9
,
e90859
.

Westholm
 
J.O.
 et al. (
2014
)
Genome-wide analysis of drosophila circular RNAs reveals their structural and sequence properties and age-dependent neural accumulation
.
Cell Rep
.,
9
,
1966
1980
.

You
 
X.
,
Conrad
T.O.
(
2016
)
ACFS: accurate circRNA identification and quantification from RNA-seq data
.
Sci. Rep
.,
6
,
1
11
.

Zeng
 
X.
 et al. (
2017
)
A comprehensive overview and evaluation of circular RNA detection tools
.
PLoS Comput. Biol
.,
13
,
e1005420
.

Zhang
 
X.-O.
 et al. (
2014
)
Complementary sequence-mediated exon circularization
.
Cell
,
159
,
134
147
.

Zhang
 
X.-O.
 et al. (
2016
a)
Diverse alternative back-splicing and alternative splicing landscape of circular RNAs
.
Genome Res
.,
26
,
1277
1287
.

Zhang
 
Y.
 et al. (
2016
b)
The biogenesis of nascent circular RNAs
.
Cell Rep
.,
15
,
611
624
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Arne Elofsson
Arne Elofsson
Associate Editor
Search for other works by this author on:

Supplementary data