A combined computational pipeline to detect circular RNAs in human cancer cells under hypoxic stress

Abstract Hypoxia is associated with several diseases, including cancer. Cells that are deprived of adequate oxygen supply trigger transcriptional and post-transcriptional responses, which control cellular pathways such as angiogenesis, proliferation, and metabolic adaptation. Circular RNAs (circRNAs) are a novel class of mainly non-coding RNAs, which have been implicated in multiple cancers and attract increasing attention as potential biomarkers. Here, we characterize the circRNA signatures of three different cancer cell lines from cervical (HeLa), breast (MCF-7), and lung (A549) cancer under hypoxia. In order to reliably detect circRNAs, we integrate available tools with custom approaches for quantification and statistical analysis. Using this consolidated computational pipeline, we identify ~12000 circRNAs in the three cancer cell lines. Their molecular characteristics point to an involvement of complementary RNA sequences as well as trans-acting factors in circRNA biogenesis, such as the RNA-binding protein HNRNPC. Notably, we detect a number of circRNAs that are more abundant than their linear counterparts. In addition, 64 circRNAs significantly change in abundance upon hypoxia, in most cases in a cell type-specific manner. In summary, we present a comparative circRNA profiling in human cancer cell lines, which promises novel insights into the biogenesis and function of circRNAs under hypoxic stress.


Supplementary Note 1 Consolidation of approaches to identify circRNAs from rRNA-depleted RNA-Seq data
Several computational tools are available to detect circRNAs from RNA-Seq data (reviewed in (Szabo and Salzman, 2016;Gao and Zhao, 2018). Two recent studies reported a small overlap between predictions when different tools were benchmarked on the same datasets (Hansen et al., 2015;Zeng et al., 2017), suggesting that a combination of multiple algorithms is advisable to obtain a more reliable list of candidate circRNAs. In this study, we sought to apply two established algorithms, find_circ (Memczak et al., 2013) and CIRCexplorer (Zhang et al., 2014) in order to comprehensively profile the circRNA repertoire of different cancer cell lines.
We extensively characterized the predicted circRNA sets to identify the specific strengths and weaknesses of either algorithm. In particular, we tested the different algorithms by find_circ and CIRCexplorer on a single sample from HeLa RNA-Seq data. As described in (Memczak et al., 2013), we filtered find_circ output demanding unique alignments, unambiguous breakpoint, GU/AG splice site signal and a maximum genomic distance of 100 kb between the back-splice sites, obtaining a total of 3,752 circRNAs. Then, we applied a further filter on counted reads, demanding a minimum of two distinct reads supporting the back-splice sites, thus reducing the number of circRNAs to 1,118. Only 37% of them could be detected in at least two replicates with the same algorithm, suggesting a certain level of noise in the find_circ predictions. This is also reflected in a larger fraction of false-positive predictions (see below, Supplementary Figure 2F). The CIRCexplorer output included 3,343 circRNAs, that were further filtered demanding a minimum of two reads supporting the back-splice sites, to get a final list of 2,269 circRNAs (2,193 exonic circRNAs, 76 ciRNAs). 68% of them could be detected in at least one of the other replicates with the same algorithm.
When we compared the final outcomes of find_circ and CIRCexplorer, the two algorithms agreed on the prediction of 903 circRNAs, while further 285 and 1,366 circRNAs were reported exclusively by find_circ and CIRCexplorer, respectively (Supplementary Figure 2A). In addition, for circRNAs detected by both tools, the detected number of reads spanning the back-splicing junctions was highly correlated, but not completely consistent (Supplementary Figure 2B).
Next, we investigated the underlying reasons of the discrepancy in the amount of detected circRNAs. Examining the number of reads supporting the back-splicing junctions, circRNAs detected by a single tool were significantly less abundant than circRNAs identified by both algorithms. However, several circRNAs resulted to be particularly abundant (Supplementary Figure 2C). First, we looked at circRNAs detected only by find_circ (n = 285), finding that 191 (67%) were included in the chimeric alignments by STAR but not reported in CIRCexplorer output, mostly because they originated from unannotated junctions (n = 133).
Then, we investigated circRNAs detected only by CIRCexplorer (n = 1,366) and found that 26 (2%) circRNAs spanned a genomic region larger than 100 kb. Moreover, 24 of them were initially detected by find_circ but removed by the genomic distance filter. We also noticed that 85 circRNAs (6%) showed noncanonical splice site motifs, even though they originated from annotated splice sites. Finally, albeit HeLa cells were taken from a female patient, four circRNAs were annotated on chromosome Y. These circRNAs resided in the pseudoautosomal regions (PAR) which are homologous between chromosomes X and Y.
While the above-mentioned filters only slightly influenced the overlap between results from find_circ and CIRCexplorer, we found that the threshold placed on supporting back-splice reads drastically impacted on the final outcome. find_circ provides two different measures of supporting back-splice reads, counting either all back-splice reads, thus including PCR duplicates, or just unique back-splice reads, which are distinct in their sequence and hence likely arise from independent reverse transcription events. find_circ's expression filter is based on unique back-splice reads to avoid putative PCR artifacts. On the opposite, CIRCexplorer does not count unique back-splice reads. As a consequence, 68% (n = 936) of the CIRC-explorer-only circRNAs were also initially reported by find_circ first but then excluded as putative PCR duplicates as they were supported by only a single unique backsplice read. Consistently, when we compared the CIRCexplorer output to circRNAs found with find_circ when demanding a minimum of two any reads, independently on their sequence (n = 2,513), the overlap between both tools largely increased.
Based on these results, we established a new pipeline to unify the results of both tools and thus obtain a comprehensive catalogue of circRNAs from our RNA-Seq datasets ( Figure 1A).
For each cell type, sequencing reads of the different conditions were merged and mapped to the human genome (version GRCh38/hg38) with Bowtie2 (Langmead and Salzberg, 2012) and STAR (Dobin et al., 2012). Unmapped reads from Bowtie2 were used to detect circRNAs with find_circ as described in (Memczak et al., 2013). Chimeric junctions resulting from STAR alignment were fed into CIRCexplorer. Although keeping only circRNAs that are detected by both tools would increase the reliability of the candidates, this would leave out abundant circRNAs that are missed by one of the two algorithms due to its specific criteria.
For instance, the abundant and hypoxia-regulated circZNF292 (Boeckel et al., 2015) originates from a back-splice site within an intron, and it is thus not identifiable by CIRCexplorer. In a next step, we united the circRNAs identified by either algorithm and then systematically filtered out the detection artifacts described above. In particular, we removed circRNAs with a genomic distance between back-splice sites of more than 100 kb. Regarding the sequence at predicted back-splice sites, we kept circRNAs with canonical splice site motifs (GU/AG) as well as with GC/AG pairs, which represent the most frequent noncanonical splice site motifs (Burset et al., 2000). We finally excluded circRNAs with back-splice junctions that were predicted to reside in non-overlapping genes. For the three analyzed cancer cell lines, this procedure yielded a total of 12,006 predicted circRNAs.
In order to obtain consistent back-splice read count estimates, we used a custom script to recount back-splice reads from STAR's chimeric alignments. We discriminated unique backsplice reads from putative PCR artifacts based on the mapping position rather than read sequence. To detect a circRNA as present in a given cell line, we demanded a minimum of two unique back-splice reads supporting the back-splice junction in at least one sample. For the abundance estimates, unique and non-unique back-splice reads were taken into account.
In order to evaluate the sensitivity and specificity of our pipeline, we tested it on a published dataset from human Hs68 cells, in which circRNAs had been specifically enriched by RNase R digestion followed by RNA-Seq (Jeck et al., 2013). Similar to previous publications (Hansen et al., 2015;Wang et al., 2017;Zeng et al., 2017;Hansen, 2018), we predicted circRNAs in the rRNA-depleted total RNA-Seq data ("total") and compared their back-splice read counts to the levels in the RNase R-treated RNA-Seq data ("RNase R"). In order to compare the performance of our pipeline to the isolated tools, we also ran CIRCexplorer and find_circ separately. For comparability, we applied the same filter criteria for both tools on supporting back-splice reads, i.e. at least two supporting back-splice reads (unique or nonunique) in one of the total RNA samples to keep a circRNA prediction, following the standard CIRCexplorer approach. For find_circ, we additionally tested the detection performance requiring at least two unique supporting back-splice reads in one of the total RNA samples, as suggested in the original publication (Memczak et al., 2013) and proposed in our combined pipeline. For quantification of the circRNAs, raw back-splice read counts were normalized to sequencing depth, and replicates were combined by averaging the normalized counts. As in the previous publications, the derived fold-enrichment values ( ) were subdivided into depleted and RNase R-resistant circRNAs ( , respectively). These were further stratified into <5-fold reduction ( ; strong depletion), 5-1-fold reduction ( ; modest depletion), 1-5-fold increase ( ; modest enrichment), and >5-fold increase ( ; strong enrichment).
RNase R-resistant circRNAs were considered as true-positives in the following evaluation.
We found that our pipeline detected the RNase R-resistant circRNAs with 80% precision (Supplementary Figure 2F). The precision was higher than for find_circ with either settings (66% and 70%) and comparable to CIRCexplorer (80%). Individually run, both tools predicted more circRNAs, reflected in slightly higher absolute numbers of true-and falsepositives. In the case of CIRCexplorer, the higher numbers most likely result from the more lenient requirement of any two back-splice reads (compared at least two unique back-splice reads in our pipeline). In essence, our pipeline combines reliable circRNA predictions from both tools, thereby extending the scope of CIRCexplorer to intronic back-splice sites and higher stringency on unique back-splice reads, while maintaining high precision. Altogether, we conclude that our combined pipeline is well suited to reliably predict circRNAs with high precision.

Supplementary Tables
Supplementary Table 1. Overview of RNA-Seq datasets used in this study.

Supplementary Figures
Supplementary Figure 1 ) for hypoxia-regulated protein-coding genes. Circle diameter corresponds to gene ratio, which reflects the number of regulated genes with a given GO term relative to the total number of regulated genes and the total number of annotated genes associated with this GO term. Circle color indicates significance (hypergeometric test followed by Benjamini-Hochberg correction). Only GO terms with adjusted P value/ q value < 0.05 are shown. (D) Hypoxia-induced alternative splicing is highly divergent between the three cell lines. Venn diagram compares cassette exons with significant difference in inclusion upon hypoxia in the three cell lines ('percent spliced-in', ΔPSI > 10%, false discovery rate < 5%).

Supplementary Figure 2. Characterization of overlaps and discrepancies in circRNA predictions by find_circ and CIRCexplorer. (A)
Venn diagram compares circRNAs predicted by CIRCexplorer and find_circ from RNA-Seq of a representative HeLa sample (normoxia, replicate 1). (B) Comparison of back-splice read counts for circRNAs detected by both CIRCexplorer and find_circ. (C) Box plot shows circRNA quantification by CIRCexplorer and find_circ, discriminating between circRNAs that are in common or predicted by only a single tool. The latter are usually supported by less back-splice reads, with notable exceptions (labeled). (D) Characterization of 285 circRNAs that are only predicted by find_circ (find_circ only). 33% of these circRNAs could not be detected by CIRCexplorer as the respective back-splice junctions were not present in the chimeric alignments from STAR.
(E) Characterization of 1,366 circRNAs that were only predicted by CIRCexplorer (CIRCexplorer only). Criteria like genomic distance of the predicted back-splice sites, splice site motifs, gene annotation, custom adjustments of alignments by CIRCexplorer and supporting back-splice reads contribute to different extent to the discrepancy between both tools. For the remaining 328 circRNAs, the reason remained unclear. Altogether, the results highlight the relevance of filtering based on unique back-splice reads, as done in find_circ pipeline. (F) Evaluation of circRNA predictions from our pipeline compared to CIRCexplorer and find_circ (with two settings, see Supplementary Note 1) based on published RNase Rtreated RNA-Seq data. Predicted circRNAs were categorized into depleted or RNase Rresistant based on fold-enrichment of back-splice read counts in total over RNase R-treated RNA-Seq data. The pipeline performed as good as CIRCexplorer and better than find_circ, both for RNase R-resistant circRNAs in general as well as for highly enriched circRNAs (>5fold). (G) Validation of circularity for the exonic ZNF292 isoform in HeLa cells, similarly to Figure 1D. RT-PCR products using divergent oligonucleotides after polyA(+) selection or RNase R treatment. Oligonucleotides amplifying the linear PLOD2 transcript were used as control.  (GENCODE v24), distinguishing between exons from multi-exon and single-exon genes. (E) A substantial number of host genes produce more than one circRNA isoform. Histogram summarizes number of circRNAs per host gene. We note that the frequency of alternative back-splicing is likely underestimated, as our analysis is solely based on back-splice junctions and does not consider possible internal alternative splicing events. (F) Most host genes produce few predominant circRNA isoforms. Boxplot compares the relative abundance of circRNA isoforms produced from a given gene to the expected frequency based on equal proportions. Genes were stratified by the number of associated circRNA isoforms, grouping genes with more than ten circRNA isoforms. Blue line and dots represent the expected relative abundance computed from total circRNA isoforms from a given gene (