-
PDF
- Split View
-
Views
-
Cite
Cite
Patrick Sorn, Christoph Holtsträter, Martin Löwer, Ugur Sahin, David Weber, ArtiFuse—computational validation of fusion gene detection tools without relying on simulated reads, Bioinformatics, Volume 36, Issue 2, January 2020, Pages 373–379, https://doi.org/10.1093/bioinformatics/btz613
- Share Icon Share
Abstract
Gene fusions are an important class of transcriptional variants that can influence cancer development and can be predicted from RNA sequencing (RNA-seq) data by multiple existing tools. However, the real-world performance of these tools is unclear due to the lack of known positive and negative events, especially with regard to fusion genes in individual samples. Often simulated reads are used, but these cannot account for all technical biases in RNA-seq data generated from real samples.
Here, we present ArtiFuse, a novel approach that simulates fusion genes by sequence modification to the genomic reference, and therefore, can be applied to any RNA-seq dataset without the need for any simulated reads. We demonstrate our approach on eight RNA-seq datasets for three fusion gene prediction tools: average recall values peak for all three tools between 0.4 and 0.56 for high-quality and high-coverage datasets. As ArtiFuse affords total control over involved genes and breakpoint position, we also assessed performance with regard to gene-related properties, showing a drop-in recall value for low-expressed genes in high-coverage samples and genes with co-expressed paralogues. Overall tool performance assessed from ArtiFusions is lower compared to previously reported estimates on simulated reads. Due to the use of real RNA-seq datasets, we believe that ArtiFuse provides a more realistic benchmark that can be used to develop more accurate fusion gene prediction tools for application in clinical settings.
ArtiFuse is implemented in Python. The source code and documentation are available at https://github.com/TRON-Bioinformatics/ArtiFusion.
Supplementary data are available at Bioinformatics online.
1 Introduction
Cancer is a class of diseases caused by changes in the genome (i.e. mutations) of individual cells (Hanahan and Weinberg, 2000) allowing those cells to proliferate in an uncontrolled manner and eventually invade other body parts. Among the genomic changes, which include point mutations and small insertions and deletions, are also larger structural variants, which are characterized by chromosomal rearrangements and can result in so-called fusion genes. Here, the open reading frames (ORFs) of two genes are merged, giving rise to a fused gene and possibly a fusion protein, which might have new properties enhancing the cancerous behavior of the mutant cells. A well-known historical example is the BCR-ABL fusion (Nowell and Hungerford, 1960), caused by a translocation between chromosomes 22 and 9, the so-called Philadelphia chromosome. Discovery of the BCR-ABL fusion culminated in the development of the tyrosine-kinase inhibitor Imatinib, which at the time was considered a ‘magic bullet’ against cancer (Deininger et al., 2005), pioneering the field of stratified targeted medicine (Britten et al., 2013). Today, Imatinib and its successors are first-line treatment choices for BCR-ABL-positive leukemia patients. Another example of a recurring fusion gene for which targeted therapies are available is the EML4-ALK fusion in non-small-cell lung cancer (Solomon et al., 2014).
Parallel to the clinical development of drugs targeting specific fusion genes, the now-widespread use of next-generation sequencing (NGS) and RNA sequencing (RNA-seq) has enabled the high throughput screening of patient cohorts (Hu et al., 2018) and the discovery of diverse fusion genes (Gao et al., 2018). However, reliable detection of fusion genes using NGS data remains a challenge.
A recent study compared different fusion gene prediction tools and evaluated their performance for both simulated NGS datasets (including simulated fusion transcripts) and real datasets with 44 previously validated fusion events (Kumar et al., 2016). On simulated data, eight out of nine tools showed high-recall values between 44% and 84%, with positive predictive values (PPV) reaching up to 100%. However, for the real datasets observed recall values were much lower, with 6 out of 12 tools failing to report any previously validated fusion gene, and sensitivities ranging from 2% to 70% for the remaining tools. At the same time, the overall number of predicted fusion events was often very high which is indicative of a much lower PPV for real data.
Simulated data are not able to reflect the challenges of analyzing real NGS data. Moreover, prediction results containing validated fusion genes are relatively rare. To date, only two of three tools for fusion gene prediction that were used in a recent pan-cancer screening (Gao et al., 2018) were tested on real RNA-seq samples. Of all predicted fusion genes only 24 of 787 predicted by STAR-Fusion (Haas et al., 2017) and 25 of 489 predicted by Eric-Script (Benelli et al., 2012) were validated (Edgren et al., 2011).
Algorithms for simulating short-read NGS data try to faithfully recreate error sources for short reads (Bruno et al., 2013; Tan et al., 2015); however, this is obviously limited to error sources, which can be accounted for. Recent advances in sequencing technology (e.g. SMRT sequencing (Levene et al., 2003), nanopore sequencing (Norris et al., 2016) or just technology upgrades like that of the Illumina Novaseq) are not supported, requiring updates to the error models. Furthermore, experimental conditions (e.g. polyA enrichment versus exome capture RNA-seq) and features of the input material (e.g. formalin-fixed versus fresh frozen) may not be modeled. Thus, simulated read datasets cannot represent the full complexity of RNA-seq data from real biological samples.
Here, we present a complementary approach to the problem of testing software tools for the prediction of fusion genes from NGS data. Most available tools rely on a well-annotated reference genome and intend to find read groups showing non-regular, inconsistent mapping patterns (split or junction reads and discordant or spanning read pairs). These may indicate the presence of a fusion gene. As an alternative to the simulation of such reads, we propose modifications of the reference genome sequence and the respective annotation, that make wild-type genes appear as fusion genes. A result of this modification is that reads aligned against such a reference will exhibit the non-regular, inconsistent mapping pattern that is expected for reads originating from fusion genes. By creating these artificial fusion events (ArtiFusions) one can precisely control the number of known fusion transcripts that could result from an analysis of an RNA-seq dataset using the modified reference genome. The main advantage is that one can now use any available RNA-seq dataset for such analyses. The wealth of publicly available data from repositories like the SRA, TCGA or ENA allows testing of many real-life variabilities in experimental protocol, sequencing protocol or sample type and origin, without the necessity to rely on purely simulated reads or limited number of validated real fusion genes.
2 Materials and methods
2.1 Implementation
ArtiFuse requires as input a reference genome assembly sequence in FASTA format, and a gene model in BED format, that can for example be generated from a gene model file in general transfer format (GTF) using the tool ‘gtfTogenePred’ (hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/) and is a more compact representation of the information the GTF file contains. Both the FASTA and the GTF files are available from the Ensembl website for many species (https://www.ensembl.org/info/data/ftp/index.html).
Furthermore, a comma-separated value file listing pairs of genes and relative breakpoint positions to define artificial fusion breakpoints is necessary as input. Contrary to existing approaches, ArtiFuse introduces these artificial breakpoints by modifying the reference genome sequence. Thereby a modified FASTA file containing an altered genome sequence according to the defined gene pairs and breakpoints is generated (Fig. 1), which can then be used by the fusion gene prediction tools.

Modifying the reference genome sequence makes wild-type transcripts appear like fusions during alignment. (A) Different mRNAs, including one derived from a fusion gene (indicated by gray and red) are shown. Differences between wild-type and fusion genes occur during data analyses when paired reads are aligned to the reference sequence (indicated as Gene A–D). For wild-type genes, all reads map consistently across the entire ORF (Gene A and B), but for fusion genes inconsistencies occur at the breakpoints (Gene C and D). (B) Mapping of reads to fusion gene breakpoints is shown in more detail: On the wild-type reference sequence, reads at the breakpoint map partially (split reads) and read pairs map discordantly to two loci (discordant pairs). These mapping inconsistencies are used to predict fusion genes, which if predicted correctly can resolve these as spanning pairs and junction reads. (C) By introducing the same mapping inconsistencies that identify fusion genes into the reference sequence, the sequence downstream of a defined breakpoint in Gene A is copied downstream of a defined breakpoint in B and filled up in Gene A with ‘N’s. When mapping to this modified reference, fusion gene prediction tools will identify a fusion event from Gene A–B at the defined breakpoints. We call these artificially introduced events ArtiFusions
Generation of ArtiFusions starts by splitting the sequence of the ORF from Gene A at the exon boundary that is closest to a defined relative breakpoint (Fig. 1C). The sequence upstream of that position remains unchanged, while the rest of the sequence is introduced downstream of an exon boundary into the locus of Gene B. The copied sequence in Gene A is overwritten with ‘N’s to prevent mapping. When transcriptome sequencing reads from Gene A are mapped to the altered reference sequence, the same mapping inconsistencies that identify fusion genes, such as split reads and discordant read pairs, will occur at the artificially introduced breakpoint (Fig. 1C).
With ArtiFuse it is possible to generate numerous artificial breakpoints within one reference genome. All introduced changes, including the modified sequences and the resulting artificial breakpoints as genomic positions, are listed in a summary file. These ArtiFusions can then serve as positive controls when testing fusion gene detection algorithms. As each one inherits all properties of the gene defined as Gene A, it is possible to control expression level, known orthologues and all other gene-related properties of the ArtiFusion, by selecting the gene accordingly allowing for numerous test scenarios. All artificial breakpoints will match known exon boundaries.
ArtiFuse is implemented in Python. The source code and documentation are available at https://github.com/TRON-Bioinformatics/ArtiFusion.
2.2 Selection of candidate genes and generation of ArtiFusions
To demonstrate the utility of ArtiFuse in comparing fusion detection tools, we created ArtiFusions of several gene pairs. Candidate genes were selected based on parameters in in-house generated RNA-seq data from the MCF7 cell line: expression level, length and known orthologues. Therefore, expression was determined from eight RNA-seq datasets using the STAR RNA-seq aligner (Dobin et al., 2013) and subsequent read counting and normalization. The median expression for each gene was calculated to obtain an applicable value across all samples. Sequence identity scores for known orthologues were obtained from the Ensembl bioMaRt repository (Durinck et al., 2009). Using this information, three major parameters were explored for their influence on fusion gene detection algorithms: expression, sequence identity and breakpoint position with regard to the 3’ end of the transcript.
To test the influence of gene expression level on fusion gene detection, we grouped gene pairs into three expression ranges according to the expression of the defining first gene: 0–1 FPKM (low), 1–10 FPKM (medium) and 10–100 FPKM (high). The influence of expressed homologs was tested by selecting genes with 50% or >90% identity to at least one other highly expressed gene (>10 FPKM). The influence of breakpoint position was tested by selecting genes with medium expression (1–10 FPKM) and where the breakpoint position varied between 10%, 50% and 90% relative distance to the 3’ end. In total, we selected 325 gene pairs to generate ArtiFusions. ArtiFuse recognizes when the underlying gene model prevents breakpoint generation at the desired position and reports such cases in the summary file. Here, 117 gene pairs were excluded and 208 ArtiFusions were created for 208 gene pairs (Supplementary Tables S1–S4).
2.3 Evaluation of fusion gene detection tools against ArtiFusions
The fusion gene detection tools InFusion (InFusion-build-20-02-17), MapSplice2 V.2.2.1 and SoapFuse V.1.27 were tested using ArtiFusions (Jia et al., 2013; Okonechnikov et al., 2016; Wang et al., 2010). For each tool, a custom reference was generated using the modified ArtiFusion genome Fasta file. All databases were generated using default options. Fusion detection with SoapFuse and MapSplice2 was performed using default parameters. InFusion was used with setting ‘minimum proportion of unique split alignments supporting the fusion’, (–min-unique-alignment-rate2) to 0, minimum number of candidates originating from unique alignment (–min-unique-split-reads) to 0 and allowing non-coding-regions to form fusions (–allow-non-coding), as suggested by the author.
2.4 Read pileup
In order to identify the number of spanning and junction reads expected for an ArtiFusion event, we performed a read pileup after a previous STAR alignment to the unaltered wild-type reference sequence and counted the supporting reads using a custom Python script. Supporting reads include read pairs that encompass the fusion breakpoint, also classified as spanning read pairs and reads overlapping the breakpoint by at least 10 bp, also classified as junction reads.
2.5 In-house sequencing of MCF7 cell line
Samples were prepped using the ‘TruSeq Stranded mRNA LT Sample Preparation Kit’ from Illumina (San Diego, CA, USA). Poly(A)-positive mRNA transcripts were isolated from total RNA by binding to magnetic oligo(d)T beads (mRNA purification beads). After elution from the beads and fragmentation, mRNA was reverse transcribed during first- and second-strand synthesis, resulting in blunt-ended cDNA. An adenosine overhang was added to the 3’ ends (A-tailing step) to facilitate ligation of Illumina’s TruSeq RNA T-overhang adapter molecules. These adapters have individual, dedicated index sequences and add the Illumina-specific sequences that are necessary for amplification, flow cell hybridization and sequencing. A polymerase chain reaction (PCR) based amplification step enriched the final product. Double-stranded cDNA sequencing libraries were further checked for quality and quantity using Qubit and Bioanalyzer 2100. Libraries MK47 and MK62 were sequenced 4-plex on Illumina’s HiSeq2500 platform (2 × 50 bp), resulting in ∼2 Mio reads, respectively ∼11 Mio reads. Samples L1 and L2 were sequenced 2-plex on Illumina’s HiSeq4000 platform (2 × 100 bp), which resulted in ∼162 and 217 Mio reads. The raw sequencing data have been deposited in the NCBI Sequence Read Archive [SRA: SRP199155].
2.6 ART-simulated transcriptome
Human transcriptome simulation was performed with 100-, 200- and 500-fold coverage using ART (Huang et al., 2012). Here, version MountRainier-2016-06-05 was used with the following parameters: -p (switch on mate-pair simulation) -l 50 (controls read length for each mate) -f [100, 200, 500] (fold coverage) -m 200 (mean fragment length) -s 10 (standard deviation of DNA/RNA fragment size for paired-end simulations) -rs 1557736317 (seed for randomness). The simulated transcriptomes were analyzed for ArtiFusions in the same way as the real datasets.
3 Results
3.1 Performance estimate of fusion gene detection tools based on ArtiFusions
We tested the performance of three fusion gene detection algorithms using our ArtiFusion approach: InFusion, MapSplice and SoapFuse (Jia et al., 2013; Okonechnikov et al., 2016; Wang et al., 2010). Using eight published and in-house-generated MCF7 cell-line RNA-seq datasets of varying read lengths and sequencing depths, we investigated the influence of a wide range of real, non-simulated datasets (Supplementary Table S1 and Fig. 2) on fusion gene detection. Furthermore, we created ArtiFusions in genes with different expression levels to determine sensitivity along with this parameter (Supplementary Tables S2–S4).

Fusion gene detection as estimated from ArtiFusions depends on sequencing depth and expression level. (A) ArtiFusions were introduced in genes with varying expression as determined by real RNA-seq data: low (0–1 FPKM, n = 27), medium (1–10 FPKM, n = 31) and high (10–100 FPKM, n = 45). Three fusion detection tools (InFusion, MapSplice and SoapFuse) were used to identify ArtiFusions in eight different MCF7 sequencing samples with varying read lengths and sequencing depths. The recall values for all tool-sample combinations are shown as the ratio of correctly predicted ArtiFusions versus all introduced ArtiFusions. (B) Using the same set of ArtiFusions (n = 103) recall values were identified from three simulated transcriptomes with 100, 200 and 500-fold coverage. (C) The percentage of correctly identified ArtiFusions within the top 50 predicted fusion genes according to junction reads is shown. (D) The overall recall values based on all ArtiFusions independent of the expression level (n = 103) versus the total number of all predicted events for each sample and tool are shown. Sample coverages are indicated by dot size
We first evaluated the sensitivity of the tools by asking how many of the ArtiFusions were identified as fusion events. As expected, the recall values of all three tools increase with sequencing depth for all analyzed expression levels (Fig. 2A). All tools achieved their best recall values for medium and highly expressed ArtiFusions: InFusion and SoapFuse achieve recall values of up to 0.7–0.8, while MapSplice achieved a maximum recall of 0.6. For detection of medium and highly expressed genes sensitivity reached a plateau at a sequencing depth of about 20–30 million reads, this performance does not improve with samples sequenced at almost 10-times greater depth. Detection of low-expressing ArtiFusions is extremely insensitive; here, deeper sequencing helps in detection with an at-best recall value of 0.37. This indicates that deeper sequencing provides added benefit only for low-expressing fusion genes. MapSplice shows lower recall values in all tested scenarios (0.1–0.4) compared to InFusion (0.3–0.56) and SoapFuse (0.23–0.54).
Using ART (Huang et al., 2012) we simulated transcriptomes with three different coverages (100-, 200- and 500-fold). When using simulated data (in contrast to real data), we observed much higher recall values (0.88 and 0.97) for detection of the same set of ArtiFusions as before (Fig. 2B). Here, we did not differentiate ArtiFusions according to expression level, as all genes are covered equally by the simulated transcriptomes. For the simulated data, we observed no major difference between the three tested tools.
We calculated the fraction of correctly identified ArtiFusions among the 50 highest-ranking predicted fusion genes according to reported junction reads for each tool (Fig. 2C). Here, InFusion shows lower values (10–26%), while MapSplice (20–32%) and SoapFuse (18–37%) perform similar for most cases. We compared recall value with the total number of predicted fusions, here, a striking correlation for all prediction results is evident (Fig. 2D).
Next, we analyzed the effect of read length. The three samples Sun, Daemen and Encode have comparable sequence depths but different read lengths of 50, 77 and 100 bp, respectively. However, we did not observe any clear performance advantage for samples with longer reads (Fig. 2A), which might indicate that the fusion detection tools we applied in this study do not take full advantage of longer reads.
3.2 Recovery of junction reads and spanning pairs
In general, detection tools identify fusion events from reads that map inconsistently to the unmodified wild-type reference genome close to a potential breakpoint position. In this process, reads (junction reads) and read pairs (spanning pairs) that span the breakpoint are commonly used as quantitative support for fusion events. The ArtiFusion approach allows computation of the expected number of reads that support an artificially created fusion. We computed the expected number of junction reads and spanning pairs from read pileup of the reads mapped to the unaltered reference genome at the position of the introduced ArtiFusion breakpoints. By considering only correctly identified breakpoints, we compared the expected read counts to the number of junction reads and spanning pairs reported by each tool (Fig. 3 and Supplementary Fig. S1).

Recovery of junction and spanning reads introduced as ArtiFusions is highly accurate independent of expression level and coverage. (A and B) Reported number of junctions reads for correctly predicted ArtiFusions were compared to the expected number of reads at the corresponding position assessed by pileup after alignment to wild-type reference. Here, reported junction reads for all three tools correlate with the expected number in the low coverage sample (2 Mio reads), but correlation is lower in high-coverage samples (217 Mio reads). (C and D) The reported number of spanning pairs were compared to the number of reads as assessed by the pileup. Please note that the number of spanning pairs is not expected to be identical, only proportional
For all tools tested, the reported junction reads correlated well with the expected number, especially in samples with lower sequencing depth. Furthermore, the reported junction reads vary minimally between the three tools (Fig. 3A and B).
For spanning read pairs, we observed a high correlation between expected and reported read counts. However, the number of spanning read pairs reported among the tools varied by up to one order of magnitude (Fig. 3C and D and Supplementary Fig. S2). Furthermore, the correlations between expected and observed spanning pairs were overall weaker compared to the ones for junction reads, especially in samples with low-sequencing depth (Fig. 3C). Here, the absolute number of reads seems too low for accurate quantification by the tested tools.
MapSplice reports on average 10-fold more spanning pairs for detected ArtiFusions than InFusion, with SoapFuse being in the middle. MapSplice overestimates the actual number of spanning pairs, indicating that this tool uses less stringent alignment criteria. However, MapSplice predicts overall fewer fusion events (Fig. 2B), indicating that it has more stringent cutoff criteria with regard to minimal number of spanning pairs.
3.3 Influence of paralog expression and relative breakpoint position on detection
Sequence similarity between expressed paralogous genes might complicate the detection of fusion genes by ambiguous assignment of reads (or parts of reads) to genes during the sequence alignment process. To investigate this effect, we introduced ArtiFusions in genes with expressed paralogues of varying sequence identity and compared their detection to genes without paralogues. We observed reduced recall values for InFusion and SoapFuse for the paralogues of very high identity (Fig. 4A). There is a high probability that under these conditions, InFusion and SoapFuse misalign reads, obscuring the prediction. MapSplice, on the other hand, appears to gain supportive reads, leading to an increased recall value. Interestingly, when we check whether any predicted fusion genes involve some of the known paralogous genes, we did not find any involvement. This indicates that expression of genes with high identity does not affect precision.

Highly homologous genes, but not relative breakpoint position, impacts the sensitivity of fusion gene detection as estimated from ArtiFusions. (A) ArtiFusions were introduced in genes with moderate expression (1–10 FPKM) and either no known homologs, homolog with ∼50% identity (n = 16) and homologs with >90% identity (n = 19). Here only genes with expressed homologs (>1 FPKM) were considered, to evaluate the effect of similar sequences on fusion gene detection. The recall values for all tool-sample combinations are shown as the ratio of correctly predicted ArtiFusions versus all introduced ArtiFusions for the respective test cases. (B) ArtiFusions were introduced in genes with moderate expression (1–10 FPKM) at different relative positions on the transcript at 10% distance from 3’ end (Position 3’, n = 16), at 50% distance from the 3’ end (Position middle, n = 15) and at 90% distance from 3’end (Position 5’, n = 11). Shown are the recall values for all tool-sample combinations as the ratio of correctly predicted ArtiFusions versus all introduced ArtiFusions for the respective test cases
We also investigated whether the relative position of the breakpoint on the transcript affects fusion gene detection: to this end, ArtiFusion breakpoints were introduced at 10%, 50% and 90% transcript length. We observed slightly reduced recall value for breakpoints at the 5’ end only for samples with very low sequencing depths (Fig. 4B). In these low coverage samples, a potential 3’ bias due to the isolation of poly-adenylated RNA might have a more pronounced effect as in higher coverage samples that have also sufficient coverage near the 3’ end. Taken together, we can conclude that the sensitivity of fusion gene detection taking ArtiFusion-modified reference genomes is most strongly affected by high sequence similarity among expressed homologs and low sequencing depth.
4 Discussion
One of the main difficulties in testing fusion detection algorithms is the lack of sufficiently validated fusion genes that can serve as positive controls to accurately assess performance. Previously, simulated reads were used to assess performance (Bruno et al., 2013; Tan et al., 2015). However, these tools are limited in simulating the complexity of real samples and cannot account for all biases and errors found in real-world datasets. While those tools can be of use, they cannot correctly assess true performance on real-world samples (Kumar et al., 2016). A recently published approach to assess prediction performance for structural variants from whole-genome sequencing data aims to circumvent limitation of synthetic reads, by modifying alignment data of real samples, demonstrating the need for novel methodologies (Lee et al., 2018). However, contrary to ArtiFuse it still relies on read manipulation.
Existing tools like SimFuse and FuSim combine sequences from two different genes into one simulated fusion gene, thereby allowing for simulation of reads that cover that simulated fusion gene (Bruno et al., 2013; Tan et al., 2015). However, with ArtiFuse we do the opposite, we split the sequence of one gene into two different gene loci in order to make it appear like a fusion gene. This way, bona fide reads of any expressed gene are forced to map discordant and as split reads at the ArtiFusion breakpoints, the same way reads from a real fusion gene would map. The reads required for detection of ArtiFusions exist in any real datasets as long as the gene is expressed. In contrast, the reads for fusion genes simulated with SimFuse and FuSim cannot exist in real datasets and therefore require simulation. The concept of ArtiFuse, which is to pretend that bona fide genes are fusion genes by changing the reference sequence, is to the extent of our knowledge completely novel.
We used ArtiFuse to introduce over 200 ArtiFusions into the human reference genome and tested them in eight real datasets of varying sequencing quality and depth with three different fusion gene detection tools, generating almost 5000 datapoints in total. We found that for medium and highly expressed fusion genes the recall value never exceeds a range of 0.7–0.8, even in very high-coverage samples, and drops for lower expressed genes and samples with low coverage, to values between 0.1 and 0.2. Considering all events, only one in two ArtiFusions was predicted and just in samples with sufficient sequencing depth. This is in contrast to higher recall values observed when using simulated transcriptomes. The equal coverage of the simulated transcriptomes gives an advantage when analyzing low-expressed ArtiFusions. However, this cannot completely explain the observed overall performance gain, because the coverage of the simulated transcriptomes (100–500 fold coverage) was selected to be lower than the coverage for high-expressed genes in high-coverage samples, as evident by higher number of junction reads (up to several 1000 junction reads; Fig. 2B). This demonstrates the importance of using real datasets when evaluating performance for fusion gene prediction tools.
ArtiFusions are generated within the genomic reference and affect all reads mapping to a given loci. Therefore, it is not possible to simulate fusion genes in a heterozygous situation with the fusion gene being expressed from one allele and the wild-type variant being expressed from the other allele. Additionally, alternative splicing can further complicate prediction of real fusion genes. It would be also possible to simulate such fusion genes by read simulation with all the previously described limitations. Using ArtiFusions it is conceivable to mimic heterozygosity and alternative splicing by careful selection of the ArtiFusion breakpoint toward alternatively spliced exon junction. However, this requires precise knowledge of the splicing patterns that occur in the analyzed dataset. Therefore, it is possible that prediction performance is still overestimated for these situations.
We challenged the fusion detection tools by evaluating their performance for different expression levels, different levels of sequence identity and different location of breakpoint positions. Notably, the ArtiFusion approach allows evaluating such features by simultaneously taking sample intrinsic characteristics into account, such as coverage, read length and potentially unknown technical biases from sequencing. This comparison allowed a much more detailed and realistic comparison between tools that would not be possible with simulated reads.
The omission of synthetic reads is the main advantage of ArtiFuse, allowing for even more complex scenarios as tested here: For example, to accurately assess performance of detection algorithms in RNA-seq data derived from formalin-fixed samples versus fresh frozen samples. Such difference and complex technical biases cannot be easily simulated. ArtiFusions offer a very simple and elegant solution to such questions, as real samples can always be used with all their intrinsic properties and technical biases. Furthermore, simulation of synthetic reads is a computationally expensive process that has to be repeated for each sample. The generation of ArtiFusions, however, requires only limited computational resources to modify the reference sequence, and the same ArtiFusions can be used to subsequently test multiple samples.
Another advantage of ArtiFuse is that candidate ArtiFusions can be selected on any gene property such as e.g. its expression level in FPKM, allowing for a much more meaningful level of control. Expected junction and spanning reads can be determined after wild-type mapping and used to estimate recovery rates of detection tools, to see whether supportive reads are lost or overestimated. While this is also possible using simulated reads, it is much harder to define meaningful values for a fusion gene. The user has to define the actual number of junction and spanning reads, taking into account coverage, read length and fragment size of the simulated sample, which is much more complex than just selecting an actual gene with the desired property as for ArtiFuse.
The recall value is a key metric that is often missing for real datasets. While it is possible to determine the precision by validating predicted events with independent methods, one can often not test for missed events in real datasets. ArtiFuse overcomes that limitation. However, it is not possible to directly determine the precision with ArtiFusions, as additional true fusion genes might exist in the dataset. Simulated data on the other hand, offering control over all simulated events, can be used to determine precision and recall value, but might not represent real-world performance for both values. Therefore, for best evaluation of fusion gene prediction tools a combined approach using simulated data, ArtiFusions and also validation data on a subset of fusions probably provides the most realistic evaluation.
We believe that ArtiFuse provides a very clean approach to estimate performance of fusion gene prediction tools especially in the context of varying sample qualities with different technical biases. ArtiFuse is available as open-source software on github (https://github.com/TRON-Bioinformatics/ArtiFusion).
Acknowledgements
We thank Dr Karen Chu for her comments on the manuscript. Furthermore, we like to thank Dr Jonas Ibn-Salem and Martin Suchan for great scientific discussions on the matter of fusion genes.
Conflict of Interest: none declared.
References