VaDiR: an integrated approach to Variant Detection in RNA

Abstract Background Advances in next-generation DNA sequencing technologies are now enabling detailed characterization of sequence variations in cancer genomes. With whole-genome sequencing, variations in coding and non-coding sequences can be discovered. But the cost associated with it is currently limiting its general use in research. Whole-exome sequencing is used to characterize sequence variations in coding regions, but the cost associated with capture reagents and biases in capture rate limit its full use in research. Additional limitations include uncertainty in assigning the functional significance of the mutations when these mutations are observed in the non-coding region or in genes that are not expressed in cancer tissue. Results We investigated the feasibility of uncovering mutations from expressed genes using RNA sequencing datasets with a method called Variant Detection in RNA(VaDiR) that integrates 3 variant callers, namely: SNPiR, RVBoost, and MuTect2. The combination of all 3 methods, which we called Tier 1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. We also found that the integration of Tier 1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision. Finally, we observed a higher rate of mutation discovery in genes that are expressed at higher levels. Conclusions Our method, VaDiR, provides a possibility of uncovering mutations from RNA sequencing datasets that could be useful in further functional analysis. In addition, our approach allows orthogonal validation of DNA-based mutation discovery by providing complementary sequence variation analysis from paired RNA/DNA sequencing datasets.


Dr. Jeremy Chien
Abstract: Background Advances in next-generation DNA sequencing technologies are now enabling detailed characterization of sequence variations in cancer genomes. With whole genome sequencing, variations in coding and non-coding sequences can be discovered. But the cost associated with it is currently limiting its general use in research. Whole exome sequencing is used to characterize sequence variations in coding regions, but the cost associated with capture reagents and biases in capture rate limit its full use in research. Additional limitations include uncertainty in assigning the functional significance of the mutations when these mutations are observed in the non-coding region or in genes that are not expressed in cancer tissue.

Results
We investigated the feasibility of uncovering mutations from expressed genes using RNA sequencing datasets with a method called "VaDiR: Variant Detection in RNA" that integrate three variant callers, namely: SNPiR, RVBoost and MuTect2. The combination of all three methods, which we called Tier1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. We also found that the integration of Tier1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision. Finally, we observed higher rate of mutation discovery in genes that are expressed at higher levels.

Conclusions
Our method, VaDiR, provides a possibility of uncovering mutations from RNA sequencing datasets that could be useful in further functional analysis. In addition, our approach allows orthogonal validation of DNA-based mutation discovery by providing complementary sequence variation analysis from paired RNA/DNA sequencing data sets.
Have you included all the information requested in your manuscript?

Resources
A description of all resources used, including antibodies, cell lines, animals and software tools, with enough information to allow them to be uniquely identified, should be included in the Methods section. Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model organisms and tools, where possible.
Have you included the information requested as detailed in our Minimum Standards Reporting Checklist?
Yes Availability of data and materials All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.
Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist?

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Background Next-generation sequencing has enabled the discovery of novel variants in genetic sequences. However, even though the cost of sequencing has decreased in recent years, whole genome sequencing (WGS) can still be prohibitively expensive in many cases [1]. Sequencing only exonic regions of the genome helps reduce cost, and multiple tools (such as MuTect2 provided by GATK [2], MuSE [3], SomaticSniper [4] and VarScan2 [5]) have been developed for somatic variant discovery using whole exome sequencing (WES) data, and the performance of these tools was recently evaluated [6]. Still, the reagents used to capture exonic regions are costly and produce uneven coverage across the genome due to capture rate biases [7,8], and only a fraction of the genes in an exome are actually expressed in any given cell [9]. For diseases like cancer, mutations in expressed regions are of greater interest than in non-exonic or unexpressed exonic regions because they are more likely to affect cellular function directly. The transcriptome is therefore an attractive subject of research in cancer and other human pathologies, and some of the cancer genes, such as FOXL2 in granulosa-cell tumors [10] and ARID1A in clear cell carcinomas of the ovary [11], were initially discovered through transcriptome sequencing. The calling of variants with sequencing data from transcriptome (RNA-seq) is more challenging because of the splice junctions. Tools like RVBoost [12], SNPiR [13] or GATK Haplotypecaller are created to address this problem. Somatic variant calling from RNA is more difficult because of RNA processing like RNAediting, allele-specific expression, variable levels of gene expression, and the heterogeneity of tumors which .tex file Click here to download Manuscript VaDiR_bmc_new_revised_jc.tex leads to low variant frequencies of some mutations [14]. Tools such as RVBoost, SNPiR, and GATK Haplotypecaller can be used to perform germline variant calling from RNA, but their performance and limitations for somatic variant calling have not been studied previously. Nonetheless, these approaches have the potential to provide an orthogonal method to validate DNA sequence variations by complementing the analysis with RNA sequence analysis. Additional challenges include the determination of detected mutations either as germline or somatic. In tumor tissues, somatic mutations differ from the germline variations of a patient that are different from the reference genome. To detect somatic sequence variations, it is necessary to compare DNA sequences from normal tissue, such as blood, to DNA or RNA sequences from tumor tissue. If germline sequence variations are not filtered out, it would be difficult to assign detected variations as either somatic or germline. Additionally, it would be improper to assign a variant discovered in the tumor tissue as a somatic mutation when this particular position has no sufficient coverage in germline sequencing.
It should be noted that the integrated approach used by RADIA [15], that combines the somatic variant sequence analysis from tumor DNA and RNA sequencing, allows the discovery of DNA sequence variations in expressed genes and a better characterization of the effect of mutations on gene expression and phenotypic alterations. However, its use of WES of tumor tissue introduces additional cost. RADIA uses the tumor DNA and normal DNA sequencing data sets in the main analysis, and RNA sequence analysis is used as an orthogonal supplement. DNA sequence variations are considered as the ground truth, and RNA variants not supported by DNA sequencing were rejected as false-positives. Although somatic variants discovered only by RNA sequencing have the potential of being false-positives, some of these variants may represent missed calls from tumor DNA sequencing or RNAediting sites that have not been annotated. A detailed comparison of somatic DNA and RNA variants from different tools will provide us with more precise processing and discovery of sequence variations from RNA and DNA sequencing.
In this study, following the recommendation and practices that are widely adopted in the field of bioinformatics [16,17], we chose a validated dataset to perform a detailed comparison of somatic DNA and somatic RNA sequence variations from 21 pairs of whole exome and mRNA sequencing from ovarian cancer genomes. We formulated an approach to utilize three publicly available tools, namely MuTect2, RVboost and SNPiR for variant discovery from RNA sequencing. We evaluated the performance of each tool and established the best combination of these tools that enables discovery of variants from RNA sequence with high precision and recall. We showed that most of the variants which would be classified as false-positives or false-negatives can be explained by biological characteristics. In addition, we investigated the performance of our workflow on artificially spiked variants in coding regions of mRNA sequencing data and we compared the performance of VaDiR to RADIA. Finally, we showed the performance of our workflow on a biologically relevant study: the comparison of somatic variants in high-grade serous carcinomas collected from patients with chemotherapy-resistant or -sensitive ovarian cancer.

DATA DESCRIPTION
Twenty one samples of ovarian serous cystadenocarcinoma from The Cancer Genome Atlas (TCGA) were divided into two groups: 11 cases that were sensitive to the cancer treatment and 10 cases that were resistant. Sensitive cases had a progression-free survival of more than 18 months, and resistant cases had progressionfree survival of less than 12 months. The clinical data for the patients were retrieved from cBioPortal ( [18][19][20]), and the Illumina sequence files for tumor RNA and normal blood DNA were retrieved from cghub [21] and gdc [22] (Supplementary Table 1). Whole exome sequencing and mRNA sequencing datasets were available from each patient.
Additional data used for the artificial spiking of variants (see section "Detection of artificial spiked variants") were provided by Dr. Andrea Mariani and came from three different tumor samples from a patient with serous ovarian carcinoma.

ANALYSIS
Performance characteristics of each method and different combinations of two or more methods To describe the performance characteristics of each method, we use recall and precision metrics instead of sensitivity and specificity because we are interested in variant calls only. Specificity is not a relevant measure because it includes all true negative calls which are in millions. We performed variant calling using RVboost, SNPiR, and MuTect2 separately. Each caller alone calls many variants which are not validated by DNA somatic variants (discordant calls), while SNPiR calls the most variants ( Figure 1(A)). Mutect2 provides the least amount of variant calls not supported by DNA sequencing compared to the other two methods. However, only 10% of variant calls made by Mu-tect2 was supported by DNA sequencing. These results indicate that any single caller is not adequate in discovering variants with high precision. Therefore, we  Figure 1(B), Supplementary Table 2). The combination of Tier1 with mutations called by Mutect2 and SNPiR (hereafter referred to as Tier2) leads to a higher recall (11.3%) while the precision is still in a moderate range (41.5%). For the following analysis, we concentrated only on Tier1.

Effect of weighted features
Additionally, we performed a weighted average of three callers with the goal of decreasing the number of false positive (FP) and false negative (FN) calls. Specifically, we investigated the effect of different weights on the evalue, which was defined as the sum of FP and FN. The weights on each of the callers were systematically varied from 0 to 1 in increments of 0.1. Evalues were calculated for each weighted combination and the optimal weights were defined as those that resulted in the smallest evalue. The consensus call of all three callers (Tier1) is denoted in blue ( Figure 2). Our results demonstrate that many different combinations of weights produce similar evalues as compared to the consensus call of all three callers ( Figure 2, Supplementary Figure 1), suggesting that no improvement in performance was gained by weighted average approach. Similarly, no appreciable gain in performance was noted when we considered the variant allele frequency (vaf ) in the estimation of the weights (Supplementary Figure 2). Thus, taken collectively, our results showed little to no benefit in using weighted features.
Performance of a combined calling method A total of 634 somatic mutations were called from 21 tumor samples. 516 mutations were concordant and 116 were discordant with mutation calls made from DNA (see Supplementary Table 2). To get a ground truth of variants which could have been called by RNA and were called in tumor DNA, we filtered out all DNA variant calls which have a read depth below 10 in RNA. With this filtering, we found a total of 515 variants which were called at the RNA level, while 452 of them are concordant (true-positive) and 63 discordant (false-positive) ( Table 1). 1,779 of the 10,361 variants called by DNA callers have read depth greater than ten at the RNA level, and 1,327 of them were missed by RNA calling (74.5% false-negative rate).
Variants not found in RNA To understand why variant calls from RNA sequencing missed a large majority of variant calls observed by DNA sequencing, we checked the properties of variants missed by RNA callers. From the 10,361 somatic variants called by at least two DNA variant callers, 9,845 were missed by Tier1. Out of them 8,517 (86.5%) were missed because these variants reside in genes that are not expressed (4,628) or expressed less abundantly (3,890) (Supplementary Figure 3). For the mutations in genes with high transcript abundance, 474 (4.8%) were missed because these variants were not in exonic regions. The effect of transcript abundance on variants discovered from RNA-seq could also be observed in the percentage of concordant calls: 516 (24.7%) of the expressed mutations called in DNA in exonic regions were called by Tier1 (Figure 3 (A)) but when the expression is higher (DP>10) 34.6% (452 out of 1,305 mutations) of the somatic mutations were called. This result confirms that an important factor in RNA-seq variant calling is the expression level.
Among the mutations found by DNA callers but missed by Tier1 from highly expressed genes (DP>10), 531 (5.4%) of the mutations had variant allele fraction (VAF) < 0.20 in tumor DNA, while 141 of them had a VAF = 0 ( Figure 3 (B), Supplementary Figure 4), which can be explained through missed indels and that we accepted only reads with a high quality value in the discovery of the DP of all variants. Additionally, 724 (7.4%) of the missed mutations had VAF < 0.20 in tumor RNA, while 493 of them had a VAF = 0 in tumor RNA. This result confirms that one of the limitations of RNA-based variant calling methods is that they are highly dependent on the VAF. Figure 3 (B) shows that VAF of missed variant is significantly lower than VAF of called variants both at the DNA and RNA levels (p-value < 0.0001). Moreover, the difference is much greater between VAF of called variants and missed variants at the RNA levels, suggesting that many of the missed variants at the RNA level may be the result of mutations present in small fraction of tumor cells and the lower expression of mutated transcripts.
From the variants with high expression and high VAF, thirty one mutations were not called by any of the callers. Ninety six mutations were filtered out by at least one of the callers because of potential evidence of germline variants or because the realigning step with PBLAT shows that these variants could come from mismapping. Most of the missed variants with low VAF are called by MuTect2 or SNPiR alone or Mu-Tect2 and SNPiR together (Figure 3 (C)). It is not clear if these missed variants are false-negatives, i.e, true variants missed by VADiR, or if they are falsepositives made by DNA callers. Given that many of the missed variant calls (not found by VaDiR) are the result of PBLAT step in VaDiR to eliminate mis-mapped reads and this step is not used in DNA callers, it is possible that some of the calls missed by VaDiR are true negatives that are incorrectly called by DNA callers.
Variants not found in DNA The differences in coverage or VAF between DNA and RNA datasets could also contribute to discordant calls. Therefore, we checked those attributes at discordant sites. From all 116 discordant mutations called by Tier1, 53 (45.7%) had a read depth (DP) of uniquely mapping reads under 10 at RNA level and seventeen (15.7%) had a read depth under 10 at DNA level (Supplementary Figure 5). Another 22 (19.0%) mutations had VAF > zero at DNA level, indicating that these low-level DNA variants were missed by DNA-based callers used by TCGA. Twenty three variants with VAF=0 at DNA level but high DP in germline DNA, tumor DNA and tumor RNA were mostly either A>G or C>T (Supplementary Figure 6). Those variants were found at 12 different positions, of which one variant (chr3:58141791 A>G [FLNB:p.M2324V]) is found in 4 different samples and another (chr20:10285837 C>T) in 9 different samples. These likely represent unannotated RNA-editing sites [23][24][25].
Because we observed differences in the VAF at the discordant sites, we next expanded the analysis to all sites. Interestingly, we observed a weak correlation of VAF between tumor DNA and tumor RNA at positions with DP>0 for tumor DNA and RNA ( Figure  4 (A)). When we limit the analysis to positions with DP>10 for tumor DNA (Figure 4 (B)) or tumor and normal DNA (Figure 4 (C)), we also observed a weak correlation. Finally, when we limit the analysis to positions with DP>10 for tumor DNA and RNA and normal DNA, we observed a strong correlation of 0.74 of variant allele fraction between RNA and DNA (Figure 4 (D)). Only four mutations had VAF around 0.50 at DNA level but 1.0 at RNA level which suggests that these are imprinted genes. These results suggest that VAF in abundant transcripts are strongly correlated with VAF at DNA level. Therefore, VAF obtained from RNA-sequencing may be used as a substitute for DNA VAF for subclone phylogenetic analysis. As shown by McPherson et al. [26] subclonal phylogenetics can use limited/targeted sequencing to identify subclones.

Detection of artificial spiked variants
To further assess the performance of RNA-based callers, we used BamSurgeon and spiked-in 200 artificial RNA sequence variants at varying variant fractions in transcriptomes from three samples of two different tumor sites from one patient. From the 200 simulated variant positions, 120 were actually spiked in because failed positions have too low read depth even if the positions for spiking were obtained from expressed genes.
On average 71% of all spiked-in variants were found by each caller alone. The combination of all three callers leads to a calling of around 50% of all spiked-in mutations (Table 2, Supplementary Figure 7). By using Tier2, we were able to call 60% of all spiked-in mutations. 55.6% of the mutations missed by Tier1 but called by at least one caller are not in coding regions (Table 3). From the remaining missed variants, 15.7% have a variant allele fraction of less than 0.2 and 6.1% have high variant allele fraction but have a DP<10 in DNA.

Comparison between RADIA and VaDiR
Since RADIA performs function similar to our workflow VaDiR, we compared the performance differences between RADIA and VaDiR. RADIA uses DNA variant calling as the primary method and use RNA variant calling as a supplement. All somatic variants called by RADIA are supported by DNA-level evidence and RNA-only variants are not called by RADIA. Therefore, we limited our comparison to variants that are found at both RNA and DNA levels by RADIA and VaDiR. A total of 308 mutations were called by either RADIA or VaDiR or both in six samples. Of these, 175 mutations were called by both methods, 12 mutations were called by VaDiR only, and 121 mutations were called by RADIA only, while VAF of variants missed by VaDiR are significantly lower than VAF of variants missed by RADIA (Supplementary Figure 8). From these 121 mutations, 40 (33.1%) had a read depth below 10 in RNA. 52 (43.0%) mutations, with a read depth over 10, had VAF below 0.20. This shows again the limitation of method based only on RNA. Six of the remaining 29 variants were in non-exonic regions and would not be called by our method.
Ovarian cancer: resistant vs. sensitive Since variant calling from RNA-seq provides both mutational status and gene expression, the number of mutations found by RNA-seq may be associated with pathologic or clinical phenotypes. In contrast, the total number of mutations found at the DNA level may not be associated with pathologic or clinical phenotype because it may be confounded by potentially nonrelevant mutations in non-coding region or in genes that are not expressed. To determine if variant calling from RNA-sequencing may provide novel insights into clinical phenotype, we characterized the number of mutations in expressed genes from RNA-seq obtained from 10 chemotherapy-resistant and 11 chemotherapysensitive ovarian carcinomas. We considered concordant mutations only (those found by both RNA-and DNA-based callers) for the analysis. The results indicate that concordant rate is higher for Tier1 mu -1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 tations compared to Tier2 mutations although total number of mutations are higher in Tier2 (Figure 5 (A)). We observed higher amount of mutations in chemotherapy-sensitive ovarian carcinomas compared to chemotherapy-resistant counterparts ( Figure  5 (A)). This result is consistent with previous studies indicating that sensitive tumor samples have a higher mutation rate in ovarian cancer [27]. In these samples, number of mutations was significantly higher at either DNA (pV alue = 0.017 [Two Sample t-test, t = −2.3474, df = 19]) or RNA (pV alue = 0.03 [Two Sample t-test, t = −2.605, df = 19]) levels in sensitive carcinomas compared to resistant carcinoma samples ( Figure 5 (B)).
We next focus our analysis to variants that produce nonsynonymous mutations because they are more likely to contribute to a change in phenotype and the divergent evolution of tumor subclones. If a tumor sample is predominantly represented by a tumor subclone, VAF of nonsynonymous SNVs in that subclone will provide the largest fraction of mutations, and thus higher fractions of VAF in nonsynonymous SNVs is expected. On the other hand, if the tumor sample is represented by multiple tumor subclones, each containing subclone-specific mutations, nonsynonymous SNVs will be found at low levels in this tumor. Therefore, VAF of nonsynonymous mutations may represent clonal heterogeneity. Results, shown in Supplementary Figure 9, indicate that differences in VAF be-

DISCUSSION
In addition to the consensus calling of variants by three methods, we tested weighted combinations of the three methods with and without considering the vaf [28]. We didn't see any improvements in the numbers of true-positive variants, false-negative variants and false-positive variants . Therefore, the approach that uses weighted average features is not implemented in our tool. However, our workflow provides the possibility of combining calls from any or all callers for further refinement or for adapting to the need of users.
With our approach, we were able to call variants with high precision. Only a small fraction of the variants which are called in RNA but not in DNA are likely false positives. The remaining discordant variants are either RNA-editing sites or are missed by DNA callers. Most of the variants called in DNA but missed by VaDiR are not in coding regions or are not expressed. We also missed many variants that have low VAF. Those are called by none of the callers, MuTect2 only, or SNPiR only. These mutations are observed at low VAFs in tumor DNA, and therefore they likely represent mutations from small subsets of tumor subclones. Finally, our approach missed approximately 15% of variants (127/853) with a high DP and a high VAF. Among the 127, 96 mutations were called by at least one method, indicating that consensus calling is too stringent or that parameters for one of the callers is not optimal. Those data are confirmed by the artificial spiked-in variants where only variants with high VAF could be called by all three callers.
The comparison to RADIA shows that VaDiR misses mainly low-frequency RNA variants while RADIA misses some high-frequency RNA variants. This result confirms the limitation of calling variants only from RNA, but it also shows that VaDiR can be used to call a great number of somatic variants without the need for tumor whole exome sequencing. It should be noted that current workflow is not completely independent of DNA sequencing since we use germline DNA sequencing to filter out germline variants. However, if the goal is to discover variants in RNA sequencing, VaDiR workflow can be modified to use MuTect2 without germline DNA and to leave out the last filtering step for DP and VAF values in germline DNA. VaDiR may be suitable for tiered studies where VaDiR can be used in the initial step to identify common variants from RNA sequencing datasets, and these candidate mutations can be confirmed by targeted DNA sequencing in a larger cohort to uncover biologically relevant somatic mutations for a specific cancer type. By focusing the initial variant discovery to expressed genes in diseased samples, follow-up validation sequencing efforts can be more targeted to limited regions of interest, thereby lowering the total cost of these genomic studies.
We were also able to find new possible RNA-editing sites, which should be investigated in future studies. Therefore, our workflow provides new capabilities that are missing in existing approaches and can be used to gain novel insight into disease phenotype. Our main concern in future studies would be to increase the number of concordant variant calls by adjustment of the filtering steps from SNPiR and RVboost and to investigate the reasons for missed somatic variants with high VAFs. Future work will also include efforts to make this tool available through a web-server for the detection of somatic variants in RNAseq .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65
To validate the results that we obtained from RNA, we used somatic variants from DNA called by any two of the variant callers MuSE, MuTect2, Somatic-Sniper, and VarScan. We retrieved the corresponding VCF files from GDC [22].
We implemented SNPiR with the following modifications: In the file BLAT candidates.pl at line 94, the developers incorrectly handled the information in the CIGAR-string of hardclipped reads, that resulted in a faulty shift in the base position. We corrected the code to handle CIGAR-strings correctly. This modification was necessary because our workflow differs from the SNPiR workflow in that we use hard-clipped reads. At the same location, we also added an optimization to avoid searching through more base positions than necessary. Further, we changed the filter to use PBLAT instead of BLAT, so we could utilize additional CPU threads to improve execution time. We made similar changes in the file filter mismatch first6bp.pl at line 84. In addition, we optimized the search algorithm in filter intron near splicejuncts.pl by skipping exons and genes that do not contain a given variant position (which also introduced the requirement that SNPiR's gene annotation table be sorted by position) and moderately improve code for readability. Finally, we modified convertVCF.sh to filter out any variant whose read depth (DP) value was zero, in order to prevent division-by-zero errors that occurred with our dataset. Rather than replacing the original SNPiR files in our distribution, we have included both versions and prefixed our file names with "revised ".
For comparison with our method, we implemented RADIA with the following modification: During BLAT filtering, RADIA also incorrectly handled the hardclipped reads. We corrected the code for the same reasons as described for the SNPiR implementation.
For creation of the figures, the R package ggplot2 [39] was used.

Aligning sequences
The procedure for the alignment to the reference genome followed GATK Best Practices [40,41] (Figure   6). For RNA-seq, we used the STAR aligner in 2-pass mode with the parameters implemented by ENCODE project. The resulting aligned reads were processed to add read groups, sort, mark duplicates, split reads that spanned splice junctions, create an index, realign around known indels, reassign mapping qualities, and recalibrate base quality scores.
For DNA, we used the BWA-MEM aligner with the same reference genome. The resulting aligned reads were processed to add read groups, sort, mark duplicates, create an index, realign around known indels, reassign mapping qualities, and recalibrate base quality scores.

Calling variants
A refined BAM file for each sample is then used to process the variant calling. Three different methods for calling are used: RVboost, SNPiR, and MuTect2. The first two methods are for germline variants in RNA and the last method is for somatic variants in DNA. None of these methods is for somatic variant calling in RNA. RVboost and SNPiR use the same variant caller, UnifiedGenotyper from GATK, but different filtering procedures. RVboost filters variants using a statistical learning method called boosting, whereas SNPiR uses hard filtering in 7 steps (Supplementary Table 4). To adapt MuTect2's results for RNA, we implemented three of SNPiR's hard-filtering steps. RVboost and SNPiR only need the refined RNA BAM file from the tumor tissue. MuTect2 needs both the refined RNA BAM from the tumor tissue and the refined DNA BAM from normal tissue.

Filtering somatic variants by caller intersection and additional hard filters
In addition to the filtering procedures of the variant callers themselves, we further filtered our results by taking an intersection of vcf files from the three callers. We restricted our final, combined callset to the variants called by all three methods (Tier 1) or supplemented by variants called by MuTect2 and SNPiR (Tier2). We also applied our own hard filters, only accepting variants with a read depth (DP) of at least five and a VAF of less than 3% in uniquely mapping reads (Mapping quality of at least 40) in the normal DNA at the corresponding position.

Weighting of Features
For the performance of different weighted combinations of the three callers, namely SNPiR (s), Rvboost (r) and MuTect2 (m), we performed two experiments using all variants in coding regions that have a read depth DP > 10 in RNAseq. The weight w i of caller i was calculated as follow: w i =  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 values v i ranged from 0 to 1 in increments of 0.1. To find the best weighted combination, we determined an evalue, which is calculated as sum of all false-negative and all false-positive variants. Next, we calculated the area under the precision-recall curve (AUPRC), sensitivity/recall, specificity and precision. Experiment 1 (Supplementary Table 5): For all variants called by at least one caller, we calculated the weighted score s as follows: (1) or no call (0) made by caller i. We then identified the optimal threshold of s that provides the lowest evalue. This was done for each weighted combination of callers. Experiment 2 (Supplementary Table 6): We calculated the evalue for each weighted combination to determine the optimal threshold for which the variant is called multiplied by the variant allele frequency (vaf ) and adjusted to the dynamic range of the callers as follow: The threshold is a value of s between -3 and 3 at which the lowest evalue is achieved.

Processing artificial spiked variants
We used BAMSURGEON to spike in 200 variants in coding regions of two ovarian tumor samples, such that each sample had a different random frequency of spiked-in variants. The samples were then processed by VaDiR.
Processing samples with RADIA Six samples from TCGA, three from resistant patients and three from sensitive patients, were processed with RADIA. This analysis required three BAM files from each sample: one from normal blood DNA, one from tumor DNA, and one from tumor RNA. We followed the instructions provided by RADIA for filtering. We used all possible filters provided by RADIA.

AVAILABILITY AND REQUIREMENTS
• Project name: somatic VaDiR • Project home page: e.g. http://to.be.added.later  Ethics approved and consent to participate The datasets were obtained from the Cancer Genome Atlas, and the use of data was approved under the Project #4017 at dbGaP.

Competing interests
The authors declare that they have no competing interests.         Your PDF file "VaDiR_bmc_new_revised_jc.pdf" cannot be opened and processed. Please see the common list of problems, and suggested resolutions below.