A deep-learning-based RNA-seq germline variant caller

Abstract Summary RNA sequencing (RNA-seq) can be applied to diverse tasks including quantifying gene expression, discovering quantitative trait loci and identifying gene fusion events. Although RNA-seq can detect germline variants, the complexities of variable transcript abundance, target capture and amplification introduce challenging sources of error. Here, we extend DeepVariant, a deep-learning-based variant caller, to learn and account for the unique challenges presented by RNA-seq data. Our DeepVariant RNA-seq model produces highly accurate variant calls from RNA-sequencing data, and outperforms existing approaches such as Platypus and GATK. We examine factors that influence accuracy, how our model addresses RNA editing events and how additional thresholding can be used to facilitate our models’ use in a production pipeline. Supplementary information Supplementary data are available at Bioinformatics Advances online.


Introduction
RNA-seq is a widely used method for transcriptome analysis. The technology is often used to study gene expression in a variety of contexts. Expression can be examined over time (Zhang et al., 2014), in response to disease or environmental changes (Kakumanu et al., 2012;Ren et al., 2012), or to examine differences across cell types, tissues or species (Gerstein et al., 2014;The GTEx Consortium, 2015;Villani et al., 2017;Zhang et al., 2022). In addition to gene expression, RNA-seq can be used for identifying quantitative trait loci (QTL) (Vigorito et al., 2021), quantifying isoform and allelespecific expression (Li and Dewey, 2011;Miao et al., 2018;Raghupathy et al., 2018), detecting gene fusion events (Haas et al., 2017), and in identification of RNA-editing events (Bahn et al., 2012;Lo Giudice et al., 2020). Interestingly, it is also possible to use RNA-seq to perform germline variant calling (Brouard et al., 2019). However, although germline variant calling with RNA-seq data could be beneficial in many studies, it is an underutilized application of RNA sequencing (Curry et al., 2021).
The lack of studies using RNA-seq for germline variant calling is likely driven by a number of challenges not present with traditional DNA sequencing (DNA-seq) approaches. For example, alignment coverage, a proxy for gene expression, exhibits more dynamic range in RNA-seq than DNA-seq. Coverage can range from little to none to excessively high levels. Low coverage can reduce or eliminate evidence of germline variants and complicate genotype classification (Kukurba and Montgomery, 2015;Sims et al., 2014). Additionally, RNA splicing, allele-specific expression and RNA editing can skew allele frequencies, introduce artifacts or remove variant signals (Guo et al., 2017;Jehl et al., 2021). Sample preparation can also introduce errors. For example, PCR can skew allele frequencies due to amplification bias (Parekh et al., 2016). Finally, contamination by RNAse enzymes can degrade RNA, and RNA-seq often requires an additional reverse transcription step to produce a cDNA library that can introduce errors (Schroeder et al., 2006). These issues can be challenging to address in clinical and research settings.
Despite these challenges, calling germline variants from RNAseq data provide a number of advantages over traditional DNA-seq approaches. Foremost is an economic consideration: RNA-seq can provide germline variant calls where one might otherwise sequence an additional DNA sample (Ozsolak and Milos, 2011). This is especially useful when characterizing coding variation, which is captured with most RNA-seq approaches, providing a rapid and costeffective way to interrogate coding regions. Additionally, the versatility of RNA-seq allows germline variation data to be combined with information about transcription and enables QTL discovery (Sun and Hu, 2013).
Existing RNA-seq variant callers use statistical or algorithmic approaches (DePristo et al., 2011;Oikkonen and Lise, 2017). However, these methods are challenging to implement given the numerous additional sources of error and uncertainty in RNA-seq data, and the complex interactions between them. In contrast, deeplearning approaches are capable of learning distinct patterns, extracted from sequence data, that are associated with the validity of candidate variants. In recent years, deep-learning approaches have been shown to be highly competitive and often exceed the performance of statistical and algorithmic approaches (Olson et al., 2022). Here, we introduce a new method for performing variant calling on RNA-seq data. We adapt DeepVariant, a deep-learningbased variant caller, for use in calling germline variants from RNA-seq data. Our best performing DeepVariant RNA-seq model is trained using data from the Genotype-Tissue Expression (GTEx) consortium (The GTEx Consortium, 2015, and can overcome challenges associated with RNA-seq variant calling and accurately classify germline variants. Our GTEx-based DeepVariant RNA-seq model outperforms existing approaches, achieving an F1 score of 0.933 on coding sequences (CDS) and 0.921 on exonic sequences.
The DeepVariant RNA-seq model and code are available under an open-source license at https://github.com/google/deepvariant.

Software implementation
DeepVariant is a deep-learning-based variant caller that uses a convolutional neural network to classify the genotype of a position in the genome by learning how read data correlate with real and spurious variants (Poplin et al., 2018). To classify variants, DeepVariant first scans a BAM file to identify evidence of SNP or INDEL variation. For example, a candidate variant SNP is identified when two or more bases differ from the reference at a particular site. When candidate sites are identified, DeepVariant will proceed to generate an example, or model input. Examples are constructed using read pileups by extracting a series of sequence features into channels. Each channel represents a sequence feature such as base, base quality or mapping quality. These channels are stacked to form a model input for training or inference. Following training, DeepVariant can use the resulting model to classify candidate variants as being homozygous-REF, heterozygous or homozygous-ALT.
We adapted the data preprocessing stage of DeepVariant to allow for processing of RNA-seq alignment data (BAMs). RNA-seq BAMs, in contrast to DNA-seq BAMs, contain SKIP CIGAR operations which are used to represent skipped regions ( Supplementary  Fig. S1) in the reference genome. These are generally caused by splicing events, although structural variation and mismapping of read segments are other possible explanations. Previously, DeepVariant would incorporate these SKIP operations during the local realignment step, resulting in a substantial increase in the computation time. To address this issue, we introduced a new flag (-split_skip_reads) that splits reads into multiple segments by removing their SKIP operations. This change allowed DeepVariant to process RNA-seq data efficiently ( Supplementary Fig. S2). With these changes in place, model training and inference operate similar to existing approaches that use DNA-seq.

Model training and evaluation
Once we updated DeepVariant to process RNA-seq data, we trained models using two datasets. For the first model, we used samples from the Genome in a Bottle consortium (GIAB) (Poplin et al., 2018;Wagner et al., 2022;Zook et al., 2016). The GIAB consortium has developed a high quality set of genotypes for seven human cell lines (Wagner et al., 2022), derived from two sets of trios and a singleton. These genotypes can be used as training labels or for benchmarking purposes. To train the DeepVariant RNA-seq GIAB model ('DV RNA-seq [GIAB]'), we extracted RNA from cell pellets of lymphoblastoid lines HG001 (GM12878; 10 replicates), and HG002 (GM26105; three replicates) obtained from the Coriell institute. This RNA was sequenced with the Tempus xT assay, which includes exome-capture regions that cover the exons of 19 433 human genes (34 Mb target region of the human genome) (Beaubier et al., 2019). Sequenced reads were aligned using STAR v2.5.4. The resulting BAMs were used for training, using data from chr2-19; chr21 and chr22 were used for hyperparameter optimization, with chr1 and chr20 held out for testing. For training we used the variants from the truth set from Genome-in-a-Bottle v4.1 and restricting to high confidence regions(Cleary_undated-zh). The GIAB dataset allowed us to train a model using high-quality labels.
We used RNA-seq data from GTEx V8 (phs000424.v8.p2) to train our second model ('DV RNA-seq [GTEx]'). The GTEx consortium has extensively characterized human tissue samples from postmortem donors, and provides whole-genome DNA-seq and RNAseq from a large collection of tissues. RNA-seq data from the GTEx project was sequenced using the Illumina TruSeq non-stranded polyAþ selection protocol. We assigned GTEx donors to a train or test group, and randomly selected 1000 RNA-seq samples from our train donor group (526 unique donors across 41 tissue types; Supplementary Table S1). Similarly, we randomly selected 200 RNA-seq samples for testing (74 unique donors across 36 tissue types; Supplementary Table S1). The GTEx dataset allowed us to train a model with a large number and variety of samples. Both models were trained using Inception V3, a convolutional neural network model (Szegedy et al., 2015). Parameters are detailed in Supplementary Table S2.
To derive a truth dataset for training the GTEx model, we ran DeepVariant's highly accurate WGS model on GTEx DNA-seq samples and used the resulting genotypes as labels for training and testing our RNA-seq model. We refer to this training data as a 'silver truth' dataset because our labels are unlikely to be as accurate as GIAB for training. Nevertheless, this approach generated over 80.4 million training examples, 314Â more than was used for the GIAB dataset, and provided a set of training examples with more diversity across tissue and ancestry. We also partitioned the genomes in our silver truth dataset: chr1-19 were used for training, chr21 and chr22 were used for tuning, and chr20 was held out for testing. We performed training by subsetting on exonic regions, and by warm starting from an existing whole-exome sequence trained DeepVariant model. A summary table of the datasets for both the GTEx and GIAB models is provided in Supplementary Table S3. We used prealigned BAMs and CRAMs provided by the GTEx consortium. The V8 release RNA-seq BAMs aligned using STAR v2.5.3a, and DNAseq BAMs aligned using BWA (v0.7.13-r1126).
Separately, we also sequenced HG002 (GM24385, GM26105, GM27730) and HG005 (GM26107) on the NovaSeq platform using an mRNA poly-A-based library preparation. These samples were processed using the nf-core/rnaseq pipeline (Ewels et al., 2020) (10.5281/zenodo.1400710). We used HG002 with our GTEx model to establish filtering cutoff thresholds. HG005, which was held out from training for both GIAB and GTEx models, was used to compare performance of both models. Benchmarking was performed using GIAB Truth set v4.2.1. We used hap.py with the RTG vcfeval engine to perform benchmarking (https://github.com/Illumina/hap. py). Variants were required to have a minimum depth of 3Â in the RNA-seq sample in order to be considered.

DeepVariant RNA-seq produces accurate germline variant calls
We trained two DeepVariant RNA-seq models: 'DV RNA-seq [GTEx]' and 'DV RNA-seq [GIAB]'. We compared the performance of the GTEx and GIAB models using HG005 (GM26107; poly-A capture), which was held out from training for both models. Performance was stratified across CDS, exon, transcript, gene, highconfidence GIAB regions (GIAB-hc), chr20 (which was held out from all training data) and regions with a minimum of 3Â depth. Our initial analysis here examines F1 scores as they are sensitive to both precision and recall and enable comparison of models using a single value. Our subsequent analyses consider precision and sensitivity separately. The GTEx model outperforms the GIAB model, except for INDEL exon regions (Table 1). We chose to focus most of our subsequent analysis on DV RNA-seq [GTEx] given that it outperformed the GIAB model and fits a desirable use case of calling SNPs in CDS regions.

Comparison of DeepVariant RNA-seq to other open source callers
We compared the performance of our GTEx and GIAB models with several other RNA-seq variant calling approaches using our GTExderived 'silver truth' dataset. Our comparison includes the warm-start model trained on whole-exome data and used for training (DV WES; DeepVariant v1.1.0) to see how much training on RNA-seq data improved our model. We also tested GATK HaplotypeCaller (v4.2.4.0) (DePristo et al., 2011) based on a RNA-seq workflow developed by the GATK team. Finally, we tested Platypus (v0.8.1.2) (Rimmer et al., 2014). Platypus requires the BAM input to be preprocessed with a tool called Opossum (v1.0), which performs filtering and processing of RNA-seq BAMs to optimize them for RNA-seq variant calling (Oikkonen and Lise, 2017). We also tested other callers with and without Opossum-preprocessed BAMs.
We used our GTEx test data to compare performance across variant callers. Similar to our model comparison analysis, we stratified across several regions intersected at a 3Â minimum coverage. We observe that DV RNA-seq [GTEx] achieves the highest median F1 score across all region stratifications (Table 2), performing the best overall in CDS regions (Fig. 1). We observe the same trend with INDELs as well ( Supplementary Fig. S3). Interestingly, we found that preprocessing with Opossum lowered the performance of DeepVariant, likely because the alignment data were substantially modified from our training dataset. However, our modifications to DeepVariant obviate the need for preprocessing while still resulting in more accurate results.
We also evaluated the performance on chr20 that was held out from training and again found DV RNA-seq [GTEx] to outperform alternative variant callers (Supplementary Table S3).

Factors impacting RNA-seq model performance
Next, we investigated factors that impact DV RNA-seq [GTEx] performance, focusing on the use case of calling SNPs in CDS regions. We examined F1 scores after restricting variant calling to sites at a series of coverage thresholds in CDS regions ( Fig. 2A). For each threshold, we indicate the relative precision and sensitivity.
Assuming variants are detectable at a minimum depth of 3Â, we asked what proportion of these detectable variants remain across minimum coverage thresholds ( Supplementary Fig. S4). Although our examination here will be dependent on coverage for a given sample, our analysis suggests that a modest coverage threshold of We also examined the correlation between the RNA integrity number (RIN) and RNA-seq sample accuracy. An RIN score can be calculated for a given RNA sample using an algorithm that examines the ratio of 28S and 18S ribosomal RNA. RIN scores range from 1 (totally degraded) to 10 (intact) (Schroeder et al., 2006), and are used as an indicator of RNA quality. We observe a weak correlation between RIN and CDS SNP F1 scores (R ¼ 0.26; P ¼ 0.00015; Fig. 2B), suggesting that DeepVariant RNA-seq can be used on samples with low RIN scores. We also observe a weak correlation for INDELs as well (R ¼ 0.16, P ¼ 0.021; Supplementary Fig. S5).
Tissue type can strongly influence RNA-seq data. We examined how GTEx tissues impacted DeepVariant RNA-seq performance in CDS regions. We observed the highest F1 score with skin and the lowest performance in blood (Fig. 2C). However, performance for SNPs in CDS regions was relatively consistent across tissues. One major factor that may drive differences in performance across tissue is the proportion of abundant versus rare transcripts, which in turn affects the coverage at variant positions.
Unlike DNA, RNA-seq reads possess splicing boundaries. We reasoned that error rates might increase near exon boundaries, where read context around candidate variants will drop off and make accurate variant classification more difficult. Unsurprisingly, we observe this effect across several RNA-seq variant calling methods (Fig. 3). However, DV RNA-seq [GTEx] appears to largely eliminate this effect for false-positive (FP) SNP variants at exon boundaries, and achieves low false-negative (FN) levels as well.

Our model learns to ignore RNA-editing events
RNA is subject to editing by ADAR (adenosine deaminases acting on RNA), which can result in Adenosine-to-inosine (A-to-I) changes (Walkley and Li, 2017). These A-to-I changes are observed in RNAseq datasets as apparent A-to-G and T-to-C changes which do not reflect real variants in the DNA, and can increase the germline FP rate. We sought to examine how our GTEx-based model treated these types of events. To do this, we annotated variants that belonged to an RNA editing database called REDIportal (Mansi et al., 2021). REDIportal has identified and annotated genomic positions where RNA editing has been observed using GTEx data.
We asked how the variants compared with the label dataset based on the base change observed and whether they were present in the REDIportal database. Interestingly, we observe that the DV RNA-seq [GTEx] model has a large reduction in the number of FPs at REDIportal sites compared to the DV WES model. In Figure 4, we show a representative example of this for GTEx sample GTEX-ZZ64-1126-SM-5GZXY. We also observe this phenomenon in an aggregate analysis (Supplementary Fig. S6). The REDIportal FP panel shows a considerable reduction in the FP rate at sites where RNA editing has been observed. The reduction in FPs does appear to slightly increase the FN rate at these sites. However, we observe an overall reduction in errors, suggesting that our model learns to ignore RNA-edited sites. We suspect that our model picks up on sequence context and allele frequency differences at these sites as a basis for filtering them out.

Selecting a cutoff for practical implementation
When implementing a variant caller into a production analysis pipeline, it is convenient to select a cutoff based on quality scores to maintain a desired true positive rate (TPR) or false discovery rate (FDR) based on the application. For example, the 1000 Genomes Project FDR target was 5% (although ultimately the project delivered calls at 1% FDR) (Siva, 2008). In this example, we will similarly illustrate a threshold that favors precision over sensitivity.
Variability between samples is expected due to both random sampling during the sequencing process and variation among samples. Therefore, cutoff scores should be established by examining multiple replicates or samples using validated variants. We appropriate cutoff score that would maintain an FDR of 1.5% while retaining high sensitivity.
We considered two quality scores to use for setting a cutoff: Genotype quality (GQ) and quality (QUAL  . We filtered tissues with >5 samples in our test set probability of a variant call being incorrect, whereas QUAL describes the probability a given site is not a variant. We found GQ worked well for both SNPs and INDELs for thresholding in comparison to QUAL scores, and decided on a cutoff of GQ ! 18 (GQ: Fig. 5; QUAL: Supplementary Fig. S6). This threshold achieves an FDR of <1.5% for both SNPs and INDELs while maintaining high sensitivity for SNPs, and modest sensitivity INDELs. Finally, we evaluated the GQ ! 18 threshold using an LCL line from HG005 (GM26107; poly-A selected) to see how it would perform against unfiltered models and other RNA-seq variant callers. We again examined performance in high-confidence GIAB regions intersected with CDS regions. We find that DV RNA-seq [GTEx] GQ ! 18 is able to achieve a precision of 0.998 on SNPs and 0.989 on INDELs with similar levels of sensitivity we observed when deriving our cutoff (Table 3).

Runtime
We examined the runtime performance of our DeepVariant RNAseq models against existing approaches ( Supplementary Fig. S8). All runtimes were measured using GTEx RNA-seq BAM files on Google Cloud n1-standard-64 virtual machines (64 cores, 240 Gb ram). Using Opossum preprocessed bams generally reduced the required runtime of variant callers, but would add an additional runtime of 15.6 min to preprocess BAM files. DeepVariant RNA-seq models are faster than GATK (Table 4). Although Platypus runs faster than DeepVariant, running Platypus requires preprocessing with Opossum. The total runtime of Platypus þ Opossum results in a similar runtime to DeepVariant RNA-seq models without Opossum preprocessing (Table 4).

Discussion
This study introduces DeepVariant RNA-seq, an extension of DeepVariant that enables germline variant calling using Illumina short-read RNA-seq data. To the best of our knowledge, this is the first deep-learning-based RNA-seq germline variant caller. However, an existing tool called SmartRNASeqCaller (Bosio et al., 2019) is a machine-learning-based post-processing pipeline that operates on variants called with germline RNA-seq variant callers such as GATK. DeepVariant RNA-seq can operate in a standalone manner, without requiring pre-or postprocessing.
Additionally, we have demonstrated that a 'silver-truth set' can be used to generate highly accurate models when used appropriately. The GTEx-based model was trained using label variants derived from the DeepVariant whole-genome sequencing (DV WGS). Note that we only use a fraction of these labels for training where we observe adequate RNA-seq coverage. Although these labels are likely of poorer quality than GIAB-based truth sets, we observed that this approach allowed for a substantially larger and more diverse training dataset than the GIAB-based model. This approach ultimately yielded improved performance compared with the GIAB-based training approach, suggesting that a silver truth may be appropriate in cases which can dramatically expand gold standard datasets of limited size.
Our examination of RNA editing events suggests our model learns to ignore these types of events, because it is trained on germline variant calls only. Conversely, this suggests that we could train a model capable of calling these RNA-editing events by configuring our model to produce an additional output (e.g. probability of a site being an RNA editing event). Such functionality could integrate germline and RNA-editing events into a single variant call format file, and allow for investigations of RNA editing and germline variant interactions. Consider the result when a heterozygous variant occurs adjacent to an RNA editing event within a codon. For example, [heterozygous genotype: T/C] þ [RNA edit: A/G] þ [C] could produce four different protein isoforms: UAC (Tyr), UGC (Cys), CAC (His) and CGC (Arg). Although CDS editing is rare, this extension of our work would allow for these types of events to be identified, in addition to long-range combinatorial effects of genotype and RNA edits. We expect our models to be sensitive to sample type, library preparation and/or bioinformatic methods. For example, we observed considerable differences in terms of performance when Opossum was applied to preprocess RNA-seq bams. Similarly, we expect using alternative RNA-seq aligners (other than STAR) may impact results. We will note, however, that DeepVariant performs a realignment step that may mitigate this effect somewhat.
By releasing the DeepVariant RNA-seq models, we hope to provide reliable methods to add RNA-seq variant detection to typical RNA analysis use cases. We welcome community feedback to further enhance DeepVariant RNA-seq.  Note: The duration required to run opossum, the variant caller and the total duration are listed.

Software and data availability
The DeepVariant RNA-seq GTEx model and documentation are available at https://github.com/google/deepvariant. See the RNA-seq case study for a tutorial on how to use the RNA-seq model. URLs for GIAB RNA-seq FASTQs and BAMs are listed in Supplementary  Table S5.

Funding
This work was supported by Google LLC and Tempus.