NPSV-deep: a deep learning method for genotyping structural variants in short read genome sequencing data

Abstract Motivation Structural variants (SVs) play a causal role in numerous diseases but can be difficult to detect and accurately genotype (determine zygosity) with short-read genome sequencing data (SRS). Improving SV genotyping accuracy in SRS data, particularly for the many SVs first detected with long-read sequencing, will improve our understanding of genetic variation. Results NPSV-deep is a deep learning-based approach for genotyping previously reported insertion and deletion SVs that recasts this task as an image similarity problem. NPSV-deep predicts the SV genotype based on the similarity between pileup images generated from the actual SRS data and matching SRS simulations. We show that NPSV-deep consistently matches or improves upon the state-of-the-art for SV genotyping accuracy across different SV call sets, samples and variant types, including a 25% reduction in genotyping errors for the Genome-in-a-Bottle (GIAB) high-confidence SVs. NPSV-deep is not limited to the SVs as described; it improves deletion genotyping concordance a further 1.5 percentage points for GIAB SVs (92%) by automatically correcting imprecise/incorrectly described SVs. Availability and implementation Python/C++ source code and pre-trained models freely available at https://github.com/mlinderm/npsv2.


Image generation
The NPSV-deep image representations (Figure 1A) are 9-channel tensors incorporating the alignment, allele, insert size, quality and haplotype information described below and Supplemental Table 1.The image region is centered on and spans the entire event plus a flanking region.During image generation, each pixel is a single sequenced base.Reads are rendered in sorted order of genomic coordinates.Due to the large number of reads spanning larger SVs, the pileup images are generated as coverage histograms, i.e., horizontally adjacent pixels may be derived from different reads.If the coverage would exceed the maximum image height, reads are randomly downsampled so that the total coverage is less than the number of image pixels.The resulting variable-width image (SV width plus flanks) is compressed to a fixed size (default of 100×300).
Base alignment, e.g.match, mismatch, soft clip, is derived from the CIGAR string of the original alignments.Reads are assigned to the reference or alternate allele by realigning the read to the reference and alternate sequence (derived from the putative SV) using BWA 1 (via SeqLib 2 ) and a scoring metric adapted from svviz2 3 .Only reads originally aligned within some flanking distance of the image region are realigned.We haplotag reads with WhatsHap 4 and the phased input SNVs (provided as an optional input) to identify SV features observed only on a single haplotype (indicative of a heterozygous genotype) 5,6 .We compute reference and alternate insert size z-scores from the SV length and an estimate of the insert size distribution.Since the fragment insert could span the entire image region and, thus, not be represented by any bases in the image, we include "insert" pixels that encode the insert size for spanning fragments with reads outside the image region.Enumerated values, e.g., strand, are mapped to fixed pixel values, while continuous values, such as quality scores are linearly mapped to pixel ranges.Pixels with all zeros (black in the composite images in Figure 1) are the absence of a sequenced base.
The same image generation procedure is applied to the actual and simulated SRS data.Similar to NPSV 7 , the simulated data is generated with the ART SRS simulator 8 configured to match the original sequencing data (sequencer error model, read length, coverage, insert size distribution) and aligned with the same pipeline as the actual data.In this evaluation we align reads with BWA-MEM 1 and mark duplicates with samblaster 9 to mimic the BCBio pipeline applied to the actual data 10 .To enable haplotagging, we insert phased SNVs into the simulated haplotypes (no relationship between the SV and SNV alleles is assumed).A preprocessing step computes the mean, per-chromosome coverage and insert size distribution with bedtools 11 , SAMtools 12 and indexcov 13 to inform the SRS simulation and image generation.

Neural network architecture and training
Figure 1B shows the genotyping model.NPSV-deep implements a paired network to transform the paired images (100×300×9) of actual and simulated data using identical DNNs into a joint embedding space (d = 512) 14 .The genotype label is determined from the label of the simulated data most similar to the actual data (smallest Euclidean distance).For multi-allelic SVs, the genotype is determined from the smallest distance across all possible combinations of alleles and genotypes.The neural network, training and inference were implemented with Tensorflow v2.8 15 .
The DNN backbone is shown in the Figure1A inset.The backbone is the Inception V3 image classification network with the original last layer removed, connected to a single fully connected projection layer 16 .We implement a standard training loop using the Adam optimizer and a fixed learning rate (0.004).The gradients for the paired backbone networks are additive due to the shared weights 14 .Due to the structural differences between the pileup images and the ImageNet dataset used to originally train the InceptionV3 model, we do not implement transfer learning and instead train the entire DNN (including Inception V3) from random initial weights.
We train the genotyping model using a 30 sample subset of the HGSVC2 sequence-resolved phased SV and SNV call set with the NA12878 and HG002/NA24385 samples and their private SVs excluded 17 .The SRS data is sourced from the high-coverage 1000 Genomes dataset 18 .The multi-sample HGSVC2 call set contains multiple similar and overlapping SVs in some regions.To reduce instances of putative homozygous reference SVs partially overlapping non-reference SVs we filtered the training call set to isolate non-reference SVs.For each sample, we grouped overlapping SVs.Groups with only reference SVs were retained as is, groups with a single non-reference SV were collapsed to that non-reference SV, groups with multiple non-reference SVs were discarded.The resulting training call sets contained 28,555-30,794 DEL and 40,269-43,003 INS SVs per sample.
We trained a cohort of models for each SV type with a range of hyperparameters: 1, 3 and 5 simulated replicates, 10 and 20 epochs and with/without random simulated depth augmentation (randomly set simulated depth between 50-100% of actual depth).All models are trained to fixed endpoints without early stopping or other validation criteria.The final ensemble of five models for each variant type (DEL, INS) was selected to achieve the best combined genotyping accuracy for the Syndip validation dataset 19 .To validate model precision, we augmented the Syndip dataset with putative false positive SVs called by Lumpy 20 and Manta 21 in the Syndip SRS data.

SV refining
Figure 4A shows the SV refining workflow.For each DEL SV overlapping a repetitive region (the UCSC simple repeats track in this evaluation), the SV proposer generates alternative SV descriptions at all possible starting positions in the union of that region and any "widened" region provided in the VCF 22 .The read support filter optionally excludes any of the resulting proposed SVs that induce unique k-mers (k = 31) not present in SRS data.We genotype all SVs (original and proposed) using NPSV-deep.To minimize the noise observed in Figure 4C, we smooth the distances for each set of proposed SVs (original SV and size) with an exponentially weighted moving average.The SV refiner groups all proposed SVs within some distance (default of 1000 bp).For each group, we rank the proposed SVs by the minimum non-reference distance (smallest distance between actual and simulated data for heterozygous and homozygous alternate genotypes).If the smallest non-reference distance is less than that of its corresponding original SV, i.e., the actual simulated data are more similar for the proposed alternative SV than the original SV, the proposal is chosen as the "best alternate" for the group.For each original SV in the group, we update its genotype if one of its proposals was selected as the best alternate.If another SV's proposal was selected as the best alternate, we set the genotype to homozygous reference under the assumption there is only one non-reference SV per region.All SVs without a best alternate remain unchanged.

Evaluation
As noted in the main text, we evaluated the proposed and comparison genotypers using the Genome in a Bottle (GIAB) v0.6 SV callset in HG002/NA24385 (GRCh37) and the Polaris 2.1 and HGSVC2 freeze 3 SV callsets in NA12878 (GRCh38) 17,23,24 following a previously reported approach 7 .SVs less than 50 bp, greater than 15 Mbp, filtered (in the upstream dataset) except for "LongReadHomRef" SVs in GIAB (described as "Long reads supported homozygous reference for all individuals"), or with missing genotypes are excluded.GIAB SVs outside Tier 1 and Tier 2 regions were also excluded.Supplemental Table 2 shows the breakdown of genotypes in each dataset.
We executed all tools with the same input VCFs and SRS data using the default parameters.VCF FILTER annotations reduce concordance (since such SVs are treated as no-calls) and so are ignored.The default parameters may not represent the best possible configuration for each tool.Nor does this evaluation utilize all the capabilities of the comparison tools, which may support variant types that NPSV-deep does not, provide additional visualization features, or be optimized for population-scale genotyping.See the previously reported evaluation for additional descriptions of any tool-specific configuration 7 .
We calculated genotyping accuracy using Truvari, modified as described previously 7,25 .Genotype concordance is calculated as the fraction of matching homozygous reference, heterozygous and homozygous alternative genotypes among all genotyped SVs in the truth set.Non-reference concordance treats heterozygous and homozygous alternate genotypes as equivalent."No calls", i.e., "./." genotypes, in the test set are treated as incorrectly genotyped.In the discovery context (using the output of SV discovery tools as the input SVs for genotyping), false positive SVs, i.e., variants not present in the truth set, but genotyped as homozygous reference in the test set are treated as correctly genotyped, since they are correctly indicated as not present in the individual.In the genotyping context Truvari is configured to match SVs within 20 bp, of any allele similarity, with 60% overlap, while in the discovery context (using the output of SV discovery tools as the input SVs for genotyping) Truvari is configured to match SVs within 2000 bp, of any allele similarity, with 70% overlap 23 .Mendelian errors were identified in autosomal regions using BCFTools 26 .To identify offset SVs, we matched GIAB Tier 1 DEL SVs to SVs called by PBSV 2.21 27 in PacBio CCS reads (obtained from the GIAB FTP repository) using Truvari configured to match SVs within 2000 bp, with 70% allele and size similarity and computed the maximum distance between the GIAB and PBSV breakpoints.As a result of the repetitive context for this SV, no alternate spanning pairs or alternate allele reads were identified.However, as observed in the individual channels that evidence is consistent with a homozygous alternate genotype, and not homozygous reference as might otherwise be expected.Supplemental Figure 5: Genotyping accuracy incorporating SV refining for HGSVC2 SVs in NA12878 using the Platinum Genomes SRS data.The "npsv-deep (refine)" workflow treats similar or overlapping SVs as "proposals", applying the same refining algorithm to select the most likely SV for each group (predicting all others to be homozygous reference).The best concordance for each metric is highlighted with a black outline.

Supplemental
Supplemental Table 3

Supplemental Figure 1 :
Image generation for 822 bp homozygous alternate (1/1) DEL in HG002.The original SRS data is shown in an IGV 28 screenshot in the upper left.This SV is a deletion of a single copy of a repeat at 12:22129565-22130387 (GRCh37).Composite pileup coverage images generated by NPSV-deep for the actual and simulated SRS data are shown top right.The individual image channels are shown at the bottom.

Supplemental Figure 3 :
Genotype concordance for SVs stratified by size and whether the SV overlaps a tandem repeat (TR) > 100 bp.The background bars show the counts of SVs of different sizes and repetitive contexts in the GIAB call set.An expanded version of Figure 3D.

Table 1 :
NPSV-deep image channels.Enumerated values, e.g., strand, are mapped to fixed values, while continuous values, such as quality scores, are linearly mapped to ranges.

:
Exact (non-reference) genotype concordance for GIAB v0.6 SVs including "LongReadHomRef" SVs where "long reads supported homozygous reference for all individuals") in tier 1 regions and tier 1 regions and tier 2 SVs combined.Best performance is bolded.

Table 4 :
Exact (non-reference) genotype concordance for Polaris 2.1 and HGSVC2 call sets in NA12878 using Platinum Genomes SRS data.Best performance is bolded.

Table 5 :
Exact (non-reference) genotype concordance for just private SVs excluded from the training set and all SVs in the HGSVC2 call set for N12878 using Platinum Genomes SRS data.The "All" columns are copied from Supplemental Table4.Best performance is bolded.

Table 6 :
29ecution time and memory usage for Genome-in-a-Bottle Tier 1 and Tier 2 SVs on a 36-core compute node (dual 18-core Intel Xeon 6140 2.3 GHz CPUs).Execution time is determined with the time utility, and memory usage is the maximum resident set size reported by the SLURM cluster job queueing system.We ran tools that support parallel execution for a single sample on all 36 cores (using GNU Parallel29and the start/end range parameters to parallelize svviz2).