GenomeWarp: an alignment-based variant coordinate transformation

Abstract Summary Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transformation for coordinate-only data types, but none supports accurate transformation of genome-wide short variation. Here we present GenomeWarp, a tool for efficiently transforming variants between genome assemblies. GenomeWarp transforms regions and short variants in a conservative manner to minimize false positive and negative variants in the target genome, and converts over 99% of regions and short variants from a representative human genome. Availability and implementation GenomeWarp is written in Java. All source code and the user manual are freely available at https://github.com/verilylifesciences/genomewarp. Supplementary information Supplementary data are available at Bioinformatics online.


Supplementary
: Non-overlapping source assembly regions can produce overlapping target assembly regions. Matching sequences on the positive strand are shown in blue. Matching sequences on the negative strand are shown in orange. A single target assembly region that is mapped from multiple source regions is highlighted in red in both the source and target assemblies: base pairs 2-12 in the source genome are identical to base pairs 3-13 in the target genome on the same strand, while base pairs 13-23 in the source genome are identical to base pairs 11-21 in the target genome on the opposite strand. Since the three target genome base pairs 11-13 are linked to multiple distinct source genome regions, the regions are omitted from further consideration by GenomeWarp. The chain diagram uses the UCSC Genome Browser Chain representation .
For each confidently called region in the source genome: 1. Identify all variants within the region. 2. Create haplotypes based on both the region sequence in the source genome and the variants within the region. 3. Align the haplotypes to the region sequence in the target genome. 4. Emit variants in the target genome based on the haplotype alignments. 5. Emit the target genome region coordinates as a confidently called region. Figure S3: Pseudocode for the general algorithm to perform variant conversion between assemblies. A GenomeWarp implementation based on the above pseudocode could handle all possible interactions between variants and reference genome differences. However, because human genome assemblies are quite similar in mapped sequence content ( Supplementary Table S1 ), GenomeWarp version 1.2 relies on a set of simpler heuristics ( Supplementary Table S2 ) that account for nearly all regions seen in practice. Supplementary Table S1: Distribution of HG001 GiaB v3.3.2 GRCh37 source SNPs and biallelic indels classified by region type when transforming to GRCh38. Of the 3,775,119 total variants in the GiaB HG001 benchmarking callset on GRCh37, 3,258,078 are SNPs and 478,217 are biallelic indels. These are classified into different region types based on the base pair composition of the homologous regions of the source and target genome assemblies. "Identical" indicates the same number and order of nucleotides; "Mismatch" indicates the same number of nucleotides in the source and target genomes but at least one mismatch between the sequences; "Alignment" indicates the number of nucleotides in the source and target genomes differ. "Unmapped" indicates source variants either not wholly within benchmarking source regions or that could not be uniquely mapped to the target genome assembly with all bases remapping. "+" and "-" indicate that the region maps to GRCh38 on the forward and reverse strand, respectively.  Figure 1C).  Figure 1A). In other cases, the reference allele is changed to reflect the target genome base pair(s).

Yes Yes No
Indel and reference sequence changes can interact in complex ways to affect variant representations. See Figure  1D.  Table S1 ).

Supplementary Note
Comparison of GenomeWarp to other tools for transforming HG001 The National Institute of Standards and Technology hosts the Genome in a Bottle (GiaB) Consortium, which aims to provide reference genome materials for benchmarking purposes (Zook et al., 2014;Zook et al., 2016, Krusche et al, 2019 ). To evaluate the performance of GenomeWarp, the pilot genome HG001, benchmarking callset version v3.3.2 was transformed from the GRCh37 genome assembly to GRCh38 (all source data available from the links in Supplementary Table S4 ). Input files were modified to use consistent contig naming (e.g. "chr1" rather than "1"), then GenomeWarp v1.1.0 was run with its default parameters.
The benchmarking regions in the GRCh37 assembly span 2,575,064,465 base pairs (bp) and the GRCh37 VCF contains 3,775,119 variants, of which 3,690,061 are contained within a benchmarking region. Upon transformation to GRCh38, the benchmarking regions span 2,570,662,227 bp (99.8% of the source assembly) and the GRCh38 VCF contains 3,676,373 variants (99.6% as many variants as contained within benchmarking regions in the source assembly). The subset of region type transformations supported in GenomeWarp v1.1.0 are sufficient to capture this large number of input variants owing to the large fraction of regions containing identical base pairs on the same strand in the human genome assemblies ( Supplementary Table S1 ).
The GiaB consortium directly created an HG001 v3.3.2 version of variants and benchmarking regions on GRCh38 as well as GRCh37. These GRCh38 data were used as a ground truth set to evaluate variants generated by lifting over the GRCh37 variants and regions using GenomeWarp and other lift over alternatives. It is important to note that differences between the lifted over GRCh37 variants and the native GRCh38 variants may arise due to multiple factors: 1) errors in lifting over variants between assemblies, and 2) callset differences based on the mapping and variant calling workflows performed on each assembly. While the latter source of errors is unavoidable, we expect their contributions to be relatively small owing to the similar sequence content of GRCh37 and GRCh38.
The alternative lift over tools analyzed were CrossMap (Zhao et al., 2014) Table S3 ). For indels, GenomeWarp reduces false negatives 5-fold and reduces false positives 55-to 60-fold ( Supplementary Table S3 ). The increase in, and similar total number of, false positive SNP calls for both GenomeWarp and Picard LiftoverVcf RECOVER_SWAPPED_REF_ALT=true compared to Picard LiftoverVcf without handling reference sequence changes indicates that those false calls are likely callset differences between GRCh37 and GRCh38 rather than logical errors introduced by the lift over algorithms, as the code to transform reference and alternate bases is only supported for biallelic SNPs by Picard and is very straightforward (see https://github.com/broadinstitute/picard/blob/master/src/main/java/picard/vcf/LiftoverVcf.java#L 502-L514 ).

Differentiating between reference callset differences and GenomeWarp transformation errors
As illustrated in the above section, false positive and negative variants that arise when comparing a callset initially made on GRCh37 and transformed to GRCh38 to a callset made directly on GRCh38 may be due to either inherent callset differences or to errors in the transformation process. To understand the relative contributions of each, we took the GiaB benchmark callset on GRCh38, transformed it to GRCh37 using GenomeWarp, transformed that callset back to GRCh38 using GenomeWarp, and then compared the resulting callset to the original GiaB benchmark callset on GRCh38 using hap.py ( Supplementary Table S5 ).
For both SNPs and indels, nearly all errors are false negatives (99.57% and 99.97%, respectively). This is expected owing to the conservative nature of the GenomeWarp algorithm; variants that cannot be confidently transformed are instead omitted along with the corresponding region. The presence of the 44 false positive variants was surprising given this design decision.
We investigated these cases in detail and in each case found that the error was due to the entire region surrounding a variant being mapped to a different location in the genome. For example, the region GRCh38.chr1:144515866-144515876 maps to GRCh37.chr1:144542180-144542190, but GRCh37.chr1:144542180-144542190 maps to GRCh38.chr1:149154796-149154807. When a variant is present in one of the GRCh38 regions but not the other a false variant is introduced. This error mode is due to the LiftOver chains not enforcing an injective mapping (visual example in Supplementary Figure S2 ) and is irreducible error for any tool relying on the accuracy of LiftOver chains as part of the transformation algorithm.  Table S6 ). The loss of recall can be attributed at least in part to the conservativeness of the GenomeWarp conversion algorithm as described in the above analysis on GenomeWarp transformation losses. The loss of precision is attributable to differences in the callsets themselves, as the above analysis of GenomeWarp transformation losses shows that very few false positives are introduced by the transformation algorithm, and are consistent with the difference in total calls in the GiaB callsets on the two assemblies. Finally, the transformed callset was evaluated against the callset generated using DeepVariant directly on GRCh38 and filtering non-variant regions to those with a minimum GQ of 20. The resulting SNP F1 of 0.968662 and indel F1 of 0.972409 are substantially lower than the comparison to the GiaB benchmarking callset on GRCh38, showing that genomic regions outside the GiaB benchmark regions are enriched for differences between the assemblies that lead to both variant calling diversity and fewer successful liftover operations.