Tackling reference bias in genotyping by using founder sequences with PanVC 3

Abstract Summary Overcoming reference bias and calling insertions and deletions are major challenges in genotyping. We present PanVC 3, a set of software that can be utilized as part of various variant calling workflows. We show that, by incorporating known genetic variants to a set of founder sequences to which reads are aligned, reference bias is reduced and precision of calling insertions and deletions is improved. Availability and implementation PanVC 3 and its source code are freely available at https://github.com/tsnorri/panvc3 and at https://anaconda.org/tsnorri/panvc3 under the MIT licence. The experiment scripts are available at https://github.com/algbio/panvc3-experiments.


Standard reference genome G A T T A A T --T C
Variants relative to standard ref.

G A T G G C A T A T C G A T T A C T --T C G A T G G C T --T C G A T T G C A T A G C G T T T A C T A A T C NGS Reads T T T C G A T G C T A A A C T A Segmentation G A T T A A T --T C G A T G G C A T A T C G A T T A C T --T C G A T G G C T --T C G A T T G C A T A G C G T T T A C T A A T C Founder reconstruction G A T T A A T --T C G A T G G C A T A G C G T T T A C T A A T C
Read alignment to founder sequences

T T -T C G A T T A A T --T C G A T G G A T G G C A T A G C C T A A A C T A G T T T A C T A A T C Alignment projection T T -T C C T A A G A T G A C T A G A T T A A T --T C Mapping quality recalculation T T -T C C T A A G A T G A C T A G A T T A A T --T C
Variant calling (standard reference) 1 1 1 1 0 0 0 0 0 0 0 G T T T A C T A A T C 0 0 0 0 1 2 2 2 1 0 0

Ad hoc reference genome G A T G A C T A A T C
Read alignment to ad hoc reference

T T T C C T A A G A T G A C T A G A T G A C T A A T C Variant calling (ad hoc reference) T T G A T G A C T A A T C Projection to standard reference T T G A T G A C T A A T C G A T T A A T --T C
Supplemental Figure 1: Example genotyping workflows that utilise the new and, for comparison, the previous version of PanVC (respectively "PanVC 3" and "PanVC 2").The inputs are shown on top left, the common phases on the top, and the phases specific to either version respectively at the bottom right and left.The alignments have been chosen for demonstration purposes and are not necessarily optimal.A key difference is that due to projecting the alignments to the reference sequence co-ordinates, only one alignment phase is necessary in the new version, whereas in PanVC 2 the reads were realigned to the ad hoc reference constructed using the initial alignments.Due to the differences in design, the workflows can predict different variants.

Co-ordinate Transformation Data Structure
To project co-ordinates between aligned sequences, the following simple data structure is used.For each sequence of n characters, a bit vector B of n elements is built.The element B[i] is set to 0 if the i-th character of the sequence is a gap, and otherwise to 1. Suppose the co-ordinate k of some source sequence is to be projected to a destination sequence that has been aligned to the source.Also, suppose the corresponding bit vectors are respectively called B S and B D .We first retrieve the position of the k-th one in B S using the select operation, thus converting the co-ordinate to that of the multiple sequence alignment.To retrieve the position in the destination sequence, we count the ones in B D up to that position with the rank operation.Using suitable data structures, this can be done in constant time after preprocessing the bit vectors in linear time (Babenko et al., 2015).To save space, a compressed representation of the bit vectors can be used (Navarro and Providel, 2012).

Alignment Projection Algorithm
The run-length encoded sequences of edit operations that read aligners typically use to report the corresponding parts of an aligned read and the reference sequence are known as CIGAR strings (The SAM/BAM Format Specification Working Group, 2022).To rewrite the edit operations, the sequence of operations is processed from beginning to end.Insertions in the read are initially added to the rewritten sequence of edit operations without modifications.Other types of operations are said to consume reference, that is, indicate that the alignment steps along the source sequence.To handle such an operation, the projected co-ordinate of the current source sequence character is first compared to that of the previous character.If there are additional characters in the destination sequence between these positions, an equivalent number of deletion operations is added to the alignment.Next, the alignment operation is rewritten according to the rules listed in Supplemental Table 1.After all the operations have been rewritten, the resulting sequence of edit operations is checked for maximal runs of insertion and deletion operations that contain at least one of each type, and the corresponding runs of characters in the read and the destination sequence are identified.Finally, the aforementioned parts of the read and the destination sequence are realigned and the rewritten sequence of edit operations is modified accordingly.
Similarly to projecting the co-ordinates, the rank operation can be used to determine the number of characters in the destination sequence between a pair of multiple sequence alignment co-ordinates in constant time.Hence, the linear time complexity with respect to the number of operations is achieved, not taking into account the time required to realign parts of the corresponding read.

Reference Bias Experiment
We produced simulated reads from NA12878's chromosome 1.To this end, we generated predicted haplotype sequences from the heterozygous variants in the Genome-in-a-Bottle project v4.2.1 small variant call set (Wagner et al., 2022) and the hs37d5 reference sequence and treated the variants as phased.We then used Mason (Holtgrewe, 2010) to generate 30 million paired-ended reads for each haplotype, resulting in a mean coverage of 24.3 per haplotype with 90.38 % of all bases covered.The reads were then aligned with each of the tested workflows.
In case of Reference Flow, VG-MAP and Giraffe, the phase 3 variant data for chromosome 1 from the 1000 Genomes project (The 1000Genomes Project Consortium et al., 2015) was used as indexing input, having first removed NA12878.In the case of PanVC 2, a set of 3 founder sequences generated from said variant data with the maximum distance between subgraphs set to 2 was used as indexing input.These parameters were chosen since they yielded good results in our earlier experiments with PanVC 2.
In the case of PanVC 3, a set of 11 founder sequences including the reference sequence was used.The founder sequences were generated from the aforementioned variant data.A post-processing step was also Supplemental Table 1: Rewriting rules for the edit operations in alignment projection.The middle columns ("Destination has gap" and "Match") respectively indicate whether the destination sequence has a gap character, and whether the character in the read matches that in the destination sequence.tested where the mapping qualities were recalculated and only the alignment with the best mapping quality for each read was kept.

Computational requirements
We benchmarked the indexing and alignment steps of the workflows that utilised Bowtie 2, VG-MAP, VG-Giraffe, PanVC 2, and PanVC 3 with Snakemake's benchmark feature.To evaluate memory consumption, we measured the maximum resident set size for each step.Additionally, we measured the wall clock time to run each workflow.The latter benchmark can be considered approximate, as the server used to test the workflows is shared and the I/O operations are taken into account.
The machine used ran Ubuntu Linux version 14.04.5 LTS and had 172 GB of RAM and 40 Intel Xeon (Skylake, IBRS) processors, 2.4 GHz each.Bowtie 2, vg, and Samtools were run parallelised; Bowtie 2 and vg used 32 threads.The results are shown in Supplemental Figures 3 and 4.
The results indicate that generating the founder sequences does not considerably increase the time required for generating the index with the inputs of this experiment.On the other hand, projecting the alignments does increase the time needed for aligning the reads substantially.The reason for this may be that the projection tool is relatively new and has not yet been parallelised.On the other hand, projecting the alignments as well as recalculating and maximising the mapping quality require very little memory.We also note that with the current version, parallelisation of the alignment projection could be achieved by splitting the input (e.g. by chromosome or even arbitrarily).

Structural Variant Calling Experiment
Sequencing data for NA24385 from the Human Pangenome Reference Consortium (Zook et al., 2016) were used for testing structural variant calling accuracy.The data consisted of paired-end 150 bp Illumina reads with approximately 450 bp fragment size, downsampled to approximately 30x coverage.The reads were aligned with Bowtie 2 (Langmead and Salzberg, 2012) and Manta (Chen et al., 2015) was used for variant calling.Only variants in chromosome 1 were considered; to this end, Manta's --region option was used to limit variant calling to the chromosome in question.Finally, the variants were evaluated with Truvari (English et al., 2022) against the Genome-in-a-Bottle v0.6 Tier 1 structural variant benchmark set (Zook et al., 2020), considering only the regions listed with the data set that were located in chromosome 1, and variants with the FILTER field set to PASS.
For Reference Flow, VG-MAP and Giraffe, the phase 3 variant data for chromosome 1 from the 1000 Genomes project (The 1000Genomes Project Consortium et al., 2015) was used as indexing input.In the case of PanVC 3, the founder sequences were generated from the same variant data.In the baseline workflow, Supplemental Figure 2: Proportions by the difference of the lengths of the alternative allele and the reference allele in the reference bias experiment (continued).
Supplemental Figure 3: Measured wall clock times spent on executing selected workflows in the reference bias experiment.In case of PanVC 2 and 3, some steps are shown separately.
Supplemental Figure 4: Maximum resident set sizes when executing selected workflows in the reference bias experiment.In the case of PanVC 2 and 3, some steps are shown separately.
Supplemental Table 3: Precision and recall as reported by Truvari and the called variant counts in the structural variant calling experiment, considering all regions of chromosome 1 and only variants that passed all filters.MM indicates that the alignments were filtered by maximising the mapping quality while MR indicates that no filtering was done.The rows have been sorted in descending order by precision.

Workflow
PanVC Bowtie 2's default mode in which it reports the best alignment for each read was used.In the workflows which used PanVC 3, Bowtie 2 was asked to search for up to a number of valid alignments equal to the number of generated founder sequences plus one for each read.
In addition to Manta, we tested GRIDSS (Cameron et al., 2021) with all tested aligners.While the variant caller reported several thousands of variants with Bowtie 2 and PanVC 3, the precision and the recall reported by Truvari were zero.It is possible that the variants could not be evaluated by Truvari (e.g.due to breakends).

Take-One-Out Experiment with a Human Chromosome
We tested the current and the previous version of PanVC with the workflows shown in Supplemental Figure 1.The variant projection algorithm of the previous version of PanVC ("PanVC 2") was improved to detect more cases where comparing the ad hoc reference to the standard reference indicated opposite results to those in the called variants relative to the ad hoc reference.One such situation occurred when a subsequence had been deleted from the ad hoc reference but the called variants suggested that at least one chromosome copy contained the deleted sequence.
Supplemental Table 4: Precision and recall as reported by Truvari and the called variant counts in the structural variant calling experiment, considering confident regions of chromosome 1 and only variants that passed all filters.MM indicates that the alignments were filtered by maximising the mapping quality while MR indicates that no filtering was done.

Further Considerations
While the idea of including additional sequences to the index used as part of read alignment is not new, our method allows the alignment information to be retrieved in an efficient manner regardless of the length or the number of the alternative sequences.Our tools are scalable in the sense that improvements to read alignment could be achieved with a relatively small number of founder sequences.The phases where the alignments are projected and their mapping qualities are recalculated can also be done in parallel per chromosome.Another option in the projection algorithm is to partition the input arbitrarily and process the partitions in parallel.
The mapping quality recalculation algorithm needs all the alignments of a given read as its inputs, but with a partitioning that takes the read names into account the same parallelisation method could be applied.We expected that by adding founder sequences to the index, the measured variant calling results would converge to some value.Instead, we noticed that better results could sometimes be achieved with a smaller number of founder sequences, similar to our experiments with the previous version of PanVC.Our previous hypothesis was that the algorithm for determining the heaviest path used in the previous version could be the cause, but since the new version does not calculate the heaviest path, there seems to be some other reason at least in this case.We did not yet examine the issue but one possible reason is that the sets of founder sequences used as indexing inputs differ in the way the founder segments have been recombined, which then causes the reads to be aligned in another way.
As the experiments indicate that structural variant calling results did improve, a worthwhile further development of our tools is to add means for including structural variants to founder sequences.The resulting multiple sequence alignment will then contain long regions of gaps but it is only needed as input for constructing the co-ordinate conversion data structure.
(a) Results for the Bowtie 2 and PanVC 2 workflows.Supplemental Figure 2: Proportions R R+A ("balances") by the difference of the lengths of the alternative allele and the reference allele in the reference bias experiment.The proportion is shown on the Y axis, the length difference on the X axis, and the value n shown with the fill colour indicates the number of alignments in each equivalence class.The median balance is shown with a horizontal line inside each box, and the lower and upper hinges correspond to the first and third quartiles.The whiskers show the smallest and the largest value such that the difference to respectively the first and the third quartile is at most 1.5 times the distance between those quartiles; more distant values (outliers) have been plotted individually with dots.The tested workflow is shown on the right of each row.(b) Results for the VG-MAP and Giraffe workflows.

Table 2 :
Overall mean absolute error values of the ratios calculated in the reference bias experiment such that the lengths of the reference allele and the alternative allele differ by at most 25 nucleotides.The values have been calculated with the formula 1 n n 1 |x i − 0.5| where x i are the proportions R R+A and R and A are the counts of reads that support respectively the reference allele and the correct alternative allele.MM indicates that the alignments were filtered by maximising the mapping quality while MR indicates that no filtering was done.
The rows have been sorted in descending order by precision.