Efficient short read mapping to a pangenome that is represented by a graph of ED strings

Abstract Motivation A pangenome represents many diverse genome sequences of the same species. In order to cope with small variations as well as structural variations, recent research focused on the development of graph-based models of pangenomes. Mapping is the process of finding the original location of a DNA read in a reference sequence, typically a genome. Using a pangenome instead of a (linear) reference genome can, e.g. reduce mapping bias, the tendency to incorrectly map sequences that differ from the reference genome. Mapping reads to a graph, however, is more complex and needs more resources than mapping to a reference genome. Reducing the complexity of the graph by encoding simple variations like SNPs in a simple way can accelerate read mapping and reduce the memory requirements at the same time. Results We introduce graphs based on elastic-degenerate strings (ED strings, EDS) and the linearized form of these EDS graphs as a new representation for pangenomes. In this representation, small variations are encoded directly in the sequence. Structural variations are encoded in a graph structure. This reduces the size of the representation in comparison to sequence graphs. In the linearized form, mapping techniques that are known from ordinary strings can be applied with appropriate adjustments. Since most variations are expressed directly in the sequence, the mapping process rarely has to take edges of the EDS graph into account. We developed a prototypical software tool GED-MAP that uses this representation together with a minimizer index to map short reads to the pangenome. Our experiments show that the new method works on a whole human genome scale, taking structural variants properly into account. The advantage of GED-MAP, compared with other pangenomic short read mappers, is that the new representation allows for a simple indexing method. This makes GED-MAP fast and memory efficient. Availability and implementation Sources are available at: https://github.com/thomas-buechler-ulm/gedmap.


Program calls in our experiments
In all program calls in this section, the common in-and output files are called FA, VCF, FQ and OUT.

GED-MAP
In the following, GED is the string ofĜ. GED.adj stores the adjacency list and GED.2fa is the bit-vector used to transform a coordinate inĜ to a coordinate in the reference. MINI is the minimizer index. The default parameters for the minimizer are w = 5 and k = 20. Likewise by default a trimming that removes all minimizers with more than 1000 occurences is applied. The flag 'rc' enables search for reverse complement and 'tc' sets the number of used threads. The parameters 'mac' and 'mat' are giving a limit to how often the dynamic programming algorithm will be started and executed completely. The parameter 'd' gives the maximum distance for the alignment. These parameters were set differently on different read sets. g e d m a p p a r s e FA VCF GED g e d m a p i n d e x GED -o MINI -a GED . adj -2 fa GED . e d s 2 f a g e d m a p a l i g n GED FQ MINI -o OUT -2 fa GED . e d s 2 f a -a GED . adj -rc -tc 8 ( -d 1 5 / 1 0 0 -mat 1 0 / 1 0 0 0 )

HISAT2
We used version 2.2.1. The index was generated with default commands. EXT contains the extracted variation and IDX is the prefix for all index files. HISAT2 offers many parameters to customize the mapping process.
With the chosen parameter setting, the internal scoring function equals the edit distance.
h i s a t 2 _ e x t r a c t _ s n p s _ h a p l o t y p e s _ V C F . py -non -rs FA VCF EXT hisat2 -b u i l d -f --large -i n d e x --snp EXT . snp --h a p l o t y p e EXT . h a p l o t y p e FA IDX h i s a t 2 -q -x IDX -U FQ -S OUT --t h r e a d s 8 --s e n s i t i v e --no -s o f t c l i p --np 0 -k 1

VG Giraffe
We used version v1.37.0-11 'Monchio' of the vg-toolkit. In the following commands, VG is the variation graph and XG is a more space-efficient representation of VG. GFA is the graph in GFA format. IDX is the prefix of the indexes used for the mapping process. These indexes include the GBWT, the corresponding GBWT graph, a minimizer index, and a distance index. We then mapped FQ to the index with default parameters and converted the output to SAM format.

Minimap2
We used verion 2.24 of Minimap2. In the following commands, IDX is the minimizer index.

Layout of the minimizer index
Let {ω 1 , ω 2 , . . . , ω (4 k ) } be the set of k-mers over the alphabet {A, C, G, T }. We denote the k-mer with lexicographic rank j by ω j . Further, let M (ω j ) be the set of positions at which ω j occurs as minimizer in the EDS graph. For example in Figure 2 in the main document M (AGA) = {4, 17} and M (AAA) = ∅. The purpose of the index is to determine M (ω j ) given a k-mer ω j . In the following let n be the number of k-mers w such that M (ω) ̸ = ∅ and m be the number of k-mers with |M (ω)| = 1. Our index layout is similar to SSHash [A] and also exploits the facts that for high enough values of k, firstly n is magnitudes smaller than 4 k (the index is sparse) and secondly m is not much smaller than n (non-empty sets M (ω) are often singleton).
We use a bit array member of size 4 k to indicate if a k-mer is present in the index, i.e. member [j] = 1 ⇔ |M (ω j )| > 0. This array is compressed using the Elias-Fano encoding. We further enhance member with additional data to support rank queries (rank (member , x) := x i=1 member [i]) in constant time. With rank queries we can map every k-mer, which is present in the index, to a value in {1, . . . , n}. We define h : {1, . . . , 4 k } → {1, . . . , n} with h : j → rank (member , j)). (So the function h corresponds to the Minimal Perfect Hash Function in [A]. Additionally member allows us to answer membership queries in constant time. ) We use a second bit array singleton of size n to indicate if the set The array singleton is also enhanced with a rank support. We To be able to report the elements of the non-singleton sets, we need an additional vector start (similar to Sizes in For instance, the arrays for the index shown in Figure  The index used for the experiments requires 5.9 GB storage. This devides in 3.4 GB for pos S , 0.7 GB for pos M , 1.5 GB for member and 0.2 GB for start and 0.1 GB for singleton.
Evaluation of the performance of the minimizer index for different values of k and w The two main parameters for the index are the length k of a k-mer and the window size w. If k or w is high, there are fewer positions per minimizer in the index. Furthermore, higher values of k or w are reducing the number of minimizers per pattern. This results in a low number of positions per pattern, which leads to a higher mapping speed. When the number of positions per pattern is low, however, the algorithm may not be able to place the pattern correctly and hence the accuracy is lower.
To evaluate this effect, we built the index for the whole human genome with different parameters of k and w and mapped 1 million generated samples using the index. We measured the following values: • the average number of positions, loaded from the index, per sample • the mapping speed • the average number of 'correct' positions (positions that are part of the correct alignment), loaded from the index, per sample. • the accuracy, i.e. the fraction of correctly placed samples • the size of the index As described in the main document, we applied a trimming, which removed all k-mers with more than 1000 occurrences from the index. Our evaluation is shown for the index before and after the trimming.  On the one hand, to be able to find the correct alignment, we would like to have a high number of 'correct' positions. On the other hand, to be fast, we would like to have a low number of positions per read. Figure 2 is showing that the number of positions per read decreases exponentially (note the semi logarithmic scale), when increasing k or w. The throughput behaves inversely proportional to the number of positions until it reaches a limit approximately at 2 -3 ×10 5 patterns per second. To be able to place the pattern correctly, the number of 'correct' positions is more relevant. This is shown in Figure 3 along with the accuracy of the mapping process. One can see that the number of 'correct' positions decreases slowly as k increases. Furthermore, we see that the trimming remarkably lowers the The index size is shown in Figure 4. A base line for the index size is the space needed to store the positions (in the arrays pos S and pos M ). That strongly depends on w and is quite independent of k. An increase of k mainly increases the space needed to store the array member .
For a good choice of the values for k and w as well as the trimming, one has to consider a space-time-accuracy trade off. A higher space consumption comes with a higher accuracy and a lower mapping speed.
For our experiments we choose w = 20 and k = 5. With these values the minimizer index has a size of 5.9 GB and stores positions for over 879 million k-mers. That is about 6.7 bytes per k-mer. (We already need 4 bytes to store a single position of a k-mer.) The trimming removed about 1.5 · 10 4 k-mers with a total of over 51 million positions.

Applying a non lexicographic hash function counters the loss of alternative alleles
At a fixed position in the EDS graph, different k-mers may start if there is a branch right behind this position. Instead of considering all k-mers of such a position in the minimizer index, we focus on the k-mer with the minimum hash value. Hence the k-mer at this position does only represent one alternative.
When using the lexicographic rank as the hash value of a k-mer (as done in the examples of this article), all k-mers, which start right before a branch, will represent the lexicographic smallest alternative and so the lexicographic larger alternatives will not be covered properly; see Figure  5. There are three minimizers that cover the variant (G|T) and all of them cover the alternative G. Hence the alternative T is not present in the index. This effect can be countered by applying a more random-like hash function, as done in the actual implementation. If the hash values behave like random values, an alternative of a variant will be covered with each minimizer that overlaps this variant. Suppose the hash function maps the k-mers in the example to their lexicographic rank, except ATCC is mapped to a lower hash value than AGCC. Then the second minimizer would be ATCC instead of AGCC and both alternatives are covered in the index.
This does not guarantee that all alleles are present in the index. Nevertheless, our experiments show that, with a suitable choice of parameters, there are in practice enough matching minimizer to be able to place a pattern correctly. As one can see in Figure 3, there are on average almost 20 matching minimizers between a pattern and the EDS with the parameters of our experiments.

Comparison of the graph based tools on the linear reference
We repeated the experiments of Section 3.3 on the linear reference. That is, we built the indexes for GED-MAP, VG-Giraffe and HISAT2 without the variant-information (Table 1) and then executed the mapping on the same input as in the original experiments ( Figure 6). The most remarkable observation in Table 1 is that, without variations, the index construction of HISAT2 is fast and memory efficient.  6. Evaluation of the accuracy, throughput, and space consumption of the mapping process. This figure shows the same evaluation as Figure 3 in the main document, but without using variants in the graph based tools.