Abstract

Motivation

Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy.

Results

We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.

Availability and implementation

The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.

1 Introduction

Sequencing databases are growing exponentially in size (Katz et al. 2021). In recent years, ‘sequence graphs’, in particular ‘De Bruijn graphs’ (DBGs), have become increasingly prominent models for representing and indexing large collections of sequencing data (Marchet et al. 2021), enabling improvements in both the scale and the accuracy of many biological analysis tasks, including genotyping (Hickey et al. 2020, Sirén et al. 2021), variant calling (Colquhoun et al. 2021, Krannich et al. 2021), and sequence search (Karasikov et al. 2020, Sirén et al. 2021).

Established search methods are designed for databases of assembled genomes (Morgulis et al. 2006). However, a large fraction of sequencing data deposited in archives like the Sequence Read Archive (SRA) or the European Nucleotide Archive have not yet been assembled (Harrison et al. 2020). This is because assembly requires expensive compute and human labour resources, done by first representing the overlaps between reads as an ‘assembly graph’ (e.g. a DBG), then extensively cleaning the graph (often requiring manual intervention), and finally assembling contiguous sequences (contigs) by graph traversal (Bankevich et al. 2022). Since genome assembly strives for long, high-quality contigs (Rahman and Medvedev 2022), the cleaning may discard a significant amount of signal from the sample (Sirén 2017, Rhie et al. 2021) with no guarantee that there will be no misassemblies among the final contigs (Rahman and Medvedev 2022).

To avoid these signal reduction and misassembly issues, a common way to compress and index a collection of unassembled read sets for search queries is to first construct and only lightly clean an assembly graph for each read set, then merge the graphs into a ‘joint assembly graph’ (Pandey et al. 2018, Karasikov et al. 2020). For indexing diverse sequencing datasets, light cleaning is still crucial to reduce the accumulation of noise when indexing diverse collections containing hundreds of thousands of samples (Karasikov et al. 2020). DBG-based indexing tools encode metadata as ‘graph annotations’, a key-value store associating each node with one or more metadata tracks, such as sample ‘labels’ (Iqbal et al. 2012, Almodaresi et al. 2018, Garrison et al. 2018, Pandey et al. 2018, Turner et al. 2018, Holley and Melsted 2020, Karasikov et al. 2020, 2022, Marchet et al. 2020, Fan et al. 2024). Similar to these previous works, we use accession IDs as labels to associate nodes back to their original database entries.

A key search task on these indexes is ‘sequence-to-graph alignment’, a generalization of pairwise sequence-to-sequence alignment (i.e. computing the maximum similarity ‘score’ between a ‘query’ and a ‘target’ sequence). In this setting, the target sequences are the spellings of contiguous walks (Section 2.1) on the sequence graph (Lee et al. 2002, Dvorkina et al. 2020, Ivanov et al. 2020, 2022, Li et al. 2020, Rautiainen and Marschall 2020, Schulz et al. 2021, Chandra and Jain 2023a, Joudaki et al. 2023, Ma et al. 2023, Rajput et al. 2023). Many of these tools follow a three-step ‘seed-chain-extend’ search paradigm (Section 2.2). This involves extracting and ‘anchoring’ query ‘seeds’ to the graph, using a ‘colinear chaining’ algorithm to construct anchor chains, and then ‘extending’ the chains via a search in the graph to form alignments.

1.1 Challenges when aligning to annotated DBGs

A large proportion of read sets in the SRA are sequenced at low depth (The fungi SRA samples indexed by Karasikov et al. (2020) have a median (mean) k-mer multiplicity of 9 (10.5); Supplementary Fig. 2.), producing heavily fragmented assembly graphs (Rhie et al. 2021). For metagenomics samples in particular, the constituent organisms are often sequenced at or below 1× coverage (Danko et al. 2021). Although the light cleaning applied to assembly graphs ensures that exact seed matches can be found (Karasikov et al. 2020), the long contiguous walks required for high-scoring (i.e. high-precision) alignments are often not present because of high graph fragmentation. This results in low recall, particularly for long reads, because the short alignments that can be found are not reported to maintain search precision.

Current alignment approaches for sequence graphs have limited support for fragmented graphs. The first approach is ‘label-free’ alignment, which ignores annotations during alignment. When applied to an annotated DBG representing a diverse cohort, these tools (Limasset et al. 2016, Garrison et al. 2018, Heydari et al. 2018, Dvorkina et al. 2020, Ivanov et al. 2020, 2022, Rautiainen and Marschall 2020) can meander search through a large search space inflated by biologically irrelevant sample combinations (Sirén et al. 2021, Karasikov et al. 2022). For this reason, these tools primarily target single-species pangenomes that often satisfy the assumption that all walks are biologically relevant. If this assumption holds, then these methods can overcome fragmentation in an individual sample’s assembly graph by combining sequence information from multiple samples since such contiguous walks are present in the joint graph. A second approach aligns queries to walks where all nodes in the walk share one (single-label) or more (label-consistent) common label(s) (Luhmann et al. 2021, Schulz et al. 2021). These tools suffer from low recall on fragmented assembly graphs, and so, this property also applies to joint annotated graphs because sample combinations cannot compensate for fragmentation. However, these tools do not suffer from search space inflation because all walks are biologically relevant. This is why these tools are applied to high-quality contiguous graphs indexing reference genomes or high-depth sequencing samples (Luhmann et al. 2021, Schulz et al. 2021). A third, recently emerging intermediate approach is ‘haplotype-aware’ alignment, which either aligns in a label-free fashion and scores recombinations afterwards (Rosen et al. 2017) or combines samples to a restricted degree by penalizing each combination during alignment (Avila et al. 2022, Chandra and Jain 2023b). These alignment strategies do not consider similarity (hence, biological relevance) when scoring a sample change and have so far only been applied to single-species pangenomes.

A property shared by all of these approaches is that a discontinuity in an individual assembly graph that does not overlap with another assembly graph will propagate to the joint DBG, limiting the joint graph’s contiguity. This stems from the approaches’ shared definition of alignment: the alignment target must be a contiguous walk.

1.2 Contribution: label-guided sequence-to-DBG alignment

Our goal in this work is to develop an alignment approach that can produce long accurate alignments to collections of low-depth sequencing samples represented by fragmented annotated DBGs. To this end, we propose a new alignment strategy called ‘multi-label alignment’ (MLA) designed for annotated DBGs. The strategy extends alignment scoring models with two key new operations: (i) ‘Label Change’ (LC) and (ii) ‘Node Length Change’ (NC). The label change operation penalizes traversals that change from one sample to a dissimilar sample, thus enhancing local single-character similarity scores with more informative global similarity. The node length change operation dynamically adjusts the node length, thus using shorter length nodes as proxies for missing nodes to locally improve graph connectivity (i.e. reducing fragmentation).

Efficiently implementing these operations must overcome several computational challenges. First, efficient anchor chaining relies on having a small number of anchors per query (Jain et al. 2022), an assumption that is easily violated in a multi-label setting. Second, sequence-to-graph alignment decision problems (i.e. does there exist an alignment) that allow for DBG edits is shown to be NP-complete (Gibney et al. 2022), necessitating heuristics to prevent excessive use of node length change operations.

To address these challenges, we implement MLA in a fast, approximate, yet accurate two-step approach: The first step is a new single-label seed-chain-extend aligner for annotated DBGs called SCA (Fig. 1.13), meant to reduce the computational burden of multi-label chaining by first performing single-label chaining with a traditional scoring model and extending the top chains among all labels to provide a preliminary alignment pool. SCA, adapts recent improvements in chaining to a DBG setting. The second step is a multi-label chainer called MLC (Fig. 1.4) that incorporates our novel multi-label scoring operations into its chain scoring. It extracts anchors from the alignments provided by SCA, resulting in a much smaller curated anchor set on which we apply our more expensive operations. It then performs multi-label chaining on these anchors and stitches the multi-label chains into alignments using fragments from SCAs alignments.

Computing multi-label alignments of a query sequence to an annotated De Bruijn graph. Each colour represents a label in the graph annotation, with some nodes (rounded boxes) having multiple labels. We first (1) extract seeds (half-rounded boxes, shaded seeds have no match) of length l≤k from the query sequence (in this example, k=4, l=3) and anchor the seeds to the graph, where each anchor (normal boxes) matches a seed to a node and a label (each bucket represents the node-label pairs to which a seed matches). Then, we (2) construct single-label chains (indicated by green lines) and extend them along single-label walks into alignments using SCA (note that a single-label walk does not exist from GGCA to CACC, so the yellow anchor GCA cannot be chained to ACC). We then (3) extract anchors from this alignment pool (reducing the anchor pool) and form multi-label chains using MLC, which allows label changes at matching l-mers. We (4) connect anchors using single-label alignment segments to form multi-label alignments.
Figure 1.

Computing multi-label alignments of a query sequence to an annotated De Bruijn graph. Each colour represents a label in the graph annotation, with some nodes (rounded boxes) having multiple labels. We first (1) extract seeds (half-rounded boxes, shaded seeds have no match) of length lk from the query sequence (in this example, k=4, l=3) and anchor the seeds to the graph, where each anchor (normal boxes) matches a seed to a node and a label (each bucket represents the node-label pairs to which a seed matches). Then, we (2) construct single-label chains (indicated by green lines) and extend them along single-label walks into alignments using SCA (note that a single-label walk does not exist from GGCA to CACC, so the yellow anchor GCA cannot be chained to ACC). We then (3) extract anchors from this alignment pool (reducing the anchor pool) and form multi-label chains using MLC, which allows label changes at matching l-mers. We (4) connect anchors using single-label alignment segments to form multi-label alignments.

Label change scores measure sample similarity. ΔLC measures the probability that a node with an orange label is also present in the blue sample but was not observed in that sample. We score a change to the orange label much higher because of the large overlap between the orange and blue k-mer sets.
Figure 2.

Label change scores measure sample similarity. ΔLC measures the probability that a node with an orange label is also present in the blue sample but was not observed in that sample. We score a change to the orange label much higher because of the large overlap between the orange and blue k-mer sets.

Traversal in a variable-order DBG overcomes graph disconnects to spell the sequence ATCTGGAGCT. Traversing from TCTG to its suffix TG allows for traversal to TGGA, so the red nodes compensate for the disconnect in the blue subgraph. In MLA, nodes of length l<k act as proxies for k-mers. In this graph, TGG stands in for CTGG. In our model, nodes of length <k may only traverse to longer nodes (e.g. TGG cannot traverse to GG). The green line represents the path taken to spell the sequence above and the orange arrows represent node length-changing traversals.
Figure 3.

Traversal in a variable-order DBG overcomes graph disconnects to spell the sequence ATCTGGAGCT. Traversing from TCTG to its suffix TG allows for traversal to TGGA, so the red nodes compensate for the disconnect in the blue subgraph. In MLA, nodes of length l<k act as proxies for k-mers. In this graph, TGG stands in for CTGG. In our model, nodes of length <k may only traverse to longer nodes (e.g. TGG cannot traverse to GG). The green line represents the path taken to spell the sequence above and the orange arrows represent node length-changing traversals.

Coverage of each read’s best alignment. Coverage is the percentage of query characters covered by an alignment. SCA+LC uses our ΔLC, but not does allow node length changes. We report median coverages above each distribution.
Figure 4.

Coverage of each read’s best alignment. Coverage is the percentage of query characters covered by an alignment. SCA+LC uses our ΔLC, but not does allow node length changes. We report median coverages above each distribution.

Overall, we show in this work how our fast approximate implementation of MLA produces substantially longer and more accurate alignments compared to state-of-the-art aligners.

2 Preliminaries and background

2.1 Notation and definitions

A ‘string’ is a finite sequence of characters drawn from an alphabet Σ. Σk denotes the set of all strings of length k (k-mers). For a string s=s[1]s[2]s[l] of length |s|=l, with indices 1i<jl, we denote a substring by s[i:j]:=s[i]s[j]. The set of all k-mers extracted from a string set S is denoted by K(S,k):=sSi=1|s|k+1{s[i:i+k1]}.

A ‘node-centric’ DBG of order k representing S has the nodes V:=K(S,k) and implicit edges E(V):={(v1,v2)V2:v1[2:k]=v2[1:k1]} (Chikhi and Rizk 2013). A ‘walk’ is a sequence of nodes W=(v1,,vm) where each (vi,vi+1)E. The ‘spelling’ of W is the string T=v1v2[k]vm[k]. A ‘walk cover’ is a set of walks s.t. each node is visited by at least one walk. The ‘width’ of a graph is the number of walks in a minimal walk cover. An ‘annotated DBG’ has an auxiliary set of string labels L and an annotation A:V2L associating each node with a label set. W is ‘label-consistent’ if i=1mA(vi). V:L2V fetches all nodes with a given label.

We denote a query string by QΣ|Q|. A ‘sequence-to-graph alignment’ of Q to the ‘target string’ T along Wa is a tuple a=(Qa,Ta,Ea,Wa), where Qa is a substring of Q, Ta is a substring of T, and Ea is a sequence of ‘edit operations’ transforming Ta to Qa. These operations are in {match, mismatch, insertion open, insertion extension, deletion open, deletion extension}. Each operation has a ‘score’, denoted by Δ=, Δ, ΔIO, ΔIE, ΔDO, and ΔDE, respectively. Only Δ= is positive, all other scores are negative. The score of a, denoted by ΔS(a), is the sum of all edit operation scores, where a higher score indicates greater similarity. a is label-consistent if Wa is label-consistent. Given two alignments a1,a2 with respective substrings Q[i1:i1+l11],Q[i2:i2+l21] s.t. i1<i2, we define O(a1,a2):=min{l2,i1+l1i2}. a1 and a2 ‘overlap’ if O(a1,a2)>0 and are ‘disjoined’ by a ‘gap’ of length −O(a1,a2) otherwise. For a gap length lN+, we denote the scoring model’s ‘gap penalty’ by ΔG(l). In this work, we assume affine scoring (e.g. ΔG(l):=ΔIO+(l1)ΔIE for an insertion).

2.2 Sequence-to-graph alignment with seed-extend-chain

Modern approximate aligners predominantly use the ‘seed-chain-extend’ paradigm involving (i) ‘seed anchoring’, (ii) ‘colinear chaining’, and (iii) ‘anchor extension’ (Shaw and Yu 2023). A ‘seed’ is a query substring. An ‘anchor’ is a tuple of a seed and a graph walk spelling a superstring of the seed, where each node in the walk contributes at least one character to the spelling. A ‘chain’ is a sequence of anchors s.t. any two consecutive anchors are in order in both the query and the target, meaning that there exists a walk connecting their nodes (Mäkinen et al. 2019). Some works determine a traversal distance between nodes on-the-fly by traversing local neighbourhoods around nodes (Dvorkina et al. 2020, Li et al. 2020). More efficient strategies require a decomposition of the graph, typically into subgraphs (Chang et al. 2020, Rautiainen and Marschall 2020) or a path/walk cover (Mäkinen et al. 2019, Chandra and Jain 2023a, Ma et al. 2023, Rajput et al. 2023). After chaining, anchor extension searches the graph forwards and backwards from the ends of each anchor to find high-scoring walks.

3 Methods

3.1 General alignment workflow

Given a query Q, we find and anchor seeds, compute label-consistent alignments using SCA, then chain these alignments together into multi-label alignments using MLC. First, given a user-set seed length lk, we extract l-mer seeds from Q (Fig. 1.1) and anchor them by fetching all graph nodes with matching l-length suffixes and their associated labels (Fig. 1.2). An anchor α(i,l,v,) is a tuple in A:=N×N×V×L anchoring the seed Q[i:i+l1] to a node v, with an associated label A(v). Afterwards, we find the top-scoring single-label chains (Fig. 1.3, Section 3.4). We extend each chain into a single-label alignment by using global alignment to connect consecutive anchors and ends-free extension from the first and last anchor in the chain. Given the resulting alignment pool, we extract all l-mer anchors and compute multi-label chains (Fig. 1.4, Section 3.5), incorporating label change (ΔLC) and node length change operations (ΔNC). We construct multi-label alignments from the top chains using segments from the label-consistent alignments.

3.2 Deriving a scoring model for novel alignment operations

We now detail the scoring model for our new operations for sequence-to-graph alignment: ΔLC and ΔNC. For all constants defined in this section, refer to Section 4.3 for the values we use in our experiments. We define and extend two probabilistic graphical models for sequence-to-sequence alignment: a ‘target model’ and a ‘null model’, based on the probabilistic models presented by Frith (2019). Briefly, a model represents all possible mutations of an underlying target sequence, where a walk in a model emits edit operations that generate a query sequence (detailed in Supplementary Section S2.2). The score of any edit operation is the log probability ratio of the operation’s corresponding transition probabilities in the target and the null model. The alignment score is the sum of these log ratios (i.e. the query’s log-likelihood ratio). These sequence-to-sequence alignment models trivially induce analogous models for sequence-to-graph alignment by treating the spelling of each walk in a sequence graph as a separate target sequence and, hence, a pair of target and null graphical models.

3.2.1 Deriving and computing label-change scores

For simplicity, suppose that we traverse along an edge (v1,v2) s.t. A(v1)={1} and A(v2)={2}, where 12. We denote this ‘label change’ from 1 to 2 as 12 and score this event using the probability that v2 has label 1 conditioned on v2 having the label 2. Intuitively, this is the probability that the k-mer v2 observed in the sample with label 2 is also present in the sample with label 1, but was not observed (Fig. 2). To formulate this precisely, we extend our alignment models so that each graph-traversing operation emits a label change with transition probability Pr(12). Thus, the ‘label-change score’ is:
(1)
where λLC is a user-set scaling constant. For the null model, we assume no relationship between 1 and 2, so Pr0(12):=Pr(2), where Pr() is the empirical probability of a label : Pr():=|V()||V|. For this new model to be reducible to a label-free setting, we require that ΔLC()=0 (i.e. only score if the label changes) and that ΔLC(12)0 (i.e. no label change increases the score). We satisfy these requirements if Pr(12):=Pr(vV(1)V(2)), simplifying Equation (1) to:
(2)
Due to the log in the definition of ΔLC and the use of integers for alignment scores, we only require order-of-magnitude accuracy when computing Pr(1|2). So, we approximate Pr(1|2) using HyperLogLog++ counters (Heule et al. 2013) of V(1) and V(2), respectively, and the precomputed values |V(1)| and |V(2)|. We use a false-positive probability of 0.05 for the counters and compute
(3)

3.2.2 Deriving penalties for node length changes

Since low sequencing depth can produce disconnected graphs, one way to compensate for this is to allow for node insertions into the graph during the search. In this section, we describe how we use dynamic changes of the underlying DBGs order k during alignment as a more tractable proxy for inserting nodes.

Although MLC only utilizes node length change operations during chaining, we nonetheless incorporate this operation into our scoring model to derive its scoring function. For this, we switch our graph from a DBG of fixed order k to a variable-order DBG of maximum order k (Boucher et al. 2015).

Given a node v of length k with spelling s and a suffix length 1l<k, a node length change is the traversal from v to the node with spelling s[kl+1:k]. In our model, nodes with length l are proxies for missing k-mers (Fig. 3). However, we need to define penalties for changing the node length to prevent degenerate cases, such as searches along walks where every node has a short length (e.g. 1 or 2) that exist for every sequence over the alphabet.

To avoid this case, and to ensure that the model reduces to standard sequence-to-graph alignment on fixed-k DBGs, our scoring model does not penalize traversals that increase the node length by 1 or maintain the node length at k. Otherwise, we penalize traversal from a node of length k to one of length 1l<k with a score ΔNC(kl) and disallow all other node length-changing traversals. Consolidating these rules,
(4)
where ΔJ<0 is a user-set constant.

3.3 Local colinear chaining on assembly DBGs

We now describe the modular chaining algorithm used by both SCA and MLC. Given anchors sorted by increasing end position (i.e. i+l), we perform local chaining using Algorithm 1, based on the more practical alternative algorithm implementation by Jain et al. (2022) (We fall back to the chaining algorithm from minimap2 (Li, 2018) with affine gap scoring if there are more than |Q| anchors (Section 3.6).). Their algorithm minimizes a nonnegative chain cost rather than maximizing an integer score, so we use the equations by Eizenga and Paten (2022) to convert scores into costs when evaluating our termination condition. This ‘forward pass’ computes optimal chaining scores. We reconstruct chains by ‘backtracking’, ensuring that we incorporate each anchor into, at most, one chain.

The helper function connect:A×AZ approximates the score of a global alignment connecting anchor α1 to α2, with different implementations for SCA and MLC. Note that this score does not include α1 since its score is already included in the score vector ΔS when computing updates.

Algorithm 1

Computing local colinear chaining scores.

Input: A sequence of anchors (α(i1,l1,v1,1),,α(in,ln,vn,n)) s.t. x<yix+lxiy+ly. An initial guess of the maximum distance between anchors bN+ and a scaling factor b2>1.

Output:ΔSZn s.t. ΔS[j] is the best chaining score from any subsequence of (α(i1,l1,v1,1),,α(ij,lj,vj,j)) ending with α(ij,lj,vj,j).

1: ΔS[Δ=·l1,,Δ=·ln]

2: repeat

3:   F1

4:   forL1 to ndo

5:    Fmin{j{F,,L}:iL+lL(ij+lj)b}

6:    forjF to L s.t. ij<iL and ij+lj<iL+lLdo

7:     sΔS[j]+connect(α(ij,lj,vj,j),α(iL,lL,vL,L))

8:     ΔS[L]max{ΔS[L],s}

9:    end for

10:     end for

11:   blastb

12:   bb·b2 Increase max. distance between anchors

13:   sbestmax{ΔS[1],,ΔS[n]} Find the best score.

14:   cbest2(|Q|sbestΔ=) Convert score into a cost.

15: untilcbest>blast

16: returnΔS

3.4 SCA: single-label seed-chain-extend alignment

SCA implements colinear chaining and anchor extension algorithms for label-consistent alignment. After seed anchoring, we merge the anchors into maximal unique matches (MUMs). We then group the anchors by label and perform separate forward passes of the chaining algorithm for each group. Afterwards, we select the top ρ|Q| chains (with ties) among all labels and backtrack to construct these chains. ρ is a user-set chain density (Section 4.3).

To compute the connection score between two anchors α1 and α2, we sum a match score for the additional characters introduced by α2 to a gap penalty based on the absolute difference between (i) the difference of α1 and α2’s seed end positions in the query, and (ii) the traversal distance between their nodes in the graph (Supplementary Algorithm S1). To quickly estimate a traversal distance between two nodes along a label-consistent walk, we use walk covers of the subgraphs representing each label. A traversal distance is known if there exists a walk from α1 to α2 in the cover.

For each chain, we connect consecutive anchors using global alignment, then extend the first anchor backwards and the last anchor forwards using ends-free extension. For global alignment, we use TCG-Aligner (Karasikov et al. 2022) to ensure that each connecting walk is represented by the cover. For ends-free extension, we modify TCG-Aligners extender to restrict traversal to label-consistent walks. To further reduce the number of extensions, we discard chains that overlap with already-completed alignments. The result is a pool of label-consistent alignments.

3.5 MLC: multi-label colinear chaining

Given a pool of label-consistent alignments, we extract anchors from these alignments and construct multi-label chains (Fig. 1.4). Unlike the colinear chaining method in SCA, we have access to global alignment scores for connecting any in-order pair of anchors extracted from the same label-consistent alignment. We leverage this to define an anchor connection score that ensures that MLCs chaining scores equal the final MLA scores.

One property of MLCs scoring is that, given a chain of anchors from the same alignment and label, we only need the anchors at the beginning and end of the chain to compute the chain score. So, as a preprocessing step, we discard any anchor α(ij,lj,vj,j) if it is the only anchor at position ij and if another anchor α(ij+1,lj+1,vj+1,j) exists from the same alignment.

After the forward pass, we reconstruct the highest scoring chain for each label from the pool of label-consistent alignments. We construct a multi-label alignment from each chain by connecting consecutive anchors using alignment segments from the pool.

3.5.1 Multi-label anchor connection scoring

We now detail MLCs anchor connection scoring (Supplementary Algorithm S2). Suppose we have anchors α(ij,lj,vj,j) and α(iL,lL,vL,L) from alignments ax and ay, respectively. We consider three cases for the update score: (i) extending along the same alignment (i.e. x=y), (ii) connecting disjoint alignments (i.e. O(ax,ay)0), and (iii) connecting overlapping alignments (i.e. O(ax,ay)>0).

  • If x=y, then the anchor connection score is the score of ay’s segment up to the longer query substring ending at iL+l (i.e. ΔS(ay[:iL+l])ΔS(ay[:ij+l])).

  • If ax and ay cover disjoint regions of Q, then the connection score includes the sum of ax’s remaining score, the score of ay’s segment up until iL+l, the label change score, and the gap penalty for inserting extra query characters:
    (5)

    To differentiate this case from (iii), we insert a sentinel $ character into the target spelling with an incurred ΔDO score.

  • If ax and ay cover overlapping regions of Q, assume that the current alignment segment of ax up until ij+l is of length k and that |ay[iL:]|k (meant to avoid ambiguity when spelling a walk). We also only consider overlaps of length <k1 (i.e. one cannot traverse from ax to ay with a single step without modifying the graph) since we assume that longer overlaps would have been discovered otherwise via normal graph traversal during anchor extension. We use a two-step procedure to try to connect the two anchors. First, we try to find an intermediate anchor α(ij,lj,vj,L) from ay s.t. ij=ij and hop to that node. Then we continue the traversal along ay towards vj to complete the connection. If such a node exists, the connection score is
    (6)

Using Fig. 3 as an example, one such intermediate anchor is the node AATG sharing a 2-mer suffix with the node TCTG.

Note that this score does not depend on the length of vj and vj’s longest common suffix if vjvj. We motivate this design choice with the following observation: If there is a single character mismatch between position kl+1 (where l need not equal the seed length l) in a node v’s spelling and position kl in another node v’s spelling that prevents the nodes from being adjacent (e.g. v spells GATGC and v spells ACGCT for k=5 and l=3), then we require l node insertions to create a path to v from a predecessor of v, despite the ground truth that it originated from a single substitution (for our example, given appropriate characters c,dΣ, these new nodes would spell cdGAC, dGACG, and GACGC). Thus, defining the connection score as a function of kl instead of kl would induce unequal scores for each value of l, even though all of these edit events are equally likely.

3.6 Time and space complexity

Suppose we have n sorted anchors (incurring a worst-case complexity of O(nlogn) if the seeder does not produce sorted anchors). Let C denote the time to execute a connect function. If n|Q|, then Algorithm 1 has an average-case time complexity in O(c*|Q|C), where c* is the minimum chain cost (Jain et al. 2022). If n>|Q|, then we fall back to the chaining algorithm from minimap2 (Li 2018) with a worst-case time complexity in O(bnC) for the forward pass, where b is a user-set bandwidth parameter. Although we perform ρ|Q| backtracking procedures, since each anchor is incorporated into at most one chain, we terminate a procedure as soon as we reach an already-chained anchor. So, backtracking takes worst-case O(n) time. Alongside the n anchors, we store Θ(1) information per anchor for the chaining scores and backtracking information, so the total space complexity is in Θ(n).

Consider SCAs connect. Suppose that Q maps to L labels, with corresponding anchor counts n1,,nL, walk cover sizes W1,,WL, and optimal chaining costs c1*,,cL*. Since MUMs generally satisfy ni|Q| (Jain et al. 2022), the average-case time complexity of SCAs forward passes is O(|Q|i=1Lci*Wi). After extension, we extract m anchors from the resulting alignments.

Considering MLC, it is clear that CMLAO(1). Since we chain anchors from multiple labels in a single forward pass, we can no longer assume that m|Q| will generally hold. Thus, multi-label chaining has an average-case time complexity in O(c*|Q|) when m|Q| and worst-case time complexity in O(bm) otherwise. Splicing alignment segments to convert a multi-label chain into an alignment that spells T runs in worst-case O(|T|) time.

4 Evaluation methodology

We compare our methods to GraphAligner (Rautiainen and Marschall 2020), a state-of-the-art tool for traditional sequence-to-graph alignment to DBGs, and PLAST (Schulz et al. 2021), a BLAST-like tool for label-consistent alignment to annotated DBGs. We implement two additional baselines: SCA+LC100 implements multi-label chaining with ΔLC=100 (plotted in Supplementary Fig. S3), while SCA+LC implements our ΔLC but disallows node length changes. Following this terminology, we denote our full MLA method of SCA+MLC as SCA+LC+NC. We evaluate all tools on a simulated joint assembly graph indexing 3042 fungi genomes from GenBank (Sayers et al. 2022).

4.1 Simulating assembly graphs and query sequences

From each genome, we simulate a 10×-coverage Illumina HiSeq-type read set using ART v2.5.8 (Huang et al. 2011) and construct a cleaned assembly DBG of order k=31 using the procedure in MetaGraph (Karasikov et al. 2020). Briefly, this involves pruning graph tips of length <2k and discarding any unitig whose median k-mer count is below a noise count threshold computed from the sample’s k-mer spectrum. We then merge these graphs into an annotated DBG with nodes labelled by their source accession IDs. To merge the graphs, we compute walk covers for each graph and construct the joint graph from these sequences.

We preprocess the graph and generate walks using the procedure described by Rajput et al. (2023). To reduce the final representation size, we refrain from maintaining a matrix of traversal distances from each node to each stored walk. Instead, we augment each cover with additional walks spanning the edges discarded during preprocessing. We losslessly encode the walks as coordinate graph annotations (Karasikov et al. 2022). We use the GFA representation of the joint graph to interface with GraphAligner and the walk covers to construct the PLAST index.

We simulate query reads from the same genomes with a different random seed, using ART for Illumina HiSeq-type reads and pbsim3 (Ono et al. 2022) for PacBio Sequel CLR-type, PacBio Sequel HiFi-type, and ONT-type reads. We generate HiFi reads from simulated 10-pass Sequel subreads using the PacBio ccs tool. We form a query set for each read type by drawing 100 random reads from the pool of reads aggregated from all genomes.

4.2 Recall-, coverage-, and taxonomy-based measures

Alongside read mapping quality, we measure the accuracy of the retrieved labels w.r.t. the ground-truth genomes using ‘taxonomic profiles’ since our query reads are simulated from known genomes.

Taxonomic classification accuracy is highly dependent on which alignments are reported in a read mapping. So, we first find an appropriate score cut-off to determine which alignments to report for each read. Since our test reads are of different lengths, we divide each score by the length of its corresponding query to get a ‘relative score’. Given the alignments of a read sorted by decreasing relative score and a cut-off, we select alignments by greedily picking a disjoint subset whose relative scores are above the cut-off. We vary the cut-off from 0.0 to 1.0 in steps of 0.02. For each cut-off, we evaluate the (i) recall and the (ii) mean taxonomic profile error. We measure taxonomic profile error using the WGSUniFrac error of the profile relative to the ground-truth profile (Wei and Koslicki 2022), a measure of the fraction of the taxonomic tree traversal distance that differs between the two profiles. For easier interpretation, we define the ‘UniFrac accuracy’ as 1.0WGSUniFracerror. Since taxonomic IDs are not available for all strains, we generate a custom taxonomic tree by augmenting the NCBI Taxonomy with a new leaf node for each GenBank accession ID. The ground-truth profile of a read states that its ground-truth accession is 100% abundant, whereas the profiles computed from alignments weight each accession by the fraction of query characters covered by alignments to that accession.

4.3 Experimental setup and code availability

We performed all experiments on an AMD EPYC-Rome processor from ETH Zurich’s high-performance compute systems using a single thread and a seed size of l=19, with default parameters for all tools except for the following: For scoring, we set Δ==1 and Δ=ΔIO=ΔIE=ΔDO=ΔDE=1. For SCA and MLC, we use a chain density of ρ=0.01 and chaining parameters b=400 and b2=4. In MLC, we set the label-change score scaling factor to λLC=Δ= and the node length change scaling factor to ΔJ=ΔDE. We motivate these choices in Supplementary Section S2.4. We use the BOSS representation to index our DBGs, which natively supports the subset of variable-order DBG operations used for node length changes (Bowe et al. 2012). We implemented our methods within the GPLv3-licensed MetaGraph framework (Karasikov et al. 2020) hosted at https://github.com/ratschlab/metagraph.

5 Results and discussion

5.1 Assembly graphs are much wider than pangenomes

Our simulations produced read sets with a median (mean) of 700 470 (1 768 812) k-mers per sample. These resulted in graphs with a median (mean) size of 48 660 (122 826) k-mers (Supplementary Fig. S1). After merging these assembly graphs, the annotated DBG contains 187 662 586 k-mers and its index consumes 179 MB. We report execution statistics for graph indexing in Supplementary Section S2.5.

We observe that unlike the pangenome graphs explored in previous chaining works with widths ranging from 1 to 60 (Chandra and Jain 2023a, Ma et al. 2023, Rajput et al. 2023), our assembly graphs are much wider, with a median (mean) width of 221 (542.8). These observations corroborate our choice to refrain from encoding a full node-to-walk distance matrix for SCA.

5.2 Multi-label alignments are substantially longer, computed at competitive execution times

For all read types, we observe that SCA+LC+NC produces the longest alignments (Fig. 4), with median query coverage increases of 47.4% for HiFi, 46.7% for CLR, and 45.5% for ONT reads, respectively, relative to the next-best state-of-the-art aligner. All methods have median coverages of 100% for Illumina reads, with the smallest interquartile range observed for SCA+LC+NC. Our baselines SCA+LC100 and SCA+LC achieve similar or slightly improved performance relative to SCA, but far below SCA+LC+NC. PLAST has the highest third quartile coverage for PacBio reads among all tools and higher median coverage on HiFi reads than the most comparable tool SCA. However, its execution time is 280–1200× slower than SCA due to its default behaviour of extending a large number (200) of high-scoring matches per query (Fig. 7).

5.3 Multi-label alignment maintains the best classification accuracy at most read mapping recall levels

Sweeping through different relative score cut-offs, we observe that the full MLA (SCA+LC+NC) method has the greatest recall on error-prone reads for all cut-offs and the greatest recall for HiFi reads at cut-offs below 0.6 (Fig. 5). All tools perform similarly on Illumina reads. When comparing taxonomic profiles, SCA+LC+NC has the greatest UniFrac accuracy for long reads at recall values above 20% for HiFi and ONT reads and above 50% for CLR reads (Fig. 6). All tools perform similarly on Illumina reads. The areas under the mean UniFrac Accuracy-Recall curves (AUARCs) for Illumina reads are 0.930 for PLAST and 0.895 for our methods (Fig. 7). For all long-read types, SCA+LC+NC has the highest AUARCs, ranging from 0.852 to 0.882, compared to SCA (0.806–0.835), SCA+LC100 (0.811–0.844), SCA+LC (0.815–0.845), and PLAST (0.548–0.677). We infer that MLA improves accuracy by chaining together alignments to related samples.

Read mapping recall for different relative score cut-offs. For each relative score (i.e. score scaled by query length) cut-off, the recall is the fraction of reads with an alignment scoring at least at that cut-off.
Figure 5.

Read mapping recall for different relative score cut-offs. For each relative score (i.e. score scaled by query length) cut-off, the recall is the fraction of reads with an alignment scoring at least at that cut-off.

Mean UniFrac accuracy of all mapped reads at different read mapping recall levels. UniFrac accuracy measures the similarity between a read’s taxonomic profile (induced from the labels of all reported alignments) and its ground-truth profile, similar to precision. Shading indicates 95% confidence intervals of the mean estimated from 1000 bootstrap samples. Based on our interpretation of WGSUniFrac error values (detailed in Supplementary Section S2.3), each y-axis grid line corresponds to the midpoint value of a taxonomic rank (accession, strain, species, genus, etc.). The y-axis is scaled to better emphasize the upper range of UniFrac accuracy values, corresponding to accuracy at lower taxonomic ranks. We exclude GraphAligner since it does not consider labels.
Figure 6.

Mean UniFrac accuracy of all mapped reads at different read mapping recall levels. UniFrac accuracy measures the similarity between a read’s taxonomic profile (induced from the labels of all reported alignments) and its ground-truth profile, similar to precision. Shading indicates 95% confidence intervals of the mean estimated from 1000 bootstrap samples. Based on our interpretation of WGSUniFrac error values (detailed in Supplementary Section S2.3), each y-axis grid line corresponds to the midpoint value of a taxonomic rank (accession, strain, species, genus, etc.). The y-axis is scaled to better emphasize the upper range of UniFrac accuracy values, corresponding to accuracy at lower taxonomic ranks. We exclude GraphAligner since it does not consider labels.

Comparison of performance measures for each alignment tool. AUARC is the area under the UniFrac Accuracy–Recall curve (Fig. 6), where higher values indicate greater accuracy. Error bars indicate 95% confidence intervals calculated from 1000 bootstrap samples. See Supplementary Table S3 for the numbers plotted here.
Figure 7.

Comparison of performance measures for each alignment tool. AUARC is the area under the UniFrac Accuracy–Recall curve (Fig. 6), where higher values indicate greater accuracy. Error bars indicate 95% confidence intervals calculated from 1000 bootstrap samples. See Supplementary Table S3 for the numbers plotted here.

5.4 Overall performance

Overall, multi-label alignments are substantially longer than alignments produced by other tools, while maintaining competitive execution times and the lowest RAM usage (Fig. 7). We decompose our methods’ execution times into their different components in Supplementary Table S4. For Illumina reads, SCA and SCA+LC+NC are substantially faster than all other tools. GraphAligner is the fastest tool for long reads, while SCA and SCA+LC+NC are comparable.

6 Conclusions

In this work, we presented the MLA strategy, a novel alignment scoring model that compensates for disconnects in low-depth assembly graphs by combining sequence information from multiple related samples and improving graph connectivity during alignment. We implement MLA on annotated DBGs with a two-step process: SCA computes label-consistent alignments and MLC computes multi-label alignments from these label-consistent alignments.

Since SCA does not encode a traversal distance matrix from each node to each walk encoded by the graph annotation, providing such a matrix can potentially increase label-consistent alignment lengths and consequently provide a stronger basis for MLC. Our observation that PLAST produces longer alignments than SCA on low-error reads provides further evidence that a more rich traversal distance index can improve SCAs alignment performance. Despite the large widths of assembly graphs, we expect this matrix to be sparse and, hence, easily compressible.

Another possible extension is merging SCA and MLC into a single holistic seed-chain-extend procedure. We maintained these two steps in this work to provide a small number of anchors to each chaining run. One can explore how well our current approach approximates this unified approach and how to approximate an ends-free extension incorporating label and node length changes. In this context, there are a few possibilities for implementing the node length change operation into chain scoring. These include using the current procedure of finding intermediate suffix-sharing nodes and a more sensitive, but daunting approach of representing walk covers for all desirable node length values l<k.

Although our algorithms are implemented within the MetaGraph framework, the concepts from our methods can readily be applied in other pangenome graph frameworks and potentially see more widespread use. These methods make unassembled read sets a more powerful resource for bioinformatics research.

Acknowledgements

The authors would like to thank Ragnar Groot Koerkamp, Maximilian Mordig, Mohammed Alser, Mario Stanke, Inanc Birol, and anonymous reviewers for their helpful discussions and feedback.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

H.M. and M.K. were partially funded as part of Swiss National Research Programme (NRP) 75 ‘Big Data’ by the SNSF grant number 167331. A.K. is partially funded by the LOOP Zurich and the Monique Dornonville de la Cour Foundation to G.R. H.M., M.K., and A.K. were also partially funded by ETH core funding (to G.R.). H.M. is also partially funded by the Personalized Health and Related Technologies (PHRT) Transition Postdoc Fellowship Project Number 2021/453.

Data availability

The data (including accession IDs for all reference genomes), scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.

References

Almodaresi
F
,
Sarkar
H
,
Srivastava
A
et al.
A space and time-efficient index for the compacted colored De Bruijn graph
.
Bioinformatics
2018
;
34
:
i169
77
. https://doi.org/10.1093/bioinformatics/bty292

Avila
J
,
Bonizzoni
P
,
Ciccolella
S
et al. RecGraph: adding recombinations to sequence-to-graph alignments. bioRxiv, https://doi.org/10.1101/2022.10.27.513962,
2022
, preprint: not peer reviewed.

Bankevich
A
,
Bzikadze
AV
,
Kolmogorov
M
et al.
Multiplex De Bruijn graphs enable genome assembly from long, high-fidelity reads
.
Nat Biotechnol
2022
;
40
:
1075
81
. https://doi.org/10.1038/s41587-022-01220-6

Boucher
C
,
Bowe
A
,
Gagie
T
et al. Variable-order De Bruijn graphs. In: DCC ‘15: Proceedings of the 2015 Data Compression Conference, Snowbird, Utah, USA. Washington, DC, USA: IEEE Computer Society.
383
92
.
2015
. https://doi.org/10.1109/DCC.2015.70

Bowe
A
,
Onodera
T
,
Sadakane
K
,
Shibuya
T.
Succinct De Bruijn graphs. In:
Raphael
B
,
Tang
J
(eds),
Algorithms in Bioinformatics
.
Berlin, Heidelberg
:
Springer Berlin Heidelberg
,
2012
,
225
35
.

Chandra
G
,
Jain
C.
Gap-sensitive colinear chaining algorithms for acyclic pangenome graphs
.
J Comput Biol
2023a
;
30
:
1182
97
. https://doi.org/10.1089/cmb.2023.0186

Chandra
G
,
Jain
C.
Haplotype-aware sequence-to-graph alignment. bioRxiv, https://doi.org/10.1101/2023.11.15.566493,
2023b
, preprint: not peer reviewed.

Chang
X
,
Eizenga
J
,
Novak
AM
et al.
Distance indexing and seed clustering in sequence graphs
.
Bioinformatics
2020
;
36
:
i146
53
. https://doi.org/10.1093/bioinformatics/btaa446

Chikhi
R
,
Rizk
G.
Space-efficient and exact De Bruijn graph representation based on a Bloom filter
.
Algorithms Mol Biol
2013
;
8
:
22
. https://doi.org/10.1186/1748-7188-8-22

Colquhoun
RM
,
Hall
MB
,
Lima
L
et al.
Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs
.
Genome Biol
2021
;
22
:
267
. https://doi.org/10.1186/s13059-021-02473-1

Danko
D
,
Bezdan
D
,
Afshin
EE
et al.
A global metagenomic map of urban microbiomes and antimicrobial resistance
.
Cell
2021
;
184
:
3376
93.e17
. https://doi.org/10.1016/j.cell.2021.05.002.

Dvorkina
T
,
Antipov
D
,
Korobeynikov
A
et al.
SPAligner: alignment of long diverged molecular sequences to assembly graphs
.
BMC Bioinformatics
2020
;
21
:
306
. https://doi.org/10.1186/s12859-020-03590-7

Eizenga
JM
,
Paten
B.
Improving the time and space complexity of the WFA algorithm and generalizing its scoring. bioRxiv, https://doi.org/10.1101/2022.01.12.476087,
2022
, preprint: not peer reviewed.

Fan
J
,
Khan
J
,
Singh
NP
et al.
Fulgor: a fast and compact k-mer index for large-scale matching and color queries
.
Algorithms Mol Biol
2024
;
19
:
3
. https://doi.org/10.1186/s13015-024-00251-9

Frith
MC.
How sequence alignment scores correspond to probability models
.
Bioinformatics
2019
;
36
:
408
15
. https://doi.org/10.1093/bioinformatics/btz576

Garrison
E
,
Sirén
J
,
Novak
AM
et al.
Variation graph toolkit improves read mapping by representing genetic variation in the reference
.
Nat Biotechnol
2018
;
36
:
875
9
. https://doi.org/10.1038/nbt.4227

Gibney
D
,
Thankachan
SV
,
Aluru
S.
On the hardness of sequence alignment on De Bruijn graphs
.
J Comput Biol
2022
;
29
:
1377
96
. https://doi.org/10.1089/cmb.2022.0411

Harrison
PW
,
Ahamed
A
,
Aslam
R
et al.
The European Nucleotide Archive in 2020
.
Nucleic Acids Res
2020
;
49
:
D82
5
. https://doi.org/10.1093/nar/gkaa1028

Heule
S
,
Nunkesser
M
,
Hall
A.
Hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology, Genoa, Italy. New York, NY, USA: Association for Computing Machinery.
683
92
.
2013
. https://doi.org/10.1145/2452376.2452456

Heydari
M
,
Miclotte
G
,
Van de Peer
Y
et al.
BrownieAligner: accurate alignment of Illumina sequencing data to De Bruijn graphs
.
BMC Bioinformatics
2018
;
19
:
311
. https://doi.org/10.1186/s12859-018-2319-7

Hickey
G
,
Heller
D
,
Monlong
J
et al.
Genotyping structural variants in pangenome graphs using the vg toolkit
.
Genome Biol
2020
;
21
:
35
. https://doi.org/10.1186/s13059-020-1941-7

Holley
G
,
Melsted
P.
Bifrost: highly parallel construction and indexing of colored and compacted De Bruijn graphs
.
Genome Biol
2020
;
21
:
249
. https://doi.org/10.1186/s13059-020-02135-8

Huang
W
,
Li
L
,
Myers
JR
et al.
ART: a next-generation sequencing read simulator
.
Bioinformatics
2011
;
28
:
593
4
. https://doi.org/10.1093/bioinformatics/btr708

Iqbal
Z
,
Caccamo
M
,
Turner
I
et al.
De novo assembly and genotyping of variants using colored De Bruijn graphs
.
Nat Genet
2012
;
44
:
226
32
. https://doi.org/10.1038/ng.1028

Ivanov
P
,
Bichsel
B
,
Mustafa
H
et al. AStarix: fast and optimal sequence-to-graph alignment. In: Research in Computational Molecular Biology, Padua, Italy. Cham, Switzerland: Springer.
104
19
.
2020
.

Ivanov
P
,
Bichsel
B
,
Vechev
M.
Fast and optimal sequence-to-graph alignment guided by seeds. In: Research in Computational Molecular Biology, San Diego, CA, USA. Cham, Switzerland: Springer.
306
25
.
2022
.

Jain
C
,
Gibney
D
,
Thankachan
SV.
Algorithms for colinear chaining with overlaps and gap costs
.
J Comput Biol
2022
;
29
:
1237
51
. https://doi.org/10.1089/cmb.2022.0266

Joudaki
A
,
Meterez
A
,
Mustafa
H
et al.
Aligning distant sequences to graphs using long seed sketches
.
Genome Res
2023
;
33
:
1208
17
. https://doi.org/10.1101/gr.277659.123

Karasikov
M
,
Mustafa
H
,
Danciu
D
et al. MetaGraph: indexing and analysing nucleotide archives at Petabase-scale. bioRxiv, https://doi.org/10.1101/2020.10.01.322164,
2020
, preprint: not peer reviewed.

Karasikov
M
,
Mustafa
H
,
Rätsch
G
et al.
Lossless indexing with counting De Bruijn graphs
.
Genome Res
2022
;
32
:
1754
64
. https://doi.org/10.1101/gr.276607.122

Katz
K
,
Shutov
O
,
Lapoint
R
et al.
The Sequence Read Archive: a decade more of explosive growth
.
Nucleic Acids Res
2021
;
50
:
D387
90
. https://doi.org/10.1093/nar/gkab1053

Krannich
T
,
White
WTJ
,
Niehus
S
et al.
Population-scale detection of non-reference sequence variants using colored De Bruijn graphs
.
Bioinformatics
2021
;
38
:
604
11
. https://doi.org/10.1093/bioinformatics/btab749

Lee
C
,
Grasso
C
,
Sharlow
MF.
Multiple sequence alignment using partial order graphs
.
Bioinformatics
2002
;
18
:
452
64
. https://doi.org/10.1093/bioinformatics/18.3.452

Li
H.
Minimap2: pairwise alignment for nucleotide sequences
.
Bioinformatics
2018
;
34
:
3094
100
. https://doi.org/10.1093/bioinformatics/bty191

Li
H
,
Feng
X
,
Chu
C.
The design and construction of reference pangenome graphs with minigraph
.
Genome Biol
2020
;
21
:
265
10
. https://doi.org/10.1186/s13059-020-02168-z

Limasset
A
,
Cazaux
B
,
Rivals
E
et al.
Read mapping on De Bruijn graphs
.
BMC Bioinformatics
2016
;
17
:
237
. https://doi.org/10.1186/s12859-016-1103-9

Luhmann
N
,
Holley
G
,
Achtman
M.
BlastFrost: fast querying of 100,000s of bacterial genomes in Bifrost graphs
.
Genome Biol
2021
;
22
:
30
. https://doi.org/10.1186/s13059-020-02237-3

Ma
J
,
Cáceres
M
,
Salmela
L
et al.
Chaining for accurate alignment of erroneous long reads to acyclic variation graphs
.
Bioinformatics
2023
;
39
:
btad460
. https://doi.org/10.1093/bioinformatics/btad460

Mäkinen
V
,
Tomescu
AI
,
Kuosmanen
A
et al.
Sparse dynamic programming on DAGs with small width
.
ACM Trans Algorithms
2019
;
15
:
1
21
. https://doi.org/10.1145/3301312

Marchet
C
,
Boucher
C
,
Puglisi
SJ
et al.
Data structures based on k-mers for querying large collections of sequencing data sets
.
Genome Res
2021
;
31
:
1
12
. https://doi.org/10.1101/gr.260604.119

Marchet
C
,
Iqbal
Z
,
Gautheret
D
et al.
REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets
.
Bioinformatics
2020
;
36
:
i177
85
. https://doi.org/10.1093/bioinformatics/btaa487

Morgulis
A
,
Gertz
EM
,
Schäffer
AA
et al.
A fast and symmetric dust implementation to mask low-complexity DNA sequences
.
J Comput Biol
2006
;
13
:
1028
40
. https://doi.org/10.1089/cmb.2006.13.1028

Ono
Y
,
Hamada
M
,
Asai
K.
PBSIM3: a simulator for all types of PacBio and ONT long reads
.
NAR Genom Bioinform
2022
;
4
:
lqac092
. https://doi.org/10.1093/nargab/lqac092

Pandey
P
,
Almodaresi
F
,
Bender
MA
et al.
Mantis: a fast, small, and exact large-scale sequence-search index
.
Cell Syst
2018
;
7
:
201
7.e4
. https://doi.org/10.1016/j.cels.2018.05.021

Rahman
A
,
Medvedev
P.
Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs
.
Genome Res
2022
;
32
:
1746
53
. https://doi.org/10.1101/gr.276601.122

Rajput
J
,
Chandra
G
,
Jain
C.
Co-linear chaining on pangenome graphs. In: 23rd International Workshop on Algorithms in Bioinformatics, Volume 273 of Leibniz International Proceedings in Informatics, Houston, TX, USA. Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
12:1
12:18
.
2023
. https://doi.org/10.4230/LIPIcs.WABI.2023.12

Rautiainen
M
,
Marschall
T.
GraphAligner: rapid and versatile sequence-to-graph alignment
.
Genome Biol
2020
;
21
:
253
. https://doi.org/10.1186/s13059-020-02157-2

Rhie
A
,
McCarthy
SA
,
Fedrigo
O
et al.
Towards complete and error-free genome assemblies of all vertebrate species
.
Nature
2021
;
592
:
737
46
. https://doi.org/10.1038/s41586-021-03451-0

Rosen
Y
,
Eizenga
J
,
Paten
B.
Modelling haplotypes with respect to reference cohort variation graphs
.
Bioinformatics
2017
;
33
:
i118
23
. https://doi.org/10.1093/bioinformatics/btx236

Sayers
EW
,
Bolton
EE
,
Brister
JR
et al.
Database resources of the National Center for Biotechnology Information in 2023
.
Nucleic Acids Res
2022
;
51
:
D29
38
. https://doi.org/10.1093/nar/gkac1032

Schulz
T
,
Wittler
R
,
Rahmann
S
et al.
Detecting high-scoring local alignments in pangenome graphs
.
Bioinformatics
2021
;
37
:
2266
74
. https://doi.org/10.1093/bioinformatics/btab077

Shaw
J
,
Yu
YW.
Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic
.
Genome Res
2023
;
33
:
1175
87
. https://doi.org/10.1101/gr.277637.122

Sirén
J.
Indexing variation graphs. In: Proceedings of the Meeting on Algorithm Engineering and Experiments, Barcelona, Spain. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.
13
27
.
2017
. https://doi.org/10.1137/1.9781611974768.2

Sirén
J
,
Monlong
J
,
Chang
X
et al.
Pangenomics enables genotyping of known structural variants in 5202 diverse genomes
.
Science
2021
;
374
:
abg8871
. https://doi.org/10.1126/science.abg8871

Turner
I
,
Garimella
KV
,
Iqbal
Z
et al.
Integrating long-range connectivity information into De Bruijn graphs
.
Bioinformatics
2018
;
34
:
2556
65
. https://doi.org/10.1093/bioinformatics/bty157

Wei
W
,
Koslicki
D.
WGSUniFrac: applying UniFrac metric to whole genome shotgun data. In: 22nd International Workshop on Algorithms in Bioinformatics, Volume 242 of Leibniz International Proceedings in Informatics, Potsdam, Germany. Dagstuhl, Germany: Schloss Dagstuhl -- Leibniz-Zentrum für Informatik.
15:1
15:22
.
2022
. https://doi.org/10.4230/LIPIcs.WABI.2022.15

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data