Integrating long-range connectivity information into de Bruijn graphs

Abstract Motivation The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data. Results We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes. Availability and implementation Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex. Supplementary information Supplementary data are available at Bioinformatics online.


Theoretical proofs
In this section we prove some properties of linked de Bruijn graphs. We make the assumption that the graph is constructed from input sequences without reading errors or coverage gaps and that reads satisfy Ukkonen's condition wherein the read length L is at least one base longer than the longest interleaved or triple repeat (Ukkonen et al., 1992, Bresler et al., 2013. The Linked de Bruijn graph is defined as LG(k) = (V, E, L) where k is the de Bruijn graph parameter, and V and E are defined as in a de Bruijn graph. Vertices V is the set of k-mer-keys.

Notation
A k-mer is represented by a vertex, orientation -tuple called an oriented-vertex.
Traversal means moving through oriented vertices and along edges between vertices, using link information. v a v b v c means we start walking at vertex v a reach vertex v v then continue traversal to reach vertex v c , each with an unambiguous route. In other words if we start at v a there is only one valid path to follow and it reaches v b and so on to v c . v v v y means we cannot traverse unambiguously to vertex v y if we start at v x . We can therefore make the following statements: In Supplementary Figure 1, we can see that it is possible to traverse unambiguously from . This is because we hit a fork (bifurcation) in the graph that cannot be resolved.
An oriented vertex in the graph represents a k-mer which occurs in one or more locations in the input sequences. For a given position in the input sequence s x , we make the following definitions: 1. V (s x ) is the oriented vertex representing the k-mer starting at position s x in the input sequence; this is a many-to-one mapping.
3. S(v x ) is the set of sequence positions represented by oriented vertex v x .

Repeats
A repeated substring is a substring that occurs more than once in the input genome. Repeated substrings may be overlapping and may occur reverse-complemented. A maximal repeated substring is a substring such that adding a single character to the head or tail would decrease the number of occurrences of it. A dynamic programming solution can find the set of maximal repeated substrings (including reverse-complements) for any string in time O(N 2 ) and memory O(N ). As an example, for the string ababababa the set of maximal repeated substrings is {abababa, ababa, aba, a} with occurrence counts {2, 3, 4, 5} respectively. We can paint repeats onto the input genome by exhaustively finding all maximal substrings that appear at one or more other positions in the input sequence. Repeats must be of length ≥ k, where k is the parameter of the de Bruijn graph. Graph construction is then equivalent to gluing all repeats together (see Supplementary Figure 2). All repeats now take the general form shown in 2.(d) -at the start the graph collapses down from two or more vertices and at the end forks into two or more vertices. Between the start and end of the repeat the sequences that carry the repeat do not separate. We can now see that every fork in the graph is the ending of a repeat, and conversely that every repeat ends at a fork. If we are traversing a path through the graph and hit a fork, either we started in the repeat or we passed the start of the repeat that ends at this fork. We define: 1. REP S (s x ) as the set of repeats that sequence position s x is contained in.
2. REP V (v x ) as the union of REP S (X) for all positions X in the sequence where the k-mer associated with vertex v x appears: An example is shown in Supplementary Figure 3. Note that vertex orientation has no effect on the set of repeats a vertex is in: Entering repeats presents no issue in assembly. Leaving repeats requires some information about current location(s) in the underlying sequence. This information must be stored before you enter a repeat. Picture starting graph traversal from the middle of the red repeat in Supplementary  Figure 2.(c): once you reach the end of the red repeat, you cannot make a decision about where to go. If you were to start at v w , you can see that just before you enter each repeat, you pick up an annotation which enables you to resolve it.
Starting traversal from within a repeat is equivalent to walking multiple places in the input sequence at once, and only tracing the consensus sequence of the repeats (stopping at the end of the repeat).

Sequence Traversal
Assuming a Linked de Bruijn Graph constructed with complete information (no sequencing error, coverage gaps and all repeats contain by at least one read), we can prove the following rules about traversal: Proposition 2.1 (Sequence Traversal). Let s x , s y be points on the same input sequence, where s y follows s x . Traversal from the vertex representing s x to the vertex representing s y is possible ⇐⇒ the set of maximal substrings of s x is a subset of s y i.e.
Proof. In order to traverse from one vertex to another we must not leave any of the repeats we were originally within, as doing so would mean we hit a fork we could not resolve, therefore: If we enter repeats that start after s x then we we pick up annotations just before they start which are used to resolve them. The only graph features that will stop traversal between two connected vertices are forks representing the end of repeats that we started within. These cannot be resolved unambiguously. Therefore if we do not leave repeats we started in, we can traverse connected vertices: We can see that Proposition (2.1) follows from equations (20) and (21).

Proposition 2.2 (Sequence Transitivity).
Let s x , s y , s z be points that appear in that order on the same input sequence. If we can start at V (s x ) and reach V (s y ) and start at V (s y ) and reach Proof.

Vertex Traversal
We define REP EN D V (v x ) to be the set of last oriented-vertices of repeats that vertex v x is in (see example in Supplementary Figure 4). The last vertex of a repeat is dependent on vertex orientation: Figure 4: is a set of vertices that mark forks in the graph. The set of all fork vertices in the graph is: where V is the collection of all vertices in the graph. If you start traversal in a repeat it is not possible to leave it. That means if you start traversal at some vertex v x , you cannot traverse unambiguously past any vertex in REP EN D V (v x ).

Proposition 2.3 (Vertex Traversal).
Let v 1 and v m be vertices with orientations and {v 1 , . . . , v n } be a connected path through the graph. We can traverse starting at vertex Proof. By the same logic as Proposition (2.1). The only graph features that will stop traversal between two connected vertices are forks representing the end of repeats that we started within. If we can traverse from vertex v 1 to vertex v n then repeats that started in v 1 do not end before v n : If a repeat that we started in does not end before v n , then the only forks we encounter are from repeats that start and end between v 1 and v n . For these repeats we have an opportunity to pick up annotations that will resolve them, therefore traversal will succeed: We cannot traverse past the last vertex of a repeat that we were already in when we started traversal of the graph. If we hit a vertex that is in REP EN D V (v x ), it marks a fork in the graph that we cannot resolve. We can see that Proposition (2.3) follows from equations (27) and (28).

Lossless property
If we construct an annotated graph from a single sequence that starts and ends with a unique k-mer, following from Proposition (2.3), we are able to recover it in its entirety from the graph.
In addition to assuming error free coverage, we also assume that chromosomes start and end with unique k-mers. An undesired edge effect can appear if sequences end with k-mers that appear elsewhere in the graph, resulting in loops at the start or end of the graph representation of the sequences with in-degree greater than out-degree (see Supplementary Figure 5). You can force this to be true by explicitly adding a unique k-mer to the start/end of sequences. In the case of high coverage genome assembly this edge case is rare. a) b) Figure 5: a) graph of sequence that ends with a repeat (in red) b) adding a unique sequence to the end of the input ensures infinite loops do not exist.
To extract exactly one contig from such a Linked de Bruijn graph we extract contigs and remove contained contigs (contigs that are substring, or reverse complemented substrings of other contigs).

Pipeline commands
Below, we provide the command listings used to produce assemblies on single-end data and pairedend data. Parameters and typical settings are indicated with a ${...[=X]} sigil (e.g. ${threads=8}) and remain fixed throughout the pipeline. Inputs/outputs are indicated with angle brackets with suggested file extensions (e.g. <build.ctx>).

McCortex pipeline
Assembly with McCortex consists of several steps encompassing initial construction of the de Bruijn graph (build), removal of sequencing errors (clean), addition of missed edges for k − 1 overlaps (i.e. from reads that overlap by exactly k −1 bases) (inferedges), link construction from single-and paired-end reads (thread), link-informed contig emission (contigs), and contained contig removal (rmsubstr). The following pipeline listing demonstrates the use of these tools in sequence. Note that for read error correction and contig deduplication, we use existing tools bfc (Li, 2015) and cd-hit-est (Fu et al., 2012).

SGA pipeline
As we compare our workflow to SGA often in this manuscript, we have provided the program listing for our SGA-based pipelines. We used SGA version 0.10.15 for all analyses. Our pipeline is taken from the sga-ecoli-miseq.sh example script provided by the SGA software distribution, with the notable omission of the contig scaffolding step. For all analyses in this manuscript, only minor variations of this pipeline are required (in practice, we only modify the overlap value and the paired-end mode if we are working with single-end data).

SPAdes pipeline
Here we provide the command used for performing assemblies with SPAdes. In all analyses using SPAdes, we used version 3.11.1 of the software.

Velvet pipeline
For all assemblies using Velvet, we used the latest source code available from the Git repository (https://github.com/dzerbino/velvet, short commit hash: 9adf09f). We ran Velvet in two modes: pure de novo assembly (i.e. using paired-end reads only), and reference-guided (using the Columbus module). In the latter case, our pipeline was:   Figure 6: Number of k-mers with links as a function of k-mer size. Assembling 1 Mbp of sequence (human GRCh37 chr22:28,000,000-28,999,999) with three simulated 100X read data sets: (i) error free 100 bp reads, one read starting at every base ("perfect"); (ii) error free stochastic coverage, uniformly distributed read starts ("stochastic"); (iii) an error rate of 0.5% and stochastic uniformly distributed coverage ("error"); (iv) the "error" reads error-corrected with bfc (Li, 2015). Each graph has ∼ 1 million k-mers.

Variant calling in K. pneumoniae
We investigated the utility of links to call large variants (insertions or deletions greater than 100 bp in length). We obtained Illumina data (MiSeq 2 × 151 bp, 93X coverage; HiSeq 2 × 301 bp, 42X) and PacBio RSII data (NCBI reference GCF_001870165.1_ASM187016v1) from a single haploid K. pneumoniae isolate, CAV1016. We constructed a dBG of the canonical reference sequence (NCBI reference GCF_000016305.1_ASM1630v1, unisim5.3M bp) and the Illumina data for the study isolate at k = 31, omitting the PacBio data from graph construction for later use as validation. We then constructed an LdBG by using only the single-end reads from the study isolate for link construction. We implemented a sub program to find bubbles (graph motifs where paths diverge from a k-mer and rejoin at a later k-mer) and applied it first to the dBG, then to the LdBG, allowing events up to 200 kb in length. We removed events less than or equal to 100 bp in length, as well as duplicate events (arising from navigating the graph in both the forward and reverse direction). The reference and alternate alleles were validated by aligning each to the canonical reference and CAV1016 PacBio draft reference sequence, respectively. All alleles matched their respective sequences with 100% identity (0 mismatches, 0 gaps    Chen et al. (2013) In an effort to track plasmid transmission in a K. pneumoniae outbreak, Mathers et al. (2015) sequenced 37 isolates using the Illumina HiSeq 2000 platform, as well as generating draft reference genomes of two index case plasmid transformants with long reads from PacBio RSII instruments. Despite large homology between the two drafts (designated pKPC_UVA01 and pKPC_UVA02), mapping the Illumina reads to these sequences indicated that 21/37 isolates harbored KPC alleles on one of these two plasmid backgrounds. The authors were able to report a point mutation in isolate CAV1360's copy of KPC-2 (the altered allele being known as KPC-3). However, the alignments were insufficient to characterize a large alteration in CAV1077, nor could they detail other mutations upstream or downstream from the KPC gene.
We hypothesized that constructing a panel of links from plasmid sequences could enable reconstruction of the full plasmid sequences for the 21 isolates and permit us to describe the alterations more fully. The sequences included in the panel are listed in Supplementary Table 2.
In addition to the two Mathers et al. sequences, we included two others: a plasmid sequence from E. coli harboring KPC-3, and a plasmid sequence from an unrelated K. pneumoniae isolate harboring KPC-5. The E. coli sequence was chosen to be helpful for allele identification, but of limited utility for plasmid identification due to the divergent nature of its haplotypic sequence to the haplotypes present in the 21 K. pneumoniae isolates. The pBK31567 sequence was chosen as a negative control. As no isolates in our study carry the KPC-5 allele, this entry in the panel should go unutilized.

Performance metrics
To provide details on memory usage and performance for each step of a de novo assembly, and provide a baseline against which to evaluate these metrics, we computed runtime and memory usage of McCortex and SGA submodules on selected publicly available datasets. For our comparison, we chose E. coli (4.6 Mbp genome, ∼ 81x coverage), P. falciparum (23.3 Mbp, ∼ 63x), and C. elegans (100.3 Mbp, ∼ 24x). These results are summarized in Supplementary Table 3.
While both assemblers are designed to make fuller use of the connectivity information within reads, McCortex and SGA have vastly different design philosophies. SGA commands are designed to use very little memory, and across datasets, it is apparent that SGA uses a small fraction of the memory required by McCortex on the same dataset. McCortex attempts to balance memory usage with speed. All graph vertices and edges are loaded into memory upfront, while reads are processed in streaming (and parallelizable) fashion. Link construction also requires the storage of all putative links in memory until the full read dataset has been processed to ensure that support for each junction is correctly calculated. Thus, McCortex commands that store information based on reads supplied as input (build and thread) consistently have the highest memory use across the toolchain. The clean step requires as much memory as the build step in order to store the raw graph (containing genomic data and sequencing errors), but once most errors have been removed, memory usage by subsequent tools is reduced.
McCortex supports the use of multiple threads for processing data, greatly speeding up runtime. By default, the number of threads is 2 (one to read data from disk and one to perform processing steps). Link construction benefits from the use of many more threads, as the alignment of reads to the graph is an embarrassingly parallel process). As the major computational cost is in this thread step, our pipelines typically apply this step with several threads.