Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections

Abstract Motivation The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem. Results We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using ∼126 GB of memory, and over 16 h using ∼289 GB of memory, respectively. Availability and implementation Cuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
With the increasing affordability and throughput of sequencing, the set of whole genome references available for comparative analyses and indexing has been growing tremendously.Modern sequencing technologies can generate billions of short-read sequences per-sample, and with state-of-the-art de novo and reference-guided assembly algorithms, we now have thousands of mammalian-sized genomes available.Moreover, we now have successfully assembled genomes that are an order of magnitude larger than typical mammalian genomes, with the largest ones among these being the sugar pine ($31 Gbp) (Stevens et al., 2016) and the Mexican walking fish ($32 Gbp) (Nowoshilow et al., 2018).With modern long-read sequencing technologies and assembly techniques, the diversity and the completeness of reference-quality genomes is expected to continue increasing rapidly.Representation of these reference collections in compact forms facilitating common genomic analyses is thus of acute interest and importance, as are efficient algorithms for constructing these representations.
In this work, we tackle the initial step of the pipelines associated with whole genome and pan-genome analysis tasks based on de Bruijn graphs: the time-and memory-efficient construction of (colored) compacted de Bruijn graphs.We present a novel algorithm, implemented in the tool Cuttlefish, for this purpose.It has excellent scaling behavior and low memory requirements, exhibiting better performance than state-of-the-art tools for compacting de Bruijn graphs from reference collections.The algorithm models each distinct k-mer (i.e.vertex of the de Bruijn graph) of the input reference collection as a finite-state automaton, and designs a compact hash table structure to store succinct encodings of the states of the automata.It characterizes the k-mers that flank maximal unitigs through an implicit traversal over the original graph-without building it explicitly-and dynamically transitions the states of the automata with the local information obtained along the way.The maximal unitigs are then extracted through another implicit traversal, using the obtained k-mer characterizations.
We assess the algorithm on both individual genomes, and collections of genomes with diverse structural characteristics: 7 closely related humans; 7 related but different apes; 11 related, different and huge conifer plants; and 100 humans.We compare our tool to three existing state-of-the-art tools: Bifrost, deGSM and TwoPaCo.Cuttlefish is competitive with or faster than these tools under various settings of k-mer length and thread count, and uses less memory (often multiple times less) on all but the smallest datasets.
Note that, based on the input to the construction of a de Bruijn graph being a set of either reference sequences or sequencing reads, we distinguish the graphs as reference de Bruijn graphs and read de Bruijn graphs, respectively (De Bruijn graphs may also be constructed from k-mer sets directly, which themselves are generated from sets of references or reads.).The presented Cuttlefish algorithm works with de Bruijn graphs based on reference sequences, i.e. it constructs compacted reference de Bruijn graphs.

Related work
The amount of short-read sequences produced from samples, like whole-genome DNA, RNA or metagenomic samples, can be in the order of billions.Construction of long-enough contiguous sequences, also referred to as contigs (Staden, 1980), from the sets of reads is known as the fragment assembly problem, and is a central and long-standing problem in computational biology.Many short-read fragment assembly algorithms [BCALM (Chikhi et al., 2014), BCALM2 (Chikhi et al., 2016), Bruno (Pan et al., 2018), and deGSM (Guo et al., 2019)] use the de Bruijn graph to represent the input set of reads, and assemble fragments through graph compaction.More generally, fragment assembly algorithms are typically present as parts of larger and more complex genome assembly algorithms.In this work, we study the closely related problem of constructing and compacting de Bruijn graphs in the whole genome setting.While the computational requirements for the compacted de Bruijn graph construction from genome references are typically substantial compared to the latter phases of sequence analysis (wholeand pan-genome) tasks, only a few tools have focused on the specific problem.
SplitMEM (Marcus et al., 2014) exploits topological relationships between suffix trees and compacted de Bruijn graphs.It introduces a construct called suffix skips, which is a generalization of suffix links and is similar to pointer jumping techniques, to fast navigate suffix trees for identifying MEMs (maximal exact matches), and then transforms those into vertices and edges of the output graph.SplitMEM is improved upon by Baier et al. (2015) using two algorithms: based on a compressed suffix tree, and on the Burrows-Wheeler Transform (BWT) (Burrows and Wheeler, 1994).
TwoPaCo (Minkin et al., 2016) takes the approach of enumerating the edges of the original de Bruijn graph, which aids in identifying the junction positions in the genomes, i.e. positions that correspond to vertices in TwoPaCo's compacted graph format.Initially having the entire set of vertices as junction candidates, it shrinks the candidates set using a Bloom filter (Bloom, 1970), with a pass over the graph.Since the Bloom filter may contain false positive junctions, TwoPaCo makes another pass over the graph, this time keeping a hash table for the reduced set of candidates and thus filtering out the false positives.deGSM(Guo et al., 2019) builds a BWT of the maximal unitigs without explicitly constructing the unitigs themselves.It partitions the (k þ 2)-mers of the input into a collection of buckets, and applies parallel external sorting on the buckets.Then characterizing the k-mers at the ends of the maximal unitigs using the sorted information, it merges the buckets in a way to produce the unitigs-BWT.
Bifrost (Holley and Melsted, 2020) initially builds an approximate de Bruijn graph using blocked Bloom filters.Then for each kmer in the input, it extracts the maximal unitig that contains that kmer, by extending the k-mer forward (and backward), constructing the suffix (and prefix) of the unitig.This produces an approximate compacted de Bruijn graph.Then using a k-mer counting and minimizer-based policy, it refines the extracted unitigs by removing the false positive k-mers present in the compacted graph.

Preliminaries
We consider all strings to be over the alphabet R ¼ fA; C; G; Tg.For some string s, jsj denotes its length.s½i::j denotes the substring of s from the i'th to the j'th character, inclusive (with 1-based indexing).suf ' ðsÞ and pre ' ðsÞ denote the suffix and the prefix of s with length ', respectively.For two strings, x and y such that suf ' ðxÞ ¼ pre ' ðyÞ, the glue operation is defined as x ' y ¼ x Á y½' þ 1::jyj, where ðÁÞ is the append operation.
A k-mer is a string of length k.For some string x, its reverse complement x is the string obtained by reversing the order of the characters in x and complementing each character according to the nucleotide bases' complementarity.The canonical form x of a string x is the string x ¼ minðx; xÞ, according to the lexicographic ordering.
For a set S of strings and an integer k > 0, the corresponding de Bruijn graph is defined as a bidirected graph with: (i) its set of vertices being exactly the set of canonical k-mers from S; and (ii) two vertices u and v being connected with an edge iff there is some ðk þ 1Þmer e in S such that u and v are the canonical forms of the k-mers pre k ðeÞ and suf k ðeÞ, respectively, i.e. pre k ðeÞ ¼ u and ŝuf k ðeÞ ¼ v (This is the bidirected variant of de Bruijn graphs, applicable practically with the treatment of double-stranded DNA.For a discussion on the simpler directed variant, see Chikhi et al. (2021).We adopt an edge-centric definition of the de Bruijn graph where an edge exists iff there is some corresponding ðk þ 1Þ-mer present in S, as opposed to the node-centric definition where the edges are implicit given the vertices, i.e. an edge u !v exists iff suf kÀ1 ðuÞ ¼ pre kÀ1 ðvÞ.).A vertex v is said to have exactly two sides, referred to as the front side s vf and the back side s vb .An edge e between two vertices u and v is incident to exactly one side of u and one side of v.These incidence sides are determined using the following rule: 1.If pre k ðeÞ ¼ u, i.e. pre k ðeÞ is in its canonical form, then e is incident to the back side s ub of u; otherwise it is incident to u's front side s uf .2. If suf k ðeÞ ¼ v, i.e. suf k ðeÞ is in its canonical form, then e is incident to the front side s vf of v; otherwise it is incident to v's back side s vb .
If u ¼ v, then e is said to be a loop.So, an edge e can be defined as a 4-tuple ðu; s u ; v; s v Þ, with s u and s v being the sides of the vertices u and v, respectively to which e is incident to.The canonical k-mer x corresponding to a vertex v is referred to as its label, i.e. labelðvÞ ¼ x.An example illustration of a de Bruijn graph is given in Figure 1a.
A walk is defined as an alternating sequence of vertices and edges, w ¼ ðv 0 ; e 1 ; v 1 ; . . .; v mÀ1 ; e m ; v m Þ, such that any two successive vertices v iÀ1 and v i in w are connected with the edge e i (through any side).v 0 and v m are called the endpoints, and the v i for 0 < i < m are called the internal vertices of the walk.w is said to enter v i using e i , and exit v i using e iþ1 .For the endpoints v 0 and v m , w does not have an entering and an exiting edge, respectively.For 0 < i m, w is said to enter v i through its front side s vi f if e i is incident to s vi f .Otherwise, it is said to enter v i through its back side s vi b .For 0 i < m, the terminology for the exiting side is similarly defined using the incidence side of e iþ1 to v i .w is said to be inputconsistent if, for each of its internal vertices v i , the sides of v i to which e i and e iþ1 are incident are different.Intuitively, an input-consistent walk enters and exits a vertex through different sides.We prove in lemma 1 (see Supplementary Section S2) that the reconstruction (spelling) of the input strings s 2 S are possible only from input-consistent walks over G(S, k).Therefore, we are only interested in such walks, and refer to input-consistent walks whenever using the term walk onward.
For a vertex v i in w, let its label be l i , i.e. l i ¼ labelðv i Þ.We say that w sees the spelling s i for the vertex v i (for 0 < i m), such that s i is l i if w enters v i from the front, and is l i otherwise.s 0 is defined analogously but using the exiting edge e 1 for v 0 .The spelling of w is defined as s 0 kÀ1 s 1 kÀ1 . . .kÀ1 s m .
A path p ¼ ðv 0 ; e 1 ; v 1 ; . . .; e m ; v m Þ is said to be a unitig if jpj ¼ 1, or in G(S, k): 1. each internal vertex v i has exactly two incident edges, e i and e iþ1 ; 2. and v 0 has exactly one edge e 1 and v m has exactly one edge e m incident to the sides respectively through which p exits v 0 and enters v m .
A unitig is said to be maximal if it cannot be extended by a vertex on either side.Figure 1 illustrates examples of walks, paths, their spellings and maximal unitigs in de Bruijn graphs.
The compacted de Bruijn graph G c ðS; kÞ of the graph G(S, k) is obtained by collapsing each of its maximal unitigs into a single vertex.Figure 1b shows the compacted form G c ðS; kÞ of the graph G(S, k) from Figure 1a.The problem of compacting a de Bruijn graph is to compute the set of its maximal unitigs.
To ensure that each s 2 S can be expressed as a sequence tiling of complete maximal unitigs, a unitig needs to be prevented from spanning multiple input strings.We effectuate this through using a slightly altered topology of the de Bruijn graph G(S, k).During construction of G c ðS; kÞ, we treat each s 2 S as ð Á s Á Þ, where denotes the empty symbol.We refer a vertex v as a sentinel if the first or the last k-mer x of some input string s 2 S corresponds to v.In this occurrence of x, at least one side s v of v does not have any incident edge-we refer to these sides as sentinel-sides.Thus, each sentinel-side s v has an incident edge that can be encoded with .Theencoded edge connects s v to a special vertex, say !.All such edges connect to an arbitrary but fixed side of !. Thus, ! has one side with 0 incident edges, and another with 2 Â jSj.Therefore, ! will be a maximal unitig by itself.Furthermore, for a sentinel v connecting to some vertex u through its sentinel-side s v , a unitig containing v cannot extend to u through s v , as s v has multiple incident edgesrestraining unitigs from spanning multiple input strings.

Motivation
Given a set S of strings and an odd integer k > 0, a simple naive algorithm to construct the corresponding compacted de Bruijn graph G c ðS; kÞ is to first construct the ordinary de Bruijn graph G(S, k), and then enumerate all the maximal non-branching paths from G(S, k).The graph construction can be performed by a linear scan over each s 2 S storing information along the way in some graph representation format, like an adjacency list.Enumeration of the nonbranching paths can be obtained using a linear-time graph traversal algorithm (Cormen et al., 2009).However, storing the ordinary graph G(S, k) usually takes an enormous space for large genomes, and there is no simple way to traverse it without having the entire graph present in memory.This motivates the design of methods able to build G c ðS; kÞ directly from S without having to construct G(S, k).
For some s 2 S, it is important to note that by definition of edgecentric de Bruijn graphs, each ðk þ 1Þ-mer in s is an edge of G(s, k).Therefore, we can obtain a complete walk traversal w(s) over G(s, k) through a linear scan over s, without having to build G(s, k) explicitly.Also, each maximal unitig of the graph is contained as a subpath in this walk, as proven in lemma 2 (see Supplementary  1a), w1 ¼ ðGAC; ACA; ATG; ATGÞ (edges not listed) is a walk.w 1 spells the string GACATG.It is not a path as the vertex ATG is repeated here.Whereas the walk w2 ¼ ðGAC; ACA; ATGÞ is a path, spelling GACAT.Besides, it is a unitig, and also a maximal one as it cannot be extended on either side while retaining itself a unitig.There are four maximal unitigs in the graph (the paths referred with the arrows), with the canonical spellings: CGA, ATGTC, CTAAGA and GAGC.(a) A (bidirected) de Bruijn graph for S ¼ fCGACATGTCTTAG; GCTCTTAGg, with k ¼ 3. The vertices are the canonical k-mers from S, and each edge corresponds to some 4-mer(s) in S. Each pentagon is a vertex, with the flat and the pointy sides (vertically) denoting its front and back, respectively.For each vertex v, the string inside it is label(v), to be read in the visual direction from the front to the back.The string below it is labelðvÞ, to be read in the opposite direction.For example, the 4-mers CGAC and GCTC correspond to the edges fCGA, GACg and fAGC, CTCg, respectively.The edge corresponding to the 4-mer CATG is a loop fATG, ATGg.(b) The corresponding compacted de Bruijn graph, with each maximal unitig in its canonical form Section S2).For the set of strings S, the maximal unitigs are similarly contained in a collection of walks W(S).
Thus, to construct G c ðS; kÞ efficiently, one approach is to extract off the maximal unitigs from these walks W(S), without building G(S, k).We describe below an asymptotically and practically efficient algorithm that performs this task.

Flanking vertices of the maximal unitigs
Similar to the TwoPaCo (Minkin et al., 2016) algorithm, our algorithm is based on the observation that there exists a bijection between the maximal unitigs of G(S, k) and the substrings of the input strings in S whose terminal k-mers correspond to the endpoints of the maximal unitigs.We refer to these endpoint vertices as flanking vertices.This observation reduces the graph compaction problem to the problem of determining the set of the flanking vertices.Once each vertex in G(S, k) can be characterized as either flanking or internal with respect to the maximal unitig containing it, the unitigs themselves can then be enumerated using a walk over G(S, k), by identifying subpaths having flanking k-mers at both ends.
Consider a maximal unitig p and one of its endpoints v. Say that v is connected to p through its side s vin , and its other side is s vout .v is a flanking vertex as it is not possible to extend p through s vout while retaining itself a unitig, due to one of the following cases: i. there are either multiple edges incident to s vout ; or ii.there is exactly one edge ðv; s vout ; u; s u Þ incident to s vout , but s u has multiple incident edges (Due to the -encoded edges for sentinel-sides, it is not possible for any side to have zero incident edges-except for the special vertex !(see Section 3).).
For trivial unitigs (i.e.unitigs with exactly one k-mer), extending the unitig through s vout in both the cases violates the second property of (non-trivial) unitigs; while for non-trivial unitigs, the first property is violated in case (i) and the other one is violated in case (ii).
From this, we can observe that the adjacency information of the sides of the vertices can be used to determine the flanking vertices.As per lemma 4 (see Supplementary Section S2), a side of a vertex can have at most four distinct edges.This being finite, the adjacency information (including the presence of an -encoded edge) can be tracked using some data structure.Through a set of walks over G(S, k), each encountered edge ðu; s u ; v; s v Þ can be recorded for the sides s u and s v .With another set of walks, the maximal unitigs can then be extracted as per the flanking vertices, determined using the obtained adjacency information.
Another approach-adopted in TwoPaCo (Minkin et al., 2016)-is to have an edge list data structure supporting queries, and filling it up with the graph edges (i.e.ðk þ 1Þ-mers) with a set of walks.In another set of walks, the flanking vertices can be characterized using incidence queries (four per side), and the maximal unitigs can be extracted on the fly.

A deterministic finite-state automaton model for vertices
In these characterization approaches of the flanking vertices, a source of redundancy in the information stored is the vertex sides that have multiple incident edges, or are sentinel-sides.By definition, these sides belong to the flanking vertices of the maximal unitigs.So, in a first set of walks over the graph, once a side of a vertex has been determined to have more than one distinct edge, or to be a sentinel-side, the adjacency information for the side becomes redundant: for our purposes, we do not need to be able to enumerate the incident edges for this side, or to differentiate between them.Only the information that this side makes the vertex flanking is sufficient.Complementarily, another source of redundant information is the vertex sides observed to have exactly one incident edge (excluding an -encoded edge) up-to some point in the walks.For such a side, keeping track of only that single edge is sufficient, instead of having options to track more distinct edges: when a different edge is encountered, or the side is observed to be a sentinel-side, distinguishing between the edges becomes redundant.
Thus, we observe that the only required information to keep per side of a vertex is: i. whether it has exactly one distinct edge (excluding the -encoded edge), and if so, which edge it is among the four possibilities (as per lemma 4, see Supplementary Section S2); or ii.if it has either multiple distinct incident edges, or an -encoded edge.
Hence, a side can have one of five different configurations: one for each possible singleton edge (excluding the one), and one for when it has either multiple edges or an -encoded edge.This implies that each vertex can be in one of ð5 Â 5Þ ¼ 25 different configurations.We refer to such a configuration as a state for a vertex.Figuring out the states of the vertices provides us with enough information to extract the maximal unitigs.
To compute the actual state of each vertex v in G(S, k), we treat v as a deterministic finite-state automaton (DFA).Prior to traversing G(S, k), no information is available for v-we designate a special state for such, the unvisited state.Then in response to each incident edge or sentinel k-mer visited for v, an appropriate state-transition is made.
Formally, a vertex v in G(S, k) is treated as a DFA ðQ; R 0 ; d; q 0 ; Q 0 Þ, where- fq 0 g is the set of all possible 26 states.The 25 actual states (configurations) of the vertices in G(S, k) can be partitioned into four disjoint classes based on their properties, as in Figure 2a.
Input symbols: R 0 ¼ f ðc 1 ; c 2 Þ j c 1 ; c 2 2 R [ fg g (where denotes the empty symbol, used for sentinel-sides) is the set of input symbols for the automaton.In the algorithm, we actually make state-transitions not per edge, but per vertex.That is, for some walk w ¼ ðv 0 ; . . .; e i ; v i ; e iþ1 ; . . .; v m Þ, we process the vertices v i .Processing v i means checking the edges e i and e iþ1 simultaneously and only for v i (excluding the edges' other endpoints); making state transitions for v i' s automaton as required (This shrinks the statespace for the automata.For the subwalk ðe i ; v i ; e iþ1 Þ, if the edges e i and e iþ1 are to be processed independent of each other, then during each processing, only one side of v i can be seen.This requires each side to have an unvisited state independently, making the state-space size ð6 Â 6Þ ¼ 36.From lemma 1 (see Supplementary Section S2), the edges e i and e iþ1 are incident to different sides of v i .Processing them simultaneously for v i thus ensures that both the sides are seen together, making one state sufficient to denote the unvisited status' of both the sides together.This reduces the state-space size to ð5 Â 5 þ 1Þ ¼ 26.).Thus, the two incident edges fe i ; e iþ1 g are being used as input for the automaton of v i .The edges can be represented succinctly with a pair of characters (c 1 , c 2 ) for v i .c 1 and c 2 encode the edges incident to v i' s front and back respectively, in the subwalk ðe i ; v i ; e iþ1 Þ (c 1 and c 2 do not necessarily correspond to e i and e iþ1 , in this order; the order can also be the opposite based on the side of entrance of w to v i .).
Transition function: d : Q Â R 0 !Q is the function governing the state-transitions.Figure 2b presents a high-level view of the possible types of transitions between states of the four classes.Supplementary Figure S1 illustrates a detailed overview of the transitions defined for the states as per various input symbols.
Initial state:q 0 is the initial state for the automaton, which is the unvisited state.

The Cuttlefish algorithm
For a set S of input strings, the proposed algorithm, CuttlefishðSÞ, works briefly as follows.Cuttlefish treats each vertex of the de Bruijn graph G(S, k) as a DFA-based on the novel modeling scheme introduced in Section 4.3.First, it enumerates the set K of vertices of G(S, k), applying a k-mer counting algorithm on S. Then it builds a compact hash table structure for K-employing a minimal perfect hash function-to associate the automata to their states (yet to be determined).Having instantiated a DFA M v for each vertex v 2 K, with M v being in the initial state q 0 (i.e. the unvisited state), it makes an implicit traversal W(S) over G(S, k).For each instance (i.e.k-mer) x of each vertex v visited along W(S), it makes an appropriate state transition for v's automaton M v -using the local information obtained for x, i.e. the two incident edges of x in W(S).Having finished the traversal, the computed final states of the automata correspond to their actual states in the underlying graph G(S, k)which had not been built explicitly.Cuttlefish then characterizes the flanking vertices of the maximal unitigs in another implicit traversal over G(S, k), using the states information of the automata computed in the preceding step, and extracts the maximal unitigs on the fly.Cuttlefish(S) The major components of the algorithm-an efficient associative data structure for the automata and their states, computing the actual states of the automata and extraction of the maximal unitigs from the input strings are discussed in the next subsections.Finally, the correctness of the algorithm is proven in theorem 1 in Supplementary Section S2.

Hash table structure for the automata
To maintain the (transitioning) states of the automata for the vertices in G(S, k) throughout the algorithm, some associative data structure for the vertices (i.e.canonical k-mers) and their states is required.We design a hash table for this purpose, with a (k-mer, state) key-value pairing.An efficient representation of hash tables for k-mers is a significant data structure problem in its own right, and some attempts even relate back to the compacted de Bruijn graph (Marchet et al., 2019).In solving the subproblem efficiently for our case, we exploit the fact that the set K of the keys, i.e. canonical k-mers, are static here, and it can be built prior to computing the states.We build the set K efficiently from the provided set S of input strings using the KMC3 algorithm (Kokot et al., 2017).Afterwards, we construct a minimal perfect hash function (MPHF) h over the set K, employing the BBHash algorithm (Limasset et al., 2017).
A perfect hash function h p for a set K of keys is an injective function from K to the set of integers, i.e. for x 1 ; x 2 2 K, if x 1 6 ¼ x 2 , then h p ðx 1 Þ 6 ¼ h p ðx 2 Þ.Perfect hash functions guarantee that no hashing collisions are made for the keys in K.A perfect hash function is minimal if it maps K to the set ½0; jKjÞ.Since we do not need to support lookup of k-mers nonexistent in the input strings, i.e. no alien keys will be queried, an MPHF works correctly in this case.Also, as an MPHF produces no collisions, we do not need to store the keys with the hash table structure for collision resolution-reducing the space requirement for the structure.Although instead of the keys, we need to store the function itself with the structure, it takes much less space than the set of keys.In our setting, the function constructed using BBHash takes %3:7 bits/k-mer, irrespective of the k-mer size k.Whereas storing the keys would take 2k bits/k-mer.
As the hash value h(x) for some key x 2 K is at most ðjKj À 1Þ, we use a linear array of size jKj, indexed by h(x), to store the state of the vertices (canonical k-mers) x as the hash table values.We call this array the buckets table B. So, the hash table structure consists of two components: an MPHF h, and an array B. For a canonical kmer x, its associated bucket is in the hðxÞ'th index of B-B hðxÞ stores the state of x throughout the algorithm.

Computing the states and extracting the maximal unitigs
For a set S of input strings with n distinct canonical k-mers and an MPHF h over those k-mers, the algorithm Compute-States(S, h, n) computes the state of each vertex in G(S, k).Initially, it marks all the n vertices (i.e.their automata) with the unvisited state, in a buckets table B. Then it makes a collection of walks over G(S, k)-a walk w ¼ ðv 0 ; . . .; e i ; v i ; e iþ1 ; . . .; v m Þ for each s 2 S. For each vertex v i in w(s), i.e. its associated k-mer instance x in s, it examines the two incident edges of v i in w(s), i.e. e i and e iþ1 , making appropriate state transition of the automaton of v i accordingly.Supplementary Figure S1 provides a detailed overview of the DFA transition function d.After the implicit complete traversal over G(S, k), the actual transition to a state of the class multi-in single-out that has X 2 at the back; 3.
transition to a state of the class single-in multi-out that has X 1 at the front; 4. ðY 1 6 ¼ X 1 Þ ^ðY 2 6 ¼ X 2 Þ: transition to the only state of the class multi-in multi-out orangutan and the chimp genera, a western gorilla, a human and a bonobo.The conifer dataset consists of nine references belonging to the pine (Pinaceae) family: from the pine, the spruce and the larch genera, and a douglas fir; and two references from the redwoods (Sequoioideae) subfamily.We assembled the ape and the conifer datasets from the GenBank database (Sayers et al., 2018) of NCBI.We also assessed Cuttlefish's performance on compacting a huge number of closely related genomes.For such, we used the 93 human references generated using the FIGG genome simulator (Killcoyne and Sol, 2014), that was used to benchmark TwoPaCo (Minkin et al., 2016).Coupled with the previous 7 human genomes, this gives us a dataset of 100 human references ($322 Gbp).

Benchmarking comparisons
We benchmarked Cuttlefish against three other implementations of whole genome reference de Bruijn graph compaction algorithms: Bifrost (Holley and Melsted, 2020), deGSM (Guo et al., 2019) and TwoPaCo (Minkin et al., 2016).While TwoPaCo, like Cuttlefish, is specialized to work with reference sequences, Bifrost and deGSM are also capable of constructing the compacted graph from shortread sequences as well.We compare against Bifrost and deGSM using their appropriate settings for construction from references.We also note that among the tools, only Bifrost constructs the compacted de Bruijn graph without using any intermediate disk space.See Supplementary Section S4.2 for a discussion on the disk space usage of the benchmarked tools.
All these tools have multi-threading capability.deGSM has a max-memory option, and it is set to the best memory-usage obtained from the other tools.The rest of the tools are run without any memory restrictions.All the results reported for each tool are produced with a warm cache.Tables 1 and 2 contain a summary of the comparison results.Note that, the benchmark performances include all the steps of the pipelines to construct the compacted reference de Bruijn graphs, which for deGSM and Cuttlefish include the k-mer counting.
We do not compare against compaction algorithms designed for constructing the compacted de Bruijn graph only for sequencing data.These tools usually filter k-mers and alter the topology of the resulting de Bruijn graph based on certain conditions (branch length, path coverage, etc.), and these filters cannot always be finely tuned.Thus, even if ideally configured, these approaches will produce (potentially non-trivially) different compacted graphs.More importantly, however, methods that explicitly build the compacted de Bruijn graph from reference sequences can adopt or even center their algorithms around certain optimizations that are not available to methods building the compacted de Bruijn graph from reads-it therefore seems most fair not to compare methods that do not explicitly support the compacted de Bruijn graph construction from references against methods that explicitly support or are explicitly designed for this purpose.This is evident when evaluating methods that construct compacted de Bruijn graphs from reference sequences for other purposes (Minkin and Medvedev, 2020).In such cases, even authors who have worked on both state-of-the-art construction methods for compacted reference- (Minkin et al., 2016) and read- (Chikhi et al., 2016) de Bruijn graphs select the former over the latter.
We verified the correctness of the produced unitigs by designing a validator, that confirms: (i) the unique existence of each k-mer in the output unitigs collection [the set of maximal unitigs is unique and forms a node decomposition of the graph (Chikhi et al., 2016)]; and (ii) the complete spelling-ability of the set of input references by the unitigs (i.e. each input reference is correctly constructible from the compacted graph).
A direct correspondence between the outputs of the different tools is not straightforward.See Supplementary Section S4.3 for a detailed discussion.The command line configurations for executing the tools, and the dataset sources are present in Supplementary Section S5.

Parallel scalability
To assess the scalability of Cuttlefish across varying number of processor-threads, we used the human genome and set k ¼ 31, and executed Cuttlefish using 1-32 processor threads.Figures 3a and b show the scaling performance charts.
Generation of the set of keys (i.e.k-mers) using KMC3 (Kokot et al., 2017) is very fast for individual references, and the timing is quite low to begin with-the speedup is almost linear up-to around eight threads, and then saturates afterwards.The MPHF (minimal perfect hash function) construction does not scale past 24 threads; which we perceive as due to limitations involving bottlenecks in disk-access throughput, and memory access bottlenecks associated to the increasing cache misses in BBHash (Limasset et al., 2017).The next two steps: computing the states of the vertices and extracting the maximal unitigs, both scale roughly linearly all the way upto 32 threads.For the extraction step, we chose to report each maximal unitig, and skipped outputting the edges for the compacted graph in some GFA format.

Scaling with k
Next, we assess Cuttlefish's performance with a range of different kvalues.We use the 7 human genomes dataset for this experiment, and use 16 threads during construction.Table 3 shows the performance of each step with varying k-values.
We observe that the k-mer set construction step slows down with the increasing k, which may be attributable to overhead associated with the KMC3 algorithm (Kokot et al., 2017) in processing larger k-mer sizes.Although the count of distinct k-mers grows slowly with increasing k, the MPHF construction slows down.Since the MPHF is constructed by iterating over the k-mer set on disk, the decreased speed in this step is mostly related to disk-access throughput.The steps involving graph traversals, computing the vertices' states and extracting the maximal unitigs, are affected little in their running time by altering the value of k.Outputting the compacted graph in the GFA2 (and in GFA1) format takes much longer than outputting only the unitigs for lower values of k, due to structural properties of the compacted graph (see Supplementary Section S4.4 for a discussion).
The memory usage by Cuttlefish is directly proportional to the distinct k-mers-count n, which typically increases with k.Thus, the maximum memory usage also grows with k.
Referring back to Section 4.8, the running time of Cuttlefish is Oððm þ nÞðd k 32 eþhÞÞ = , and its memory usage is Oð8:7 Â nÞ.Table 3 demonstrates this asymptotic increase in running time with k (also with effects from n), and the increase in memory usage with n (which typically grows with k).

Effects of the input structure
Next, we evaluate the effects of some of the structural characteristics of the input genomes on the performance of Cuttlefish.Specifically, we assess the impact of the genome sizes (total reference length, m) and their structural variations (through distinct k-mers count, n) on the time and memory consumptions of Cuttlefish.The running time is Oððm þ nÞHðkÞÞ and the memory usage is Oð8:7 Â nÞ (see Section 4.8).The input size determines the total number of k-mers to be processed at the k-mer set construction, vertices' states computation and the maximal unitigs extraction steps.The variation in the input determines the performance of the MPHF construction, and indirectly affects the states computation and the unitigs extraction steps through their use of the hash table structure.We use the 7 humans and the 7 apes datasets with k ¼ 31 for such, using 16 processor threads.References are incrementally added to the input set, and the performances are measured for each input instance.The benchmarks include both the steps of building the compacted graph and extracting the maximal unitigs, and are illustrated in Supplementary Figure S2.
We observe that the running time varies both with the input size and with the structural variation.For the humans dataset, the distinct k-mers count does not increase much from one reference to seven: the increase is just $ 5%.This is because the dataset contains references from the same species, hence the genomes are very closely related.Thus, the effect of the distinct k-mers count remains roughly similar on all the instances of the dataset.The increase in running time is almost completely dominated by the total workload, i.e. the total size of the genomes.For the apes dataset, however, the genomes are not from the same species, and the variations in the genomes contribute to increasing the distinct k-mers count to $294% from one reference to seven.Thus, the running time on this dataset increases with the total genome size, with additional nontrivial effects from the varying k-mers count.
On the other hand, the memory usage by the algorithm is completely determined by the distinct k-mers count, with no effect from the input size.As described in Section 4.8, the memory usage of the algorithm is constant per distinct k-mer, taking $8.7 bits/k-mer.Supplementary Figures 2b and d conform to the theory-the shapes of the memory consumption and the distinct k-mers count plots are identical.

Conclusion
We present a highly scalable and very low-memory algorithm, Cuttlefish, to construct the (colored) compacted de Bruijn graph for collections of whole genome references.It models each vertex of the original graph as a deterministic finite-state automaton; and without building the original graph, it correctly determines the state of each automaton.These states characterize the vertices of the graph that flank the maximal unitigs (non-branching paths), allowing efficient extraction of those unitigs.The algorithm is very scalable in terms of time, and it uses a small and constant number of bits per distinct k-mer in the dataset.
Besides being efficient for medium and large-scale genomes, e.g. common mammalian genomes, the algorithm is highly applicable for huge collections of (very) large genomes.For example, Cuttlefish constructed the compacted de Bruijn graph for 100 closely related human genomes of total length $322 Gbp in <9 h, taking just $29 GB of memory.For 11 conifer plants that are from the same taxonomic order, each with very large individual genomes and having a total genome length of $204 Gbp, Cuttlefish constructed the compacted graph in <9 h, using $84 GB of memory.For these datasets, the next best method required more than 23 h using $126 GB of memory, and more than 16 h using $289 GB of memory, respectively.
The improvement in performance over the state-of-the-art tools stems from the novel modeling of the graph vertices as deterministic finite-state automata.The core data structure is a fast hash table, designed using a minimal perfect hash function to assign k-mers to indices, and a bit-packed buckets table storing succinct encodings of the states of the automata.This compact data structure obtains a memory usage of 8.7 bits/k-mer, leading the algorithm to greatly outperform existing tools at scale in terms of memory consumption, while being equally fast if not faster.For scenarios with further memory savings requirements, the memory usage can be reduced to 8 bits/k-mer, trading off the speed of the hash function.
The algorithm is currently only applicable for whole genome references.The assumption of the absence of sequencing errors in these references makes complete walk traversals over the original de Bruijn graph possible without having the graph at hand.This is not the case for short-read sequences, and the sequencing errors make such implicit traversals difficult.A significant line of future research is to extend Cuttlefish to be applicable on raw sequencing data.
On repositories with large databases containing many genome references, Cuttlefish can be applied very fast in an on-demand manner, as follows.First, one can build and store a set of hash functions, each one over some class of related genome references.This consists of the first two steps of the algorithm and is a one-time procedure, containing the bottleneck part of the algorithm.Then, whenever some set of references is to be compacted, the hash function of the appropriate super-class can be loaded into memory, and the algorithm then executes only the last two steps, which are quite fast and scalable.This works correctly because the sets of keys that these hash functions are built upon are supersets of the sets of keys being used later for the construction.Cuttlefish provides the option to save and load hash functions, making the scheme feasible.
As the number of sequenced and assembled reference genomes increases, the (colored) compacted de Bruijn graph will likely continue to be an important foundational representation for comparative analysis and indexing.To this end, Cuttlefish makes a notable advancement in the number and scale of the genomes to build compacted de Bruijn graphs upon.The algorithm is fast, highly parallelizable and very memory-frugal; and we provide a comprehensive analysis, both theoretical and experimental, of its performance.We believe that the progression will further improve the role of the de Bruijn graph in comparative genomics, computational pan-genomics and sequence analysis pipelines; also facilitating novel biological studies-especially for large-scale genome collections that may not have been possible earlier.Cuttlefish i185

Fig. 1 .
Fig. 1.For G(S, k)  in Figure (1a), w1 ¼ ðGAC; ACA; ATG; ATGÞ (edges not listed) is a walk.w 1 spells the string GACATG.It is not a path as the vertex ATG is repeated here.Whereas the walk w2 ¼ ðGAC; ACA; ATGÞ is a path, spelling GACAT.Besides, it is a unitig, and also a maximal one as it cannot be extended on either side while retaining itself a unitig.There are four maximal unitigs in the graph (the paths referred with the arrows), with the canonical spellings: CGA, ATGTC, CTAAGA and GAGC.(a) A (bidirected) de Bruijn graph for S ¼ fCGACATGTCTTAG; GCTCTTAGg, with k ¼ 3. The vertices are the canonical k-mers from S, and each edge corresponds to some 4-mer(s) in S. Each pentagon is a vertex, with the flat and the pointy sides (vertically) denoting its front and back, respectively.For each vertex v, the string inside it is label(v), to be read in the visual direction from the front to the back.The string below it is labelðvÞ, to be read in the opposite direction.For example, the 4-mers CGAC and GCTC correspond to the edges fCGA, GACg and fAGC, CTCg, respectively.The edge corresponding to the 4-mer CATG is a loop fATG, ATGg.(b) The corresponding compacted de Bruijn graph, with each maximal unitig in its canonical form

Fig. 2 .
Fig. 2. Classes of states of the vertices, and the transition relationships among those.(a) Time taken by each step.(b) Speedup for each step.(a) Four disjoint classes of states of the vertices, based on the properties of their sides.The pictorial shape of the classes correspond to the actual incidence properties of the vertices.For example, the first class of states is for vertices that have exactly one edge incident per side: the edges incident to the front and to the back being encoded with the characters X 1 and X 2 , respectively.There can be ð4 Â 4Þ ¼ 16 different configurations of this shape, and this class contains those 16 states.Whereas the second class is for vertices that have either an -edge or > 1 distinct edges incident to the front, and one unique edge incident to the back.Due to four possible configurations with this property, this class contains four states.Note that, pictorially, a singular incident edge denotes a unique edge, whereas multiple incident edges mean either >1 edge or an -edge being incident.(b) Possible transition types between the various classes of the states.For example, consider a state of the class single-in single-out, with the unique edges incident to its front and back being encoded with the characters X 1 and X 2 , respectively.Now, if the state is provided with the input (Y 1 , Y 2 ), then based on the four different joint outcomes of the conditionals ðX1 ¼ Y1Þ and ðX 2 ¼ Y 2 Þ, the following transitions can happen: 1.ðY 1 ¼ X 1 Þ ^ðY 2 ¼ X 2 Þ: self-transition; 2. ðY 1 6 ¼ X 1 Þ ^ðY 2 ¼ X 2 Þ:transition to a state of the class multi-in single-out that has X 2 at the back; 3. ðY 1 ¼ X 1 Þ ^ðY 2 6 ¼ X 2 Þ: transition to a state of the class single-in multi-out that has X 1 at the front; 4. ðY 1 6 ¼ X 1 Þ ^ðY 2 6 ¼ X 2 Þ: transition to the only state of the class multi-in multi-out

Table 2
Time-and memory-performance benchmarking for compacting colored de Bruijn graphs (i.e.multiple input references) for k ¼ 31, using 16 threads Each cell contains the running time in minutes, and the maximum memory usage in gigabytes, in parentheses.The output step is excluded from executions.The best value with respect to each metric in each row is highlighted.The filter-sizes for the TwoPaCo executions are set as described in Table1.Dashed cells in the Bifrost and the deGSM columns indicate that the experiments were not performed, as it is anticipated that insufficient memory would be available given their memory usages for smaller datasets (w.r.t.k-mer count).

Table 1 .
Time-and memory-performance benchmarking for compacting single input reference de Bruijn graphs

Table 3
Time-and memory-performance benchmarking for the steps of Cuttlefish on the 7 human genomes dataset, across different k The running times are in seconds, and the maximum memory usages are in gigabytes.