MBG: Minimizer-based sparse de Bruijn Graph construction

Abstract Motivation De Bruijn graphs can be constructed from short reads efficiently and have been used for many purposes. Traditionally, long-read sequencing technologies have had too high error rates for de Bruijn graph-based methods. Recently, HiFi reads have provided a combination of long-read length and low error rate, which enables de Bruijn graphs to be used with HiFi reads. Results We have implemented MBG, a tool for building sparse de Bruijn graphs from HiFi reads. MBG outperforms existing tools for building dense de Bruijn graphs and can build a graph of 50× coverage whole human genome HiFi reads in four hours on a single core. MBG also assembles the bacterial E.coli genome into a single contig in 8 s. Availability and implementation Package manager: https://anaconda.org/bioconda/mbg and source code: https://github.com/maickrau/MBG. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
De Bruijn graphs (DBGs) have been used in sequence analysis for purposes such as genome assembly (Bankevich et al., 2012;Garg et al., 2018;Pevzner et al., 2001;Wick et al., 2017) and error correction (Miclotte et al., 2016;Rautiainen and Marschall, 2019;Salmela and Rivals, 2014). Sparse de Bruijn graphs (Ye et al., 2011) are a form of DBGs which use only a subset of k-mers and so reduce runtime and memory use. Minimizer winnowing (Roberts et al., 2004;Schleimer et al., 2003) is a method of selecting a subset of kmers from a sequence, which has been applied to building sparse DBGs (Coombe et al., 2020). Recently, HiFi reads (Wenger et al., 2019) have reached read lengths of thousands of base pairs with error rates comparable or superior to short reads. The combination of long-read lengths and low error rates makes DBGs an attractive idea for HiFi reads and might enable hybrid methods, which typically use a combination of Illumina and long reads, to use a combination of HiFi and even longer ONT reads (Logsdon et al., 2020). Increasing the k-mer size leads to better repeat resolution and therefore better assembly. However, current tools do not scale to k-mer sizes in thousands.
Contributions. We have implemented the tool MBG (Minimizerbased sparse de Bruijn Graph) for constructing sparse de Bruijn graphs. MBG selects k-mers by minimizer winnowing (Schleimer et al., 2003) and builds the graph from those k-mers. This approach has previously been used in the ntJoin scaffolder (Coombe et al., 2020) for building graphs from assembled contigs to scaffold assemblies.
MBG can construct graphs with arbitrarily high k-mer sizes, and we show that k-mer sizes of thousands of base pairs are practical with real HiFi read data. MBG outperforms existing de Bruijn graph construction tools in runtime, with a runtime of only a few hours on a single core for constructing a graph of 50Â coverage whole human genome HiFi reads.

Materials and methods
We give a brief overview of the implementation here with detailed explanations of the individual steps in Supplementary Note SA. Since most errors in HiFi reads are homopolymer run length errors (Wenger et al., 2019), the input reads are homopolymer compressed by collapsing homopolymer runs into one character. Homopolymer compression removes most errors but it might also lead to repeat collapses if there are long repeats which only differ in homopolymer run lengths. A rolling hash function (Mohamadi et al., 2016) is used to assign a hash value to each k-mer. Minimizer winnowing (Schleimer et al., 2003) is then used to select a subset of k-mers. The selected k-mers are compressed by hashing them into 128-bit integers, which form the nodes of the minimizer graph. Edges are added whenever two minimizers are adjacent to each other in the reads. Transitive edges caused by sequencing errors are cleaned. Nonbranching paths of the graph are then condensed into unitigs. Finally, the 128-bit hashes are replaced with their base pair sequences, and homopolymer runs are expanded. The graph is then written in the GFA format (Li, 2016).

Results
We built sparse de Bruijn graphs using HiFi read data. We varied the k-mer size k, and for MBG, the window size parameter w (Schleimer et al., 2003), which determines the sparseness of the resulting graph, with higher w leading to sparser graphs. Details of the experimental setup are in Supplementary Note SB. Table 1 shows the results for selected parameters and Supplementary Table S1 contains the full results.
Comparison to existing tools. We compared MBG to BCalm2 (Chikhi et al., 2016) for building graphs using HiFi reads of E.coli. Note that N50 is not directly comparable between MBG and BCalm2 since the homopolymer compression step removes most errors and therefore greatly improves N50. BCalm2 uses less memory than MBG with w ¼ 1, but for w ¼ 10 and higher MBG uses less memory. MBG is faster than BCalm2 when w > 1, and slightly slower with w ¼ 1. The runtime of BCalm2 increases greatly as the k-mer size increases while MBG scales efficiently to high k. Due to homopolymer errors in the reads, the N50 for BCalm2 suffers when k grows above 1001. On the other hand, the homopolymer compression of MBG enables it to scale to higher k. With higher w MBG is an order of magnitude faster than BCalm2. Supplementary Table  S2 and Supplementary Note SB contain an evaluation of the error rates of the assemblies. With k ¼ 2501 and w ¼ 2500, MBG assembles E. coli correctly into a single contig in 8 s on a single core with an estimated error rate of 45 errors per 100kbp. Almost all errors are homopolymer run length errors, with only 0.18 nonhomopolymer errors per 100kbp.
Whole human genome HiFi. We ran MBG on whole human genome HiFi data from the individual HG002. Runtime is between 2 and 7 h on a single core with all parameter sets, showing that MBG is fast and scales to large k. The limitation on increasing k and w even higher is the error rate and read length of the HiFi reads. We also ran BCalm2 on the same reads with k ¼ 127. We did not run BCalm2 with higher k since the previous experiment suggests the runtime would be prohibitive. MBG is an order of magnitude faster than BCalm2, however, memory use is higher since MBG keeps all data in memory while BCalm2 uses temporary files on disk.

Conclusion
We have implemented MBG, a tool for building sparse de Bruijn graphs from HiFi reads using minimizer winnowing. The sparsification enables MBG to run orders of magnitude faster than tools for building dense de Bruijn graphs. Increasing the sparsity parameter w speeds up assembly but can reduce homopolymer run length consensus accuracy. MBG uses a novel method to compress long k-mers to constant sized hashes and enables k to scale arbitrarily high.
MBG can quickly build de Bruijn graphs of mammalian sized genomes, with runtimes ranging from 2 to 7 h on a single core. The memory use currently prevents MBG from being ran on mammalian datasets on laptops and desktop computers. However, MBG fits comfortably in the RAM of most computing servers. Disk-based approaches used by previous tools (Chikhi et al., 2016) might enable MBG to run on mammalian datasets on laptops. MBG enables small genomes such as E. coli to be assembled in a few seconds and mammalian genomes in a few hours.
Financial Support: none declared.