Eﬃcient graph-color compression with neighborhood-informed Bloom ﬁlters

Technological advancements in high throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains inaccessible to the research community through a lack eﬃcient data representation and indexing solutions. One of the available techniques to represent read data on a more abstract level is its transformation into an assembly graph. Although the sequence information is now accessible, any contextual annotation and metadata is lost. We present a new approach for a compressed representation of a graph coloring based on a set of Bloom ﬁlters. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph to decide on false positives, we can reduce the memory requirements for a given set of colors per edge by three orders of magnitude. As insertion and query on a Bloom ﬁlter are constant time operations, the complexity to compress and decompress an edge color is linear in the number of color bits. Representing individual colors as independent ﬁlters, our approach is fully dynamic and can be easily parallelized. These properties allow for an easy upscaling to the problem sizes common in the biomedical domain.

The next logical steps of data integration for genome sequencing projects are assembly graphs that help to gather short sequence reads into genomic contigs and eventually draft genomes. While assembly of a single species genome is already a challenging task [6], assembling a set of genomes from one or many WMS samples is even more difficult. Although preprocessing methods such as taxonomic binning [10] help to reduce its complexity, the task remains a challenge. A commonly used strategy to generate sequence assemblies is based on de Bruijn graphs that collapse redundant sequence information into a node set of unique substrings of length k (k-mers) and transform the assembly problem into the problem of finding an Eulerian path in the graph [20]. Especially in a co-assembly setting, where a mixture of multiple source sequence sets is combined and information in addition to the sequences needs to be stored, colored de Bruijn graphs form a suitable data structure, as they allow to associate one or many colors with each node or edge [13]. A second use case is to use such graphs for the efficient representation and indexing of multiple genomes, a so called pan-genome store [18].
Owing to the large size and excessive memory footprints of such graphs, recent work has suggested compressed representations for de Bruijn graphs based on approximate member query data structures [7,3] or generalizations of the Burrows-Wheeler transform to graphs [5]. The latter is often referred to as the BOSS representation, an acronym of the authors' initials. The recent work on compressed colored De Bruijn graphs has followed this trend. Currently, there exist two distinct paradigms. The first is to compress the complete colored graph in a single data structure and the second to handle two separate (compressed) representations of graph and coloring. In the first group fall approaches such as the Bloom Filter Trie [12] for pan-genome representations or de-BGR [19], an encoding for a weighted de Bruijn graph. The second group contains approaches such as VARI [16], that uses succinct Raman-Raman-Rao or Elias-Fano compression on the annotation vector, and Rainbowfish [1], that additionally takes into account the distribution of the annotations in the graph to achieve more efficient annotation compression rates. The recently introduced Metannot [17] is a succinct data structure based on wavelet tries. It performs similar to the other approaches, but allows for efficient handling of dynamic settings where annotation or underlying graph structure are subject to change.
For many genomics applications, for instance, the encoding of a pan-genome index for read labeling, an exact reconstruction of the colors is not necessary and an approximate recovery with high accuracy would be sufficient. In our work, we present a probabilistic compression scheme for an arbitrarily sized set of colors given an arbitrary underlying graph. Based on Bloom filters [4], a data structure for efficient approximate membership query with a one-sided error, we encode colors as bit vectors and store them as a set of filters. We further reduce the necessary storage requirements of the individual filters by maintaining weak requirements on their respective falsepositive rates, which is subsequently corrected for using neighborhood information in the graph.

Approach
We implement our reference metagenome as a colored de Bruijn graph (cDBG), which consists of a de Bruijn graph constructed from a collection of input sequences and an associated annotation. We represent this annotation as a bit matrix, associating each edge in the graph to a subset of predefined annotation classes.
To index the cDBG in a space-efficient manner, we employ the BOSS representation [5] of the DBG encoded with rank-and select-supporting succinct vectors, while the columns of the annotation matrix are stored in independent Bloom filters. As an error correction step, we employ an additional Bloom filter to indicate nodes at which neighboring edges change their colors.
Finally, we test the utility and scalability of the structures with a series of data sets derived from viral, bacterial, and human genomes.

Preliminaries and notation
Let Σ be an alphabet of fixed size (in the case of genome graphs, Σ = {$, A, C, G, T, N }). Given a string s ∈ Σ * , we use s[i : j] to denote the substring of s from 1-based position i up to and including position j.
Given a collection of input strings S = {s 1 , . . . , s n } ⊂ Σ * , we define the input sequence S to be the concatenation of the s i with the delimiter $ · · · $ k . We also definẽ Finally, given bit vectors a, b ∈ {0, 1} m , we use the notation a | b and a & b to denote the bitwise OR and AND operators, respectively.

de Bruijn graph representation
Definition 2.1 A colored de Bruijn graph of order k (where k > 1 is an integer) of the string S together with n associated color bits is an ordered tuple of the form where The edge set E may also be defined in terms of substrings of S, It is clear that E ∼ = E and we use the map to interconvert. The elements of V are also known as k-mers and the elements of E (k + 1)-mers.

3
. CC-BY-NC 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/239806 doi: bioRxiv preprint first posted online Dec. 26, 2017; Let {e 1 , . . . , e |E| } denote the reverse-lexicographically sorted elements of E. We then define the annotation matrix A such that We represent the tuple (V, E) using the BWT-based BOSS representation [5]. Further details are provided in Supplemental Section A.

Bloom filter-based compression
Since the columns of the annotation matrix A encode set inclusion, we use a Bloom filter to probabilisticly store these vectors [4].
, 1} m denote a bit vector in which only the i th bit is set to one. On this structure, the operation insert, the relation ∈, and the operator ∪ are supported, Let X be a random variable defined on the universe X and letX = {x 1 , . . . , x s } be a sample drawn from X which is inserted sequentially into a Bloom filter BF . Then the false positive probability (FPP) can be approximated [14] as

Neighborhood-informed compression
To reduce the FPP of the filters, we propose to exploit the fact that annotations of neighboring nodes in the graph tend to be the same. Based on the assumption that the annotation is constant over a segment of length , we can compute the intersection of the annotations over all nodes of the segment and obtain an annotation with much lower FPP. Following the argument in [14], the FPP for an annotation of a segment of length can be approximated as 1 − e − ds m d , since there are effectively d independent hash functions in use that each lead to a reduced FPP.
We need an additional data structure to store nodes at which the annotations of incoming and outgoing edges differ. For this we introduce an additional bit vector called the continuity vector that stores for each node whether the colors of the incoming edges match the outgoing ones.
4 . CC-BY-NC 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/239806 doi: bioRxiv preprint first posted online Dec. 26, 2017; Using this vector, given an edge e i = (v i , v i ) ∈ E, its continuity path N (i) ⊆ E is defined recursively as follows: We then encode C using a Bloom filter.

Decompression
We encode A as a collection of Bloom filters B = {BF 1 , . . . , BF n } defined on the input space E.
The annotation of the edge e i ∈ E is then queried as follows: We can then use this to define the annotation function as

Dynamic properties
A desirable property for genomics applications is the dynamic representation of a given data structure. That is, to add or remove new color classes or to adapt the coloring to changes in the underlying graph structure. Coming back to the conceptual representation of the coloring as a bit matrix, we would like to allow for dynamic behavior for both changes in the edges (rows) and color classes (columns).
With the proposed method it is possible to efficiently extend the graph with additional nodes. New edges are added to the color bit Bloom filters and new discontinuity nodes are immediately added to the continuation Bloom filter. Using this strategy, it is important that the final size of the number of edges with a specific color is estimated correctly, as this determines the optimal size of the filter chosen.
In addition to the dynamic behavior on edges, our approach also supports dynamic coloring. When the colored de Bruijn graph is extended with additional color labels, each color bit simply gets a new Bloom filter. This will have no effects on the accuracy of the remaining colors. Vice versa, removing a color bit is as simple as ignoring or discarding the corresponding Bloom filter. A second advantage of this strategy is, that each color bit can be compressed with an independent compression rate. Hence, we can easily prioritize certain annotations on the De Bruijn graph by increasing their corresponding Bloom filter size and therefore their decompression accuracy.

Data
We used several different data sets to evaluate the behavior of our Bloom filter color compression. These four data sets originate from viruses (Virus1000), bacteria (BacteriaSelect and BacteriaAll) and humans (chr22+gnomAD) and were chosen to test the method on different coloring distributions, coloring sizes and coloring densities. We constructed de Bruijn graphs of order k = 63 for each data set.
The virus and bacteria data sets were both generated from publicly available GenBank [8] complete genome data. The Virus1000 data set consists of 1000 randomly selected complete virus 5 . CC-BY-NC 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not .
genomes, whose resulting graph consists of several disjoint, linear paths. The BacteriaSelect and BacteriaAll data sets consist of 45 and 136 different bacterial strains from the genus Lactobacillus, respectively, which leads to a linear topology in the graphs with many shorter paths disconnecting and reconnecting from a main backbone path. For the human chr22+gnomAD dataset, chromosome 22 from the hg19 assembly of the human reference genome was used as the main reference backbone, together with exome variants from the gnomAD dataset [9]. This results in a graph that has a similar structure as the Virus1000 graph, scaling the number of nodes three-fold and reducing the total number of colors.
A summary of the data sets is shown in Table 1. The data sets have been used in previous work [17] and further information about the data sets and the list of all virus and bacteria strains that were used can be found in their appendices.

Parameters
For each data set, the size of each color Bloom filter is computed as a constant factor ε (shared between all filters) of the number of edges in the graph annotated with that color. Appropriate values of ε, which resulted in color Bloom filter collections with average accuracies of 95% and 99%, were determined through binary search.

Evaluation and Applications
This section empirically evaluates the performance of the proposed color compression scheme. Our experiments are based on a range of datasets originating from viruses (Virus1000), bacteria (BacteriaSelect and BacteriaAll) and humans (chr22+gnomAD) compiled by Mustafa et al. [17]. Table 1 gives a comprehensive overview of these collections in terms of number of nodes, edges, color bits and unique colors. Table 1: Virus, bacteria and human datasets used for evaluation. b = color bits per edge, U = unique edge colors (bit combinations). Virus1000 is composed of 1000 viral strains. BacteriaSelect and BacteriaAll consist of 45 and 136 bacterial strains, respectively. chr22+gnomAD uses chromosome 22 from the hg19 assembly of the human reference genome as the main reference backbone, together with the gnomAD dataset [9]. We evaluate space efficiency of the color compression scheme in terms of bits per edge across the entire de Bruijn graph. As discussed earlier, the method is parameterized by its desired accuracy. Table 2 shows results across all datasets for accuracy settings of 95% and 99%. Note that decompression is slightly decreased for higher accuracy, as more context is needed. Here we count the annotation as being correct, if all annotation bits (i.e., the color) are correct. The annotation accuracy per annotation is much higher.

6
. CC-BY-NC 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . Table 2: Compression efficiency at color accuracy settings of 95% and 99.0%, respectively. The average individual Bloom filter false positive probability (FPP) is the ratio of set bits in the Bloom filter raised to the power of the number of hash functions (standard deviation smaller than 1% for all data sets). The average context length (including standard deviation) is the average number of queries to a color bit Bloom filter before a specific false positive bit is detected and removed from the color of a path. We can observe that even at high accuracy thresholds, compression to 1-3 bits per edge can be achieved. To measure the computational complexity of the proposed method, we report the singlethread compression and decompression times for the entire graph on a system with 36 cores of Intel(R) Xeon(R) CPU E5-2697 v4 (2.30GHz) that is part of ETH's shared high-performance compute systems. While decompression is generally the more costly of the two steps, both operations attain a throughput of several 100k edges per second, even on a single thread.

95%
To further investigate the connection between target accuracy and compression ratio, Figure 1a plots accuracy as a function of the number of bits per edge invested into color Bloom filters. All datasets show a similar, steeply rising, behavior as we increase the relative filter sizes. This trend reaches its asymptote between 1 and 3 bits per edge at accuracy levels approaching 1.0. Similarly, Figure 1b focuses on the Virus1000 collection and linearly increases the number of coloring bits per edge. For this data, we computed a chain of virus genome collections named Virus50 to Virus950 in steps of 50. Virus50 consists of 50 randomly selected genomes from Virus1000, while subsequent collections are generated by randomly sampling additional sets of 50 genomes without replacement (i.e. Virus100 is defined is a subsample of 100 genomes which contains Virus50). To report average compression ratios, 10 random draws of Virus1000 were generated with different random seeds and used to compute the derived virus genome collections.
Finally, we close with a side-by-side comparison of the various de Bruijn graph color compression schemes presented in Section 1. In addition to these domain-specific methods, we include two popular general-purpose static compression methods, gzip and bzip2. Table 3 lists the number of bits per edge required to compress our four experimental collections. At an accuracy of 95% our method is considerably more space efficient, achieving compression ratios orders of magnitude greater than the competing methods. At 99% accuracy our approach performs comparably to Rainbowfish on the human genome collection while on all other collections we see a continued significant performance advantage of our method.

Conclusion
We have presented a probabilistic, compressed representation of a color encoding for arbitrary graphs, demonstrated on colored de Bruijn graphs. Our method uses approximate set representations for storing an arbitrary amount of annotations on the graph and leverages the graph topology and takes advantage of continuous colorings of neighboring nodes to improve the achieved compression ratios. Our representation can be efficiently decompressed and queried to retrieve the color of arbitrary paths in the graph. Although it is helpful to know the frequency of individual colors upfront to optimally choose the size of the individual Bloom filters used, this factor can be easily estimated from the size of the input data, allowing to directly build the full coloring. We have shown the utility of our approach on different biological datasets, including data from virus, bacteria and human genomes, representing different classes of graph topologies and colorings. On all datasets we achieve comparable or strongly increased compression performance at very high level of decompression accuracy. Notably, our approach is fully dynamic and allows for an easy extension with additional labels / colors or for changes in the underlying graph structures, enabling the augmentation of large colored graphs with new annotations -a scenario commonly occurring in the genomics setting. The copyright holder for this preprint (which was not . In future work we will adapt our method to better scale with dynamic changes. If a dataset grows rapidly in the number of edges, the decoding accuracy will eventually drop, eventually requiring a re-initialization of a larger Bloom filter. Currently this means to reload all elements into a larger filter. Further, despite being dynamic, our current representation does not allow for the removal of colors or edges from the graph. To support this we could replace the Bloom filters with other probabilistic set representations that allow for item removal [2,11]. To further increase performance of the neighborhood-informed color compression, a separate continuity vector can be stored for each color Bloom filter (producing a continuity matrix ) to further reduce the FPP. Given their greater sparsity compared to maintaing a single continuity vector, wavelet tries [17] may be an option for a dynamic structure to compress them losslessly. Lastly, an additional space improvement could be achieved with more space efficient probabilistic set representations such as compressed Bloom filters [15].

A Supplemental Methods
For simplicity, the vector pairs (W, B) and (F, L) can be interleaved into W − , F − ∈ (Σ ∪ Σ−) |E| , respectively 1 , With this encoding, Lemma A.1 Forward graph traversal can be done using the first-last equivalence between F and W − inherent to FM indices [5],

11
. CC-BY-NC 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/239806 doi: bioRxiv preprint first posted online Dec. 26, 2017;