Bedtk: finding interval overlap with implicit interval tree

Li, Heng; Rong, Jiazhen

doi:10.1093/bioinformatics/btaa827

Abstract

Summary

We present bedtk, a new toolkit for manipulating genomic intervals in the BED format. It supports sorting, merging, intersection, subtraction and the calculation of the breadth of coverage. Bedtk uses implicit interval tree, a data structure for fast interval overlap queries. It is several to tens of times faster than existing tools and tends to use less memory.

Availability and implementation

The source code is available at https://github.com/lh3/bedtk.

1 Introduction

Processing genomic intervals is a routine task and has many practical applications in bioinformatics. Some common uses of genomic intervals include computing coverage and depth of specific genomic features and finding intersections of new sequencing results with known genomic features such as exons or genes. BEDTools (Quinlan and Hall, 2010) is a popular toolkit to manipulate intervals in the BED format. However, it can be slow given large datasets. BEDOPS (Neph et al., 2012) addresses this issue by streaming sorted BED files. While this strategy improves performance, it is less convenient to use and is limited to a subset of interval operations. These observations motivated us to develop bedtk that achieves high performance without requiring sorting.

2 Materials and methods

Efficiently finding overlapping intervals is a core functionality behind all interval processing tools. Bedtk uses implicit interval tree, a data structure extending the concept of array layouts of searching tree (Brodal et al., 2002; Khuong and Morin, 2017), to perform interval overlap queries against a static list of intervals.

2.1 Implicit binary search trees

A binary search tree (BST) is a binary tree where each node is associated with a key and this key is no less than all keys in the left subtree and no greater than all keys in the right subtree. Here, we show that a BST can be implicitly represented by a sorted array. In this implicit BST, each node is an element in the array and the tree topology is inferred from the index of each element.

More specifically, consider a sorted array consisting of $2^{K + 1} - 1$ elements. This array can implicitly represent a binary tree with $K + 1$ levels, with leaves put at level 0 and the root at level $K$ (Fig. 1). This implicit BST has the following properties:

Fig. 1.

Open in new tab Download slide

Example of implicit interval tree

At level $k$ ⁠, the first node is indexed by $2^{k} - 1$ ⁠. As a special case, the root of the tree is indexed by $2^{K} - 1$ ⁠.
For a node indexed by $x$ at level $k$ ⁠, its left child is indexed by ${x - 2}^{k - 1}$ ⁠, and its right child indexed by ${x + 2}^{k - 1}$ ⁠.
For a node indexed by $x$ at level $k$ ⁠, it is a left child if its $(k + 1)$ th bit is 0 (i.e. $⌊x / 2^{k + 1}⌋$ is an even number), and its parent node is ${x + 2}^{k}$ ⁠. Node $x$ is a right child if its $(k + 1)$ th bit is 1 (i.e. $⌊x / 2^{k + 1}⌋$ is an odd number), and its parent node is ${x - 2}^{k}$ ⁠.
For a node indexed by $x$ at level $k$ ⁠, there are $2^{k + 1} - 1$ nodes descending from $x$ (including $x$ ⁠). The left-most leaf in this subtree is $x - (2^{k} - 1)$ ⁠.

With these properties, we can identify the parent and the children of a node in $O (1)$ time. Figure 1 shows an example of implicit BST for the start positions of a list of intervals which is sorted by the start positions. When there are fewer than $2^{K + 1} - 1$ elements in the array, we still assume a full binary tree with some nodes set to ‘empty’. Notably in this case, an empty node may have a non-empty child (e.g. in Fig. 1, empty node 11 has a non-empty child node 9).

2.2 Implicit interval trees

An augmented interval tree is a data structure for interval overlap queries. It is an extension to BST where a node corresponds to an interval and each node additionally keeps a MaxEnd field which is the largest end position in the subtree descending from the node. Because a BST can be implicitly represented with an array, an augmented interval tree can be represented with an array as well (Fig. 1).

Sorting input intervals takes $O (N \log N)$ time. Computing MaxEnd requires a bottom-up tree traversal in $O (N)$ time. The time complexity of searching is $O (\log N + m)$ per query, where $N$ is the number of intervals in the tree and $m$ is the number of overlapping intervals.

3 Results

We implemented implicit interval tree in bedtk along with few other common operations such as sorting and interval merging. We compared bedtk to BEDTools v2.29.2 and BEDOPS v2.4.39 on two BED files: (i) 1 194 285 exons from GenCode annotations and (ii) 9 944 559 alignment intervals of long RNA-seq reads. Both are available at https://github.com/lh3/cgranges/releases.

As is shown in Table 1, bedtk is consistently faster than BEDTools and BEDOPS for all the evaluated operations, even when we discount sorting time for the BEDOPS ‘intersection’ operation. The performance gap between bedtk and BEDTools is even larger for unsorted input (‘intersect’ and ‘coverage’). This exemplifies the advantage of the implicit interval tree. BEDOPS takes the least memory for sorting potentially because it uses advanced encoding; it uses even less memory for ‘intersect’ as it streams the input instead of loading one or both input files into memory. However, requiring sorted input complicates data processing pipelines and wastes working disk space. And even with sorted input, BEDOPS is slower than bedtk.

Table 1.

Performance of interval operations

Operation	Bedtk	BEDTools	BEDOPS
Sort	5.35 s/393 MB	21.14 s/2731 MB	11.25 s/205 MB
Intersect	6.42 s/19 MB	54.87 s/397 MB	8.78 s/3 MB
Coverage	10.91 s/19 MB	257.11 s/678 MB

Operation	Bedtk	BEDTools	BEDOPS
Sort	5.35 s/393 MB	21.14 s/2731 MB	11.25 s/205 MB
Intersect	6.42 s/19 MB	54.87 s/397 MB	8.78 s/3 MB
Coverage	10.91 s/19 MB	257.11 s/678 MB

Note: Each cell gives the CPU time in sections and peak memory in megabases (MB) measured on a Linux server with two AMD EPYC 7301 CPUs at 2.2 GHz. The ‘sort’ operation sorts file 2. ‘intersect’ reports intervals in file 2 that overlaps intervals in file 1. For each interval in file 2, ‘coverage’ computes the number of bases covered by intervals in file 1. BEDOPS does not support the ‘coverage’ operation. It also requires sorted input for ‘intersect’; sorting time is excluded. All operations are CPU bounded, not I/O bounded.

Open in new tab

Table 1.

Performance of interval operations

Operation	Bedtk	BEDTools	BEDOPS
Sort	5.35 s/393 MB	21.14 s/2731 MB	11.25 s/205 MB
Intersect	6.42 s/19 MB	54.87 s/397 MB	8.78 s/3 MB
Coverage	10.91 s/19 MB	257.11 s/678 MB

Operation	Bedtk	BEDTools	BEDOPS
Sort	5.35 s/393 MB	21.14 s/2731 MB	11.25 s/205 MB
Intersect	6.42 s/19 MB	54.87 s/397 MB	8.78 s/3 MB
Coverage	10.91 s/19 MB	257.11 s/678 MB

Note: Each cell gives the CPU time in sections and peak memory in megabases (MB) measured on a Linux server with two AMD EPYC 7301 CPUs at 2.2 GHz. The ‘sort’ operation sorts file 2. ‘intersect’ reports intervals in file 2 that overlaps intervals in file 1. For each interval in file 2, ‘coverage’ computes the number of bases covered by intervals in file 1. BEDOPS does not support the ‘coverage’ operation. It also requires sorted input for ‘intersect’; sorting time is excluded. All operations are CPU bounded, not I/O bounded.

Open in new tab

4 Discussions

An implicit interval tree can be implemented in <100 lines of C++ code. It is a simple yet efficient data structure for fast interval queries. Daniel Jones alters the memory layout of implicit interval tree to van Emde Boas for improved cache locality (https://github.com/dcjones/coitrees). It speeds up query at the cost of more memory and unsorted query output. Meanwhile, Mike Lin gives up the standard top-down interval tree query and instead allows to start a query from any node in the tree (https://github.com/mlin/iitii). This improves the performance on practical data. In additional to interval trees, nested containment list (Alekseyenko and Lee, 2007) and augmented interval list (Feng et al., 2019) are alternative data structures for fast interval overlap queries. However, no standalone user-oriented tools have implemented these advanced algorithms. Bedtk is the first toolkit that is designed for end users and outperforms popular tools for common interval operations.

Acknowledgements

The authors thank Daniel C. Jones and Mike Lin for further improving the performance of implicit interval trees.

Funding

This work was supported by National Institutes of Health [R01HG010040].

Conflict of Interest: none declared.

References

Alekseyenko

A.V.

,

Lee

C.J.

(

2007

)

Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases

.

Bioinformatics

,

23

,

1386

–

1393

.

Brodal

G.S.

et al. (

2002

) Cache oblivious search trees via binary trees of small height. In SODA '02: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January. pp.

39

–

48, San Francisco, California.

Feng

J.

et al. (

2019

)

Augmented Interval List: a novel data structure for efficient genomic interval search

.

Bioinformatics

,

35

,

4907

–

4911

.

Khuong

P.V.

,

Morin

P.

(

2017

) Array layouts for comparison-based searching. ACM Journal of Experimental Algorithmic,

22

, 1.3:1-39.

Neph

S.

et al. (

2012

)

BEDOPS: high-performance genomic feature operations

.

Bioinformatics

,

28

,

1919

–

1920

.

Quinlan

A.R.

,

Hall

I.M.

(

2010

)

BEDTools: a flexible suite of utilities for comparing genomic features

.

Bioinformatics

,

26

,

841

–

842

.

Author notes

The authors wish it to be known that, in their opinion, Heng Li and Jiazhen Rong should be regarded as Joint First Authors.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
September 2020	18
October 2020	10
November 2020	5
December 2020	6
January 2021	20
February 2021	8
March 2021	10
April 2021	1
May 2021	9
June 2021	69
July 2021	44
August 2021	16
September 2021	8
October 2021	17
November 2021	5
December 2021	13
January 2022	11
February 2022	16
March 2022	10
April 2022	6
May 2022	26
June 2022	43
July 2022	18
August 2022	33
September 2022	61
October 2022	56
November 2022	46
December 2022	16
January 2023	29
February 2023	30
March 2023	35
April 2023	19
May 2023	25
June 2023	24
July 2023	25
August 2023	20
September 2023	168
October 2023	43
November 2023	61
December 2023	28
January 2024	199
February 2024	60
March 2024	56
April 2024	28

Article Contents

Bedtk: finding interval overlap with implicit interval tree

Abstract

1 Introduction

2 Materials and methods

2.1 Implicit binary search trees

2.2 Implicit interval trees

3 Results

4 Discussions

Acknowledgements

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Bedtk: finding interval overlap with implicit interval tree

Abstract

1 Introduction

2 Materials and methods

2.1 Implicit binary search trees

2.2 Implicit interval trees

3 Results

4 Discussions

Acknowledgements

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only