SeQuiLa-cov: A fast and scalable library for depth of coverage calculations

Abstract Background Depth of coverage calculation is an important and computationally intensive preprocessing step in a variety of next-generation sequencing pipelines, including the analysis of RNA-sequencing data, detection of copy number variants, or quality control procedures. Results Building upon big data technologies, we have developed SeQuiLa-cov, an extension to the recently released SeQuiLa platform, which provides efficient depth of coverage calculations, reaching >100× speedup over the state-of-the-art tools. The performance and scalability of our solution allow for exome and genome-wide calculations running locally or on a cluster while hiding the complexity of the distributed computing with Structured Query Language Application Programming Interface. Conclusions SeQuiLa-cov provides significant performance gain in depth of coverage calculations streamlining the widely used bioinformatic processing pipelines.


Introduction
Given a set of sequencing reads and a genomic contig, depth of coverage for a given position is de ned as a total number of reads overlapping the locus.
The coverage calculation is a frequently performed but timeconsuming step in the analysis of Next Generation Sequencing (NGS) data. In particular, Copy-Number Variant detection pipelines require obtaining su cient read depth of the analyzed samples [1,2,3]. In other applications, the coverage is computed to assess the quality of the sequencing data (e.g. to calculate the percentage of genome with at least 30X read depth) or to identify genomic regions overlapped by insucient number of reads for reliable variant calling [4]. Finally, depth of coverage is one of the most computationally intensive parts of di erential expression analysis using RNA-seq data at single-base resolution [5,6,7].
Traditionally, these methods calculate the depth of coverage using a pileup-based approach (introduced in samtools [9] and used in GATK [11]), which is ine cient since it iterates through each nucleotide position at every read in a Binary Alignment Map (BAM) le. An optimized, event-based approach has been proposed in bedtools [10] and mosdepth [13]. These algorithms use only speci c 'events', i.e. start and end of the alignment blocks within each read ( Figure 1A) instead of analyzing every base of each read, which signi cantly reduces the overall computational complexity.
Samtools and bedtools depth of coverage modules do not provide any support for multi-core environment. Mosdepth implements parallel BAM decompression, but its main algorithm remains sequential. Sambamba, on the other hand, promotes itself as a highly parallel tool, implementing depth of coverage calculations in a map-reduce fashion utilizing multiple threads on a single node. Regardless of parallelization degree, all of the above mentioned tools share a common bottleneck caused by using a single thread for returning results. Finally, GATK was the rst genomic framework providing a support for distributed computations, however, the DepthOfCoverage method has not been ported yet to the current software release of the toolkit. We present the rst fully scalable, distributed, SQLoriented solution designated for the depth of coverage calculations. SeQuiLa-cov, an extension to the recently released Se-QuiLa [14] platform, runs a redesigned event-based algorithm for the distributed environment and provides convenient, SQLcompliant interface. The tool can be easily integrated with other applications implemented in Scala, R, or Python.
In the most general case, the algorithm can be used in a distributed environment where each cluster node computes the coverage for the subset of data slices using the event-based method. Speci cally, for the i-th partition containing the set of reads (read_set i ), the set of events i,chr vectors (where chr is an index of genomic contig represented in read_set) is allocated and updated, based on the items from read_set i . For all reads, the algorithm parses the CIGAR string and for each continuous alignment block characterized by start position and length len it increments by one the events i,chr (start) and decrements by one the value of events i,chr (start + len). To compute the partial coverage vector for partition i and contig chr, a vector value at the index j is calculated as follows: The result of this stage is a set of partial_coverage i,chr vectors distributed among the computation nodes. To calculate the nal coverage for the whole read_set, an additional step of correction for overlaps between the partitions is required. An overlap overlap i,chr of length l between vectors partial_coverage i,chr and partial_coverage i+1,chr may occur on the partition boundaries where l tailing genomic positions of partial_coverage i,chr are the same as l heading genomic positions of partial_coverage i+1,chr (see Figure 1C).
If an overlap is identi ed then the coverage values from the partial_coverage i,chr 's l-length tail are added into the partial_coverage i+1,chr 's head and subsequently the last l elements of partial_coverage i,chr are removed. Once this correction step is completed, non-overlapping coverage i,chr vectors are collected and yield the nal coverage values for the whole input read_set.
The main characteristic of the described algorithm is its ability to distribute data and calculations (such as BAM decompression and main coverage procedure) among the available computation nodes. Moreover, instead of simply performing full data reduction stage of the partial coverage vectors, our solution minimizes required data shu ing among cluster nodes by limiting it to the overlapping part of coverage vectors. Importantly, SeQuiLa-cov computation model supports ne-grained parallelism at user-de ned partition size in contrary to the traditional, coarse-grained parallelization strategies that involve splitting input data at a contig level.

Implementation
We have implemented SeQuiLa-cov in Scala programming language using the Apache Spark framework. To e ciently access the data from a BAM le we have prepared a custom data source using Data Source Application Programming Interface (API) exposed by SparkSQL. Performance of the read operation bene ts from the Intel Genomics Kernel Library (GKL) [15] used for decompressing the BAM les chunks and from predicate pushdown mechanism that lters out data at the earliest stage.
The implementation of the core coverage calculation algorithm aimed at minimizing, whenever possible memory footprint by using parsimonious data types, e.g. Short type instead of Integer, and e cient memory allocation strategy for large data structures, e.g. favoring static Arrays over dynamic size ArrayBu ers. Additionally, to reduce the overhead of data shu ing between the worker nodes in the correction for the overlaps stage we used Spark's shared variables [16] accumulators and broadcast variables ( Figure 1C). Accumulator is used to gather information about the worker nodes' coverage vector ranges and coverage vector tail values, that are subsequently read and processed by the driver. This information is then used to construct a broadcast variable distributed to the worker nodes in order to perform adequate trimming and summing operations on partial coverage vectors. Both samtools and bedtools calculate coverage using only a single thread, however, their results di er signi cantly, with samtools being around twice as fast. Sambamba positions itself as a multithreaded solution although our tests revealed that its execution time is nearly constant, regardless of the number of CPU cores used, and even twice as slow as samtools. Mosdepth achieved speedup against samtools in blocks coverage and against sambamba in windows coverage calculations, however, its scalability reaches limit at 5 CPU cores. Finally, SeQuiLa-cov, achieves nearly identical performance as mosdepth for the single core but the execution time decreases substantially for greater number of available computing resources which makes this solution the fastest when run on multiple cores and nodes. 1 per-base results are treated as blocks output. Samtools lacks the functionality of blocks coverage calculations, however, we included this tool in our benchmark for completeness, treating its per-base results as blocks outcome assuming that both result types require nearly the same resources.

Supported coverage result types
SeQuiLa-cov features three distinct result types: per-base, blocks, and xed-length windows coverage ( Figure 1A). For perbase, the depth of coverage is calculated and returned for each genomic position making it the most verbose output option. The method producing block level coverage (blocks) involves merging adjacent genomic positions with equal coverage values into genomic intervals. As a consequence, fewer records than in case of per-base output type are generated with no information loss. The xed-length windows the algorithm generates set of xed length, tiling, non-overlapping genomic intervals and returns arithmetic mean of coverage values over positions within each window.

ANSI SQL compliance
SeQuiLa-cov solution promotes SQL as a data query and manipulation language in genomic analysis. Data ows are performed in SQL-like manner through the custom data source supporting convenient Create Table as Select and Insert as Select methods. SeQuiLa-cov provides a table abstraction over existing BAM/CRAM les, with no need of data conversion, which can be further conveniently queried and manipulated in a declarative way. The coverage calculation function bdg_coverage, as described in Algorithm sub-section, has been implemented as table-valued function( Figure 1D).

Benchmarking
We have benchmarked SeQuiLa-cov solution with leading software for depth of coverage calculations, speci cally samtools depth, bedtools genomeCov, sambamba depth and mosdepth (results of DepthOfCOverage from outdated GATK version are available in supplementary data). The tests were performed on the aligned WES and WGS reads from the NA12878 sample (see Methods for details) and aimed at calculating blocks and window coverage. To compare the performance and scalability of each solution, we have executed calculations for 1, 5, and 10 cores on a single computation node (see Table 2). Samtools depth and bedtools genomeCov are both natively non-scalable and were run on a single thread only. Exomewide calculations exceeded 10 minutes and genome-wide analyses took over two hours in case of samtools, while bedtools' performance was signi cantly worse, i.e ∼1.9x for WES and ∼4.75x for WGS. Sambamba depth declares to take advantage of fully parallelized data processing with the use of multithreading. However, our results revealed that even when additional threads were used, the total execution time of coverage calculations remained nearly constant and greater than samtools's result. Mosdepth shows signi cant speedup (∼1.3x) against samtools when using single thread. This performance gain increases to ∼3.7x when using 5 decompression threads, however, it does not bene t from adding additional CPU power. In case of xed-length window coverage mosdepth achieves over ∼1.3 speedup against sambamba.
SeQuiLa-cov achieves performance similar to mosdepth when run using a single core. However, SeQuiLa-cov is ∼1.3x and ∼2.5x as fast as mosdepth when using 5 and 10 CPU cores, respectively, demonstrates its better scalability. The similar performance characteristic is observed for both blocks and xed-length windows methods.
To fully assess the scalability pro le of our solution, we have performed additional tests in a cluster environment (see Methods for details). Our results show that when utilizing additional resources (i.e. more than 10 CPU cores), SeQuiLa-cov is able to reduce the total computation time to 15 seconds for WES and less than one minute for WGS data (Figure 2). Scalability limit is achieved for 200 and ∼500 CPU cores in case of WES and WGS data, respectively.
To evaluate the impact of Intel GKL library on de ate operation (BAM bzgf block decompression), we have performed blocks coverage calculations on WES data on 50 CPU cores. The results showed on average ∼1.18x speedup when running with Intel GKL de ate implementation.
Finally, our comprehensive functional unit testing showed that results calculated by SeQuiLa-cov and samtools depth are identical.

Conclusions
The application of the recent advancements in big data technologies and distributed computing can contribute to both speeding up genomic data processing and management. Analysis of large genomic data sets require e cient, accurate, and scalable algorithms to perform calculations utilizing the computing power of multiple cluster nodes. In this work, we show that with su ciently large cluster genome-wide coverage calculations may last less than a minute and at the same time being over 100x faster than the best single-threaded solution.
SeQuiLa-cov is one of the building blocks of SeQuiLa [14] ecosystem, which initiated the move towards e cient,   The presented call for coverage method takes sample identi er (sample1) and result type (blocks) as input parameters. bdg_coverage is implemented as a tablevalued function. Therefore, it outputs a table as a result allowing for customizing a query using Data Manipulation Language e.g. in the SELECT or WHERE clause.
For the purpose of this example, we assume that BAM le for sample1 contains only reads from chr3. Panel C shows the concept of distributed version of events-based algorithm. Assuming that we run our calculations in a distributed environment, the computation nodes do not work on the whole input data set (table read_set) but on n smaller data partitions (slice 1 , slice 2 , .. ,slicen), each containing subset of input aligned reads. First the algorithm calculates partial events vector for available data slices and subsequently produces corresponding partial partial_coverage vector. Due to the possibility of overlapping of ranges between two consecutive data slices, additional correction step needs to be performed. When an overlap is identi ed, the distributed processing of genomic data and providing SQLoriented API for convenient and elastic querying. We foresee that following this direction will enable the evolution of genomic data analysis from the le-oriented to table-oriented processing.

Test data
We have tested our solution using reads from NA12878 sample which were aligned to hg18 genome. WES data containing over 161 million of reads weights 17 GB and WGS data include over 2,6 billion of reads taking 272 GB of disk space. Both BAM les were compressed at the default BAM's compression level (5).

Testing environment
To perform comprehensive performance evaluation, we have setup a test cluster consisting of 28 Hadoop nodes (1 edge node, 3 master nodes and 24 data nodes) with Hortonworks Data Platform 3.0.1 installed. Each data node has 28 cores (56 with hyper-threading) and 512 GB of RAM, Yet Another Resource Negotiator (YARN) resource pool has been con gured with 2640 virtual cores and 9671 GB RAM.

Investigated solutions
In our benchmark we have used the most recent versions of the investigated tools i.e. samtools version 1.9, bedtools 2.27.0, sambamba 0.6.8, mosdepth version 0.2.3 and SeQuiLa-cov version 0.5.1.

Availability of supporting data and materials
The Docker image is available at https://hub.docker.com/r/ biodatageeks/. Supplementary information on benchmarking procedure as well as test data is publicly accessible at http://biodatageeks.org/sequila/benchmarking/benchmarking. html#depth-of-coverage.

Author's Contributions
MW -conceptualization, formal analysis, investigation, software and writing. AS -data curation, formal analysis, investigation, software, visualization and writing. WK -formal analysis, investigation, writing. TG -formal analysis, supervision, investigation, visualization and writing. All authors approved the nal manuscript.  Each experiment setting was repeated several times and the height of each bar along with the corresponding error bars indicate the average, as well as the minimum and maximum execution time, respectively. The best pileup-based solution is de nitely slower (two times for WGS calculations) than both event-based solutions what clearly shows the superiority of the latter one. Mosdepth execution time scales up to 5 cores, afterwards it shows no furthe gain in performance. SeQuiLa-cov has nearly the same execution time results as mosdepth for both blocks and windows calculations for a single core, but scales out desirably utilizing all 500 CPU cores on cluster nodes and at the same time performing WGS calculations in less than 1 minute.