Boiler: lossy compression of RNA-seq alignments using coverage vectors

We describe Boiler, a new software tool for compressing and querying large collections of RNA-seq alignments. Boiler discards most per-read data, keeping only a genomic coverage vector plus a few empirical distributions summarizing the alignments. Since most per-read data is discarded, storage footprint is often much smaller than that achieved by other compression tools. Despite this, the most relevant per-read data can be recovered; we show that Boiler compression has only a slight negative impact on results given by downstream tools for isoform assembly and quantification. Boiler also allows the user to pose fast and useful queries without decompressing the entire file. Boiler is free open source software available from github.com/jpritt/boiler.

1. each item is assigned to up to 1 knapsack 2. the capacity of each knapsack is not exceeded by the combined weights of the items assigned to it 3. the total weight of the items in all the knapsacks is maximized.
MSSP is known to be strongly NP-hard (1). We reduce MSSP to a special case of the read decompression problem where the coverage vector never exceeds 1. We first construct a vector C encoding knapsack capacities in unary. We start with empty C then, for each i, append c i 1s followed by a single 0. Because the length of C depends on the numeric knapsack weights, this is a pseudopolynomial time reduction. Next, we let the read length tally equal the item weight tally. Finally, we run our decompression algorithm on the coverage vector C and read length tally. The algorithm packs reads into the nonzero stretches of C. This solution is converted to an MSSP solution by converting reads to the corresponding items and stretches of the coverage vector to the corresponding knapsacks.
The reduction satisfies the requirements of a pseudo-polynomial transformation (2). Hence, the read decompression problem for unpaired reads is strongly NP-hard.

SUPPLEMENTARY NOTE 2
Greedy algorithm for obtaining reads from coverage vector.
The algorithm works from one end of the coverage vector to the other, removing reads that remain consistent with the coverage vector. We take advantage of the homogeneous read length distribution produced by sequencing experiments by preferentially removing reads of the most common length.
When necessary, we adjust the lengths of previously found reads by a few bases to match the coverage vector as closely as possible.
Initially, we extract reads in end-to-end sets of the form (a,b,n) where a and b are the starting and ending indices in the coverage vector and n is the number of end-to-end reads. Each read set must satisfy where l min and l max are the minimum and maximum lengths in the read distribution, respectively. Each time we find a new read (b,c), we search for an existing read set matching (a,b,n) and update it to (a,c,n+1). If no such read exists, we add a new read set (b,c,1).
extend(x 0 ,x 1 ) searches for a read set of the form (a,x 0 ,n) satisfying n·l min ≤ (x 1 −a) ≤ n·l max and updates it to (a,x 1 ,n) and decrements the coverage vector in the range [x 0 ,x 1 ) by 1.
shorten(x 0 ,x 1 ) searches for a read set of the form (a,x 1 ,n) satisfying n·l min ≤ (x 0 −a) ≤ n·l max and updates it to (a,x 0 ,n) and increments the coverage vector in the range [x 0 ,x 1 ) by 1.
These functions allow us to adjust previous reads by small amounts to fit in later reads. The read extraction algorithm works as follows: Last start and end be the indices of the first and last nonzero elements in the coverage vector, respectively. We find a and b such that cov[i] > 0∀i ∈ [start,a), cov[a] = 0 and Special end case: if a = end < start+l min , we first attempt to run extend(start,a). If unsuccessful, we decrement the bases in the coverage vector in the range [start,a) but do not add a new read.
If a ≥ l mode , we add a new read (start,start+l mode ) and update the coverage vector. Otherwise, we attempt to run extend(start,a). If unsuccessful, we attempt to run shorten(a,b). If this is also unsuccessful, we do one of the following: 1. If (a−start) ≥ l min , we add a new read (start,a) and update the coverage vector.
2. If l min 2 ≤ (a−start) < l min , we add a new read (start,start+l mode ) and update the coverage vector.
3. If (a−start) < l min 2 , we decrement the bases in the coverage vector in the range [start,a) but do not add a new read.
We then update start and end and repeat until the coverage vector is empty.

SUPPLEMENTARY NOTE 3
Tool versions and parameter settings. Boiler and CRAMTools were run with default options. Goby was run with in full "ACT H+T+D" mode as described in the "Goby parameter settings" section of the Goby study (? ).
Boiler and Goby remove read names by default, but CRAM does not. CRAMtools has an option --preserve-read-names, but we cannot find a working mechanism in version 3 to remove read names. Thus, for a fairer comparison, we stripped the read names before compressing.
When running Cufflinks, we used the --no-effective-length-correction option to avoid variability due to an issue (recently resolved) in how Cufflinks performs effective transcript length correction. StringTie was run with default parameters.
A full listing of the parameter settings used for the evaluations is at the following URL: http://bit.ly/boiler-201605-expts

SUPPLEMENTARY NOTE 4
Boiler Compression Ratio for HISAT Output Table 2 compares the compression ratio and compressed size for alignment files generated by TopHat 2 and HISAT. The initial file size is the size of the BAM with read names removed. For most paired-end datasets the HISAT alignment BAM was larger than the TopHat 2 BAM, but the compressed files generated by Boiler were roughly the same size. This leads to a better compression ratio for HISAT alignments.

Goby Compression Ratio
In Table 4 and Figure 3 of our main paper, we compare Boiler's compression rate to CRAM and Goby. As summarized in Table 1 of the main paper, CRAM does not preserve quality scores and tags by default, and we further remove read names before compressing for a fair comparison. Goby discards read names and most tags, but preserves quality score information and 'MD' tag for mismatches. Goby also preserves the identity of orphaned reads, which Boiler converts to unpaired reads. Table 2 compares the effect on Goby's compression rate of (1) converting orphaned reads to unpaired reads by modifying the SAM flags field, and (2) replacing all quality scores with '#' for efficient compression. For all datasets, orphaned reads represent only a small fraction of Goby storage, and removing quality scores reduces compressed size by less than 5% in unpaired datasets and less than 3% in all paired-end datasets.

SUPPLEMENTARY NOTE 5
Though Boiler allows the user to pose targeted queries without decompressing the entire file, we also evaluate how long Boiler takes to decompress an entire file relative to other tools. These results are presented in Table 3. Overall, Boiler takes roughly 2 -4 times longer than CRAMTools and about 1 -2 times longer than Goby to decompress entire files.

SUPPLEMENTARY NOTE 6
We investigated the effect of varying the threshold k in the exon scoring equation (1) in the main paper. As k increases, the scoring function becomes more relaxed, allowing exons with more divergent boundaries to contribute to the score. If k = 0, then two exons receive a score of 1 only if they are identical, 0 otherwise. On the other extreme, as k → ∞ the score approaches 1 for all pairs of exons. Figure 1 shows the precision and recall at varying k thresholds for the simulated D. melanogaster dataset containing 1 million paired-end reads. We observe that, across various k cutoffs, Supplementary accuracy decreases slightly after Boiler compression. Both plots show an inflection point around k = 5 to 10, after which the accuracy measure becomes more stable. Based on these results, we used a threshold of k = 10 in all experiments.
SUPPLEMENTARY NOTE 7 Table 4 shows the read-level precision and recall for the HISAT-generated alignments, before and after compression with Boiler.

SUPPLEMENTARY NOTE 8
Tables 5 and 6 show the transcript-level precision and recall not weighted by coverage, compared to the reference transcriptome. Tables 7 and 8 show the precision and recall not weighted by coverage for the direct comparison of transcriptomes before and after compression.

SUPPLEMENTARY NOTE 9
Weighted k-mer recall.
We assess fidelity by measuring weighted k-mer recall (WKR), a component of the KC score developed by Li et al. (3) to assess transcriptome assemblies. WKR measures the degree to which an assembly recovers k-mers from the true simulated transcriptome, weighted by abundances of simulated transcripts containing the k-mer. For a k-mer r, its frequency profile p(r) is defined as: p(r) = t∈T n(r,t)c(t) t∈T n(t)c(t) where T is the simulated transcriptome and for each transcript t ∈ T :  • n(r,t) is the number of times r occurs in t,

Supplementary
• n(t) is the total number of k-mers in t, and • c(t) is the coverage of t.
Letting R(T ) be the set of all k-mers in transcriptome T : WKR is defined with respect to the true transcriptome T , which we obtain from Flux Simulator's output. The GEUVADIS sample is not considered here, since it is not simulated. Figure 2 shows that WKR is largely unchanged after Boiler compression for various k-mer length settings. It also shows that the difference in WKR is more pronounced for Cufflinks than for StringTie. Table 9 shows the WKR for k = 15 for all datasets. Overall, the differences are slight, with the biggest difference at k = 15 being an increase of 0.4% for the paired-end 2.5M-read D. melanogaster sample. the tripartite score. There are two versions of this score, strict and loose.

Supplementary
We first construct a tripartite graph containing a node for each transcript in the cufflinks output for the alignments both before and after compression, as well as for each transcript in the reference transcriptome. We add a connecting edge from each transcript from the original set to the bestmatching transcript from the reference set, determined using the transcript scoring method described previously. Similarly, we add an edge from each transcript in the compressed set to the best match from the reference set of transcripts.
For the strict tripartite score, we take all the nodes from the set of reference transcripts that are connected to a single node A i from the set of original transcripts and a single node B i from the set of compressed transcripts. The final score is the average of the transcript scores for every pair A i , B i . For the loose tripartite score, we take all the nodes from the set of reference transcripts that are connected to at least one node from the set of original transcripts and at least one node from the set of compressed transcripts. Let A i be the original transcript with the highest score compared to the reference node, and let B i be the compressed transcript with the highest score compared to the reference transcript. The final score is the average of the transcript scores for every pair A i , B i . Tables 10 and 11 show the tripartite scores alongside the percentage of transcripts from the original and compressed set of transcripts that contribute to the score.

SUPPLEMENTARY NOTE 11
Transcript quantification with Cufflinks and StringTie.
We measured Boiler's effect on quantification accuracy when transcripts are quantified directly from alignments without an initial assembly step. We ran Cufflinks and StringTie in quantification-only mode (-G for Cufflinks and -G -e for StringTie). As input we use the alignment BAM files output by TopHat 2 in each of the simulation experiments described in the main text. We ran each tool on the BAM file both before and after Boiler compression. For comparison, we also ran each tool on each of the 5 technical replicates described in the "Fidelity" section in the main text. The reference transcriptome provided to each tool was the same one used to generate the original dataset with Flux Simulator. Both tools output a file called "isoforms.fpkm tracking" with FPKM estimates for all transcripts.  Figure 3 plots Boiler-compressed vs. original FPKM for all transcripts in the StringTie quantitation output for the D. melanogaster 20M paired-end dataset (blue points). We also plot the same for a randomly-chosen pair of technical replicates (red points).
While the Boiler RMSE for Cufflinks quantitation is often (though not always) higher than the highest observed for the technical replicates, the Boiler RMSE for StringTie quantitation is always lower than the lowest observed for the technical replicates. This is likely because Boiler removes read names during compression, breaking the relationship between a multi-mapping read and its several alignments. This in turn affects Cufflinks quantification, which depends on read names to group multi-mapping alignments during quantification. A subject for future work is whether and how Boiler can preserve enough information about multi-mapping alignments to control RMSE without substantially decreasing compression ratio.
Note that these technical replicates are simulated, and so exhibit less inter-sample variability than real-world replicates.
Supplementary Figure 3. Accuracy of quantification results computed with Stringtie.