Crumble: reference free lossy compression of sequence quality values

Abstract Motivation The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving. Results On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details). Availability and implementation Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble. Supplementary information Supplementary data are available at Bioinformatics online.

bcftools norm -m -both -t $region -f $href $v 2>/dev/null | \ vt decompose_blocksub -| \ bcftools view -T^$exclude.bed | bcftools view -T $include.bed > $v.norm.vcf The normalised / ltered les are then compared with bcftools isec to count the shared variants between truth and call sets and those occurring only in one le: bcftools isec -c both -p $call.isec $truth.norm.vcf.gz $call.norm.vcf.gz This is a relatively strict denition of identity, meaning that the variant must occur at both the same site and be the same call. The isec command produces 4 VCF les in the $call.isec directory: 0000.vcf: private to truth.norm.vcf (false negatives) 0001.vcf: private to call.norm.vcf (false positives) 0002.vcf: records from truth.norm.vcf, shared by both les (correct calls) 0003.vcf: records from call.norm.vcf, shared by both les (correct calls) By counting the VCF records in each le we observe the recall and precision. The les can be ltered by quality and type using bcftools view, for example: FN_SNP=`bcftools view -H -i "TYPE='snp' && QUAL >= 30" $call.isec/0001.vcf | wc -lM ore aggressive ltering was also applied based on the recommended practices from each tool, where available. The following are exclusion lter rules, applied using`bcftools view -e $lter'. We also applied a simple over-depth lter too, of DP>90 for the full 50x sample and DP>30 for the 15x sample.
It is also noted that the normalisation step is not always perfect and we cannot compute whether a compound insertion and deletion is identical to a series of SNPs. Hence some of the reported numbers of false positives / negatives may be pessimistic. However we do not believe the results are biased in favour of any specic method of quality reduction.

Results
The original BAM input le was chromosome 1 of CHM1_CHM13_2.bam, from ERR1341796 with depth ∼50x. We also subsampled this to evaluate performance on a ∼15x data set, where quality values become much more important.
The rst assessment we do is to evaluate the baseline of lossless quality values, followed by no quality values (using a xed score) to demonstrate the impact that having any quality has.
Subsequent tests evaluate quality quantisation, Crumble, Calq and QVZ2. We test variant calling precision and recall using GATK HaplotypeCaller, Bcftools and Freebayes.
Tables below show the number of true positives (TP), false positives (FP) and false negatives (FN) for all variants, after ltering by quality, and with a more complete ltering by quality, depth and per-tool recommended rules.
For our tables we use variant quality 30 in our lters, but variant callers calibrate quality values dierently and the trade o between precision and recall may alter at a dierent quality threshold.
To get a better comparison between tools and the eect that variant quality ltering has on each tool we plot the true positive vs false positive rates as a line, with points produced by varying the quality lter to values 10,15,20,25,30,40,50,75  For the binary quantisation, we observe a dip in the quality frequency distribution between 16 and 20, so we split the distribution into bases with quality >= 20 and those below. By similar counting these lead to amortised base quality scores of 4 and 28 for the two bins, which unlike single quality 13 does work well for all three tools.
The binary quantisation using values 4 and 28 has minimal impact on bcftools and freebayes recall and accuracy. With GATK it also has minimal impact on the 15x data, but with the 50x it has a small negative impact. does not compare well to the original. Binary quantisation to 4 and 28 has a negative impact on GATK at high depth, but minimal change on the shallow data set.
Tables with the actual counts of true positives, false positives and false negatives are shown below.       As with GATK HaplotypeCaller, Bcftools is harmed by having no quality values. However the lines showing binary binned (4 and 28) qualities are nearly superimposed on top of the lossless quality calls, at some points being marginally improved by the binning process.
Note the bcftools indel ltering doesn't use quality values, hence these come out as a single point.       As with Bcftools, xed quality is harmful, but again we see binary quantisation having either no eect or a small benet.      Tool Comparisons Given the above analysis, we are also able to do a side by side comparison between GATK Haplo-typeCaller, Bcftools and Freebayes results on both 50x and 15x data sets. Such an analysis is not the primary focus of this paper, but given we have the data available it is an interesting diversion.
Missing from these gures is the usefulness of output. In order to compare between tools and get a constant total number of variants we have split all multi-allelic sites and MNPs into individual records, as this permits Freebayes haplotype calls to be compared against bcftools and GATK HaplotypeCaller, however in doing so it removes one of the strengths of Freebayes in that neighbouring mutations are phased. It should be noted this is purely a snapshot of one single individual with two alleles in even proportion, so we do not encourage any broader conclusions to be made. Also note that regardless of the tool used for calling, the data has previously been passed through GATK BQSR (base quality score recalibration).
On this data set we observe that each tool occupies its own distinct space in the accuracy (true positives) vs recall (false negatives) graph for SNP calling, meaning that each tool has its own strengths.
The lightest compression level (crumble -1) is designed to cope better with subsequent remapping to dierent reference sequences, achieved by storing more lossless quality values in regions of low mapping score, potential collapsed repeats or missing insertions. However this requires a considerably larger amount of storage.
For the full 50x data set, to run crumble -9p8 on chromosome 1 took 41 minutes elapsed time on a 2.2GHz Intel Xeon E5-2660, using 3Gb of RAM. Processing the entire genome (a 155Gb BAM le) took just over 10 hours, peaking at 3.8Gb of RAM.
The eect diers slightly per caller, although as expected the lowest level of lossy compression (crumble -1) was always closest to the original calls. Even so, crumble -1 gives a compressed quality size only 14% larger than the binary quantisation method using scores 4 and 28 introduced in the previous section. Crumble with the maximum optimised GATK parameters appears to also work well with bcftools and freebayes, indicating the optimisation is more related to the data rather than the caller.
Both higher levels of crumble tested give around 2.3 times better quality compression than the binary quantisation. There is some variation between 50x / 15x and between SNP / Indel on whether the light Crumble -1 qualities are better than the lossless ones. However uniformly the P-score smoothing and more aggressive compression modes of Crumble are benecial to all tests, with the more optimised parameters working best overall.                     We only show GATK HaplotypeCaller results for CALQ and QVZ2, as evaluating these tools is not the primary focus of this paper.
Compared to the lossless qualities, with the 50x data sets CALQ gives a signicant decrease in true positives. The 15x data set fares better, representing a dierent tradeo between precision and recall. The compressed quality size is comparable to the lightest compression with crumble (`crumble -1`).   It required around 10Gb of RAM and took 27 minutes to encode. It uses its own compressed le format for storing the quality values. After decoding we ran the replace_qual_sam.py tool from CALQ to update the SAM le prior to variant calling.
Comparing the Crumble results with QVZ2 we see the eect of minimising quality mean squared error vs aggressively increasing and decreasing qualities based on likelihood of variant calls changing. The mean squared error from Crumble changes will be very signicant, but the size reduction is proportionally far greater while still achieving minimal changes to variant calling, in this case a small gain. QVZ2 has minimal impact on calling precision and recall at its lowest level (-t1).
QVZ2 -t4 produces a slight shift towards more false positives with fewer false negatives, but is broadly benecial, especially post ltering. The compression ratio at this option is not far behind CALQ and Crumble -1. Finally QVZ2 -t16 gives the smallest le of all (about 10% smaller than crumble -9p8), but has a signicant increase in false positives.

Syndip regions
The Syndip data set is not perfect and there are bed les to lter out poor regions. This may lead to concern that we are testing only well behaved data and do not know how the tools work in hard to sequence regions. This concern is true for all truth sets generated from real sequencing data, including the Genome in a Bottle (GIAB) and Platinum Genomes (PlatGen) data sets that have been established for longer. The Syndip paper indicates that testing variant callers on Syndip probes more of the genome, including more dicult parts, leading to substantially higher false positive rates than seen with GIAB and PlatGen. Figure 2a reveals that the FPPM of SNPs estimated from Syndip is often 5-10 times higher than FPPM estimated from GIAB or PlatGen. Looking into the Syndip FP SNPs, we found most of them are located in CNVs that are evident in PacBio data in the context of long anking regions, but look dubious in short-read data alone.
The total number of bases included in chromosome 1 from Syndip is 212.9Mb out of 225.3Mb of non-N reference. This compares favourably to 204.4Mb in ltered GIAB.
Furthermore we can subtract the GIAB regions from Syndip regions to get only regions that occur in Syndip (around 8.4Mb). To see a signicantly elevated overall false positive rate, either the Syndip data is highly erroneous or the bulk of the extra false positives are within this region.
To test this we ran GATK on the data sets ltered to this region alone.
Comparing this to the full Syndip regions for chromosome 1 we see that 65% of false positives occur within this small portion. This addresses the possibility that we are restricting ourselves to only good quality data. The results show that Crumble still performs well in this region.         The eect of Crumble on both data sets is a considerable reduction to quality size within CRAM (32 fold and 15 fold respectively). The ability to perform lossy read name compression (which is part of CRAM rather than Crumble) is hampered on the RNASeq data by having reads split over larger regions and not colocating within the same CRAM slice and very few being labelled as properly paired. As a consequence the read names are the largest data type in the crumbled RNASeq data set. Neither of these les have an excessively large collection of auxiliary tags.
For both les the ratio of original to crumbled size is higher with CRAM (7.8 and 3.4) than BAM (3.7 and 3.0), demonstrating the benet of combining lossy quality encoding with a columnar le format.
During preparation of this manuscript a bug was xed aecting the speed of Crumble on RNAseq data. Thus the RNASeq TopHat data was processed using a more recent git commit (v0.8-4-g556c716). Both version 0.8 and 0.8-4 were tested on the E.Coli data and observed to give identical results. Final timings were 6 min 43 seconds for the E.Coli data and 599 min 11 seconds for the RNASeq data corresponding to unthreaded BAM processing speeds of 3.3Mb/s and 0.34 Mb/s, demonstrating that there is still some degraded CPU performance operating on RNASeq data sets.

Conclusion
As expected, the 15x sample has fewer condent consensus bases than the 50x sample leading to a slightly lower quality compression ratio, however even at 15x there is sucient condence in calling to discard most quality values.
The original CRAM le for the 15x chromosome 1 comprised 24 million reads and 36 billion base pairs, giving 2.67 bits per lossless quality value. After optimal Crumble parameters were applied, this reduced to 0.16 bits per quality.
It is clear there are a lot of parameters that can be adjusted for controlling when to adjust quality values, and to which values. We have not exhaustively explored this search space.
There are also open questions on the performance of Crumble on somatic / non-clonal samples, such as cancers, or mixed sample data sets. Hence we do not recommend the use of Crumble on such data without prior evaluation.
We also do not recommend usage of Crumble on non-Illumina data sets until further evaluation has been made.