Skip to Main Content

Article Navigation

Journal Article

Crumble: reference free lossy compression of sequence quality values

Abstract

Motivation

The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving.

Results

On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details).

Availability and implementation

Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The rapid reduction of costs for genome sequencing (Wetterstrand, 2016) has led to a corresponding growth in storage costs, far outstripping Moore’s Law for CPU and Kryder’s Law for storage. This has led to considerable research into DNA sequence data compression (Numanagić et al., 2016).

The most significant component in data storage cost is the per-nucleotide confidence values, which carry information about the likelihood of each base call being in error. The original CRAM proposal (Fritz et al., 2011) introduced the term ‘quality budget’ for lossy compression. Given a fixed amount of storage we can decide how to spend this budget, either by uniform degradation of all qualities or more targeted fidelity in important regions only. How to target this has been the focus of lossy compression research, with two main strategies: ‘horizontal’ and ‘vertical’.

‘Horizontal’ compression smooths qualities along each sequence in turn, as implemented in libCSAM (Cánovas et al., 2014), QVZ (Malysa et al., 2015) and FaStore (Roguski et al., 2018) or via quantization (Illumina, 2014). This type of compression can be applied before alignment and is entirely reference free.

‘Vertical’ compression takes a slice through an aligned dataset in the SAM format (Li et al., 2009) to determine which qualities to keep and which to discard, as used in CALQ (Voges et al., 2018), or via hashing techniques on unaligned data in Leon (Benoit et al., 2015) and GeneCodeq (Greenfield et al., 2016). Traditional loss measures, such as mean squared error, will appear very high, but these tools focus on minimizing the changes in post-processed data (variant calling).

We present Crumble as a mixture of both horizontal and vertical compression. It operates on coordinate sorted aligned Sequence Alignment/Map (SAM), Binary Alignment/Map (BAM) or CRAM files. Although this approach does not explicitly use a reference, the sequence aligner does, which may result in some reference bias.

2 Materials and methods

A variant caller evaluates the sequence base calls overlapping each genome locus along with their associated qualities to determine whether that site represents a variant. Irrespective of whether the call is a variant, if the same call is made with comparable confidence both with and without sequence quality values present then it can be concluded that the qualities are not necessary in that column.

This requires running the variant caller twice to assess the change, but if limited to sites with high confidence calls the need for a second test can be avoided. We implemented a fast, but naïve, caller derived from Gap5’s consensus algorithm (Bonfield and Whitwham, 2010). This is a pure pileup-column oriented approach that treats the lack of a base (deletion) as a fifth base type (‘*’) and then identifies the most likely homozygous or heterozygous combination of bases that match the observed base calls, confidence values and mapping qualities. Thus it assumes a single individual diploid sample and is not tuned to work with somatic variants. This is further modified by reducing the confidence for the consensus by the bases which do not match the hypothesis, thus producing a deliberately pessimistic caller. The aim is not to have a built-in high-quality caller, but to preserve quality values if any downstream variant caller may be uncertain while retaining independence from any standard tool.

Even when deemed unnecessary, qualities cannot be entirely discarded as tools expect them to exist. By replacing the qualities for bases that agree with a confident consensus call with constant high values, the entropy of the quality signal is reduced. Quality values for bases that disagree with a confident consensus call may optionally be set to a constant low value, heavily quantized or left intact.

There are sites where any variant caller may incorrectly give the wrong call with high confidence. Furthermore the reference itself may be incorrect and a subsequent realignment to an updated reference may change read locations and alignment strings. We do not wish to replace qualities in such regions. We therefore have a set of heuristics to try to find potentially unreliable calls and retain verbatim the confidence values for these locations and surrounding bases depending on sequence context. Similarly there may be places where an entire read needs to have qualities retained as there is evidence for it being misplaced or being part of a large structural rearrangement.

The heuristics used in Crumble to identify where confidence values should be retained vary by compression level requested, but include:

Concordant soft clipping: many reads having soft clipped bases at the same site often indicates a large insertion (absent in the reference) or contamination.
Excessive depth: possible contamination or collapsed repeat. Variant calls often appear unusually good in such data, even when wrong.
Low mapping quality: possibly caused through poor reference. We optionally can also store quality values for the reads with high mapping quality that colocate with many low mapping quality reads.
Unexpected number of variants: we assume data from a single diploid sample with at most two alleles at each locus. More than two alleles imply misaligned data, duplication or contamination.
Low quality variant calls: typically a single base locus where the consensus is unclear.
Proximity to short tandem repeats: alignments are often poor in such regions, especially if indels are present, leading to bases occurring in the wrong pileup column.

Finally for the quality values that we deem necessary to keep, we optionally provide horizontal compression via the P-block algorithm from CSAM. This is most useful on older Illumina datasets with over 40 distinct levels of quality values.

The nature of the Crumble algorithm makes it amenable to streaming and it does not require large amounts of memory to operate.

3 Results

Analysis of how quality compression affects variant calling was performed on Syndip (Li et al., 2018), an Illumina sequenced library artificially constructed from the haploid cell lines CHM1 and CHM13, with an associated high quality truth set based on two PacBio assemblies (Schneider et al., 2017). When compared with the Genome in a Bottle (GIAB) or Platinum Genomes (PlatGen) datasets this has a considerably larger set of tricky indels in the truth set, giving SNP false positive rates 5–10 times higher (Li et al., 2018) than on GIAB or PlatGen truth sets. Although Syndip still requires a list of regions to exclude, the total number of excluded non-N reference bases is 40% fewer than GIAB 3.3.2. By restricting analysis to solely the regions within Syndip and not within GIAB we observe 65% of Chromosome 1 false positives occur within this region, but crumble still shows good performance (see Supplementary Material).

The input BAM file (ERR1341796) had previously been created with GATK best practices including IndelRealigner and Base Quality Score Recalibration steps. To test the impact on raw variant calling, we ran GATK HaplotypeCaller (Poplin et al., 2017), Bcftools (Li, 2011) and Freebayes (Garrison and Marth, 2012), filtering to calls of quality 30 or above, without use of GATK Variant Quality Score Recalibration. As a baseline we compare Crumble to the original lossless results and against a single fixed quality value. This latter test demonstrates that quality values are important, but we only need a small quality budget to achieve comparable results to lossless compression. Indeed, we observe that vertical quality score compression can marginally improve variant calling by standard callers, as has been noted previously in the QVZ (Malysa et al., 2015) and Leon (Benoit et al., 2015) papers.

Table 1 shows the GATK lossless results on the Syndip along with the changes caused by lossy compression using a variety of Crumble options on both the full Syndip data and a low coverage subset. We chose the minimal compression level, an expected maximum compression level and a set of manually tuned parameters optimised for this dataset. The manual tuning traded false positives and false negatives in an attempt to get a call set comparable or better than the original in all regards. It is unknown if the tuned parameters are appropriate for all datasets. More complete comparisons including against other tools are available in the online Supplementary Material.

Table 1.

Effect of lossy quality compression on 50× and 15× Syndip data using GATK HaplotypeCaller

Category	Original	Original F	Crumble-1	Crumble-1 F	Crumble-9p8	Crumble-9p8 F	Crumble*	Crumble* F
50× Qual size (MB)	4107	—	614	—	235	—	229	—
50× SNP False Positive	6226	2968	–359	–79	–251	–67	–526	–181
50× SNP False Negative	4648	7625	0	–53	–25	–184	+41	–123
50× Indel False Positive	3965	3649	–7	–41	+19	+9	–35	–32
50× Indel False Negative	7881	7972	+7	+11	–103	–82	–93	–72
15× Qual size (MB)	1211	–	260	—	77	—	72	—
15× SNP False Positive	4798	2517	–10	+63	+347	+225	–359	–29
15× SNP False Negative	14985	27761	–205	–297	–3027	–4608	–1866	–2865
15× Indel False Positive	2781	2521	+2	–14	+109	+60	+53	+26
15× Indel False Negative	13136	13925	–8	+5	–484	–427	–444	–410

Category	Original	Original F	Crumble-1	Crumble-1 F	Crumble-9p8	Crumble-9p8 F	Crumble*	Crumble* F
50× Qual size (MB)	4107	—	614	—	235	—	229	—
50× SNP False Positive	6226	2968	–359	–79	–251	–67	–526	–181
50× SNP False Negative	4648	7625	0	–53	–25	–184	+41	–123
50× Indel False Positive	3965	3649	–7	–41	+19	+9	–35	–32
50× Indel False Negative	7881	7972	+7	+11	–103	–82	–93	–72
15× Qual size (MB)	1211	–	260	—	77	—	72	—
15× SNP False Positive	4798	2517	–10	+63	+347	+225	–359	–29
15× SNP False Negative	14985	27761	–205	–297	–3027	–4608	–1866	–2865
15× Indel False Positive	2781	2521	+2	–14	+109	+60	+53	+26
15× Indel False Negative	13136	13925	–8	+5	–484	–427	–444	–410

Note: Comparison of unfiltered and filtered (marked with ‘F’) calls on the Syndip truth set. GATK filtering rules are listed in the Supplementary Material. Crumble* refers to parameters optimized for this dataset: ‘crumble -9p8 -u30 -Q60 -D100’. The false positive/negative values of the GATK calls on the crumbled dataset are shown relative to their respective GATK called lossless dataset. The truth set for Chromosome 1 has 269 655 SNPs and 46 036 indels, counting multi-allelic sites once per allele. The quality sizes are absolute for all files.

Table 1.

Effect of lossy quality compression on 50× and 15× Syndip data using GATK HaplotypeCaller

Category	Original	Original F	Crumble-1	Crumble-1 F	Crumble-9p8	Crumble-9p8 F	Crumble*	Crumble* F
50× Qual size (MB)	4107	—	614	—	235	—	229	—
50× SNP False Positive	6226	2968	–359	–79	–251	–67	–526	–181
50× SNP False Negative	4648	7625	0	–53	–25	–184	+41	–123
50× Indel False Positive	3965	3649	–7	–41	+19	+9	–35	–32
50× Indel False Negative	7881	7972	+7	+11	–103	–82	–93	–72
15× Qual size (MB)	1211	–	260	—	77	—	72	—
15× SNP False Positive	4798	2517	–10	+63	+347	+225	–359	–29
15× SNP False Negative	14985	27761	–205	–297	–3027	–4608	–1866	–2865
15× Indel False Positive	2781	2521	+2	–14	+109	+60	+53	+26
15× Indel False Negative	13136	13925	–8	+5	–484	–427	–444	–410

Category	Original	Original F	Crumble-1	Crumble-1 F	Crumble-9p8	Crumble-9p8 F	Crumble*	Crumble* F
50× Qual size (MB)	4107	—	614	—	235	—	229	—
50× SNP False Positive	6226	2968	–359	–79	–251	–67	–526	–181
50× SNP False Negative	4648	7625	0	–53	–25	–184	+41	–123
50× Indel False Positive	3965	3649	–7	–41	+19	+9	–35	–32
50× Indel False Negative	7881	7972	+7	+11	–103	–82	–93	–72
15× Qual size (MB)	1211	–	260	—	77	—	72	—
15× SNP False Positive	4798	2517	–10	+63	+347	+225	–359	–29
15× SNP False Negative	14985	27761	–205	–297	–3027	–4608	–1866	–2865
15× Indel False Positive	2781	2521	+2	–14	+109	+60	+53	+26
15× Indel False Negative	13136	13925	–8	+5	–484	–427	–444	–410

Note: Comparison of unfiltered and filtered (marked with ‘F’) calls on the Syndip truth set. GATK filtering rules are listed in the Supplementary Material. Crumble* refers to parameters optimized for this dataset: ‘crumble -9p8 -u30 -Q60 -D100’. The false positive/negative values of the GATK calls on the crumbled dataset are shown relative to their respective GATK called lossless dataset. The truth set for Chromosome 1 has 269 655 SNPs and 46 036 indels, counting multi-allelic sites once per allele. The quality sizes are absolute for all files.

On the original BAM file with ∼50× coverage we observed a 17 fold reduction in the size of CRAM compressed quality values, while achieving a 6% drop in filtered SNP false positive rate (higher precision) and 2% drop in false negative rates (higher recall). Indels also see a 1% improvement in both measures. At a sub-sampled 15× coverage we see a 1% drop in filtered SNP false positive rates and a 10% reduction in SNP false negatives. Indel calls were more comparable, with 1% higher false positives and 3% lower false negatives.

It is likely these gains to both SNP precision and recall only apply to data coming from a single individual, but they demonstrate the efficacy of lossy quality compression.

4 Conclusion

We have demonstrated that Crumble, when combined with CRAM, can greatly reduce file storage costs while having a minimal, if not beneficial, impact on variant calling accuracy of individual samples. For maximum compression Crumble also permits discarding read identifiers and some auxiliary tags, typically yielding files in the size of 5–10 Gb for a 30× whole genome processed with Crumble -9p8. Using this across a variety of BAM and CRAM files Crumble gave an overall file size reduction from 3- to 7.8-fold (details in Supplementary Material).

Crumble is designed to operate on a single sample file. For multiple samples, it is best to apply Crumble to each sample independently, produce gVCF, and then jointly call from those. Note Crumble is explicitly designed to operate on diploid data, so it is not appropriate for use on sequence from cancer or other samples with subclonal genetic structure.

Acknowledgements

We would like to thank Yasin Memari for help testing and evaluating an earlier version of the program.

Funding

This work was funded by the Wellcome Trust [WT098051].

Conflict of Interest: none declared.

References

Benoit

G.

et al. (

2015

)

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

.

BMC Bioinformatics

,

16

,

288.

Bonfield

J.K.

,

Whitwham

A.

(

2010

)

Gap5–editing the billion fragment sequence assembly

.

Bioinformatics

,

26

,

1699

–

1703

.

Cánovas

R.

et al. (

2014

)

Lossy compression of quality scores in genomic data

.

Bioinformatics

,

30

,

2130

–

2136

.

Fritz

M.H.-Y.

et al. (

2011

)

Efficient storage of high throughput dna sequencing data using reference-based compression

.

Genome Res

.,

21

,

734

–

740

.

Garrison

E.

,

Marth

G.

(

2012

)

Haplotype-based variant detection from short-read sequencing

.

arXiv Preprint arXiv

,

1207

,

3907.

OpenURL Placeholder Text

Greenfield

D.L.

et al. (

2016

)

GeneCodeq: quality score compression and improved genotyping using a Bayesian framework

.

Bioinformatics

,

32

,

3124

–

3132

.

Illumina

(

2014

). Reducing whole-genome data storage footprint. Technical report. www.illumina.com/documents/products/whitepapers/whitepaper_datacompression.pdf (6 October 2017, date last accessed).

Li

H.

(

2011

)

A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

.

Bioinformatics

,

27

,

2987

–

2993

.

Li

H.

et al. (

2009

)

The sequence alignment/map format and SAMtools

.

Bioinformatics

,

25

,

2078

–

2079

.

Li

H.

et al. (

2018

)

A synthetic-diploid benchmark for accurate variant calling evaluation

.

Nat. Methods

. doi: 10.1038/s41592-018-0054-7.

OpenURL Placeholder Text

Malysa

G.

et al. (

2015

)

QVZ: lossy compression of quality values

.

Bioinformatics

,

31

,

3122

–

3129

.

Numanagić

I.

et al. (

2016

)

Comparison of high-throughput sequencing data compression tools

.

Nat. Methods

,

13

,

1005

–

1008

.

Poplin

R.

et al. (

2017

)

Scaling accurate genetic variant discovery to tens of thousands of samples

.

bioRxiv

, doi: 10.1101/201178.

OpenURL Placeholder Text

Roguski

Ł.

et al. (

2018

)

FaStore: a space-saving solution for raw sequencing data

.

Bioinformatics

,

34

, 2748–2756.

OpenURL Placeholder Text

Schneider

V.A.

et al. (

2017

)

Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

.

Genome Res

.,

27

,

849

–

864

.

Voges

J.

et al. (

2018

)

Calq: compression of quality values of aligned sequencing data

.

Bioinformatics

,

34

,

1650

–

1658

.

Wetterstrand

K.A.

(

2016

). DNA sequencing costs: data from the NHGRI genome sequencing program (GSP). www.genome.gov/sequencingcostsdata. (6 October 2017, date last accessed).

© The Author(s) 2018. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Associate Editor:

Download all slides

Views

4,171

Altmetric

Total Views 4,171

3,313 Pageviews

858 PDF Downloads

Since 7/1/2018

Month:	Total Views:
July 2018	147
August 2018	317
September 2018	136
October 2018	94
November 2018	88
December 2018	56
January 2019	414
February 2019	186
March 2019	135
April 2019	130
May 2019	91
June 2019	76
July 2019	84
August 2019	85
September 2019	64
October 2019	67
November 2019	69
December 2019	61
January 2020	56
February 2020	66
March 2020	62
April 2020	28
May 2020	78
June 2020	58
July 2020	62
August 2020	43
September 2020	42
October 2020	49
November 2020	55
December 2020	27
January 2021	45
February 2021	36
March 2021	52
April 2021	31
May 2021	45
June 2021	31
July 2021	46
August 2021	15
September 2021	22
October 2021	30
November 2021	40
December 2021	24
January 2022	34
February 2022	34
March 2022	34
April 2022	37
May 2022	29
June 2022	13
July 2022	43
August 2022	51
September 2022	45
October 2022	56
November 2022	28
December 2022	25
January 2023	19
February 2023	15
March 2023	49
April 2023	28
May 2023	18
June 2023	25
July 2023	14
August 2023	29
September 2023	21
October 2023	22
November 2023	29
December 2023	25
January 2024	22
February 2024	23
March 2024	38
April 2024	22