SPRING: a next-generation compressor for FASTQ data

Chandak, Shubham; Tatwawadi, Kedar; Ochoa, Idoia; Hernaez, Mikel; Weissman, Tsachy

doi:10.1093/bioinformatics/bty1015

Abstract

Motivation

High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression.

Results

In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina’s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources.

Availability and implementation

SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

There has been a tremendous increase in the amount of genomic data produced in the past few years, mainly driven by the improvements in High-Throughput Sequencing technologies and the reduced cost of sequencing a genome. A single genome sequencing experiment on humans typically results in hundreds of millions of short reads (of length 100–150 bp), which are (possibly corrupted) sub-strings of the same underlying genome sequence. These raw sequencing data is typically stored in the FASTQ format, which consists of the reads along with the quality values which indicate the confidence in the read sequence and read identifiers which consist of metadata related to the sequencing process. In most cases, the reads are sequenced in pairs from short fragments of the genome, resulting in paired-end FASTQ files. A typical FASTQ dataset for a human genome sequencing experiment requires hundreds of GBs of storage space (for a typical sequencing coverage of 30×). Due to the huge sizes involved, compression of the FASTQ files is of utmost importance for their storage and distribution.

There is significant amount of recent work on FASTQ compression (Numanagić et al., 2016), including SCALCE (Hach et al., 2012), Fqzcomp (Bonfield and Mahoney, 2013), DSRC 2 (Roguski and Deorowicz, 2014) and FaStore (Roguski et al., 2018). Since the reads are sub-strings of the underlying genome, there is much redundancy to be exploited for compression. Specialized compressors, which explicitly utilize the structure present in the reads, can achieve a compression gain of more than 10× as compared to generic universal compressors such as Gzip (Numanagić et al., 2016). The quality values, on the other hand, have less structure and thus can take up a more significant fraction of the storage space in the compressed domain. Recent work (Ochoa et al., 2017; Roguski et al., 2018) has shown that the quality values can be lossily compressed without adversely affecting the performance of variant calling, one of the most widely used downstream application in practice. Moreover, newer technologies such as Illumina’s Novaseq are using quality values with fewer levels (4 levels instead of the previous 8 or 40 levels), hence supporting the claim that the precision in the quality values can be reduced with no impact on variant calling performance.

Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads (Roguski et al., 2018), scalability to high coverage datasets, pairing-preserving compression (Roguski and Deorowicz, 2014) and lossless compression (Hach et al., 2012). Partly due to these factors, Gzip is still the prevalent FASTQ compressor, even though it provides worse compression ratios (Numanagić et al., 2016).

In this work, we present the next-generation compressor SPRING, which supports all the crucial properties, while achieving significantly better compression as compared with state-of-the-art FASTQ compressors. SPRING is also eminently practical in terms of its memory/time requirements, and supports selective access to the compressed data.

2 Methods and results

SPRING supports the following recommended modes of FASTQ compression:

Lossless mode (default): In this mode, the FASTQ file is compressed so that it can be exactly reconstructed, i.e. the reads, quality, read identifiers and the read order information can be perfectly recovered.
Recommended lossy mode: In this mode, the information relevant for most of the genomic applications (such as alignment, assembly, variant calling, etc.) is preserved. This includes the reads along with pairing information and binned quality values. The quality values are subjected to the Illumina’s standardized 8-level binning (https://www.illumina.com/documents/products/whitepapers/whitepaper_datacompression.pdf) before compression (Novaseq qualities are left unchanged). The read identifiers and the order of the pairs is discarded (i.e. the decompressed FASTQ file contains the read pairs in an arbitrary order). The relative ordering of the first and the second read in each pair is still preserved.

Although we advocate for these default modes, SPRING can be highly customized based on the user needs, and provides additional capabilities such as custom binning of quality values using QVZ (Malysa et al., 2015) and binary thresholding.

For short reads (up to 511 bp), the read compression in SPRING is based on HARC (Chandak et al., 2018), with significant improvements and added support for variable-length reads. SPRING also supports long read compression, where BSC (https://github.com/IlyaGrebnov/libbsc/) is used as the read compressor. Furthermore, SPRING compresses the streams in blocks, allowing for fast decompression of a subset of reads (random access). More details and results for these features are provided in the Supplementary Material.

Table 1 shows the compression results for the two recommended modes for selected datasets. We compare SPRING to FaStore (Roguski et al., 2018), the best performing FASTQ compressor and pigz (parallelized Gzip), the most commonly used FASTQ compressor in practice. We observe that SPRING achieves significant compression gains with respect to FaStore in both modes, especially for human NovaSeq datasets, while being comparable in computational resources (Supplementary Material). For example, running with eight threads in the lossless mode, SPRING requires 2 h and 31 GB RAM for compressing the 25× NovaSeq Homo sapiens dataset, which is competitive with FaStore (2.5 h and 41 GB). For decompressing the dataset, SPRING requires 26 min and 6.1 GB RAM. This is slower than FaStore (12 min), but with significantly lower memory consumption (23 GB for FaStore). In comparison to pigz, SPRING achieves 2×–5× better compression ratios but requires higher computational resources (see Supplementary Material for further details and more extensive results).

Table 1.

Compressed sizes (in MB) for selected datasets

Organism	Technology	Coverage	Uncompressed	Lossless mode				Recommended lossy mode
			Size	pigz	FaStore	SPRING	Improvement	FaStore	SPRING	Improvement
Pseudomonas aeruginosa	GAIIx	50	768	279	145	115	1.26×	88	62	1.41×
Metagenomic	HiSeq 2000	––	19 284	6911	3602	3206	1.12×	1935	1736	1.11×
H.sapiens	HiSeq 2000	28	227 246	74 250	35 662	28 901	1.23×	17 417	13 460	1.29×
H.sapiens	NovaSeq	25	195 748	36 131	11 101	6971	1.59×	9927	5657	1.75×
H.sapiens	NovaSeq	100	787 616	144 927	33 734	25 883	1.30×	28 846	20 316	1.42×

Organism	Technology	Coverage	Uncompressed	Lossless mode				Recommended lossy mode
			Size	pigz	FaStore	SPRING	Improvement	FaStore	SPRING	Improvement
Pseudomonas aeruginosa	GAIIx	50	768	279	145	115	1.26×	88	62	1.41×
Metagenomic	HiSeq 2000	––	19 284	6911	3602	3206	1.12×	1935	1736	1.11×
H.sapiens	HiSeq 2000	28	227 246	74 250	35 662	28 901	1.23×	17 417	13 460	1.29×
H.sapiens	NovaSeq	25	195 748	36 131	11 101	6971	1.59×	9927	5657	1.75×
H.sapiens	NovaSeq	100	787 616	144 927	33 734	25 883	1.30×	28 846	20 316	1.42×

Notes: Improvement is reported with respect to FaStore. Best results for each mode are bold-faced.

Open in new tab

Table 1.

Compressed sizes (in MB) for selected datasets

Organism	Technology	Coverage	Uncompressed	Lossless mode				Recommended lossy mode
			Size	pigz	FaStore	SPRING	Improvement	FaStore	SPRING	Improvement
Pseudomonas aeruginosa	GAIIx	50	768	279	145	115	1.26×	88	62	1.41×
Metagenomic	HiSeq 2000	––	19 284	6911	3602	3206	1.12×	1935	1736	1.11×
H.sapiens	HiSeq 2000	28	227 246	74 250	35 662	28 901	1.23×	17 417	13 460	1.29×
H.sapiens	NovaSeq	25	195 748	36 131	11 101	6971	1.59×	9927	5657	1.75×
H.sapiens	NovaSeq	100	787 616	144 927	33 734	25 883	1.30×	28 846	20 316	1.42×

Organism	Technology	Coverage	Uncompressed	Lossless mode				Recommended lossy mode
			Size	pigz	FaStore	SPRING	Improvement	FaStore	SPRING	Improvement
Pseudomonas aeruginosa	GAIIx	50	768	279	145	115	1.26×	88	62	1.41×
Metagenomic	HiSeq 2000	––	19 284	6911	3602	3206	1.12×	1935	1736	1.11×
H.sapiens	HiSeq 2000	28	227 246	74 250	35 662	28 901	1.23×	17 417	13 460	1.29×
H.sapiens	NovaSeq	25	195 748	36 131	11 101	6971	1.59×	9927	5657	1.75×
H.sapiens	NovaSeq	100	787 616	144 927	33 734	25 883	1.30×	28 846	20 316	1.42×

Notes: Improvement is reported with respect to FaStore. Best results for each mode are bold-faced.

Open in new tab

In conclusion, this work presents the FASTQ compressor SPRING, which outperforms existing tools, offering 1.3×–1.8× improvement in compression over the next best performing tool on data sequenced on Illumina’s latest sequencer, NovaSeq. SPRING supports a wide variety of modes and features and is competitive in terms of computational requirements. Furthermore, the streams generated by SPRING can be easily transformed to streams compatible with the upcoming standard developed by the MPEG-G group for genomic information representation (Alberti et al., 2018). Future work includes integration of SPRING into the standard and developing specialized read compressors for long read technologies.

Funding

This work was partially supported by NIH Grant 5U01CA198943-03, grant numbers 2018-182798 and 2018-182799 from the Chan Zuckerberg Initiative DAF, an advised fund SVCF and an SRI grant from UIUC.

Conflict of Interest: none declared.

References

Alberti

C.

et al. (

2018

)

An introduction to MPEG-G, the new ISO standard for genomic information representation

. https://www.biorxiv.org/content/early/2018/10/08/426353.

OpenURL Placeholder Text

WorldCat

Bonfield

J.K.

,

Mahoney

M.V.

(

2013

)

Compression of FASTQ and SAM format sequencing data

.

PLoS One

,

8

,

e59190

.

Chandak

S.

et al. (

2018

)

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis

.

Bioinformatics

,

34

,

558

–

567

.

Hach

F.

et al. (

2012

)

SCALCE: boosting sequence compression algorithms using locally consistent encoding

.

Bioinformatics

,

28

,

3051

–

3057

.

Malysa

G.

et al. (

2015

)

QVZ: lossy compression of quality values

.

Bioinformatics

,

31

,

3122

–

3129

.

Numanagić

I.

et al. (

2016

)

Comparison of high-throughput sequencing data compression tools

.

Nat. Methods

,

13

,

1005

.

Ochoa

I.

et al. (

2017

)

Effect of lossy compression of quality scores on variant calling

.

Brief. Bioinform

.,

18

,

183

–

194

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Roguski

L.

,

Deorowicz

S.

(

2014

)

DSRC 2-industry-oriented compression of FASTQ files

.

Bioinformatics

,

30

,

2213

–

2215

.

Roguski

L.

et al. (

2018

)

Fastore: a space-saving solution for raw sequencing data

.

Bioinformatics

,

34

,

2748

–

2756

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
December 2018	65
January 2019	77
February 2019	62
March 2019	57
April 2019	44
May 2019	41
June 2019	19
July 2019	26
August 2019	187
September 2019	76
October 2019	55
November 2019	85
December 2019	45
January 2020	80
February 2020	31
March 2020	28
April 2020	23
May 2020	31
June 2020	27
July 2020	46
August 2020	30
September 2020	62
October 2020	49
November 2020	64
December 2020	65
January 2021	65
February 2021	68
March 2021	51
April 2021	112
May 2021	117
June 2021	149
July 2021	109
August 2021	89
September 2021	100
October 2021	99
November 2021	133
December 2021	139
January 2022	112
February 2022	123
March 2022	172
April 2022	133
May 2022	148
June 2022	77
July 2022	86
August 2022	97
September 2022	94
October 2022	77
November 2022	92
December 2022	99
January 2023	116
February 2023	109
March 2023	128
April 2023	114
May 2023	135
June 2023	101
July 2023	72
August 2023	113
September 2023	103
October 2023	118
November 2023	168
December 2023	121
January 2024	137
February 2024	172
March 2024	139
April 2024	100

Article Contents

SPRING: a next-generation compressor for FASTQ data

Abstract

1 Introduction

2 Methods and results

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

SPRING: a next-generation compressor for FASTQ data

Abstract

1 Introduction

2 Methods and results

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only