GTShark: genotype compression in large projects

Deorowicz, Sebastian; Danek, Agnieszka

doi:10.1093/bioinformatics/btz508

Abstract

Summary

Nowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e. squeezing human genotype to less than 62 KB. Moreover, it can also compress single samples in reference to the existing database achieving comparable results.

Availability and implementation

https://github.com/refresh-bio/GTShark.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Rapid decrease of genome sequencing costs allows many sequencing projects to grow to impressive sizes. The numbers of individuals covered by the Haplotype Resource Consortium (McCarthy et al., 2016) or Exome Aggregate Consortium (Lek et al., 2016) projects are counted in tens of thousands and even larger projects are on the go.

The aggregate results of such projects are usually stored in the Variant Call Format (VCF) files (Danecek et al., 2011), in which the descriptions of genome variations are stored in successive rows. Each tab-separated row contains nine mandatory fields describing the variant and some (unlimited) number of optional values representing genotypes. The sizes of VCF files are huge, counted often in hundreds of gigabytes per single chromosome, so gzip is a common solution partially resolving the storage and transfer problems. Nevertheless, a lot of effort was made to provide even better compression and sometimes speeding up the access to the data. The most remarkable attempts were TGC (Deorowicz et al., 2013), PBWT (Durbin, 2014), BGT (Li, 2016) and GTC (Danek and Deorowicz, 2018). The first of them focused just on the compression ratio, while the remaining aimed at the rapid queries support with good compression ratios.

In this article, our goal is to provide the best compression ratio for a collection of genotypes. Moreover, the compressed database can serve as a knowledge base, which allows to astonishingly reduce sizes of files of newly sequenced individuals.

Our tool, GTShark, is based on the Positional Burrows–Wheeler Transform (PBWT) introduced in (Durbin, 2014). The key idea of PBWT is to permute a vector of genotypes for every single variant to order the samples according to the genotypes of previous variants. Due to the linkage disequilibrium property, the neighboring genotypes (after the permutation) are likely to be the same. Thus, the permuted vector of genotypes is usually composed of a few runs of 1 s (presence of alternate value) and 0 s (presence of the referential value). PBWT and BGT (Li, 2016), which uses PBWT, store the run lengths using a simple run encoding scheme. BGT preserves a permutation of genotypes each 8192nd variant to calculate permutations from this position allowing fast random access queries. GTC does not use PBWT, but a specialized technique based on Ziv-Lempel compression algorithm, run length encoding and Huffman coding. Similarly to BGT it processes variants in blocks but permutes the haplotypes within it.

2 Materials and methods

In GTShark, we essentially follow the same way as PBWT and BGT. Nevertheless, there are several differences. First, we use a generalized PBWT [designed for non-binary alphabets in Deorowicz et al. (2019)] to directly support multialleles and unknown genotypes. Second, since we aim at the compression ratio, we do not store the intermediate samples permutations. Third, we employ an entropy coder (namely range coder) with special contextual modeling, which improves the compression ratio more than 2-fold. Fourth, we slightly modify the generalized PBWT, i.e., we resign from permuting vectors of samples for extremely rare (or frequent) variants. The experiments show that this improves the compression ratio by up to 10%.

A unique feature of GTShark is the ability to compress new, external, single-sample, VCF files with the use of the compressed collection as a reference. Such mode is not available in other tools. Such a scenario can appear, for example, in large sequencing projects, like the UK Biobank (Bycroft et al., 2018), in which the genotypes are determined using microarrays when a set of variants is fixed. To compress such sample we traverse the compressed collection and for each genotype we determine the position in the permuted vector in which this genotype would be placed if the sample were a part of the collection. We make use of the found neighborhood to predict and store the genotype compactly. More details of the algorithms can be found in the Supplementary Material.

The compressor was implemented in the C++14 language and is distributed under GNU GPL 3 licence. It is slightly parallelized: the PBWT creation and the remaining parts are executed in separate threads. Thus, the maximum parallelization gain is 2-fold but in practice it is much smaller.

3 Results

For evaluation we used two H.sapiens datasets: the 1000 Genome Project Phase 3 (1000GP3: 2504 samples, 84.80 M variants) (Sudmant et al., 2015) and the Haplotype Reference Consortium (HRC: 27 165 samples, 40.40 M variants).

The comparison of compression ratios and times of the state-of-the-art methods, i.e., TGC, PBWT, BGT, GTC and the proposed GTShark is shown in Table 1 and Supplementary Material. GTShark is the clear winner in compression ratio. The running times of PBWT, BGT, GTC, GTShark algorithms are similar as they are dominated by the processing of VCF files. GTShark is the fastest in compression due to slight parallelization of the code and some small technical improvements in parsing the VCF files. GTShark allows also to extract a single sample from a compressed collection in an average time about 11.5 min, over 16 times faster than PBWT, over 7 and 5 times slower than BGT and GTC, respectively.

Table 1.

Comparison of compressors of VCF files for HRC collection (40.40 M variants, 27 165 samples, 4310 GB of VCF, 69.7 GB of gzipped VCF)

	C-size [MB]	C-time [s]	C-RAM [GB]	D-time [s]
TGC	2386	1 090 940	76.8	2442
PBWT	3693	134 065	0.8	33 499
BGT	6717	146 178	0.008	25 610
GTC	3731	142 474	5.0	2082
GTShark	1678	97 716	1.6	20 563

	C-size [MB]	C-time [s]	C-RAM [GB]	D-time [s]
TGC	2386	1 090 940	76.8	2442
PBWT	3693	134 065	0.8	33 499
BGT	6717	146 178	0.008	25 610
GTC	3731	142 474	5.0	2082
GTShark	1678	97 716	1.6	20 563

Note: Hardware configuration: two AMD Opteron 6348 CPUs (2.8 GHz, 12 cores each), 128 GB of RAM. Column description: ‘C-size’—compressed size, ‘C-time’—compression time, ‘D-time’—decompression time, ‘C-RAM’—RAM usage in the compression stage. Bold font denotes the best results.

Open in new tab

Table 1.

Comparison of compressors of VCF files for HRC collection (40.40 M variants, 27 165 samples, 4310 GB of VCF, 69.7 GB of gzipped VCF)

	C-size [MB]	C-time [s]	C-RAM [GB]	D-time [s]
TGC	2386	1 090 940	76.8	2442
PBWT	3693	134 065	0.8	33 499
BGT	6717	146 178	0.008	25 610
GTC	3731	142 474	5.0	2082
GTShark	1678	97 716	1.6	20 563

	C-size [MB]	C-time [s]	C-RAM [GB]	D-time [s]
TGC	2386	1 090 940	76.8	2442
PBWT	3693	134 065	0.8	33 499
BGT	6717	146 178	0.008	25 610
GTC	3731	142 474	5.0	2082
GTShark	1678	97 716	1.6	20 563

Note: Hardware configuration: two AMD Opteron 6348 CPUs (2.8 GHz, 12 cores each), 128 GB of RAM. Column description: ‘C-size’—compressed size, ‘C-time’—compression time, ‘D-time’—decompression time, ‘C-RAM’—RAM usage in the compression stage. Bold font denotes the best results.

Open in new tab

In the next experiment, we measured the compression of new external samples. We used the HRC dataset and processed it as follows. We excluded 100 randomly chosen samples to obtain the collection of 27 065 samples, for which we built the compressed database of total size of 1674 MB (i.e., 61.8 KB per sample). Then we compressed the excluded samples one by one taking the compressed collection as a reference. On average GTShark processed a single sample in about 12 min and compacted it to 65.5 KB. To the best of our knowledge, the most recent experiment of this type was described in (Pavlichin et al., 2013). The authors used a reference human genome and the dbSNP database as a knowledge base. They were able to compress single individual genotypes from the 1000GP Phase 1 (39.7 M variants) to about 2.5 MB. The results should not be, however, compared directly as in the Pavlichin et al. experiment the compressed samples contained variants absent from the reference database, and in our experiment we assumed that sets of variants in the sample and reference data are the same. Nevertheless, the comparison shows what is possible in such restricted scenario.

In the final experiment, we investigated the impact of the selected knowledge base. We used Chromosome 11 data from the 1000GP3 containing 2504 samples from 26 populations. Initially, we divided the VCF file into 26 files using the population criteria. The ASW and MXL populations were significantly smaller than the rest and we excluded them from further studies. The remaining populations have cardinalities at least 85 and we randomly subsampled the larger ones to obtain 24 VCF files each containing exactly 85 samples. Then we used GTShark to compress each VCF file obtaining 24 compressed collections serving as references in the rest of the experiment. We also constructed 24 larger referential collections. Each of them, composed of 23 population VCF files and containing 23 × 85 = 1955 samples, was also compressed using GTShark. Then we compressed each single-sample VCF file (i.e., 24 × 85 = 2040 samples in total) using each of the smaller (85-sample) collections as references. The results were averaged over all 85 samples from each population. For easier interpretability of Figure 1 we grouped the populations in 5 superpopulations (African, American, East Asian, European and South Asian). The columns are described by the referential population. As one could expect the compression is the best when a dataset from the same superpopulation is used as a reference. Some populations, though, are a quite good reference for a sample of any ethnicity (for example African populations, i.e., ACB). The last column (beyond the matrix) shows the results for larger collections (containing all populations except for the one that is compressed). Here, the diversity and size of the database work in favor of the sample compression ratio.

Fig. 1.

Open in new tab Download slide

Comparison of single-sample GTShark compression for 24 populations (data: 1000GP3, Chromosome 11), sizes of sample archives [B]. The references with population codes are made up of 85 samples from the population. The ‘Coll.’ reference is a collection of 1955 samples from 23 population (without population matching compressed sample)

4 Conclusions

The proposed algorithm compressed large collections of genotype data significantly better than the existing methods and was the fastest. Its unique feature is a mode in which the compressed collection of genotypes serves as a reference for external single-sample data. In this scenario, we were able to shrink down the human genome to about 65 KB achieving better results than the former leader in compression ratio, TGC.

Funding

The work was supported by National Science Centre, Poland, project DEC-2017/25/B/ST6/01525 and by POIG.02.03.01-24-099/13 grant: ‘GeCONiI—Upper Silesian Center for Computational Science and Engineering’ (the infrastructure).

Conflict of Interest: none declared.

References

Bycroft

C.

et al. (

2018

)

The UK Biobank resource with deep phenotyping and genomic data

.

Nature

,

562

,

203

–

209

.

Danecek

P.

et al. (

2011

)

The variant call format and VCFtools

.

Bioinformatics

,

27

,

2156

–

2158

.

Danek

A.

,

Deorowicz

S.

(

2018

)

GTC: how to maintain huge genotype collections in a compressed form

.

Bioinformatics

,

34

,

1834

–

1840

.

Deorowicz

S.

et al. (

2013

)

Genome compression: a novel approach for large collections

.

Bioinformatics

,

29

,

2572

–

2578

.

Deorowicz

S.

et al. (

2019

)

CoMSA: compression of protein multiple sequence alignment files

.

Bioinformatics

,

35

,

227

–

234

.

Google Scholar

Crossref

WorldCat

Durbin

R.

(

2014

)

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

.

Bioinformatics

,

30

,

1266

–

1272

.

Lek

M.

et al. (

2016

)

Analysis of protein-coding genetic variation in 60,706 humans

.

Nature

,

536

,

285

–

291

.

Li

H.

(

2016

)

BGT: efficient and flexible genotype query across many samples

.

Bioinformatics

,

32

,

590

–

592

.

Google Scholar

Crossref

WorldCat

McCarthy

S.

et al. (

2016

)

A reference panel of 64,976 haplotypes for genome imputation

.

Nat. Genet

.,

48

,

1279

–

1283

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Pavlichin

D.

et al. (

2013

)

The human genome contracts again

.

Bioinformatics

,

29

,

2199

–

2202

.

Sudmant

P.H.

et al. (

2015

)

An integrated map of structural variation in 2,504 human genomes

.

Nature

,

526

,

75

–

81

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
June 2019	15
July 2019	7
August 2019	2
September 2019	8
October 2019	3
November 2019	28
December 2019	27
January 2020	14
February 2020	5
March 2020	9
April 2020	5
May 2020	3
June 2020	1
July 2020	1
August 2020	7
September 2020	4
October 2020	7
November 2020	1
December 2020	2
January 2021	4
February 2021	8
April 2021	13
May 2021	19
June 2021	12
July 2021	29
August 2021	22
September 2021	15
October 2021	19
November 2021	20
December 2021	27
January 2022	16
February 2022	20
March 2022	25
April 2022	10
May 2022	16
June 2022	28
July 2022	41
August 2022	26
September 2022	22
October 2022	47
November 2022	10
December 2022	9
January 2023	25
February 2023	14
March 2023	9
April 2023	21
May 2023	11
June 2023	10
July 2023	5
August 2023	12
September 2023	5
October 2023	16
November 2023	14
December 2023	10
January 2024	12
February 2024	21
March 2024	9
April 2024	11

Article Contents

GTShark: genotype compression in large projects

Abstract

1 Introduction

2 Materials and methods

3 Results

4 Conclusions

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

GTShark: genotype compression in large projects

Abstract

1 Introduction

2 Materials and methods

3 Results

4 Conclusions

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only