Abstract

Summary

Nowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e. squeezing human genotype to less than 62 KB. Moreover, it can also compress single samples in reference to the existing database achieving comparable results.

Availability and implementation

https://github.com/refresh-bio/GTShark.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Rapid decrease of genome sequencing costs allows many sequencing projects to grow to impressive sizes. The numbers of individuals covered by the Haplotype Resource Consortium (McCarthy et al., 2016) or Exome Aggregate Consortium (Lek et al., 2016) projects are counted in tens of thousands and even larger projects are on the go.

The aggregate results of such projects are usually stored in the Variant Call Format (VCF) files (Danecek et al., 2011), in which the descriptions of genome variations are stored in successive rows. Each tab-separated row contains nine mandatory fields describing the variant and some (unlimited) number of optional values representing genotypes. The sizes of VCF files are huge, counted often in hundreds of gigabytes per single chromosome, so gzip is a common solution partially resolving the storage and transfer problems. Nevertheless, a lot of effort was made to provide even better compression and sometimes speeding up the access to the data. The most remarkable attempts were TGC (Deorowicz et al., 2013), PBWT (Durbin, 2014), BGT (Li, 2016) and GTC (Danek and Deorowicz, 2018). The first of them focused just on the compression ratio, while the remaining aimed at the rapid queries support with good compression ratios.

In this article, our goal is to provide the best compression ratio for a collection of genotypes. Moreover, the compressed database can serve as a knowledge base, which allows to astonishingly reduce sizes of files of newly sequenced individuals.

Our tool, GTShark, is based on the Positional Burrows–Wheeler Transform (PBWT) introduced in (Durbin, 2014). The key idea of PBWT is to permute a vector of genotypes for every single variant to order the samples according to the genotypes of previous variants. Due to the linkage disequilibrium property, the neighboring genotypes (after the permutation) are likely to be the same. Thus, the permuted vector of genotypes is usually composed of a few runs of 1 s (presence of alternate value) and 0 s (presence of the referential value). PBWT and BGT (Li, 2016), which uses PBWT, store the run lengths using a simple run encoding scheme. BGT preserves a permutation of genotypes each 8192nd variant to calculate permutations from this position allowing fast random access queries. GTC does not use PBWT, but a specialized technique based on Ziv-Lempel compression algorithm, run length encoding and Huffman coding. Similarly to BGT it processes variants in blocks but permutes the haplotypes within it.

2 Materials and methods

In GTShark, we essentially follow the same way as PBWT and BGT. Nevertheless, there are several differences. First, we use a generalized PBWT [designed for non-binary alphabets in Deorowicz et al. (2019)] to directly support multialleles and unknown genotypes. Second, since we aim at the compression ratio, we do not store the intermediate samples permutations. Third, we employ an entropy coder (namely range coder) with special contextual modeling, which improves the compression ratio more than 2-fold. Fourth, we slightly modify the generalized PBWT, i.e., we resign from permuting vectors of samples for extremely rare (or frequent) variants. The experiments show that this improves the compression ratio by up to 10%.

A unique feature of GTShark is the ability to compress new, external, single-sample, VCF files with the use of the compressed collection as a reference. Such mode is not available in other tools. Such a scenario can appear, for example, in large sequencing projects, like the UK Biobank (Bycroft et al., 2018), in which the genotypes are determined using microarrays when a set of variants is fixed. To compress such sample we traverse the compressed collection and for each genotype we determine the position in the permuted vector in which this genotype would be placed if the sample were a part of the collection. We make use of the found neighborhood to predict and store the genotype compactly. More details of the algorithms can be found in the Supplementary Material.

The compressor was implemented in the C++14 language and is distributed under GNU GPL 3 licence. It is slightly parallelized: the PBWT creation and the remaining parts are executed in separate threads. Thus, the maximum parallelization gain is 2-fold but in practice it is much smaller.

3 Results

For evaluation we used two H.sapiens datasets: the 1000 Genome Project Phase 3 (1000GP3: 2504 samples, 84.80 M variants) (Sudmant et al., 2015) and the Haplotype Reference Consortium (HRC: 27 165 samples, 40.40 M variants).

The comparison of compression ratios and times of the state-of-the-art methods, i.e., TGC, PBWT, BGT, GTC and the proposed GTShark is shown in Table 1 and Supplementary Material. GTShark is the clear winner in compression ratio. The running times of PBWT, BGT, GTC, GTShark algorithms are similar as they are dominated by the processing of VCF files. GTShark is the fastest in compression due to slight parallelization of the code and some small technical improvements in parsing the VCF files. GTShark allows also to extract a single sample from a compressed collection in an average time about 11.5 min, over 16 times faster than PBWT, over 7 and 5 times slower than BGT and GTC, respectively.

Table 1.

Comparison of compressors of VCF files for HRC collection (40.40 M variants, 27 165 samples, 4310 GB of VCF, 69.7 GB of gzipped VCF)

C-size [MB]C-time [s]C-RAM [GB]D-time [s]
TGC23861 090 94076.82442
PBWT3693134 0650.833 499
BGT6717146 1780.00825 610
GTC3731142 4745.02082
GTShark167897 7161.620 563
C-size [MB]C-time [s]C-RAM [GB]D-time [s]
TGC23861 090 94076.82442
PBWT3693134 0650.833 499
BGT6717146 1780.00825 610
GTC3731142 4745.02082
GTShark167897 7161.620 563

Note: Hardware configuration: two AMD Opteron 6348 CPUs (2.8 GHz, 12 cores each), 128 GB of RAM. Column description: ‘C-size’—compressed size, ‘C-time’—compression time, ‘D-time’—decompression time, ‘C-RAM’—RAM usage in the compression stage. Bold font denotes the best results.

Table 1.

Comparison of compressors of VCF files for HRC collection (40.40 M variants, 27 165 samples, 4310 GB of VCF, 69.7 GB of gzipped VCF)

C-size [MB]C-time [s]C-RAM [GB]D-time [s]
TGC23861 090 94076.82442
PBWT3693134 0650.833 499
BGT6717146 1780.00825 610
GTC3731142 4745.02082
GTShark167897 7161.620 563
C-size [MB]C-time [s]C-RAM [GB]D-time [s]
TGC23861 090 94076.82442
PBWT3693134 0650.833 499
BGT6717146 1780.00825 610
GTC3731142 4745.02082
GTShark167897 7161.620 563

Note: Hardware configuration: two AMD Opteron 6348 CPUs (2.8 GHz, 12 cores each), 128 GB of RAM. Column description: ‘C-size’—compressed size, ‘C-time’—compression time, ‘D-time’—decompression time, ‘C-RAM’—RAM usage in the compression stage. Bold font denotes the best results.

In the next experiment, we measured the compression of new external samples. We used the HRC dataset and processed it as follows. We excluded 100 randomly chosen samples to obtain the collection of 27 065 samples, for which we built the compressed database of total size of 1674 MB (i.e., 61.8 KB per sample). Then we compressed the excluded samples one by one taking the compressed collection as a reference. On average GTShark processed a single sample in about 12 min and compacted it to 65.5 KB. To the best of our knowledge, the most recent experiment of this type was described in (Pavlichin et al., 2013). The authors used a reference human genome and the dbSNP database as a knowledge base. They were able to compress single individual genotypes from the 1000GP Phase 1 (39.7 M variants) to about 2.5 MB. The results should not be, however, compared directly as in the Pavlichin et al. experiment the compressed samples contained variants absent from the reference database, and in our experiment we assumed that sets of variants in the sample and reference data are the same. Nevertheless, the comparison shows what is possible in such restricted scenario.

In the final experiment, we investigated the impact of the selected knowledge base. We used Chromosome 11 data from the 1000GP3 containing 2504 samples from 26 populations. Initially, we divided the VCF file into 26 files using the population criteria. The ASW and MXL populations were significantly smaller than the rest and we excluded them from further studies. The remaining populations have cardinalities at least 85 and we randomly subsampled the larger ones to obtain 24 VCF files each containing exactly 85 samples. Then we used GTShark to compress each VCF file obtaining 24 compressed collections serving as references in the rest of the experiment. We also constructed 24 larger referential collections. Each of them, composed of 23 population VCF files and containing 23 × 85 = 1955 samples, was also compressed using GTShark. Then we compressed each single-sample VCF file (i.e., 24 × 85 = 2040 samples in total) using each of the smaller (85-sample) collections as references. The results were averaged over all 85 samples from each population. For easier interpretability of Figure 1 we grouped the populations in 5 superpopulations (African, American, East Asian, European and South Asian). The columns are described by the referential population. As one could expect the compression is the best when a dataset from the same superpopulation is used as a reference. Some populations, though, are a quite good reference for a sample of any ethnicity (for example African populations, i.e., ACB). The last column (beyond the matrix) shows the results for larger collections (containing all populations except for the one that is compressed). Here, the diversity and size of the database work in favor of the sample compression ratio.

Fig. 1.

Comparison of single-sample GTShark compression for 24 populations (data: 1000GP3, Chromosome 11), sizes of sample archives [B]. The references with population codes are made up of 85 samples from the population. The ‘Coll.’ reference is a collection of 1955 samples from 23 population (without population matching compressed sample)

4 Conclusions

The proposed algorithm compressed large collections of genotype data significantly better than the existing methods and was the fastest. Its unique feature is a mode in which the compressed collection of genotypes serves as a reference for external single-sample data. In this scenario, we were able to shrink down the human genome to about 65 KB achieving better results than the former leader in compression ratio, TGC.

Funding

The work was supported by National Science Centre, Poland, project DEC-2017/25/B/ST6/01525 and by POIG.02.03.01-24-099/13 grant: ‘GeCONiI—Upper Silesian Center for Computational Science and Engineering’ (the infrastructure).

Conflict of Interest: none declared.

References

Bycroft
 
C.
 et al.  (
2018
)
The UK Biobank resource with deep phenotyping and genomic data
.
Nature
,
562
,
203
209
.

Danecek
 
P.
 et al.  (
2011
)
The variant call format and VCFtools
.
Bioinformatics
,
27
,
2156
2158
.

Danek
 
A.
,
Deorowicz
S.
(
2018
)
GTC: how to maintain huge genotype collections in a compressed form
.
Bioinformatics
,
34
,
1834
1840
.

Deorowicz
 
S.
 et al.  (
2013
)
Genome compression: a novel approach for large collections
.
Bioinformatics
,
29
,
2572
2578
.

Deorowicz
 
S.
 et al.  (
2019
)
CoMSA: compression of protein multiple sequence alignment files
.
Bioinformatics
,
35
,
227
234
.

Durbin
 
R.
(
2014
)
Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)
.
Bioinformatics
,
30
,
1266
1272
.

Lek
 
M.
 et al.  (
2016
)
Analysis of protein-coding genetic variation in 60,706 humans
.
Nature
,
536
,
285
291
.

Li
 
H.
(
2016
)
BGT: efficient and flexible genotype query across many samples
.
Bioinformatics
,
32
,
590
592
.

McCarthy
 
S.
 et al.  (
2016
)
A reference panel of 64,976 haplotypes for genome imputation
.
Nat. Genet
.,
48
,
1279
1283
.

Pavlichin
 
D.
 et al.  (
2013
)
The human genome contracts again
.
Bioinformatics
,
29
,
2199
2202
.

Sudmant
 
P.H.
 et al.  (
2015
)
An integrated map of structural variation in 2,504 human genomes
.
Nature
,
526
,
75
81
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: John Hancock
John Hancock
Associate Editor
Search for other works by this author on:

Supplementary data