Abstract

Research in the genomic sciences is confronted with the volume of sequencing and resequencing data increasing at a higher pace than that of data storage and communication resources, shifting a significant part of research budgets from the sequencing component of a project to the computational one. Hence, being able to efficiently store sequencing and resequencing data is a problem of paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence. It overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by GRS, faster running times and compression gains of over 100-fold for some sequences. This tool is freely available for non-commercial use at ftp://ftp.ieeta.pt/∼ap/codecs/GReEn1.tar.gz .

INTRODUCTION

Inspired by the Biocompress algorithm of Grumbach and Tahi ( 1 ), the last two decades have witnessed the proposal of a myriad of algorithms for compressing genomic sequences [( 2–16 ) for a recent review]. The acquired knowledge regarding genome structure that these compression algorithms have been providing, through their representation of genomic sequences using probabilistic models, is likely to surpass in relevance the benefits of the effective storage space reduction provided.

One of the most successful compression algorithms specifically designed for genomic sequences is XM, a statistical method proposed by Cao et al. ( 14 ), though other approaches may present competitive or even superior results for some classes of genomes ( 15 , 17 ). XM relies on a mixture of experts for providing symbol by symbol probability estimates that are fed to an arithmetic encoder. The XM algorithm comprises three types of experts: (i) order-2 Markov models; (ii) order-1 context Markov models, i.e. Markov models that rely on statistical information from a recent past (typically, the 512 previous symbols); (iii) the copy expert, which considers the next symbol as part of a copied region from a particular offset. The probability estimates provided by the set of experts are then combined using Bayesian averaging and sent to the arithmetic encoder.

Common practice continues to rely on standard and general purpose data compression methods, e.g. gzip or bzip2 . However, this practice may be close to a turning point, as the rate at which genomic data is being produced is clearly overtaking the rate of increase in storage resources and communication bandwidth.

The development of high-throughput sequencing technologies that offer dramatically reduced sequencing costs enables possibilities hardly foreseeable a decade ago ( 18 ). Large-scale projects such as the 1000 Genomes Project ( http://www.1000genomes.org/ ) and The Cancer Genome Atlas ( http://cancergenome.nih.gov/ ), as well as, prizes that reward cheaper, faster, less prone to errors and higher-throughput sequencing methodologies (e.g. http://genomics.xprize.org/ ) are paving the way to individual genomics and personalized medicine ( 19 ). As such, huge volumes of genomic data will be produced in the near future. However, as a very significant part of the genome is shared among individuals of the same species, these data will be mostly redundant. Some ideas for storing and communicating redundant genomic data have already been put forward, based on, for example, single nucleotide polymorphism (SNP) databases ( 20 ), or insert and delete operations ( 21 ).

Recently, Wang et al. ( 22 ) proposed a compression tool, GRS, that is able to compress a sequence using another one as reference without requiring any additional information about those sequences, such as, a reference SNPs map. The algorithm proposed by Kuruppu et al. ( 23 ) RLZ, is also able to perform relative Lempel–Ziv compression of DNA sequences, though its current implementation cannot fully handle sequences that have characters outside the {a,c,g,t,n} set.

Other approaches propose encoding sequence reads output by massively parallel sequencing experiments ( 24–27 ), which is also a very important problem. The compression of short reads shares some points with the problem being addressed here, though it needs to cope with other requirements, such as, the efficient representation of base calling quality information.

In this article, we describe GReEn (Genome Resequencing Encoding), a new tool for compressing genome resequencing data using a reference genome sequence. As such, it addresses the same problem as GRS ( 22 ), RLZ ( 23 ) or XM ( 14 ). However, as will be demonstrated, GReEn outperforms GRS in storage space requirements and running times, though GRS can handle some sequences in a very effective way, and it overcomes RLZ's and XM's lack of support for arbitrary alphabets and inferior performance.

GReEn is a compression tool based on arithmetic coding that handles arbitrary alphabets. Its running time depends only on the size of the sequence being compressed. Moreover, it provides compression gains of over 100-fold for some sequences, when compared to GRS, and even larger gains when compared to RLZ. Finally, GReEn handles without restriction sequences that cannot be compressed with GRS due to excessive difference to the reference sequence.

MATERIALS AND METHODS

Dataset

We use the same data as in ( 22 ), for ease of comparison with GRS: two versions of the first individual Korean genome sequenced, KOREF_20090131 and KOREF_20090224 ( 28 ); two versions of the genome of the thale cress Arabidopsis thaliana , TAIR8 and TAIR9 ( 29 , 30 ); and two versions of the rice Oryza sativa genome, TIGR5.0 and TIGR6.0 ( 31 ). We also present results for four additional human genome assemblies, namely, the genome of J. Craig Venter referred to as HuRef ( 32 ), the Celera alternate assembly referred to as Celera ( 33 ), the genome of a Han Chinese individual referred to as YH ( 34 ), and the human genome reference assembly build 37.p2, as made available by the National Center for Biotechnology Information and referred to as NCBI37 ( 35 ).

Software availability

The codec (encoder/decoder) is implemented in the C programming language and is freely available for non-commercial purposes. It can be downloaded at ftp://ftp.ieeta.pt/∼ap/codecs/GReEn1.tar.gz .

The compression method

As with GRS ( 22 ), GReEn relies on a reference sequence for compressing the target sequence. The reference sequence is generally only slightly different from the target sequence, although this is not mandatory. In fact, it is possible to use a sequence from a different species as reference, though, as expected, the compression efficiency depends on the degree of similarity between both reference and target sequences. Moreover, in order to recover the target sequence, the decoder needs access to exactly the same reference sequence as that used by the encoder.

The codec developed in GReEn is able to handle arbitrary alphabets, although it automatically ignores all lines beginning with the ‘>’ character, as well as, all newline characters. We denote by forumla the set of all different characters, or symbols, that are found in the target sequence, where forumla denotes the number of elements in forumla , i.e. the alphabet size.

Each character of the target sequence is encoded by an arithmetic encoder ( 36 ). As with any arithmetic encoder, besides the symbol to encode, it is necessary to provide the probability distribution of the symbols. One major advantage of arithmetic coding is its ability to adjust the probabilistic model as the encoding proceeds, in response to the changing probability distribution from one encoded symbol to the next.

We denote by θ( c ) the relative frequency of character forumla in the target sequence, and by Pn ( c ) the estimated probability of character forumla when encoding the character at position n in the target sequence. The set of probabilities forumla are passed down to the arithmetic coder. Note that, whereas θ( c ) values are fixed for a given target sequence, Pn ( c ) values usually change along the coding process. For a sequence forumla , with N characters, the arithmetic coder produces a bitstream with  

(1)
formula
bits, which demonstrates the importance of providing good probability estimates to the arithmetic coder.

The probability distribution, Pn ( c ), can be provided by two different sources: (i) an adaptive model (the copy model) which assumes that the characters of the target sequence are an exact copy of (parts of) the reference sequence; (ii) a static model that relies on the frequencies of the characters in the target sequence, i.e. θ( c ). The adaptive model is the main statistical model, as it allows a high compression rate of the target sequence, particularly in areas where the target and reference sequences are highly similar. However, this adaptive, or copy, model will at times not be used (the reasons why will be detailed shortly), and the static model will act as a fallback mechanism, feeding the arithmetic coder with the required probability distribution.

The copy model

The copy model is inspired by the copy expert of the XM DNA compression method ( 14 ), relying on a pointer to a position in the reference sequence that has a ‘good chance’ of containing a character identical to that being encoded. As encoding of the target sequence proceeds, the pointer associated with the copy model may be repositioned to different locations of the reference sequence. When this repositioning occurs, all parameters of the model are reset. Besides accounting for the number of times, tn , that the copy model was used after the previous repositioning, two more counters are maintained: forumla stores the number of times the model guessed the correct character including the correct case (uppercase or lowercase), and forumla records the number of times the model guessed the character but failed the case (for example, it guessed ‘A’ but the correct character was ‘a’).

Figure 1 exemplifies the operation of the copy model. Consider the most recent repositioning occurred at position 341 587 of the reference sequence, corresponding to position 327 829 of the target sequence (in this example, the reference is ahead of the target, but this may be different in other cases). Assuming the codec is going to compress the character marked with ‘?’, then the character predicted by the copy model would be ‘G’ (the one under the ‘Current position’ arrow), with tn  = 12, forumla and forumla . The characters linked by the dashed arrow indicate a prediction error (the predicted character was ‘A’, whereas the correct one was ‘G’).

Figure 1.

The copy model. In this example, the copy model was restarted at position 341 587 of the reference sequence, corresponding to position 327 829 of the target sequence. Since then, it has correctly predicted 5 characters, if the case is considered, and a total of 11 characters if the case is ignored. The dashed arrow indicates a failed prediction. According to this example, the next character to be predicted is ‘G’.

Figure 1.

The copy model. In this example, the copy model was restarted at position 341 587 of the reference sequence, corresponding to position 327 829 of the target sequence. Since then, it has correctly predicted 5 characters, if the case is considered, and a total of 11 characters if the case is ignored. The dashed arrow indicates a failed prediction. According to this example, the next character to be predicted is ‘G’.

Computing the probabilities

Let us denote by forumla the character predicted by the copy model (‘G’ in the example in Figure 1 ) and by forumla the case converted forumla (‘g’ according to the example in Figure 1 ). If forumla (note that characters of the reference sequence that do not appear in the target sequence do not belong to forumla ), the probabilities that are passed down to the arithmetic coder are given by  

(2)
formula
The first two branches of Equation ( 2 ) correspond to Laplace probability estimators of the form  
(3)
formula
where the forumla s form a set of K collectively exhaustive and mutually exclusive events, and forumla denotes the number of times that event forumla has occurred in the past. In Equation ( 2 ) we considered three events, namely, forumla , forumla and forumla . The third branch of Equation ( 2 ) defines how the probability assigned to forumla , i.e. forumla , is distributed among the individual characters of forumla . This distribution is proportional to the relative frequencies of the characters, θ( c ), after discounting the effect of treating forumla and forumla differently.

If only forumla or forumla belongs to forumla , the probabilities are given by  

(4)
formula
where forumla if forumla , or forumla if forumla . As such, we have considered only two events, namely, forumla and forumla , where the distribution of probabilities among the characters of forumla is performed as before.

Finally, if both forumla , the probabilities communicated to the arithmetic coder are the character frequencies of the target sequence, i.e.  

(5)
formula

Starting and stopping the copy model

Typically, the codec starts by constructing a hash table with the occurrences and corresponding positions in the reference sequence of all k -mers of a given size (the default size is k  = 11, but it can be changed using a command line option). Figure 2 shows an example where k  = 8 and k -mers ‘CTNANGTC’ and ‘AAAGTTGG’ have been mapped by the hashing function into the same index (index 4 529 821). As usual in hashing schemes, disambiguation is achieved by direct comparison of the k -mers that originated the index, which have to be stored in the data structure in order to be compared. Using the hash table, it is easy to find in the reference sequence the characters that come right after all occurrences of a given k -mer.

Figure 2.

Data organized in a hash table.

Figure 2.

Data organized in a hash table.

Before encoding a new character from the target sequence, the performance of the copy model, if in use, is checked. If forumla m_f ?> , where mf is a parameter that indicates the maximum number of prediction failures allowed, the copy model is stopped. The default value for mf is zero, but this may be changed through a command line option.

Following this performance check, if the copy model is not in use, an attempt is made to restart the copy model before compressing the character. This is accomplished by looking for the positions in the reference sequence where the k -mer composed of the k -most-recently-encoded characters occurs. If more than one position is found, the one closest to the encoding position is chosen. If none is found, the current character is encoded using the static model and a new attempt for starting a new copy model is performed after advancing one position in the target sequence.

Special case for equal size sequences

When the reference and target sequences have the same size, the codec assumes that both sequences are aligned. Therefore, whenever the copy model is restarted, it is forced to use the current encoding position as reference. This avoids constructing the hash table, hence, increasing the codec speed and generally producing better results. However, this mode of operation may be overridden by a command line option, as it may lead to poor performance for same-sized sequences that are not aligned.

RESULTS

We compare the performance of the method proposed here, GReEn, to the performance of GRS ( 22 ), the most recently proposed approach for compressing genome resequencing data that handles sequences drawn from arbitrary alphabets. We also include results of the RLZ algorithm ( 23 ) for some sequences, due to its restriction to sequences drawn from the alphabet {a, c, g, t, n}.

Tables 1–3 present both the number of bytes produced and the time taken by the respective methods for compressing the sequences used in ( 22 ). The results regarding both the GRS and RLZ methds have been obtained using the software publicly provided by the authors. All experimental results were obtained using an Intel Core i7-2620M laptop computer at 2.7 GHz with 8 GB of memory and running Ubuntu 11.04. The best results are highlighted in boldface.

Table 1.

Arabidopsis thaliana genome: compression of TAIR9 using TAIR8 as reference

Chr  forumla   Size  GRS
 
GReEn
 
      Bytes  Secs  Bytes  Secs 
11  30 427 671  715  1551  13 
11  19 698 289  385  937 
10  23 459 830  2989  1097 
18 585 056  1951  2356 
26 975 502  604  618  11 
Total  –  119 146 348  6644  28  6559  48 
Chr  forumla   Size  GRS
 
GReEn
 
      Bytes  Secs  Bytes  Secs 
11  30 427 671  715  1551  13 
11  19 698 289  385  937 
10  23 459 830  2989  1097 
18 585 056  1951  2356 
26 975 502  604  618  11 
Total  –  119 146 348  6644  28  6559  48 

Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The forumla column indicates the size of the alphabet of the target sequence.

Table 1 displays the compression results for the TAIR9 version of the thale cress genome using the TAIR8 version as reference. Globally, GReEn required 6559 bytes for storing the sequences, whereas GRS needed a little more (6644 bytes). While GReEn took 48 s to encode the data, GRS needed only 28. Therefore, in this case, GRS is equivalent to the proposed method in terms of storage space, but faster.

Table 2 displays the compression results for the TIGR6.0 version of the rice genome using the TIGR5.0 version as reference. In this case, the outcome varies dramatically to the previous results ( Table 1 ). The first significant difference can be observed in both the compressed size and compression time of chromosome 1 : 1 502 040 bytes in 708 s using GRS versus 4972 bytes in 18 s (more than a 300-fold improvement) using GReEn. A similarly significant difference can be observed in chromosome 11 (with a gain of over 160-fold). Globally, GRS required 4 901 902 bytes and 2290 s, whereas GReEn was able to store the entire genome in just 125 535 bytes (39-fold improvement) using only 123 s of computing time.

Table 2.

Oryza sativa genome: compression of TIGR6.0 using TIGR5.0 as reference

Chr  forumla   Size  RLZ
 
GRS
 
GReEn
 
      Bytes  Secs  Bytes  Secs  Bytes  Secs 
43 268 879  185 715  35  1 502 040  708  4 972  18 
35 930 381  210 295  28  1 409  1 906  14 
36 406 689  –  –  47 764  28  17 890  15 
35 278 225  175 663  27  36 145  20  6 750  14 
29 894 789  120 625  21  6 177  5 539  12 
31 246 789  61 038  23  14  482 
29 696 629  167 822  21  4 067  2 448  12 
28 439 308  109 608  20  118 246  43  9 507  11 
23 011 239  44 953  16  14  366 
10  23 134 759  –  –  788 542  339  60 449 
11  11  28 512 666  –  –  2 397 470  1 122  14 797  12 
12  27 497 214  53 714  19  14  429 
Total  –  372 317 567  –  –  4 901 902  2 290  125 535  123 
Chr  forumla   Size  RLZ
 
GRS
 
GReEn
 
      Bytes  Secs  Bytes  Secs  Bytes  Secs 
43 268 879  185 715  35  1 502 040  708  4 972  18 
35 930 381  210 295  28  1 409  1 906  14 
36 406 689  –  –  47 764  28  17 890  15 
35 278 225  175 663  27  36 145  20  6 750  14 
29 894 789  120 625  21  6 177  5 539  12 
31 246 789  61 038  23  14  482 
29 696 629  167 822  21  4 067  2 448  12 
28 439 308  109 608  20  118 246  43  9 507  11 
23 011 239  44 953  16  14  366 
10  23 134 759  –  –  788 542  339  60 449 
11  11  28 512 666  –  –  2 397 470  1 122  14 797  12 
12  27 497 214  53 714  19  14  429 
Total  –  372 317 567  –  –  4 901 902  2 290  125 535  123 

Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The forumla column indicates the size of the alphabet of the target sequence. The missing RLZ values correspond to sequences with characters that cannot be handled by the current implementation of this algorithm.

The main conclusion from these results is that, under certain conditions not yet investigated, GRS fails to find large-scale similarities between the two sequences. Therefore, the number of bytes generated is much larger than necessary and, probably as a consequence, the running time explodes. Moreover, when the target sequence is exactly equal to the reference sequence (as in chromosomes 6, 9 and 12), the GRS reports a number of bytes that is essentially zero [in ( 22 ) they are shown as zero, although we opted to display the number of bytes effectively used], while GReEn uses a few hundred bytes. However, if critical, this could be easily reduced to almost zero using a sequence comparison before starting encoding (note that, due to the requirement that the probabilities communicated to the arithmetic coder should be represented as integers, a lower bound exists in the minimum number of bits that can be generated in each coding step).

Table 3 displays the compression results for the KOREF_20090224 version of the human genome using the KOREF_20090131 version as reference. In this case, GReEn gives consistently better results, both in terms of storage requirements and computing time. In fact, this latter aspect deserves a special note because contrarily to GRS, the running time of GReEn varies linearly with the size of the sequence. Therefore, GReEn allows for an a priori good estimate of the time that is required to compress a given sequence.

Table 3.

Homo sapiens genome: compression of KOREF_20090224 using KOREF_20090131 as reference

Chr Size  GRS
 
GReEn
 
  Bytes Secs Bytes Secs 
247 249 719 1 336 626 222 1 225 767 32 
242 951 149 1 354 059 230 1 272 105 31 
199 501 827 1 011 124 165 971 527 26 
191 273 063 1 139 225 193 1 074 357 25 
180 857 866 988 070 173 947 378 23 
170 899 992 906 116 146 865 448 22 
158 821 424 1 096 646 167 998 482 20 
146 274 826 764 313 125 729 362 19 
140 273 252 864 222 134 773 716 18 
10 135 374 737 768 364 122 717 305 17 
11 134 452 384 755 708 119 716 301 17 
12 132 349 534 702 040 114 668 455 17 
13 114 142 980 520 598 87 490 888 15 
14 106 368 585 484 791 81 451 018 14 
15 100 338 915 496 215 79 453 301 13 
16 88 827 254 567 989 91 510 254 11 
17 78 774 742 505 979 81 464 324 10 
18 76 117 153 408 529 71 378 420 10 
19 63 811 651 399 807 62 369 388 
20 62 435 964 282 628 48 266 562 
21 46 944 323 226 549 40 203 036 
22 49 691 432 262 443 41 230 049 
16 571 183 127 
154 913 754 3 231 776 500 2 712 153 20 
57 772 954 592 791 96 481 307 
Total 3 080 436 051 19 666 791 3,188 17 971 030 396 
Chr Size  GRS
 
GReEn
 
  Bytes Secs Bytes Secs 
247 249 719 1 336 626 222 1 225 767 32 
242 951 149 1 354 059 230 1 272 105 31 
199 501 827 1 011 124 165 971 527 26 
191 273 063 1 139 225 193 1 074 357 25 
180 857 866 988 070 173 947 378 23 
170 899 992 906 116 146 865 448 22 
158 821 424 1 096 646 167 998 482 20 
146 274 826 764 313 125 729 362 19 
140 273 252 864 222 134 773 716 18 
10 135 374 737 768 364 122 717 305 17 
11 134 452 384 755 708 119 716 301 17 
12 132 349 534 702 040 114 668 455 17 
13 114 142 980 520 598 87 490 888 15 
14 106 368 585 484 791 81 451 018 14 
15 100 338 915 496 215 79 453 301 13 
16 88 827 254 567 989 91 510 254 11 
17 78 774 742 505 979 81 464 324 10 
18 76 117 153 408 529 71 378 420 10 
19 63 811 651 399 807 62 369 388 
20 62 435 964 282 628 48 266 562 
21 46 944 323 226 549 40 203 036 
22 49 691 432 262 443 41 230 049 
16 571 183 127 
154 913 754 3 231 776 500 2 712 153 20 
57 772 954 592 791 96 481 307 
Total 3 080 436 051 19 666 791 3,188 17 971 030 396 

Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The size of the alphabet in the target sequence is 21 for all chromosomes, except for the M chromosome where it is 11.

Besides considering the datasets in ( 22 ), we also investigate four human genome assemblies, in order to provide a more comprehensive comparison of both GRS and GReEn compression approaches. However, our intention fell short because GRS failed to compress most of the sequences due to an excessive difference between the reference and target sequences. Table 4 displays the results obtained when the YH genome was compressed using KOREF_20090224 as reference. It is clear that GRS gave unacceptable results, both regarding the size of the compressed sequences and the time required to compress them, for the few chromosomes that could be compressed with GRS.

Table 4.

Homo sapiens genome: compression of YH using KOREF_20090224 as reference

Chr Size  GRS
 
GReEn
 
  Bytes Secs Bytes Secs 
247 249 719 – – 2 349 124 22 
242 951 149 – – 2 420 007 22 
199 501 827 17 410 946 2879 1 730 477 18 
191 273 063 – – 1 877 056 17 
180 857 866 – – 1 792 278 16 
170 899 992 25 815 446 7526 1 588 739 15 
158 821 424 – – 1 820 425 14 
146 274 826 – – 1 358 770 13 
140 273 252 – – 1 476 495 13 
10 135 374 737 – – 1 353 193 12 
11 134 452 384 – – 1 274 433 12 
12 132 349 534 16 136 610 2120 1 174 966 12 
13 114 142 980 11 227 954 3181 866 266 10 
14 106 368 585 – – 826 672 10 
15 100 338 915 – – 892 429 
16 88 827 254 – – 1 015 246 
17 78 774 742 – – 864 710 
18 76 117 153 13 187 892 4061 713 787 
19 63 811 651 – – 589 422 
20 62 435 964 8 409 776 1449 493 404 
21 46 944 323 726 269 664 374 383 
22 49 691 432 – – 444 932 
16 571 321 127 
154 913 754 – – 3 258 188 11 
57 772 954 – – 859 688 
Chr Size  GRS
 
GReEn
 
  Bytes Secs Bytes Secs 
247 249 719 – – 2 349 124 22 
242 951 149 – – 2 420 007 22 
199 501 827 17 410 946 2879 1 730 477 18 
191 273 063 – – 1 877 056 17 
180 857 866 – – 1 792 278 16 
170 899 992 25 815 446 7526 1 588 739 15 
158 821 424 – – 1 820 425 14 
146 274 826 – – 1 358 770 13 
140 273 252 – – 1 476 495 13 
10 135 374 737 – – 1 353 193 12 
11 134 452 384 – – 1 274 433 12 
12 132 349 534 16 136 610 2120 1 174 966 12 
13 114 142 980 11 227 954 3181 866 266 10 
14 106 368 585 – – 826 672 10 
15 100 338 915 – – 892 429 
16 88 827 254 – – 1 015 246 
17 78 774 742 – – 864 710 
18 76 117 153 13 187 892 4061 713 787 
19 63 811 651 – – 589 422 
20 62 435 964 8 409 776 1449 493 404 
21 46 944 323 726 269 664 374 383 
22 49 691 432 – – 444 932 
16 571 321 127 
154 913 754 – – 3 258 188 11 
57 772 954 – – 859 688 

Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The missing values are due to the inability of GRS to compress sequences differing more than a predefined value.

Table 5 displays the compression results, using GReEn, for four different human genome assemblies (HuRef, Celera, YH and KOREF_20090224) using the NCBI37 version as reference. As this article is about sequence compression, not sequence analysis, we refrain from elaborating too much on the differences observed. Nevertheless, we hint at what we believe may be possible explanations. First, the HuRef and Celera assemblies are not resequencing assemblies and this, per se , accounts for greater compression differences with respect to the reference assembly.

Table 5.

Homo sapiens genome: compression with GReEn of the HuRef, Celera, YH and KOREF_20090224 versions using the NCBI37 as reference

Chr HuRef Celera YH KOREF 
6 652 184 5 106 720 1 979 661 2 074 258 
4 109 606 3 271 105 2 205 102 1 833 388 
1 718 683 1 125 544 2 868 462 2 808 941 
2 440 255 1 675 878 1 815 309 1 844 448 
2 084 630 1 962 869 1 327 235 1 289 709 
1 926 853 1 846 101 1 460 666 1 436 168 
2 216 643 2 345 859 1 381 234 1,511 664 
1 755 512 1 084 584 1 323 845 1 310 275 
3 939 856 2 906 969 1 049 456 1 152 997 
10 2 235 388 2 025 459 1 075 899 1 237 129 
11 1 565 536 1 459 854 1 068 335 1 104 478 
12 1 495 696 1 559 635 1 199 709 1 260 183 
13 4 429 154 3 023 681 1 065 006 1 052 608 
14 3 480 676 2 325 885 803 902 854 166 
15 3 358 239 2 944 889 946 244 958 050 
16 1 848 172 2 319 629 747 166 802 956 
17 1 091 917 1 163 879 955 918 905 359 
18 893 600 625 364 726 165 765 927 
19 697 898 621 943 2 777 894 2 832 746 
20 611 521 433 253 468 215 490 498 
21 884 601 415 412 434 679 481 691 
22 929 001 655 089 404 354 431 417 
3 159 205 3 259 716 492 893 740 530 
565 746 1 157 801 138 838 279 461 
Chr HuRef Celera YH KOREF 
6 652 184 5 106 720 1 979 661 2 074 258 
4 109 606 3 271 105 2 205 102 1 833 388 
1 718 683 1 125 544 2 868 462 2 808 941 
2 440 255 1 675 878 1 815 309 1 844 448 
2 084 630 1 962 869 1 327 235 1 289 709 
1 926 853 1 846 101 1 460 666 1 436 168 
2 216 643 2 345 859 1 381 234 1,511 664 
1 755 512 1 084 584 1 323 845 1 310 275 
3 939 856 2 906 969 1 049 456 1 152 997 
10 2 235 388 2 025 459 1 075 899 1 237 129 
11 1 565 536 1 459 854 1 068 335 1 104 478 
12 1 495 696 1 559 635 1 199 709 1 260 183 
13 4 429 154 3 023 681 1 065 006 1 052 608 
14 3 480 676 2 325 885 803 902 854 166 
15 3 358 239 2 944 889 946 244 958 050 
16 1 848 172 2 319 629 747 166 802 956 
17 1 091 917 1 163 879 955 918 905 359 
18 893 600 625 364 726 165 765 927 
19 697 898 621 943 2 777 894 2 832 746 
20 611 521 433 253 468 215 490 498 
21 884 601 415 412 434 679 481 691 
22 929 001 655 089 404 354 431 417 
3 159 205 3 259 716 492 893 740 530 
565 746 1 157 801 138 838 279 461 

Number of bytes after compressing each sequence. For ease of comparison we transformed all characters to lowercase and mapped all unknown nucleotides to ‘n’ before compression. Therefore, after this transformation, all sequences were composed only of characters from the alphabet {a,c,g,t,n}.

The HuRef assembly is an individual genome sequenced with capillary-based whole-genome shotgun technologies and de novo assembled with the Celera Assembler. Hence, this assembly is the farthest apart (i.e. with a larger number of bytes required for its compression) from the reference NCBI37 assembly.

The Celera assembly represents one of the two pioneering efforts in sequencing a human genome. Its consensus sequence is derived from the genomes of five individuals using a capillary-based whole-genome shotgun sequencing approach. Unlike the reference assembly generated by the International Human Genome Sequencing Consortium (here represented in the NCBI37 assembly), which used a clone-based hierarchical shotgun strategy that is more likely to output a high-quality finished genome sequence as the sequence assembly is local and anchored to the genome, the Celera Genomics Sequencing Team opted for a whole-genome shotgun strategy where sequence contigs and scaffolds must be individually anchored to the genome, rendering assembly more complex and more prone to long-range misassembly. Moreover, this whole-genome shotgun assembly resulted from a combined analysis of the genomic data generated by the Celera Genomics Sequencing Team and some data generated by the International Human Genome Sequencing Consortium, hence, it has been claimed that this Celera assembly is not a totally independent human genome assembly ( 37 ). We believe this may be part of the explanation for the smaller compression values in Table 5 with respect to this assembly, than those of the HuRef assembly.

The YH assembly is an individual genome based on resequencing data from massively parallel sequencing technologies and assembled with the Short Oligonucleotide Alignment Program, using the NCBI human genome assembly as reference. Essentially, it is a map of SNPs with respect to the reference assembly, hence it displays very low compression values in Table 5 .

The KOREF_20090224 assembly is also an individual genome based on resequencing data from massively parallel sequencing technologies and assembled with the Mapping and Assembly with Qualities program, using the NCBI human genome assembly as reference. As with the YH assembly, resequencing renders the resulting assembly very redundant with respect to the reference (NCBI37) assembly, hence also displaying very low compression values in Table 5 .

The compression values for chromosome 19 in the YH and KOREF_20090224 assemblies are unexpectedly high. This chromosome has the highest GC content (48.4%) and the lowest (median) sequence depth (28-fold) in the YH genome ( 34 ), hence constraining the quality of the final sequence. Not surprisingly, chromosome 19 in the YH genome has a very large number (more than twice those of the reference NCBI37 assembly) of unsequenced bases (‘N’ symbols in our encoding). Chromosome 19 in the KOREF_20090224 assembly faces the same hurdles, which we assume to be a consequence of the similar sequencing methodology.

Finally, Table 6 displays again the compression results for the KOREF_20090224 version of the human genome using the KOREF_20090131 version as reference. However, for allowing the comparison of GReEn to GRS and RLZ on a larger genome, we converted the sequences to the {a,c,g,t,n} alphabet.

Table 6.

Homo sapiens genome: compression of KOREF_20090224 using KOREF_20090131 as reference

Chr RLZ GRS GReEn 
591 629 152 388 90 555 
576 769 146 754 89 440 
472 814 117 544 72 708 
471 157 134 628 83 611 
428 287 108 407 66 597 
411 404 109 866 67 264 
395 524 119 223 71 898 
350 337 94 139 56 650 
357 584 119 647 68 607 
10 335 464 101 486 60 303 
11 326 836 91 380 54 966 
12 320 444 89 170 55 408 
13 266 378 64 313 36 962 
14 248 165 58 865 34 245 
15 235 094 56 569 32 693 
16 217 748 60 580 35 315 
17 193 700 55 582 33 836 
18 182 604 48 098 29 191 
19 162 826 53 355 30 505 
20 149 403 38 114 22 969 
21 112 822 29 048 16 620 
22 119 791 32 562 18 423 
56 75 54 
428 878 224 997 129 497 
150 901 61 306 33 312 
Total 7 506 615 2 168 096 1 291 629 
Chr RLZ GRS GReEn 
591 629 152 388 90 555 
576 769 146 754 89 440 
472 814 117 544 72 708 
471 157 134 628 83 611 
428 287 108 407 66 597 
411 404 109 866 67 264 
395 524 119 223 71 898 
350 337 94 139 56 650 
357 584 119 647 68 607 
10 335 464 101 486 60 303 
11 326 836 91 380 54 966 
12 320 444 89 170 55 408 
13 266 378 64 313 36 962 
14 248 165 58 865 34 245 
15 235 094 56 569 32 693 
16 217 748 60 580 35 315 
17 193 700 55 582 33 836 
18 182 604 48 098 29 191 
19 162 826 53 355 30 505 
20 149 403 38 114 22 969 
21 112 822 29 048 16 620 
22 119 791 32 562 18 423 
56 75 54 
428 878 224 997 129 497 
150 901 61 306 33 312 
Total 7 506 615 2 168 096 1 291 629 

Number of bytes after compressing each sequence. For allowing the comparison to RLZ and GRS, all characters were transformed to lowercase before compression and all unknown nucleotides were mapped to ‘n’. Therefore, after this transformation, all sequences were composed only of characters from set {a,c,g,t,n}.

DISCUSSION

The GRS tool recently introduced by Wang et al. ( 22 ) for compressing DNA resequencing data using a reference sequence allows to significantly reduce data storage space requirements. However, this tool seems to be effective only when the target sequence is very similar to the reference sequence, preventing the compression of many sequences of interest. Moreover, as we have shown, for example, in chromosomes 1 and 11 of the TIGR6.0 version of the rice genome, it may fail to give reasonable results even for similar sequences. Another drawback of GRS is that the encoding time does not depend only on the sequence size, but mainly on the similarity between the target and reference sequences (lower similarity implying greater compression times), resulting in a large unpredictability regarding the time that a certain sequence requires to be compressed.

To overcome these limitations, we propose a statistical compression method that uses a probabilistic copy model. The probabilities are estimated for every character of the target sequence and are used to feed an arithmetic coder. The compression tool has two control parameters, namely, the size of the k -mer that is used for searching copies (with a default value of k  = 11), and the number of prediction failures that are tolerated by the copy model before it is restarted (with a default value of 0). Changing these parameters may change the performance of the codec, degrading the performance for some sequences while improving it for others. It is left to the user the decision of trying to optimize these parameters or, as we have done when producing the experimental results included in this article, to use the default values.

CONCLUSION

In this article, we described a computational tool, GReEn, aiming at compressing genome resequencing data using another sequence as reference. This tool is able to handle arbitrary alphabets and does not pose any restrictions or requirements on the sequences to compress. Several examples of its efficiency in compressing genomic data and its improvements with respect to other recently proposed tools have been included, rendering evident the practical interest of the tool here proposed.

With the generation of increasingly larger volumes of genome sequencing and resequencing data, and the increasing costs associated to storing and transmitting those data, compression tools that efficiently recognize redundancies are in demand. However, the interest in such compression methodologies goes beyond data storage and communication. By being a probabilistic model of the underlying genomic sequence(s), compression tools reveal similarities and differences that are paramount for studies of human genomic variation between individuals, hence, key for progress in personal medicine efforts.

FUNDING

European Fund for Regional Development (FEDER) through the Operational Program Competitiveness Factors (COMPETE); Portuguese Foundation for Science and Technology (FCT) through project grants FCOMP-01-0124-FEDER-010099 (FCT reference PTDC/EIA-EIA/103099/2008) and FCOMP-01-0124-FEDER-022682 (FCT reference PEst-C/EEI/UI0127/2011); European Social Fund (to S.P.G.); Portuguese Ministry of Education and Science (to S.P.G.). Funding for open access charge: Portuguese Foundation for Science and Technology (FCT) project grant FCOMP-01-0124-FEDER-022682 (FCT reference PEst-C/EEI/UI0127/2011).

Conflict of interest statement . None declared.

REFERENCES

1
Grumbach
S
Tahi
F
Compression of DNA sequences
Proceedings of the Data Compression Conference, DCC-93
 , 
1993
Snowbird. Utah
IEEE
(pg. 
340
-
350
)
2
Grumbach
S
Tahi
F
A new challenge for compression algorithms: genetic sequences
Inform. Process. Manag
 , 
1994
, vol. 
30
 (pg. 
875
-
886
)
3
Rivals
E
Delahaye
J-P
Dauchet
M
Delgrange
O
A guaranteed compression scheme for repetitive DNA sequences
Proceedings of the Data Compression Conference, DCC-96
 , 
1996
Snowbird. Utah
IEEE
pg. 
453
 
4
Loewenstern
D
Yianilos
PN
Significantly lower entropy estimates for natural DNA sequences
Proceedings of the Data Compression Conf., DCC-97
 , 
1997
Snowbird. Utah
IEEE
(pg. 
151
-
160
)
5
Chen
X
Kwong
S
Li
M
Asai
K
Miyano
S
Takagi
T
A compression algorithm for DNA sequences and its applications in genome comparison
Genome Informatics 1999: Proc. of the 10th Workshop
 , 
1999
Tokyo, Japan
Universal Academy Press, Inc
(pg. 
51
-
61
)
6
Matsumoto
T
Sadakane
K
Imai
H
Dunker
AK
Konagaya
A
Miyano
S
Takagi
T
Biological sequence compression algorithms
Genome Informatics 2000: Proceedings of the 11th Workshop
 , 
2000
Tokyo, Japan
(pg. 
43
-
52
)
7
Chen
X
Kwong
S
Li
M
A compression algorithm for DNA sequences
IEEE Eng. Med. Biol. Mag
 , 
2001
, vol. 
20
 (pg. 
61
-
66
)
8
Chen
X
Li
M
Ma
B
Tromp
J
DNACompress: fast and effective DNA sequence compression
Bioinformatics
 , 
2002
, vol. 
18
 (pg. 
1696
-
1698
)
9
Tabus
I
Korodi
G
Rissanen
J
DNA sequence compression using the normalized maximum likelihood model for discrete regression
Proceedings of the Data Compression Conference, DCC-2003
 , 
2003
Snowbird. Utah
(pg. 
253
-
262
)
10
Manzini
G
Rastero
M
A simple and fast DNA compressor
Softw. Pract. Exp
 , 
2004
, vol. 
34
 (pg. 
1397
-
1411
)
11
Korodi
G
Tabus
I
An efficient normalized maximum likelihood algorithm for DNA sequence compression
ACM T. Inform. Syst
 , 
2005
, vol. 
23
 (pg. 
3
-
34
)
12
Behzadi
B
Le Fessant
F
DNA compression challenge revisited
Combinatorial Pattern Matching: Proceedings of CPM-2005
 , 
2005
, vol. 
Vol. 3537 of LNCS
 
Jeju Island, Korea
Springer
(pg. 
190
-
200
)
13
Korodi
G
Tabus
I
Normalized maximum likelihood model of order-1 for the compression of DNA sequences
Proceedings of the Data Compression Conference, DCC-2007
 , 
2007
Snowbird. Utah
IEEE
(pg. 
33
-
42
)
14
Cao
MD
Dix
TI
Allison
L
Mears
C
A simple statistical algorithm for biological sequence compression
Proceedings of the Data Compression Conference, DCC-2007
 , 
2007
Snowbird. Utah
IEEE
(pg. 
43
-
52
)
15
Pinho
AJ
Ferreira
PJSG
Neves
AJR
Bastos
CAC
On the representability of complete genomes by multiple competing finite-context (Markov) models
PLoS ONE
 , 
2011
, vol. 
6
 pg. 
e21588
 
16
Giancarlo
R
Scaturro
D
Utro
F
Textual data compression in computational biology: a synopsis
Bioinformatics
 , 
2009
, vol. 
25
 (pg. 
1575
-
1586
)
17
Pinho
AJ
Pratas
D
Ferreira
PJSG
Bacteria DNA sequence compression using a mixture of finite-context models
Proceedings of the IEEE Workshop on Statistical Signal Processing
 , 
2011
 
IEEE, Nice, France
18
Lander
ES
Initial impact of the sequencing of the human genome
Nature
 , 
2011
, vol. 
470
 (pg. 
187
-
197
)
19
Venter
JC
Multiple personal genomes await
Nature
 , 
2010
, vol. 
464
 (pg. 
676
-
677
)
20
Christley
S
Lu
Y
Li
C
Xie
X
Human genomes as email attachments
Bioinformatics
 , 
2009
, vol. 
25
 (pg. 
274
-
275
)
21
Brandon
MC
Wallace
DC
Baldi
P
Data structures and compression algorithms for genomic sequence data
Bioinformatics
 , 
2009
, vol. 
25
 (pg. 
1731
-
1738
)
22
Wang
C
Zhang
D
A novel compression tool for efficient storage of genome resequencing data
Nucleic Acids Res
 , 
2011
, vol. 
39
 pg. 
e45
 
23
Kuruppu
S
Puglisi
SJ
Zobel
J
Reynolds
M
Optimized relative Lempel-Ziv compression of genomes
Proceeding, of ACSC 2011
 , 
2011
 
34th Australasian Computer Science Conference (ACSC 2011), Conferences in Research and Practice in Information Technology (CRPIT), Vol. 113, Australian Computer Society Inc, Perth Australia
24
Tembe
W
Lowey
J
Suh
E
G-SQZ: compact encoding of genomic sequence and quality data
Bioinformatics
 , 
2010
, vol. 
26
 (pg. 
2192
-
2194
)
25
Deorowicz
S
Grabowski
S
Compression of DNA sequence reads in FASTQ format
Bioinformatics
 , 
2011
, vol. 
27
 (pg. 
860
-
862
)
26
Fritz
MH-Y
Leinonen
R
Cochrane
G
Birney
E
Efficient storage of high throughput DNA sequencing data using reference-based compression
Genome Res
 , 
2011
, vol. 
21
 (pg. 
734
-
740
)
27
Kozanitis
C
Saunders
C
Kruglyak
S
Bafna
V
Varghese
G
Compressing genomic sequence fragments using SlimGene
J. Comput. Biol
 , 
2011
, vol. 
18
 (pg. 
401
-
413
)
28
Ahn
S-M
Kim
T-H
Lee
S
Kim
D
Ghang
H
Kim
D-S
Kim
B-C
Kim
S-Y
Kim
W-Y
Kim
C
, et al.  . 
The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group
Genome Res
 , 
2009
, vol. 
19
 (pg. 
1622
-
1629
)
29
Huala
E
Dickerman
AW
Garcia-Hernandez
M
Weems
D
Reiser
L
LaFond
F
Hanley
D
Kiphart
D
Zhuang
M
Huang
W
, et al.  . 
The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant
Nucleic Acids Res
 , 
2001
, vol. 
29
 (pg. 
102
-
105
)
30
Rhee
SY
Beavis
W
Berardini
TZ
Chen
G
Dixon
D
Doyle
A
Garcia-Hernandez
M
Huala
E
Lander
G
Montoya
M
, et al.  . 
The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community
Nucleic Acids Res
 , 
2003
, vol. 
31
 (pg. 
224
-
228
)
31
Ouyang
S
Zhu
W
Hamilton
J
Lin
H
Campbell
M
Childs
K
Thibaud-Nissen
F
Malek
RL
Lee
Y
Zheng
L
, et al.  . 
The TIGR Rice Genome Annotation Resource: improvements and new features
Nucleic Acids Res
 , 
2007
, vol. 
35
 (pg. 
D883
-
D887
)
32
Levy
S
Sutton
G
Ng
PC
Feuk
L
Halpern
AL
Walenz
BP
Axelrod
N
Huang
J
Kirkness
EF
Denisov
G
, et al.  . 
The diploid genome sequence of an individual human
PLoS Biol
 , 
2007
, vol. 
5
 (pg. 
2113
-
2144
)
33
Venter
JC
Adams
MD
Myers
EW
Li
PW
Mural
RJ
Sutton
GG
Smith
HO
Yandell
M
Evans
CA
Holt
RA
, et al.  . 
The sequence of the human genome
Science
 , 
2001
, vol. 
291
 (pg. 
1304
-
1351
)
34
Wang
J
Wang
W
Li
R
Li
Y
Tian
G
Goodman
L
Fan
W
Zhang
J
Li
J
Zhang
J
, et al.  . 
The diploid genome sequence of an Asian individual
Nature
 , 
2008
, vol. 
456
 (pg. 
60
-
66
)
35
 
The International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature , 409 , 860–921
36
Rissanen
J
Generalized Kraft inequality and arithmetic coding
IBM J. Res. Develop
 , 
1976
, vol. 
20
 (pg. 
198
-
203
)
37
Waterston
RH
Lander
ES
Sulston
JE
On the sequencing of the human genome
Proc. Natl Acad. Sci. USA
 , 
2002
, vol. 
99
 (pg. 
3712
-
3716
)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comments

0 Comments