CRAM 3.1: advances in the CRAM file format

Abstract Motivation CRAM has established itself as a high compression alternative to the BAM file format for DNA sequencing data. We describe updates to further improve this on modern sequencing instruments. Results With Illumina data CRAM 3.1 is 7–15% smaller than the equivalent CRAM 3.0 file, and 50–70% smaller than the corresponding BAM file. Long-read technology shows more modest compression due to the presence of high-entropy signals. Availability and implementation The CRAM 3.0 specification is freely available from https://samtools.github.io/hts-specs/CRAMv3.pdf. The CRAM 3.1 improvements are available in a separate OpenSource HTScodecs library from https://github.com/samtools/htscodecs, and have been incorporated into HTSlib. Supplementary information Supplementary data are available at Bioinformatics online.

Demonstration JavaScript implementations of these codecs also exist in the HTScodecs package, although these should be consider as examples rather than production ready. They do however demonstrate the practicality of implementing CRAM 3.1 in a web environment, with JavaScript speeds typically 5x slower than their C counterpart.
The quality files used in testing here are available for download from ftp://ftp.sanger.ac.uk/pub/users/jkb/CRAM3.1/. Each of these are the first 1 million quality strings from a local NovaSeq (q4+dir) and a HiSeq 2000 run (q40+dir, from 9827 2#49.bam). The first column in each file is the quality string with the second column being a flag for first read (READ1) or second read (READ2). All entropy benchmarks were performed on the first column only with newlines removed, except for FQZComp which utilises the read start points and "selector" value.

Static 16-bit rANS coding
A modern CPU can dispatch multiple instructions per cycle and permits a degree of out-of-order execution. However there is a delay between starting an operation and the result being available. Hence a series of connected expressions (e.g. b = a * 2; c = b + 17; d = c − a) leads to stalling the instruction pipeline.
The existing CRAM 3.0 rANS entropy encoder interleaves 4 rANS states so independent calculations can be performed to avoid pipeline stalling. Each state renormalises 8-bits of data at a time (emitting 8-bits at a time during encoding). This is listed as "rANS4x8" below.
While CRAM 3.1 is still permitted to use the old rANS implementation, it adds a newer version which renormalises with 16-bits at a time for speed. Additionally it can select interleaving either 4 or 32 states. 4 is sufficient to avoid instruction pipelining for traditional scalar maths operations, but with 32 we can use 4 lanes of 8x32-bit SIMD vectors (AVX2) or 2 lanes of 16x32 (AVX512). Unfortunately the compilers are not able to auto-vectorise the code, but we have written multiple SIMD implementations in the HTScodecs github repository with the ability to detect the CPU and execute the fastest available. The CRAM 3.1 rANS codecs are listed as "rANS4x16" and "rANS32x16" below.
We present figures showing the encoding and decoding rates for various rANS options, with and without bit-packing and RLE, and with and varying degrees of vectorisation from none (compiler automatic, mostly traditional scalar code) to AVX512. As we needed AVX512 support, all Intel figures were tested on an Intel Xeon Gold 6242 CPU at 2.8GHz. We also present ARM Neon benchmarks on AWS Neoverse-N1 stepping r3p1.
Tables 1 and 2 show the performance of encoding NovaSeq quality values on Intel and ARM platforms. Huffman encoding is poor due to the need to emit at least 1 bit per symbol. Uncompressed data size is 151MB. Table 3 shows the encoding of HiSeq 2000 quality values. As with the NovaSeq there is a considerable speed gain available by interleaving 32 rANS states and utilising SIMD, particularly when decoding. The uncompressed data file is 100MB.
In addition to the PACK and RLE methods, rANS also supports a STRIPE filter. This splits the data stream into N sub-streams consisting of every N th byte, every N th + 1 byte, and so on. Each of these are then compressed using their own rANS parameters. This can be particularly useful for BAM "B" auxiliary arrays. For example an "XX:B:i," auxiliary tag will store 32-bit numbers. If such numbers fall in the 1 to 100,000 range then bits 24-31 will always be 0 and bits 16-23 will be 0 or 1 only, with the bulk of the entropy in the first two bytes per value. STRIPE can permit RLE on the 4th byte, PACK on the 3rd byte, and Order-0 or Order-1 on the first two bytes.

Adaptive arithmetic coding
The adaptive arithmetic codec is best for non-stationary probabilities where the characteristics of the data may change throughout the block. However this is not a common feature of sequencing data, so the extra complexity is rarely required. This implementation of arithmetic coding is byte- wise rather than the more usual bit-wise. This can be faster for low-order models, particularly on data sets with only a few symbols being used. As with the CRAM 3.1 rANS implementation, the arithmetic coder also includes the bit-packing, run length encoding and data-striping modes.
For comparisons with the other tools, Tables 4 and 5 show the the same NovaSeq and HiSeq2000 quality data along side a representative non-vectorised rANS line for comparison. Also included is MPEG-G's GABAC algorithm, as implemented in the Genie OpenSource tool. This is also an adaptive arithmetic codec (albeit bit-wise) with the ability to do bit-based transformations before entropy encoding, so is a good comparison against the arithmetic coder used in CRAM 3.1.
Note the Genie gabac-app tool requires little-endian 32-bit data, so the quality data was first transformed with a perl script. This transformation has not been included in the timing figures.
The adaptive arithmetic codec is an order of magnitude slower than rANS, but offers some modest improvement in ratios, particularly the NovaSeq data. GABAC is an order of magnitude slower again, but this is perhaps not indicative of the speed of a fully optimised implementation. It seems poor on compression ratio, but there may also be room for further improvement within the GABAC data format.

Generalised FQZComp
The original FQZComp tool had a choice of three quality models, selected by parameters -q1, -q2 and -q3 using 12-bits, 16-bits and 20-bits respectively for the model context. The construction of these contexts were hard coded.
CRAM 3.1's FQZComp model is designed around block-based compression offering random access, so uses at most 16-bits of context. The construction of this context is configurable with the construction rules stored in the data stream so the decoder can follow the instructions during decode. This offers considerable freedom in context construction.
For example, the base qualities on long read technologies may not vary much along the length of the read, so attempting to utilise position on the sequence will likely harm compression by making the model slower to adapt. Conversely Illumina data has a strong bias between the 5' and 3' end. Additionally this also means knowing the orientation can help improve compression.
CRAM's codecs are designed to be operated in isolation without knowledge of other CRAM fields. This is a deliberate design decision in order to remove data dependencies and permit a more column-based filtering access pattern. However the additional bit required per read to store (duplicate) the orientation within the quality stream may be more than offset by space saving. It also permits a flag per read to indicate entire quality string duplication, which is beneficial when supplementary or secondary alignments are present.
Other data available for use in context construction are previous quality values, a running total of successive quality differences since the start of this read, and the position along the current read. These values can be indirected via a lookup table, so they do not need to be uniformly distributed.
We can also store part of the context from the data stream itself, stored at the start of each new record. We use this to classify data by a variety of methods. For example separating by READ1 and READ2 flags can sometimes be beneficial. We may also wish to group reads by their average quality value, the number of differences, or even their position on the flow cell (extracted from the read name) with edge-effects or bubbles changing quality characteristics. Any categorisation can be considered by the encoder as the decoder simply uses data from the file format without knowing how it was derived. These are known as the "selector" bits.
Pseudo-code for context construction used during the decoding process is below. See CRAMcodecs specification for more details. Table 6 shows the performance of FQZComp on the HiSeq 2000 and NovaSeq test sets, with several different context configurations. Also shown is libbsc, as a high quality commonly used general purpose compression tool having a similar data throughput, and the original 2012 standalone FQZComp tool. We also repeat the best rANS and arith ratios for ease of comparison. It is clear that the generalised CRAM implementation of the FQZComp quality model is a signifcant advance over the original, although this general purpose nature currently has a small speed penalty. return ctx AND (2 16 − 1) 21: end function

Name tokeniser
The name tokeniser deconstructs a read name into a series of elements and then compares each element against a previous tokenised read name. Typically this is the immediately previous name, but with mixed data sets it may be an earlier one matching the same pattern. These tokens are then collated in columns with each column then compressed using either the CRAM 3.1 rANS or adaptive arithmetic coders. Table 7 shows the compression performance on position-sorted and name-sorted data files. We show results for our name tokeniser (tok3) against brotli (bro), libbsc (bsc), bzip2, gzip (libdeflate implementation), mcm, xz and zstd. Tok3 compression levels 1 to 9 use rANS encoding and 11 to 19 use adaptive arithmetic coding. The last digit governs how hard it searches to find the optimal encoding parameters.
The position sorted data is the aligned NovaSeq file used in the CRAM benchmarks (ERR3239334). The read name sorted data set is the interleaved read1/read2 records from ERR174326. Both data sets are 1 million names, separated into 10 lots of 100,000 names to simulate coarse random access capabilities.
Mcm and libbsc are strong contenders, with mcm sometimes ahead of the CRAM name tokeniser on ratio, however as with other codecs CRAM aims to strike the balance between speed and size.

HTSlib's CRAM implementation
The CRAM specification describes the byte stream and how it must be decoded. However there are many possible byte streams that decode to the same data and hence considerable freedom is available to the encoder. We can trivially see similar choices in general purpose tools such as gzip, where the user can select compression methods -1 to -9 governing the algorithms used by the encoder (and how much time they spend), but all decoders can process the compressed file.
CRAM extends on this further by permitting different choice of algorithms as well as their own internal compression levels, for example selecting from bzip2, gzip or rANS.
Some of these choices are controllable by the user, while others are automatically chosen. User controllable options: • A general compression level, indicating the trade-off between CPU time and file size.
• The choice of which compression codecs may be used. This does not mean they will be used for all data blocks, but if a compression codec offers a significant advantage then it may be selected.
• The number of records and/or bases per slice, and the number of slices per container. This governs the random access capabilities of the file format. Note HTSlib's number of sequences per slice has a nominal cap of 500bp per sequence unless explicitly changed. This avoids the presence of long reads from producing excessively large containers.
• Whether to encode against an external reference sequence, an internal reference sequence, or referenceless.
• A compression profile. This is a simple way of specifying the above options as a single parameter. The profiles available are: fast Sets compress level to 1 and disables the name tokeniser. 10,000 sequences per slice.

normal
The default compression level (5). Name tokeniser is enabled and there are 10,000 sequences per slice.
small Compression level is set to 6 unless otherwise changed. Bzip2 and fqzcomp quality modes are enabled and 25,000 sequences per slice.
archive Compression level is set to 6 unless otherwise changed. Bzip2, fqzcomp quality modes and the adaptive arithmetic coder are enabled (both as a standalone codec and also as part of the name tokeniser). If compression level was previously set to 7 or above then Lzma is also enabled. The number of sequences per slice is set to 100,000. Use of lzma is typically the main source of difference between "archive" and "archive9" in the tables below.
HTSlib's CRAM implementation was derived from the Staden io lib package, and both share a similar strategy. The encoding process is as follows.
1. Read sufficient records to fill a container, and produce the uncompressed CRAM data series associated with these records. New containers are also started when a reference changes, unless this leads to many small containers in which case multi-reference containers are used with an additional data-series present to hold the reference ID.
Data-series are analogous to SAM columns, such as identifier, chromosome, position, qualities. Sequence is split into multiple data series based on the delta to the reference (such as length of deletion, SNPs, etc). Each auxiliary tag type is also assigned its own data series.
Modern CRAM implementations store each data series in their own block, but earlier versions would merge some together. Merging may boost compression where correlation between values occur, but it reduces the ability to do efficient data-type selective retrieval.
2. Gather statistics on each data series, such as the number of distinct values for an integer series and their minimum and maximum values. Data series that have constant values may be encoded using a Huffman tree with a zero-length bit stream (ie they consume no space other than the Huffman meta-data). In CRAM 4 this is achieved with a special CONST encoding type.
CRAM 4 also adds a new variable sized integer encoding which can distinguish signed vs unsigned values and uses the minimum value as a baseline to compare against.
3. For all other data blocks, a list of candidate compression codec methods is maintained. This is initially everything permitted by the user (not including bzip2 or lzma unless explicitly requested for CRAM 3.0, and arithmetic coder and fqzcomp quality encoder for CRAM 3.1).
Note that a single compression codec listed in the CRAM specification may be listed as multiple methods, as a codec may have several possible choices. For Gzip this consists of GZIP-RLE (run length encoding only), GZIP-1 (level 1 only), and GZIP (at the current compression level). FQZComp has 4 different sets of parameters to trial different encoding models. CRAM 3.0's 8-bit rANS has Order-0 and Order-1 encodings. CRAM 3.1's 16bit rANS has Order-0, Order-1, plus combinatons including run-length encoding and bitpacking. Similarly for the CRAM 3.1 arithmetic coder. This gives a large choice of available methods.
Compression level can restrict this initial choice.
Level 1 removes gzip, bzip2 and lzma support for quality scores (but may still use gzip in run-length-encoding mode). No rANS RLE and bit-packing are permitted.
Level 2 permits standard gzip for all data series, but still only uses gzip level 1 for quality scores and bases. Some combinations of rANS bit-packing and RLE are enabled.
Levels 3 and above permit the external codecs (gzip, bzip2, lzma), if enabled, at the compression level selected.
Levels 6 and above also enable further rANS bit-packing and RLE combinations.
The data series also governs which initial methods to utilise. If Fqzcomp is enabled, quality scores are compressed using 1 pre-defined parameter set at levels 1-4, two parameter sets at levels 4 and 5, and 4 parameter sets at levels 6 and above.
The read identifier data series removes the rANS methods and GZIP-RLE from the initial method list as they are almost never beneficial.
4. Each data series is then compressed by trialling each of the available methods listed for that specific data series type. A cost function is applied based on an estimated decoding speed and the compression level. At level 9 speed is ignored, while at level 1 it is has a high consideration. Thus if a compression method yields only 1% smaller data but is likely 100% slower then it may not be utilised for encoding.
The output size of each tested method is summed per data series.
5. After a fixed (low) number of trials, a decision is made on which method is best and it is then applied without trial for the next (high) number of containers.
6. This trial and deploy strategy is repeated multiple times. A secondary level of trial statistics are maintained so if a data series is consistently rejected by a considerable margin then it will be dropped from subsequent trial phases.
This way HTSlib learns the nature of the data and which methods work best. This is particularly important for auxiliary data which may contain any data with unknown characteristics. Hence speed assessment of HTSlib on small files may not be indicative of speed assessments on larger ones.
7. Specific events which may change the nature of the data, such as moving from mapped to unmapped or many large references to many small ALT references, can trigger a reset of the learnt compression metrics.
Note that this learning strategy is specific to HTSlib and the Io lib code it was derived from. The HTSJDK implementation has fixed compression methods for each data series. This may give a speed advantage on small files, but it lacks the ability to adapt to the data. This is one example of where evaluation of the CRAM file format must be distinguished from evaluation of specific implementation methods. However we cannot evaluate the format without evaluating at least one implementation. How best to select the encoding methods and compression codecs per data-series is an open question, however we choose to evaluate HTSlib here as we believe it to be the best of the currently available implementations, but there is still potential for future improvements while retaining CRAM compatibility.

Detailed CRAM file compression results: aligned data
The "io lib" cram dump tool can be used to interrogate the layout of a CRAM file. An example container header is listed below. The record encoding map describes how each data series is encoded. This is mostly outputting to blocks as-is (EXTERNAL) or with variable length data (either null terminated or with explicit length encodings). HUFFMAN is also (ab)used in this example for constant items, with a Huffman tree containing a single zero-bit long symbol. The BETA encoding is also used for alignment positions. The parameters to each encoding describe, among other things, the block it is stored in, for example soft-clipped sequence ("SC") is written to block number 14.
The blocks are then grouped together in slices, and each block describes the compression codec used for that data type. The "io lib" cram size tool can summarise all this data, reporting how much space each data series consumes and which compression methods are used (this may be more than one as a different codecs may be used in different slices). In the tables below, we report total file size and time taken to encode and decode, gathered using the time command. This shows the difference between total CPU usage across all threads and the elapsed time, giving an indication of thread scalability. The entry for "Deez q1" corresponds to the -q1 option of Deez which uses the samcomp quality model. Random access is not available on the quality field for this method, so it is excluded from the main figure.
We also include numbers for the experimental CRAM 4.0 version. This is ongoing work and is expected to change further, but the primary reason for size reduction over 3.1 is deduplication of read names and the option to store quality values in their original 5' to 3' orientation.  Figure 2 is the same data showing the CPU time (User + System). This now includes Deez, which is quite performant. MPEG-G is not shown as we only have the elapsed time for that test system.
The improvements with Genozip in the 12.0.34 to 13.0.5 upgrade are particularly effective for the NovaSeq data, giving the smallest file from all tools tested. The primary gain over CRAM is improved compression of the XA:Z auxiliary tag, which consumes over 20% of that data set in CRAM.

NovaSeq data
We tried to obtain details on how the MPEG-G dataset 37 was aigned, in order to compare tools against MPEG-G, but were unable to obtain them or the BAM file. Hence the results below use a difference public data set.
Source: ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram Figure 3 shows the break down in data type sizes for CRAM 2.1 (normal and small profiles), CRAM 3.0 (normal, small and archive level 9), CRAM 3.1 (normal, small and archive level 9), CRAM 4.0 (normal and small), Deez (-q1 and -q2) and Genozip (normal and -fast). Sequence size is everything that is not Name, Qual or Auxiliary fields and hence also includes mapping quality, template lengths and flags. This is to make it easier to compare between tools.
Deez is significantly ahead on sequence here, presumably due to the two-level delta, while the new CRAM codecs are significantly ahead on read names and quality values.
The updates that occurred in Genozip during the review of this manuscript are most noticable on this data set. The sequence encoding is considerably improved and the auxiliary tag encoding (dominated by the large "XA:Z" tag) is now the best of any tool. This backs up our discussion point in the main paper over tag specific encoding. Most of this size reduction occurred in 13.0.5, but since Genozip 13 the CRAM codecs presented here have also been incorporated for some improved speed and further ratio gains.  Memory usage is also shown in Table 8, but note with 12 threads each tool will have 12 times as much data in memory, with associated compression buffers and statistics. Single threaded conversions will use substantially less RAM. As we switch between CRAM profiles the memory also changes, which is primarily down to increasing the number of sequences per slice (the unit of random access). BAM is by far the lowest memory user, due to having a very small 64k compression buffer as required by bgzf.

HiSeq 2000 data
This is the Illumina Platinum Genomes deep sequencing of NA12878. It is also the MPEG-G Data-Set 02.
Source: ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR194/ERR194147/NA12878 S1.bam The MPEG-G format quoted size from their paper supplementary material is 58444073745 bytes, at 8,280s to encode and 3,990s to decode using 8 threads on an Intel i7-7700 running at 3.6Hz. The earlier MPEG document M56361 quotes the size as a very similar 58457106765, taking 7,320s to encode and 3,745s to decode using 12 threads on an Intel E5-2670 running at 2.6GHz. Our tests are on an Intel E5-2660, also with 12 threads, nominally running at 2.2GHz but with CPU frequency scaling potentially overclocking this. We believe the speed benchmarks to be sufficiently similar to be of value in comparisons, although we cannot quantify the impact of any I/O speed differences. Figure 4 shows the break down in data type sizes for CRAM 2.1 (normal and small profiles), CRAM 3.0 (normal, small and archive level 9), CRAM 3.1 (normal, small and archive level 9), Deez (-q1 and -q2), MPEG-G and Genozip (normal and -fast). Sequence size is everything that is not Name, Qual or Auxiliary fields and hence also includes mapping quality, template lengths and flags. This is to make it easier to compare between tools.
The quality values dominate the file size here, giving CRAM 3.1 a significant advantage in the overall file size. However the new name takeniser is clearly a visible gain too.

PacBio CLR data
Source: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/ 20131209 na12878 pacbio/si/NA12878.pacbio.bwa-sw.20140202.bam The document "Results MPEG-G Detailed M56361.xlsx" states 13,766,776 records, The original BAM file has 25,968,256 records with 13,682,101 primary alignments and 12,286,155 secondary alignments. Given that secondary alignments share the same sequence and quality as their primary, it may be possible to deduplicate these. MPEG-G merges records from the same template together if they occur within the same Access Unit, which is stated to be blocks of 65,536 records. 68% of the secondary alignments are within 65,536 of their primary, so this cannot account for the disparity in the number of records. This document also states "BAM AUX: NM, MD, RG".
Dropping all secondary alignments and most auxiliary tags with samtools view -F 0x100 -x AS -x PG -x SA -x XS produces a BAM file closely matching the one quoted in M56361 (it is 0.7% smaller, but that may be due to Deflate implementation differences) and an even closer match to the CRAM 3.0 size (0.3% smaller). We attempted to clarify this, but received no answer. Therefore we are working on the assumption this transformation was applied and have done the same on our results for purposes of a fair comparison. Our results are valid between the tests we performed on versions of CRAM, Deez and Genozip, but we have listed MPEG-G sizes as "estimated" due to this lack of certainty. Figure 5 shows the break down in data type sizes. As the sequences are long, read names are a tiny proportion of the total data. The PacBio qualities are dominant and hard to compress effectively. As explained above auxiliary tags were removed, but Genozip (default mode) still stored the MD:Z, NM:i and RG:Z tags. As in the HiSeq test, the data type sizes for MPEG-G come from the M56361 document. Although this was an earlier report, the total file size matches the one listed in their supplementary results so we believe them to be valid. Of note here is this data set is dominated by the complexity of the quality scores. These are very random, without much variation between tools. The exception to this is the latest release of GenoZip which utilises the sequence as a context for compression of quality values. Such crosscorrelation between data types is currently deliberately avoided in CRAM as it makes selective decode of individual data types impossible. However it is clear the gains may be significant and it is unlikely that there is a demand for obtaining quality values without also obtaining the sequences (unlike vice versa). Experiments (unpublished) demonstrate using a sequence context have a similarly considerable impact on PacBio CRAM quality, so this restriction could potentially be relaxed for CRAM 4.0, although the preferable route would be for the sequencing vendor to perform a significant quantisation to reduce the cost of quality encoding.
Unlike the HiSeq data, timings for the MPEG-G results come from the MPEG-G paper supplementary results and not the M56361 publication. Note these were performed using 8 threads on an Intel Core i7-7700 CPU running at 3.6 GHz. This is considerably different to our results using 12 cores on an Intel Xeon E5-2660 (nominally running at 2.2 GHz but with CPU frequency scaling sometimes temporarily overclocking this). However we note the HiSeq data set has speeds reported on both the i7-7700 system and the E5-2660 with encode times of 8,280 and 7,320 respectively and decode times of 3,990 and 3,745 respectively. This implies the MPEG-G timings below may be between 7 and 12% speed higher than if they were executed on the E5-2660 system. Although the MPEG-G timings are not directly comparable, we believe them to be broadly in the same ballpark and certainly within ±50%.
We also repeated the PacBio CRAM 3.1 test on a different system with AVX512 support to test the fastest rANS implementations. This test was with 4 cores on an Intel Xeon Gold 6142 CPU (nominally at 2.6GHZ but with auto-scaling enabled). The reduced core count is to ensure a long enough test and to avoid I/O effects. Results averaging over 3 runs are presented in Table 11. Time saved varies by data set and codec being used. This file has about 77% of file contents in the quality data series, so a SIMD order-1 rANS codec makes a significant difference (16 to 22%). However the slower archival compression profiles would not save nearly as much, nor will files containing a lot of long textual auxiliary tags (such as "XA:Z:"). A small sample of NovaSeq data showed a modest 4% speed gain. It was also observed that using many threads reduced the impact as other factors become more dominant. Furthermore, without CPU pinning the clock speed of CPUs running AVX512 was generally lower, reducing the potential gains.

CRAM compression: unaligned data
Sequencing data is typically generated in FASTQ. We would not consider this to be a suitable long term storage format as aligning or assembling the data both improves compression ratios and also substantially boosts the usefulness. However as an interim solution, data compression is still valuable.
Traditionally standard text compressors (such as gzip) have been used. There are a wide variety of dedicated FASTQ compressors, but we limit ourselves to evaluating the same multi-purpose encoders listed above with unaligned data.
The most naive solution is simply to generate an unaligned BAM, CRAM and related formats. This is a fast workable solution and better than straight FASTQ as it permits the presence of sample meta-data in the file header and has a rigorously defined way to attach additional per sequence meta-data, such as bar-coding, instead of exploiting the FASTQ comment section to create a format within a format.
The ERR174310 1.fastq.gz file is in NCBI SRA format, which unfortunately is a non-standard FASTQ variant where the original read identifier is in column 2. To correct this a fix the samtools import was made in PR#1485. The command to convert from fastq pairs to BAM is samtools import -@8 -N -1 ERR174310 1.fastq.gz -2 ERR174310 2.fastq.gz -o ERR174310.u.bam. The samtools fastq command can be used to reverse the process. Table 12 shows encoding performance for various tools. The MPEG-G line here comes from their paper, so was on an Intel i7-7700. All others were benchmarked by us on an Intel Xeon Gold 6142 using 8 threads. This is a different system to the aligned data benchmarks above as the aligner needed more memory. Note samtools import currently has no multi-threaded decoding of FASTQ, so threading scalability is limited. However this is a tool chain implementation issue rather than a limitation of file formats so we expect this to improve. It can be seen that storing FASTQ in unaligned BAM offers little benefit over gzip / bgzf, but using CRAM is a significant improvement due to the columnar nature of the format. This is further improved by the new codecs included in CRAM 3.1. Despite the poor multi-threading scalability of samtools import, it is competitive on speed. CRAM 3.1 was only slightly beaten by Genozip on size but ran in a fraction of the time. The default block size for Genozip is 16MB which corresponds to about 65,000 records. This is also comparable to the quoted block size for MPEG-G. CRAM 3.1 defaults to 10,000 short-read records for the normal profile, 25,000 for small profile and 100,000 for archive (not tested above).
A more advanced solution is to perform either a rapid sequence assembly, as originally proposed by Fritz et al. or use an existing known reference to do reference based alignment. Such methods do not need to be biologically rigorous as we are not attempting to get the best assembly or the best alignments, but simply to reduce the data volumes instead. There are many dedicated FASTQ compressors using such techniques, but we limit our comparison to the same tools above that also support aligned data, with bgzf (parallel gzip) as a baseline.
Alignment / assembly considerably reduces the data size, but comes with an additional CPU cost. CRAM does not prescribe a specific method and SAMtools / HTSlib do not have this built in, however we can use very fast aligners for this purpose. (It should be noted that spending too much time on this strategy simply implies you should be converting to fully aligned BAM/CRAM anyway in order to leverage the utility of aligning or assembly rather than purely as a compression method.) For this research we chose the SNAP aligner. SNAP also cannot cope with the non-standard SRA / ENA FASTQ variant and we also get better CPU performance if we naively align the data as interleaved single-ended data rather than pairs. As a pipeline we use samtools import to convert from two FASTQ files to uncompressed unaligned BAM, which is fed into SNAP aligner. The aligner outputs uncompressed aligned SAM, which is finally converted to CRAM. As we don't need fine grained random access on FASTQ, we also increased the number of sequences per CRAM slice to 100,000. We provide a crude run aligned.sh script to demonstrate this process.
The SNAP aligner parameters have been chosen to optimise the speed, as we are not attempting to get good alignments, rather to improve compression ratios. Options were snap-aligner single grch37.snap -t 8 --b -d 8 -D 0 -f -bam -map -o -sam -. We used SNAP 1.0beta.24 as this gave better speed and size than the latest release.
Note this is very much a proof of principle as the resulting CRAM cannot be trivially converted back to FASTQ without some processing to split apart the two reads into their /1 and /2 components (given we aligned it without the "paired" option). However sufficient data exists to do this without data loss. The data has also been reordered slightly in large blocks due to the multi-threaded implementation in SNAP. This could be fixed without affecting the size and likely having a minimal impact on speed, but this work has not been done.
The ideal solution would be an aligner which can go directly from compressed FASTQ to compressed CRAM in a single process while retaining the exact file ordering.  Table 13 shows alignment and assembly based methods of FASTQ compression. The BAM file is larger than the unaligned BAM, due to storing chromosome name and position, matepairing information and auxiliary tags, without any benefits from reference based compression. CRAM shows a considerable improvement in compression ratios, with an expected elevated CPU cost. Note Genozip was ran using options -@8 --pair --reference h38.snap while SNAP was running on h37. Genozip gave the smallest file, but was considerably slower than CRAM and the same comments in the unaligned data test regarding block sizes apply.
Using SNAP + CRAM in this way is not the only combination and it is not a production-ready solution yet, however the results demonstrate that the concept of marrying a rapid approximate aligner to CRAM is a viable and performant solution without needing to change to a different encoding format for FASTQ data.

CRAM compression: Crumble lossy output
The SynDip synthetic diploid genome consisting of CHM1 and CHM13 samples has previously been evaluated for lossy-compression with the Crumble tool.
Unlike many tools, Crumble does not utilise its own file format. Rather it modifies the quality string via a simple experiment: if removing the quality values at a specific loci (after alignment) does not change the variant call and does not significantly alter the variant confidence, then these values can be set to a constant high value if they confirm the call or a constant low value if they disagree. The impact of this is a heavily skewed distribution of quality values with much greater predictability and compression ratios.
To evaluate CRAM 3.1 we compared CRAM 3.0 and CRAM 3.1 on the pre-crumble (original) BAM and the post-crumble BAM. These are shown in Tables 14 and 15. The data was Chromosome 1 of CHM1 CHM13 2.bam from ENA accession PRJEB13208. The original data has been passed through GATK's quality score recalibration tool, with the original qualities copied to the OQ:Z BAM auxiliary tag and the updated qualities put into the QUAL field. It can be seen that this has dramatically increased the entropy of the quality values. Note we believe this approach is no longer part of the GATK best practice, but have no direct experience in this. However the data files are a real dataset.
The default CRAM 3.0 file is 43% smaller than the default BAM, and the default CRAM 3.1 is 4.1% smaller than the 3.0. Note for the default BAM mode we estimated the space taken by quality values and read-names by replacing these fields with "*" and measuring the impact on the BAM file size.
The Crumble output here is modifying quality scores as outlined above, but also discarding read names (via HTSlibs "lossy-names" option) where the read-pair falls entirely within a single CRAM slice, and removing the large OQ:Z auxiliary tag. The modified qualities have reduced in size by 19-fold.
It can be seen that all versions of CRAM now significantly outperform BAM by a large ratio ratio than pre-Crumble, saving between 75 and 79% of storage. CRAM 3.1 is between 3.4 and 4.6% smaller than the analogous CRAM 3.0 file, mirroring the relative gains we saw in the pre-Crumble data. Also included is the early prototype for CRAM 4.0, which slightly pushes this trend further.
For both pre-and post-crumble data CRAM 3.1's default compression level is on within 1% of or smaller than CRAM 3.0's small profile (using larger block sizes and bzip2), while also being faster (speed not shown above).

Random access
High data compression is useful, but it sometimes comes with a random access cost. As we make the blocks larger than the granularity of random access coarser, the compression ratio will improve.
The comparisons so far have been with mostly default options and different tools have very different default block sizes. For The NA12878 NovaSeq data set we have computed the average number of alignment record for block as reported by the tools. These can be seen in Table 16. BAM record count was computed by dividing the uncompressed file by 0xf f 00 (the size of a BGZF data block), and then dividing the total number of records by that figure. Similarly Deez and Genozip numbers are computed from the total record count and number of blocks. MPEG-G is reported to use 65,536 record pairs per Access Unit, and we counted the number of records observed per 65,536 uniq read identifiers. This is not quite a 2:1 ratio as some templates span access units or are aligned far away.
To test the I/O performance with random access, we performed a single range query on Chr1 at position 10,000,000 of length 1Kbp, 10Kbp, 100Kbp, 1Mbp and 10Mbp (ie position 10M to 20M). The average depth across the largest region was 36x. We used the io trace tool to evaluate the number of read and seek system calls, and the total count of bytes transferred. We repeated the test using Unix time to measure the elapsed time to query and report SAM records, without threading enabled. These were piped into wc -l to validate the result. These results are presented in Table 17.
There were some disparities in the number of records reported, with Deez and Genozip reporting records that start on or after Chr1:10000000 rather than records that overlap. There is still some small discrepancy between these tools, but this was not investigated. I/O stats are not visible for Deez as it uses memory-mapped I/O which cannot be seen in a system call trace. All reference based tools exclude the I/O on the reference itself (which is also mmaped for the HTSlib CRAM implementation). CRAM and BAM I/O figures did not include the index, while Genozip and Deez have an internal index. The index file size for BAM was 9,282,200 bytes and for CRAM it was 1,379,148 bytes.
A more complete comparison was made with BAM, CRAM 3.0 and CRAM 3.1 accessing all exon and all gene regions across Chromosome 1. The BED file was produced from Ensembl release 104 GTF at http://ftp.ensembl.org/pub/current gtf/homo sapiens/.
This was converted to exon and gene bed file with awk, and tested using the SAMtools multiregion iterator: zcat Homo_sapiens.GRCh38.104.gtf.gz | awk '$1 == 1 && $3 == "gene" \ {printf("chr%d\t%d\t%d\n", $1, $4-1, $5)}' > genes.bed zcat Homo_sapiens.GRCh38.104.gtf.gz | awk '$1 == 1 && $3 == "exon" \ {printf("chr%d\t%d\t%d\n", $1, $4-1, $5)}' > exons.bed samtools view -f 0xffff -M -L exons.bed NA12878.final.bam The genes and exons files have 5475 and 136680 regions respectively, covering 78.6Mbp and io_trace -r`pwd`--samtools view -f 0xffff -M -L exons.bed \ pwd`/NA12878.final.31.cram Table 18 shows the I/O statistics for the exons and genes regions. The I/O load is significantly lower for the CRAM files, but elapsed time was larger. Note these test files were in disk cache, so this mainly reflects CPU time. It is unknown what the elapsed time would be after purging from cache as we were unable to perform this experiment, but queries the genes.bed file on a different BAM out of disk cache showed 85s uncached and an I/O read throughput of 45MB/s. This would likely indicate on uncached data BAM is still faster than CRAM with local storage. For remote network based I/O, for example via an htsget server, we would expect this to be reversed due to fewer access regions (seeks), higher latencies and probable lower transfer speeds..