CARGO: effective format-free compressed storage of genomic information

The recent super-exponential growth in the amount of sequencing data generated worldwide has put techniques for compressed storage into the focus. Most available solutions, however, are strictly tied to specific bioinformatics formats, sometimes inheriting from them suboptimal design choices; this hinders flexible and effective data sharing. Here, we present CARGO (Compressed ARchiving for GenOmics), a high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our approach to FASTQ and SAM archives require a few lines of code, produce solutions that match and sometimes outperform specialized format-tailored compressors and scale well to multi-TB datasets. All CARGO software components can be freely downloaded for academic and non-commercial use from http://bio-cargo.sourceforge.net.


What is CARGO?
CARGO -Compressed ARchiving for GenOmics -is a set of tools and a library providing building blocks for the creation of applications to store, compress and manupulate large-scale genomic data. The main goal of CARGO is to supply universal and format-independent storage methods, whereby the record data type can be easily described by the user in terms of a special meta-language, high-performance compressing/decompressing tools can be easily generated from the record data type with little effort, and the tools thus produced can be used in order to store compressed genomic datasets in big containers.

Main features
The main features of CARGO are: • Efficient storage of genomics data in compressed form • Data aggregated into configurable containers of giga-and terabytes in size, which can hold multiple datasets having different formats • Record format defined by the user in special meta-language allowing to describe any file format used in genomic applications. For some of those (at the moment FASTQ and SAM) support is provided out-of-the-box (see Quickstart and Examples) • Automatic high-performance and multi-threaded processing of the data • Possibility of implementing range searches on the top of an arbitrary order defined by the user • Data parsing and transformation methods explicitly specified by user • Multiple compression methods to be selected by the user depending on the characteristics of the input data.

General workflow
To create from scratch a simple compressor for a specified genomic file format the user only needs to: 1. Define the record data type in the high-level CARGO meta-language 2. Translate the record definition into a set of C++ files with the CARGO tools 3. Write a simple record parser in C++ using the record data type automatically generated during the previous step 4. Compile the automatically generated application template using the automatically generated Makefile 5. Create a container or use the existing one having enough available free space

Simple FASTQ format compressor
Tip: Application source files for this tutorial are available in the directory cargo/examples/fastq/fastq-simple of the official CARGO distribution. Should one prefer to skip introduction, record data type translation, parser coding and application building steps, a pre-compiled ready-touse FASTQ format compressor is available in the directory cargo/examples/bin. The sample FASTQ file compression use-case is presented starting from Creating container subchapter below.

Prerequisites
Before compiling the application, the path to CARGO distribution directory needs to be set in the build environment in order to access CARGO C++ header files and libraries: export CARGO_PATH=/path/to/cargo/directory/ As CARGO relies on several publicly-available compression libraries, the zlib (libz) and bzip2 (libbz2) libraries need to be present in the system for linking.
Compiling CARGO applications will also require a compiler with C++11 standard support (for multi-threading support) -by default the gcc compiler version 4.8 or above should be used.

FASTQ format
The FASTQ format is an ASCII text-based format useful to store biological sequences together with their quality score values. A sample record looks like the following: 1 @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 2 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC 3 + 4 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC while a general record contains: • read id -an identifier of the read starting after the @ symbol,

Simple FASTQ format compressor
• sequence -a sequence of nucleotides encoded using AGCTN letters, • plus -a control line, optionally containing a repetition of the read identifier, • quality -a Phred sequencing quality score of the sequence.

Record type definition
In general, a FASTQ record can be seen as a triplet, consisting of 3 fields: read id, sequence and read quality (discarding the redundant information in the plus field). The read id field is usually a collection of tokens separated by a set of delimiters incl. -_ ,.;:/#; sequence field is -in majority of the cases -a list of nucleotide bases AGCTN and the quality is a list of Phred numeric values encoded as ASCII characters. However, for a simplicity of this example, all the record fields will be represented as a string type. Such record definition in CARGO metalanguage is as follows: Note: More details about the CARGO metalanguage syntax with available data types can be found in The CARGO meta-language chapter.

Translating
Having the FASTQ record definition saved in FastqRecord.cargo file, next step is to translate the definition from CARGO metalanguage to C++ code by running the cargo_translate tool available in the cargo/tools subdirectory: cargo_translate -i FastqRecord.cargo After the translation, a set of files will be generated: • FastqRecord.h: C++ definition of the FastqRecord user record type • FastqRecord_Parser.h: C++ parser template for the FastqRecord C++ user record -will need to be completed by the user • FastqRecord_Type.h: C++ TypeAPI-based record type specification for the FastqRecord record type (for subsequent internal use, it does not need to be opened or modified by the user) • FastqRecord_main.cpp: template file containing compressor/decompressor applications writing/reading a stream of FastqRecord records to/from containers • FastqRecord_Makefile.mk: Makefile template file to build such applications.
The translated C++ record definition, which will be used later when implementing parsing methods, is:

Simple FASTQ format compressor
Note: More details regarding the translated FASTQ record definition into C++ code can be found in FASTQ example subchapter.

Writing a FASTQ records parser
In the next step, the missing record parsing functions will be implemented -the C++ functions are in class FastqRecord_Parser (FastqRecord_Parser.h) and include: • void SkipToEndOfRecord(io::MemoryStream& ) -skips the characters in the memory stream until the end of the current record (if any), • void SkipToEndOfHeader(io::MemoryStream& ) -skips the characters in the memory stream until the end of the file header (none in case of FASTQ), • bool ReadNextRecord(io::MemoryStream& stream_, FastqRecord& record_) -reads the next record from the memory stream and fills the FastqRecord structure member fields with the parsed data; returns true on success, • bool WriteNextRecord(io::MemoryStream& stream_, FastqRecord& record_) parses the FastqRecord structure member fields into a textual FASTQ format to, saving the output to the memory stream.
The complete code snippet for FastqRecord_Parser class is as follows: 1 using namespace type;

Building
Important: Before building the compressor the Prerequisites need to be met on the development machine.
To build the simple FASTQ compressor a generated makefile FastqRecord_Makefile.mk is available to be used with gnu make: make -f FastqRecord_Makefile.mk As a result, a cargo_fastqrecord executable will be created.

Creating container
As the sequencing data is stored in CARGO containers (independently of the records formats), an existing one needs to be used or a new one created using cargo_tool utility from cargo/tools directory. Creating a container fastq_container with a sample configuration of 1024 (16 multiplied by 64) large blocks of 4 MiB in size and 4069 (64 multiplied by 64) small blocks of 256 KiB in size in straightforward: cargo_tool --container-file=fastq_container --create-container \ --large-block-count=16 --large-block-size=4 \ --small-block-count=64 --small-block-size=256 As a result, 3 files will be generated, which define a single container: • fastq_container.cargo-meta -holds the container's meta information, • fastq_container.cargo-stream -contains the data streams, • fastq_container.cargo-dataset -holds the stored datasets information.

Running
To store (compress) SRR001666.fastq file in the fastq_container container under the dataset name SRR001666 using the compiled cargo_fastqrecord compressor: ./cargo_fastqrecord c -c fastq_container -n SRR001666 -i SRR001666.fastq To retrieve back (decompress) the SRR001666 dataset from the container and save it as SRR001666.decomp.fastq file:

CARGO CONTAINERS
In contrast to the standard file-based approach to compression, CARGO uses specially created containers to store, retrieve and query genomic data. This strategy allows to aggregate multiple files, each one having a possibly different file format, into a single container. This allows to store in a compact way sequencing data, analysis intermediates and final results coming from either a single experiment or multiple sequencing projects.

Architecture
The CARGO container consists of 3 different areas (or parts). Their conceptual structure is presented in figure The conceptual structure of a CARGO container. In order to achieve a better organization and simplify the implementation of backup methods, the container areas are stored on disk as separate files having the common prefix * .cargo-(being * some arbitrary name for the container specified by the user): • * .cargo-meta: a file storing the container meta-information (i.e. the container's internal block configuration) • * .cargo-dataset: a file storing information about the datasets present within the container, including their underlying structure in terms of data streams • * .cargo-stream: a file storing the data streams, distributed into a big number of large blocks and small blocks. Warning: All those three files define the container as a whole, so that a corruption of any may lead to the unrecoverable loss of stored data. In particular, the information contained in the * .cargo-stream file cannot be recovered without the other two files. This is why a backup functionality for the meta-information and dataset areas has been implemented in cargo_tool (for more details, see Container tool).

Meta-information area
The meta-information area (file: * .cargo-meta) contains information regarding the container blocks configuration and the block allocation table. The container blocks configuration is determined by the sizes and the number of large blocks and small blocks, which is defined at the moment of container creation. The block allocation table holds the information about the blocks occupancy state, which can be either free, reserved (being currently written to, but not finalized) or occupied. This area is crucial for the proper block allocation mechanisms while operating on the streams area of the container.

Dataset area
The dataset area (file: * .cargo-dataset) contains the description of the stored datasets inside the container. Each dataset description contains information about the dataset record type, it's underlying streams hierarchy and blocks it occupies, an optional file header information (if available) and selected data statistics. This area is crucial for 'understanding' the data stored inside the streams area, thus allowing for it's retrieval or removal.

Streams area
The streams area (file: * .cargo-streams) is the heart of the container -it holds the genomic data organized in streams determined by the records' type definition. Internally, the data inside each stream is stored as a collection of blocks -the list of the occupied blocks with the other stream information is held in the dataset area. The streams area is divided into the large block area and the small block area, which are defined by the number and the size of large blocks and small blocks. The sizes and the numbers of the blocks are configured at the time of the container creation (for more information, see Container tool).

Container tool
The CARGO container tool -cargo_tool -is a general utility to work with the containers. It provides functionality to create, shrink, remove, backup the containers and to display information about the stored data.

Options
When launched from the command line, cargo_tool displays the following message: Options: --container-file=<name> -container filename (required for all operations) * Container lifecycle: --create-container -creates a container file with the specified parameters --large-block-size=<n> -large block size in MiB (n = [1 -256], power of 2) --large-block-count=<n> -large blocks count, which will be multiplied by 64 --small-block-size=<n> -small block size in KiB (n = [64 -16384], power of 2) --small-block-count=<n> -small blocks count, which will be multiplied by 64 --remove-container -removes a container file --clear-container -clears the contents of the container --shrink-container -shrinks the container size to fit it's content * Container backup: --snapshot-file=<file> -file name of the snapshot --create-snapshot -creates a snapshot of the dataset and meta areas --restore-snapshot -restores the meta and dataset area from a snapshot * Container diagnostics: --print-blocks -prints the information about container blocks --list-datasets -list all contained datasets names --print-dataset -prints information about the specified dataset --remove-dataset -removes the specified dataset --dataset-name=<name> -dataset name --help -displays this message where options specify: --container-file=<c_file> the container prefix file name (the suffix .cargo-* will be added to the file) --create-container an action indicator to create container with the specified blocks configuration --large-block-size=<lb_s> the size of large block (in MiB), must be in range [1 -256] and be the power of 2 --large-block-count=<lb_n> the number of large blocks -the actual number of blocks will be multiplied by 64 --small-block-size=<sb_s> the size of small block (in KiB), must be in range [64 -16384] and the power of 2 --small-block-count=<sb_n> the number of small blocks -the actual number of blocks will be multiplied by 64 --remove-container an action indicator to remove the specified container --clear-container an action indicator to clear the contents of the specified container --shrink-container an action indicator to shrink the size of the specified container --snapshot-file=<s_file> a file name for the container snapshot of meta and dataset areas --create-snapshot an action indicator to create a snapshot and save it under the given file name --restore-snapshot an action indicator to restore a snapshot from the given file name --print-blocks an action indicator to print blocks information from meta-information area --list-datasets an action indicator to list all datasets names stored in the container --print-dataset an action indicator to print the information about the specified dataset in the container --remove-dataset an action indicator to remove the specified dataset from the container --dataset-name=<d_name> a name of the queried dataset with the total number of 2048 large blocks and 1024 small blocks.
Note: When shrinking the container (to adapt it's size to the size of the stored data), free and non-occupied blocks will be released and the total number of the used blocks will be set to the nearest multiple of 64.
In the case of previous example, having reserved 2048 large blocks and 1024 small blocks, where only a significant fraction of them is occupied i.e. 500 large blocks and 100 small blocks, the stream area size after the container shrinkage is calculated as follows: cargo_tool --create-container --container-file=data_container \ --large-block-size=8 --large-block-count=8 \ --small-block-size=256 --small-block-count=16 • Print the data_container block information: cargo_tool --print-blocks --container-file=data_container • Print the HG00380 dataset information from the container data_container:

FOUR THE CARGO META-LANGUAGE
With the aim to make genomic data compression prototyping accessible to a wider audience, CARGO introduces a flexible meta-language that can be used in order to define record data types -in the spirit of what happens with databases. Subsequently, by running commands from the CARGO toolchain one can automatically translate the record type to low-level C++ code, thus achieving both flexibility of implementation and high runtime performance.
In this section we introduce and explain the syntax of the CARGO meta-language in Backus-Naur form. The latter represents the formal definition of the language, and hence should be regarded as the authoritative reference to be consulted in case of doubt. However, in order to make the semantics of the language easier to grasp, the material presented in this section is slightly different from the actual implementation: some productions have been rearranged and some replaced with conceptually equivalent ones whenever implementation technical details could be confusing to the reader.

Typesetting conventions
Backus-Naur productions are typeset as follows: where example, case_one and case_two are nonterminals, COMMA is a terminal, and "@" is a literal. In addition, constructs in square brackets like [case_three] are optional.

Pre-processing directives
There are two pre-processing directives implemented so far, inclusions or root type declarations: Inclusion directives (introduced by the keyword @include) allow textual inclusions of other meta-program source files into the current one: directive_include ::= "@include" QUOTED_STRING Root type declarations (introduced by the keyword @record) have the following form:
typeexpr ::= | typedef [":" one_or_more_annotations] | typeext [":" one_or_more_annotations] Type definitions, which lead to the production of both a C++ type and an associated automatic type interface, can be either bare type declarations, or type declarations followed by modifiers (annotations, see annotation). Annotations always have the effect of generating a new automatic type interface, even if the type being annotated has been already defined. The generation of new C++ code for the type is not performed when a type is defined in terms of an already defined type. However, it can be forced by using the operator := in lieu of = in type definitions (we also say that by doing that the type is being extended).
The name of the type being defined can only begin by an uppercase letter (it can only be an uppercase identifier): typedef ::= UPPER "=" typedef_rhs typeext ::= UPPER ":=" typedef_rhs The following table summarizes the semantics of type definitions when a new type is derived from a previously defined one:

Assignable types
There are four possible main ways of defining a type: 1. As a record/product type (record_type, similar in spirit to a C struct). Any record type must have two or more named fields.
2. As a union/variant type (union_type, similar in spirit to a C union). Any union type must have two or more named fields.
3. As a basic type (basic_type), that is a simple combination of predefined/already defined types.
4. As a subtype (referenced_subtype), i.e. as part of an already defined type. In addition, a final semicolon can be optionally present after the list of member types. Each member definition has the following general form: where the name of the member can only begin by a lowercase letter (a lowercase identifier).

Reference to a subtype
Finally, a type definition can be the name of a subtype (subtype, a subtree of an already defined type) surrounded by parentheses ( ): referenced_subtype ::= "(" subtype ")" | array_unknown_length | string_known_length | string_unknown_length

General arrays
array_known_length ::= basic_type "array" " * " INTEGER array_unknown_length ::= basic_type "array" For instance, the first line declares an array of signed integers of fixed length 4, while the second declares an array of signed integers of unknown (variable) length: int array * 4 int array

Strings
By definition, the following type equation holds true: string = char array that is, a string is an array of characters. Hence the following aliases are provided as a notational shortcut: string_known_length ::= "string" " * " INTEGER string_unknown_length ::= "string"

Qualified predefined types
They are characters, signed integers or unsigned integers with a defined bitness: qual_predef_type ::= | "char" "^" INTEGER | "int" "^" INTEGER | "uint" "^" INTEGER that is, a boolean is an unsigned integer with 1 bit -the alias is provided as a notational shortcut.

Unqualified predefined types
They are characters, signed integers or unsigned integers without a defined bitness: unqual_predef_type ::= | "char" | "int" | "uint" In fact all those definitions are notational shortcuts, as the following type equations hold true: i.e. a generic character is assumed to have 8 bits, while a generic integer is assumed to have 64 bits.

User-defined types
And finally, a type can also be defined in terms of an already defined type (which must be an identifier starting with an uppercase character, UPPER): user_type ::= UPPER

Subtypes
Subtypes are parts (or more precisely, subtrees) of alredy defined types. Subtypes can be the empty subtype (empty_subtype), in which case the parser will take as subtype the type defined last; a subtype previously defined by the user (user_subtype); the base type of an array type (vector_element_subtype); and the type of the field of a record type (compound_subfield_subtype): The empty subtype comes handy when annotating a type that has just been defined, as in the following code: (see Annotations).

Annotations
Annotations allow the user to provide the CARGO framework with more information about one or more (sub)members of record types. A typical example are directives to state that a particular field should be compressed by using some specific compression method. Such information is subsequently gathered and used by the backend in order to generate more efficient C++ code.

Block types
The size of the block used when compressing (the content of) a given subtype can be specified by assigning a value to the virtual field Block of the subtype: | "PPMd" | "LZMAL1" | "LZMAL2" | "LZMAL3" | "LZMAL4" | "LZMA" where Integer and Text are predefined values suitable for the compression of the corresponding types (see The Type API). None can be used to turn off compression. For a precise definition of all other methods in terms of their corresponding algorithms, see The Type API.

Sorting field
The user can optionally annotate (the content of) a subtype as the value by which records should be sorted by assigning a True value to the virtual field Key of the subtype: sorting_field ::= subtype "." "Key" = "True" The annotation can be used to generate additional C++ code (see Translator tool).

Translator tool
The cargo_translate tool is a utility to translate record data type definition(s) written in CARGO metalanguage into all the low-level C++ components needed to produce a working compressor/decompressor tool for the record: a C++ user record definition (see The Type API), a TypeAPI-based definition (see The Type API) and several CARGO application template files (see Application templates -t, --transform generate record transformation class template in order to apply transformations on records while processing data (to file output_prefix_Transform.h) -k, --keygen generate key generator class template in order to index sorted records while processing data (to file output_prefix_KeyGenerator.h) -v, --verbose display additional information while parsing -h, --help display help message As explained in Pre-processing directives, cargo-translate will generate several C++ files for each CARGO meta-language record definition that has been flagged with a @record keyword (see Examples for examples of use).

FIVE THE TYPE API
The main objective of the CARGO TypeAPI is to provide an abstract layer separating the low-level CARGO data streams representation and the high-level record type definition. In addition, the actual C++ data types and the stream access patterns can be deduced at compile time, resulting in the generation of optimized, high-performance data processing routines. The user only needs to define a record data structure in C++ and provide its definition in terms of the TypeAPI -the compiler will then transparently generate all the specialized code that encapsulates the underlying CARGO data streams logic.
Tip: For ease of use, the TypeAPI types definitions and their corresponding C++ record data structures (with some additional helper classes) can be automatically generated from CARGO meta-language (see The CARGO meta-language) by using the cargo_translate tool.

C++ types
From the CARGO standpoint, standard C++ types can be divided into two sets -basic and complex types, which differ in the way they are internally handled by the TypeAPI layer.

Basic types
Basic types are a subset of the plain old C data types i.e. integer, character and boolean types. The numeric type names are defined in a manner [u]int(_bits_) where: • u specifies whether the numeric type is unsigned -this is optional, • (_bits_) represent the integer width in bits (in powers of two).
In addition to the standard char and bool types, there are also uchar (unsigned char) and byte (uint8) types, which in general case correspond to the same C++ type -all the available C++ basic types are presented in the table Basic C types.

Complex types
Complex set of C++ data types consist of: • string types -the C++ STL std::string type, • array types -the containers storing basic or complex types based on C++ std::vector from STL, • struct types -the composition or tagged union made of basic and complex types defined using the standard C++ struct type.

Compression options
When defining the record types, compression method and block size might be specified explicitly to achieve better data compression or performance. By default, when defining types using TypeAPI, those parameters are optional.

Compression methods
CARGO currently implements a set of compression methods based on the popular open-source compressors, including gzip, bzip2, PPMd and LZMA, with possible easy extension with other ones as plugins. The compression method names follow a consistent schema: where: • (_method_) -defines the compressor, • (_level_) -defines the compression level in range 1-4.
The available compression methods are presented in tables Compression methods and Default compression methods.

Compression options 28
CARGO Documentation, Release 0.7

Compression block sizes
Alongside with compression method, when defining the record type, the size of the compression block can be selected by specifying the appropriate enumeration -block size enumeration is defined in a following way: where: • (_size_) -specifies a number (power of two), The specified block size corresponds to the maximum available size of the work data packet to be compressed (and the size of the internal data buffer) -the data streams are stored in a block-wise manner. The available compression block sizes are presented in the tables Compression block sizes and Default compression block sizes.
Note: The size of the compression block might influence the resulting compression ratio at the higher levels of compression, especially when using PPMd or LZMA schemes, and, especially when the specified blocks size is much smaller than the internal compressor buffer (see Compression methods).

Basic types
TypeAPI basic types provide an interface for C++ basic types i.e. the numeric, character and boolean types (see: Basic types).

Type definition
The type definition interface is defined as: where: • _c_basic_type_ -stands for the corresponding basic C++ data type, • _compression_method_ -stands for compression method enumeration (optional), default: CompressionDefault, • _block_size_ -stands for the underlying block size enumeration (optional), default: BlockSizeDefault.

Integer specialization
In addition to the general type definition, a specialized interface for defining the integer types exists and is defined as follows: TIntegerType< _c_int_type_ > where _c_int_type_ is basic C++ numeric type.

Character specialization
The specialized type for the character is defined as CharType, which uses as a compression method CompressionText and as a block size BlockSizeText.

Boolean specialization
In a similar way as the character type, the boolean type is defined as BoolType using CompressionDefault as compression method and BlockSizeDefault as block size.

Basic array types
TypeAPI provides also the interface for array type containing elements of C++ basic types (see: Basic C types). The array type is based on the standard C++ std::vector and std::string types with a variable (the standard behavior) or fixed length.
Despite using the fixed length in the case of TFixedVectorType, both array types are based on the standard C++ std::vector type for compatibility, ease of use and ease of integration.

Character array (string) specialization
The variable-and fixed-length character array (or string) are the special cases of array types -they are defined as follows: where: • _length_ -specifies the length of the fixed array, • _compression_method_ -specifies the compression method enumeration (optional), default: CompressionText, • _block_size_ -specifies the block size enumeration (optional), default: BlockSizeText.
Similarly, like in the case of TFixedArrayType<>, both string types are based on the standard C++ std::string type. In addition, TypeAPI defines StringType, which is a specialized case of TStringType<> and which uses CompressionText as the compression method and BlockSizeText for size of the compression block.

Integer array type specialization
TypeAPI provides also a specialized integer array types in both variants: variable-and fixed-length; they are defined as: TIntArrayType< _c_int_type_ > TFixedIntArrayType< _c_int_type_, _length_ > where: • _c_int_type_ -defines basic plain C++ basic numeric data type, • _length_ -specifies the length of the fixed array.
These types are the specialized cases respectively of TArrayType<> and TFixedArrayType<> types using CompressionNumeric as compression method and BlockSizeNumeric as block size.

Usage examples
• defining a byte array type with the bzip2 level 4 compression scheme and the 512 KiB block size: typedef TBasicArrayType< byte, CompressionBzip2L4, BlockSize512k > MyByteArrayType; • defining a string type using the PPMd level 2 compression scheme and the 32 MiB block size: typedef TStringType< CompressionPPMdL2, BlockSize32M > MyStringType; • defining an unsigned 64-bit integer array type of fixed length of 20 and using the default compression options: typedef TFixedIntArrayType< uint64, 20 > MyFixedInt64ArrayType;

Enumeration types
Enumeration types provide functionality for working with integer, character or string labels enumerations. Enumeration type is an exceptional case of the basic data type and it's C++ definition differs from the C plain old data enum data type.
The goal of _enum_key_map_ class template is to provide the mapping functionality of the user-specified enumeration symbols into the numeric index values, which will be internally used by TypeAPI. The idea of such key mapping class is defined as follows: where: • _key_type_ -specifies one of the C++ basic types (numeric or character types) or the C++ std::string (string label type), • _key_count_ -specifies the total number of available keys i.e. user-specified enumerations, and the static member function KeyType Key(uint32 idx_) returns the enumeration value (i.e. the key) associated with the specified index idx_ value.

Usage examples
A sample DNA enumeration type definition consists of a set of 5 characters representing the possible nucleobases i.e. ACGTN and with char type as the enumeration value: typedef TEnumType< SampleCharEnumKeyMap > DnaEnumType;

Struct type
TypeAPI struct type provides an interface for defining the product data types similar to the standard struct type available in C/C++; struct type can be composed either from basic or complex types.

Type definition
To use TypeAPI struct type, firstly a definition of a class containing the description of the C++ struct type members and its' access is required. where: • _cpp_struct_ -the C++ structure type name, • _cpp_struct_description_ -a structure type description class name, • _field_ * _cpp_type_ , _field_ * _ -the C++ type names with their corresponding member names, • _members_count_ -the total number of the data fields defined in C++ structure type, • _field_ * _typeapi_type_ -TypeAPI type names corresponding to C++ _field_ * _cpp_type_.
In addition to describing the used C++ types in _cpp_struct_ (and linking them with their equivalent TypeAPI types), in _cpp_struct_description_ needs needs also to be defined the static member functions in order to access the corresponding struct members, implementing the mutable and immutable Get * functions. Having the C++ struct type described in TypeAPI the actual struct type definition is as follows: TStructType< _user_struct_description_ >

Usage examples
• Defining a sample record type with 3 fields of character and numeric types representing a nucleobase with it's corresponding quality score at a given position: A sample record type VariantRecord consist of 3 fields: • nucleobase of char type, • quality of byte type, • position of uint64 type.
The character stream representing the consecutive nucleobase values is defined in TypeAPI as NucleobaseType and is compressed using the default compression scheme and using the default block size associated with CharType.
The consecutive quality values are defined as QualityType and are compressed using bzip2 compressor with level 4 with 512 KiB block size size. Finally, the consecutive position values are defined as PositionType and are compressed using the default options associated with TIntegerType. Those types definitions are then used to describe the C++ record type VariantRecord in VariantRecordDescription, which is later used to create a final struct record type definition in TypeAPI -VariantRecordType.

Union type
Union type provides an interface for defining tagged union types, which are based on the C/C++ struct structure type and differs from the standard C/C++ union type.

Type definition
As union type and struct type interfaces are similar, union type either requires firstly it's type description in TypeAPI. However, as the union type defines a tagged union type, which, in usability terms, differs from the standard C/C++ union, it is based on C/C++ struct type instead of union. As a result, it defines an additional field kind (uint32 __kind), explicitly holding the information (index) of the currently used member of the union. The sample C++ union type definition with it's corresponding description in TypeAPI is as follows: where: • _cpp_union_ -the C++ union type name,
Finally, using _cpp_union_description_ the TypeAPI union type is then defined as follows: TUnionType < _user_union_description_ >

Usage examples
Defining a sample union record type with 2 fields -string and integer, representing either a sequence or a match length, depending on the usage scenario: A sample record type MatchRecord consist of 2 fieldssequence of std::string type and int64 of byte type. The string stream values are defined in TypeAPI as SequenceType and are compressed using PPMd level 4 compression scheme with 32 MiB block size. The consecutive match values are defined in TypeAPI as MatchType and are compressed using default options associated with TIntegerType type. Those types definitions are then used to describe the user record MatchRecord in MatchRecordDescription, which is later used to defined a final user union record type definition in TypeAPI -MatchRecordType.

Complex array types
TypeAPI complex array types provide interfaces for defining more advanced array types used to store C++ complex types (see: Complex types).

Array type definition
In order to use complex array type the underlying complex type needs to be previously defined in TypeAPI -can be either enum, array, struct or union type. A variable-and fixed-length array types are then defined as follows: TComplexArrayType < _t_complex_type_ > TFixedComplexArrayType < _t_complex_type_, _length_ > where: • _t_complex_type_ -stands for a TypeAPI complex type definition name, • _length_ -specifies the length of the fixed array.
Compression method and block size are not specified, as they are already defined in the TypeAPI complex type definition. Similarly, like in the case of basic array types, despite the provided fixed length for TFixedComplexArrayType both array types use the standard C++ std::vector type for compatibility, ease of use and ease of integration.

Metalanguage to TypeAPI cheatsheet
The available in CARGO metalanguage types (see The CARGO meta-language) with their corresponding TypeAPI types are presented in the tables included in the following subchapter.
The used symbols among descriptions mean: • T -TypeAPI type or type definition, • t -the C++ user type, • C -compression method enumeration (usually optional), • B -the compression block size enumeration (usually optional), • n -the length of the fixed array.

Note:
The translated record type definition from CARGO metalanguage to TypeAPI performed by cargo_translate (see: Translator tool) might differ in syntax, especially in using shorter versions of type names and automatic type names generation.

RECORD DATA TYPE PROCESSING
In order to convert into semantically meaningful records the input raw data passed to CARGO applications, and to further process it, the user needs to implement in C++ a small set of parsing functions. The only mandatory ones are the raw data parsing techniques defined in Record type parser. To that end, a toolbox made of a set of several helper classes is provided within the CARGO library (see Helper classes). Furthermore, in order to perform certain operations on records, the user sometimes needs to define a data transformation function (see Record type transform) operating on the parsed record data type. Additionally, to store the records according to some sorting criterion specified by the user, a key generation function useful in the case of data extraction and range querying (see Extractor) must occasionally be specified (see Record type key generator). The general concepts underlying the CARGO data processing workflow are presented in General concepts underlying the CARGO data processing workflow. More in detail, the data storage pipeline is as follows:

General workflow
1. First the raw input genomic data in text form is parsed into corresponding C++ data structures by using the parsing function implemented by the the user (see: Record type parser) 2. Optionally, in the next step the records are transformed according to the forward transformation function specified by the user -see: Record type transform 3. Elements of the record are first split into a number of streams, subsequently compressed and finally stored inside the container.
Similarly, the general data access pipeline is as follows: 1. First streams are decompressed from the container and several single stream elements are merged in order to reconstruct the original record 2. Optionally, in the next step the records are un-transformed according to the backward transformation function specified by the user 3. Records are unparsed, from the C++ data structure defining the record into raw text, by using the unparsing function implemented by the user.

Helper classes
To quickly and easily implement record data parsers, i.e. in order to write procedures able to read/write raw data from input/to output, CARGO provides a set of basic helper classes.

Memory stream
MemoryStream class (defined in <cargo/core/MemoryStream.h>) implements safe and easy to use member functions to read/write the data to/from a specified chunk of memory. The most important member functions are presented in the table Selected MemoryStream class member functions. b_ -reference to the byte to store the read value bool PeekByte(uchar& b_) Reads the next byte available in the memory without advancing the read position returns: true -on success false -when reached the end of memory params: b_ -reference to the byte to store the read value bool WriteByte(uchar b_) Writes the next byte to the memory, on success advancing the read position by 1 returns: true -on success false -when reached the end of memory params: b_ -byte to be stored in memory 6.2. Helper classes data_ -pointer to the raw memory region to store the data, containing more than (or equal) size_' of available space size_ -the number of bytes to read int64 Write(const uchar * data_, uint64 size_) Writes the data up to size_ bytes to the memory, on success advancing the position by the number of bytes written returns: number of bytes written -value less than size_ means reaching the end of the memory params: data_ -pointer to the raw memory region to write the data from, containing more than (or equal) size_ of the stored data size_ -the number of bytes to write Tip: When implementing own record data parser code, user can assume that already initialized MemoryStream object will be passed to the parsing function.

FieldParser
FieldParser class (defined in <cargo/type/UserDataParser.h>) provides static member functions for parsing the delimiter-separated record fields reading/writing the data directly from/to MemoryStream. The most important member functions are presented in the table Selected FieldParser member functions. true -on success false -when reached the end of the memory stream or error while parsing params: stream_ -memory stream object to read from v_ -reference to the output value to store the parsed data separator_ -records fields separator character static bool WriteNextField(MemoryStream& stream_, _T v_, char separator_) Parses the next numeric field of type _T and writes to the memory stream returns: true -on success false -when reached the end of the memory stream or error while parsing params: stream_ -memory stream object to read from v_ -integer value to be parsed and stored separator_ -records fields separator character static bool WriteNextField(MemoryStream& stream_, const std::string& str_, char sep_) Parses the next string field writing the data to the specified stream returns: true -on success false -when reached the end of the memory stream or error while parsing params: stream_ -memory stream object to read from str_ -reference to the string to be parsed and stores sep_ -records fields separator character static bool PeekNextByte(MemoryStream& stream_, uchar& b_) Peeks the next available byte from specified memory stream returns: true -on success false -when reached the end of the memory stream params: stream_ -memory stream object to read from b_ -reference to the byte to store the read value true -on success false -when reached the end of the memory stream or error while parsing params: stream_ -memory stream object to read from separator_ -records fields separator character static bool SkipBlock(MemoryStream& stream_, char blockStart_, char blockEnd_) Skips the whole block (or line) of text delimited by blockStart_ and blockEnd_ returns: true -on success false -when reached the end of the memory stream params: stream_ -memory stream object to read from blockStart_ -beginning of the text block character delimiter blockEnd_ -end of the text block character delimiter

Record type parser
Record type parser is a class providing the static member functions to parse the raw textual data chunks filling the provided C++ record data structures. The class can be seen as a link between the user-understandable text data and CARGO-understandable structured data. It's the only one obligatory class which member functions needs to be implemented by user by using previously introduced helper classes i.e. FieldParser and/or MemoryStream.

Parser class template
Parser class code template is as follows: where: • MemoryStream& stream_ -represents the memory stream from/to which the record data is going to be read/written, • TRecord& record_ -represents a user C++ record data which contains or will contain the data in interest.

# skip until the end of the header
The function: void SkipToEndOfHeader(MemoryStream& stream_) implements skipping of the following next bytes in the memory stream until the end of the (file) header, which may appear depending on file format. Quite often the genomic file header lines begin with a special symbol e.g. @ (SAM format) or ## (VCF format) and they need to be filtered out before the actual records processing. This function can be left empty, as not always the genomic file format uses file header (e.g. FASTA or FASTQ format).

# skip until the end of the record
The function: void SkipToEndOfRecord(MemoryStream& stream_) implements skipping the next bytes in the memory stream until the end of the (current) record in order to properly position parser at the beginning of the next one. As quite often the records are stored line-by-line (e.g. in SAM or VCF formats), the function will just implement skipping to the end of the (current) line. In case of FASTQ or FASTA formats, where one record spans over multiple lines, an analysis of the line ending the record needs to be taken into account (see Simple FASTQ format compressor).

# read next record
The function: bool ReadNextRecord(MemoryStream& stream_, TRecord& record_) implements filling the translated C++ record data structure of TRecord type with the data parsed from memory stream. For the records defined in the tab-separated text format using only the helper class FieldParser for parsing the data should be sufficient.

# write next record
The function:

Record type parser bool WriteNextRecord(MemoryStream& stream_, TRecord& record_)
implements parsing the contents of the translated C++ record data structure of TRecord type and further writing it to memory stream. For the records defined in tab-separated text format using just the helper class FieldParser for parsing the data should be sufficient.

FASTQ example
Following the FASTQ example from Simple FASTQ format compressor subchapter, the simple FASTQ records parser can be defined as: On line 4 an empty SkipToEndOfHeader function was defined -the FASTQ format does not specify any file header, so the body of the function is left empty.
As FASTQ record cannot be uniquely identified just by analyzing single line (is defined by 4 consecutive lines), the 6.3. Record type parser

Record type transform
Record type transform is a class implementing the operations specified by user which will be applied on every record while processing the data. Providing the transform class is not obligatory.

Transform class template
The transform class code template is as follows:

# transform forward
The function: void TransformForward (TRecord& record_) implements the transformation operation which will be applied on the records after parsing from the raw textual data format (i.e.: # read next record) and before the actual data compression and storage in container. The practical usecases may include: trimming or filtering of the sequences, down-sampling or transforming the sequencing quality scores values.

# transform backward
The function: void TransformBackward(TRecord& record_) implements the transformation operation which will be applied on the records after the data decompression from the container and before parsing back (i.e.: # write next record) to the raw textual data format.

FASTQ example
Following the FASTQ example, as a record transformation Illumina quality scores reduction scheme will be applied, reducing the number of possible Q-scores to 8 values. The Q-scores mapping method implementing the specified scheme is presented in the table Q-scores mapping using Illumina quality scores reduction scheme.. Starting from the line 8 each value of the qualities will be processed, where the type of the quality is stored using a std::string type. As a first step, from each quality value is subtracted an offset -QualityOffset, which, for simplicity of this example, has a fixed value of 33 (for more information regarding quality values see: FASTQ format). The value of the absolute quality score is assigned to q variable, which is later compared with possible quality ranges and assigned a new value following the Illumina quality scores reduction scheme. At the end of the processing (line 19), the new, transformed quality score with the offset is assigned to the current one.
Tip: Performance-wise, to skip the 7 conditional checks for the range of the q value, a simple translation lookup table can be used and indexed by q value itself -see Record transform sub-chapter example for the implementation.

Record type key generator
Record type key generator is a class implementing the key generation method providing the sorting order of the records stored inside the container. The user needs to implement a function returning a value of the key object of C++ std::string type, which will be later used to compare the ordering of records. Providing this class by user is optional.
Caution: CARGO assumes that records which will be parsed and read are already sorted externally by the user in the specified order e.g. using standard linux sort tool.

Key generator class template
Key generator class template is presented in the code snippet below: is the only obligatory one to be implemented, which returns the specified by user record key value to make the further records comparisons.
Following the FASTQ example from the previous subchapters, tag field will be used to generate the key for the further records' order comparison.

APPLICATION TEMPLATES
When translating a record data type definition in CARGO meta-language with cargo_translate (see: Translator tool), in addition to C++ and TypeAPI record type definitions a command-line application template file is also generated. This file, <record_type_name>_main.cpp, will contain the definitions of a set of routines able to compress, decompress and optionally extract and transform, input data formatted as a stream of record defined as per the user's original type specification. Such routines are based on CARGO library application template interfaces (available in <cargo/type/IApp.h>). They are ready to be compiled and used straight on; they support multi-threaded processing and Unix pipes for easy integration with existing tools.
Given a specified record type data, the generated standalone application contained in <record_type_name>_main.cpp can provide the following functionality: • compressor routine: it stores the data inside the container under a specified dataset name • decompressor routine: it retrieves a dataset from the container by name • extractor routine: it extracts from the container a contiguous range of the set of records belonging to a sorted dataset • transformator routine: it applies a user-specified transformation on the records belonging to a dataset.

Application interface class
The hierarchy of CARGO application interface classes (defined in <cargo/type/IApp.h>) is presented in CARGO application interfaces hierarchy.

# IApp
The IApp base class provides a simple interface for developing applications based on CARGO library. In addition to the virtual destructor, it contains 2 pure virtual methods which need to be implemented by the derived classes: should contain the implementation of the application logic alongside with the parameters parsing -argc_ and argv_ are the passed command-line parameters.

Another virtual method:
void Usage() should contain displaying the application usage information or help.

# ISpecApp
From IApp interface, the further specialized classes -ICompressorApp, IDecompressorApp, ITransformatorApp and IExtractorApp, named in general ISpecApp, are derived implementing own specific functionality; they share the common concept: The specified class interface, derived from IApp class, implements the function: which functionality is split into two smaller virtual functions for possible future overloading by the derived classes. The first one: implements parsing and processing of the command line arguments, saving them into passed args_ parameter of the specified InputArgs type. The second one: int RunInternal(const InputArgs& args_) implements the application-specific logic based on CARGO library. In addtion, the function: type::IUserDataProcessorProxy * CreateDataProcessor(const InputArgs& args_) = 0 is a pure virtual function, playing a role of a placeholder, which implementation is to provided by the further derived classes i.e. TCompressor, TDecompressor, TTransformator and TExtractor or named in general -TSpecApp.

# TSpecApp
The aim of TSpecApp class layer is to provide a record data-type and operation specialized interface, from which the final application can be created. The general interface of the specialized application template is as follows: The specified TSpecApp template parameters are: • class _TDataRecordType -TypeAPI record data type definition (see: The Type API), • class _TRecordsParser -the user-specified records data type parser (see: Record type parser), • class _TRecordsTransformator -an optional, user-specified record data type transform (see: Record type transform), • class _TRecordsKeyGenerator -an optional, user-specified record data type key generator (see:

Record type key generator).
More information about the usage of the specialized application templates are described below in the corresponding sub chapter Running the application.

Application template file
The generated application template file -<record_type_name>_main.cpp, by cargo_translate tool (see: Translator tool) contains the implementations of a sample record type compressor, decompressor, (optionally) transformator and (optionally) extractor sub-applications. The concept of the such generated standalone application using CARGO sub-applications is as follows: function is defined, where the number of input arguments and their description depend on the sub-application being selected. The first argument will be always the sub-application selector (c|d|t|e -line 40), which is being checked on line 30 and launching respectively either the records compressor, decompressor, transformator or extractor.

Building the application
When a CARGO meta-language record definition is translated with cargo_translate (see: Translator tool), in addition to several C++ files a Makefile is also generated. This way, CARGO applications can be easily build: the Makefile contains all required compilation flags, paths and libraries -which makes it the most convenient and recommended method. Even more advanced examples can be created by tweaking the Makefile.

Build prerequisites
Before building CARGO applications from the generated template application files, the CARGO_PATH environment variable needs to be set, pointing to the root of the CARGO installation directory: export CARGO_PATH=/path/to/cargo/directory/ As CARGO relies on several publicly-available compression libraries, the zlib (libz) and bzip2 (libbz2) libraries need to be present in the system when compiling.
Compiling CARGO applications will also require a compiler with C++11 standard support (for multi-threading support) -by default the gcc compiler version 4.8 or above should be used.

Using generated Makefile
To build CARGO application using the generated Makefile file <record_type_name>_Makefile.mk, say make -f <record_type_name>_Makefile.mk A successful build will generate an executable named cargo_<record_type_name>_toolkit.

Running the application
A compiled application from the provided CARGO templates when run from the command line can display following message: CARGO <record_type_name> toolkit usage:./cargo_<record_type_name>_toolkit <c|d|t|e> [options] where the possible switch launches sub-application related with the user-defined record data type: • c -compressor, • d -decompressor, • t -transformator,

Building the application
• e -extractor, and [options] specify the available options depending on the selected sub-application switch (described below).

Compressor
The compressor sub-application stores the record type input data inside the containers; when launched from command line: ./cargo_<record_type_name>_toolkit c displays the following message: The available options are: -c container, --container=container container file name prefix (e.g. <container_file>.cargo-) -n name, --dataset-name=name name of the dataset to be stored -i file, --input-file=file input file name (optional) -t n, --threads-num=n the number of processing threads (optional) -b size, --block-size=size the block size (in MiB) of the input buffer (optional) -q, --no-io-buffering disables the IO buffering of container blocks reducing memory usage (optional) -a, --apply-transform apply records forward transformation (if defined by user, optional) -g, --generate-key generate a records key for future ranged queries (if defined by user, optional) -s, --skip-key skip the 1st field as a generated key when parsing records (requires -a, optional) -h, --help display help message Compressor processes the user defined type input data (either from file or by default -from standard input) and stores it in the container under the specific dataset name. To better control the performance, a maximum number of processing threads can be specified. The input block buffer size, depending on the input data type and size, should be picked relatively large (by default 64 MiB), as it influences the compression ratio -in principle, the larger, the better, at the expense of a bigger memory usage. In addition to compressing, the records forward transformations can be applied, but only if the operator has been previously defined by the user in <record_type>_Transform class (see: Record type transform) and compiled with the final compressor application. The similar concept applies to the records key generation (see: Record type key generator), which is used inside the blocks to identify the order of the processed records.
Important: CARGO applications do not implement sorting of the records internally. Thus, in case of storing the data with requested specified order and using the user provided key generator, the input data needs to be previously sorted -this can be done by using a combination of the standard linux tools incl. sort.

Decompressor
The decompressor sub-application retrieves the record type data from the containers; when launched from command line: . The available options are: -c container, --container=container container file name prefix (e.g. <container_file>.cargo-) -n name, --dataset-name=name name of the dataset to be read -o file, --output-file=file output file name (optional) -t n, --threads-num=n the number of processing threads (optional) -q, --no-io-buffering disables the IO buffering of container blocks reducing memory usage (optional) -a, --apply-transform apply records backward transformation (if defined by user, optional) -h, --help display help message Decompressor retrieves the user defined type data stored in the container outputting it either to a file or standard output (by default). In addition to decompression, the record data type backward transformation can be applied if it was previously defined by user in <record_type>_Transform class (see: Record type transform). The maximum number of processing threads can be also specified.

Transformator
The transformator sub-application performs the record type data transformations, applying the user specified transform (either forward or backward) on all of the input records and saving result to the output; when launched from command line: ./cargo_<record_type_name>_toolkit t displays the following message:  Transformator is the only one sub-application not operating directly on CARGO containers. It applies the user defined record data type transformation either forward or backward (see: Record type transform) and/or key generation (see: Record type key generator) on all the records -such generated key will be appended as an extra field at the beginning of the record. The input data is read either from a file or the standard input and saved either to a file or the standard output. The input buffer block size and number of processing threads control the overall performance.

Extractor
The extractor sub-application retrieves the record type data from the containers within a specified key range; when launched from command line: ./cargo_<record_type_name>_toolkit e displays the following message: CARGO extractor sub-application available options: -c <container> -container file prefix -n <dataset> -dataset name -k <key_0>::<key_n> -records extract range from <key_0> to <key_n> [-o <file>] -output file, optional ( -c container, --container=container container file name prefix (e.g. <container_file>.cargo-) -n name, --dataset-name=name name of the dataset to be read -k k_begin-k_end, --key=k_begin-k_end records extraction range from k0 to kn -o file, --output-file=file output file name (optional) -t n, --threads-num=n the number of processing threads (optional) -q, --no-io-buffering disables the IO buffering of container blocks reducing memory usage (optional) -a, --apply-transform apply records backward transformation (if defined by user, optional) -h, --help display help message Extractor retrieves the user defined type data stored in the container from the specified dataset and within the specified key range -the key generation function must be earlier defined by the user (see: Record type key generator). In addition to extraction, the record data type backward transformation can be applied if it was previously defined by user in <record_type>_Transform class (see: Record type transform). The sub-application outputs the data either to a file or standard output (by default). The maximum number of processing threads can be also specified.

Important:
The format of specified key in the extraction range from k_begin to k_end must match the key format defined by user in the <record_type>_KeyGenerator class.

Example usages
• Using the compiled SAM record type toolkit cargo_samrecord_toolkit, store the HG00306.sam input SAM file in the HG00 container as a dataset named HG00306: cargo_samrecord_toolkit c -c HG00 -n HG00306 -i HG00306.sam • Retrieve the SAM dataset HG00306 from container HG00 using 8 processing threads and saving the output as HG00306.out.sam: cargo_samrecord_toolkit d -c HG00 -n HG00306 -o HG00306.out.sam -t 8 • Decompress the HG00306.bam BAM file using SAMTools and stream output SAM stream to CARGO container HG00 under HG00306 dataset name, using 8 processing threads and applying user-defined records transformation: samtools view -h HG00306.bam | cargo_samrecord_toolkit c -c HG00 \ -n HG00306 -t 8 -a • Apply user-specified records forward transformation and key generator to the HG00306.bam file, sort the records and store the streamed data in the HG00 container as a dataset named HG00-sorted: When creating a specific file format compressor or decompressor, the general workflow is as follows: 1. Define the record data type in CARGO metalanguage 2. Translate the definition using cargo_translate tool -a set of user files will be generated 3. Fill the generated record data type parser template file 4. Optionally, define the record data type transformation (forward and backward) 5. Optionally, define the record data type key generator (for sorting order) 6. Build the application using the generated Makefile file.
Important: Before building the examples make sure that build prerequisites described in subchapter Build prerequisites are met.

FASTQ
The FASTQ example shows a simple proof-of-concept whereby with a few lines of code we create a compressor specialized to the FASTQ format . Results are comparable to those produced by state-of-the-art FASTQ format compressors.
The sources for this example are available in the cargo/examples/fastq/fastq-simple subdirectory of the standard CARGO distribution. A precompiled binary can be found in the directory cargo/examples/bin/cargo_fastqrecord_toolkit-simple.
Tip: Should one prefer to skip the following steps and build the FASTQ example straight away, a Makefile file is provided in the main directory together with a FASTQ record type definition in CARGO metalanguage (FastqRecord.cargo) and a complete record parser (FastqRecord_Parser.bak).

General format description
The FASTQ format is an ASCII text-based format useful to store biological sequences together with their quality score values. A sample record looks like the following: 1 @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 while a general record contains: • read id: an identifier of the read starting after the @ symbol • sequence: a sequence of nucleotides encoded using AGCTN letters • plus: a control line, optionally containing a repetition of the read identifier • quality: a Phred sequencing quality score of the sequence.
Note: In the sequence field other IUPAC symbols can also appear and, in special cases, nucleotide sequences using a lowercase notation agctn.

Record type definition
In our example, the FASTQ record is defined in CARGO metalanguage as follows: Such defined record type consists only of 3 string fields: read id tag, sequence seq and quality qua, omitting the optional plus control field. No explicit compression methods were selected, so the default ones (see: Default compression methods) will be applied, while storing the data inside container.

Translation
Once the record type definition in CARGO meta-language has been saved as a FastqRecord.cargo file, the definition can be used as an input for the translation step. Running the command cargo_translate -i FastqRecord.cargo will generate the following files:

Records parser
The generated parser file FastqRecord_Parser.h contains the skeleton of the record data type parser class. Some of its functions need to be implemented by the user (for more information see: Record type parser). After this step has been completed, the code will look as follows: 1 using namespace type; On line 4 an empty SkipToEndOfHeader function was defined -the FASTQ format does not specify any file header, so the body of the function has been intentionally left empty.
As the FASTQ record cannot be uniquely identified just by analyzing a single line (it's defined by 4 consecutive lines), the function SkipToEndOfRecord needs to: 1. Skip until the end of the current text line, passing a newline separator \n 2.
Peek the next symbol from the memory stream

FASTQ
3. Check whether the next line starts with @ symbol and if so -exit, otherwise: (4) 4. Skip the next line and go to (2).
Once the memory stream is positioned at the beginning of the next FASTQ record, reading and parsing it can be done straightforward (starting from line 26), while unparsing the FASTQ record and writing it to a memory stream goes in a similar way (starting from line 35).

Building
The generated Makefile FastqRecord_Makefile.mk contains the information required in order to build a FASTQ toolkit taking the options specified by user into account. Running gnu make in the example's directory as make -f FastqRecord_Makefile.mk will produce the executable cargo_fastqrecord_toolkit.

Running
The generated application uses the default command line parameters in order to select compressor (c) and decompressor (d) sub-routines (details are described in Application templates).

SAM-STD
The SAM-STD example shows a simple proof-of-concept whereby with a few lines of code we create a compressor specialized to the SAM . The compression ratio and speed achieved by this example outperform the widely used BAM format as implemented by SAMTools. In addition, with minor parameter tweakings of the record type definition one can achieve a compression ratio comparable to that of current state-of-the-art SAM compression tools. Tip: Should one prefer to skip the following steps and build the SAM-STD example straight away, a Makefile file is provided in the main directory together with a SAM record type definition in CARGO metalanguage (SamRecord.cargo), a complete record parser (SamRecord_Parser.bak) and a record transformation function (SamRecord_Transform.bak).

General format description
SAM -Sequence Alignment/Map format is a tab-delimited text format used by a variety of bioinformatics tools to store sequence alignment information in a record-like way. The file consist of an optional header block (having lines starting with the @ symbol) followed by an arbitrary number of consecutive SAM records, each consisting of the fields described in table SAM format field description. In addition to the standard 11 SAM format fields, there are also optional fields defined as TAG:TYPE:VALUE, where TAG is a two-character string matching /[A-Za-z][A-Za-z0-9]/ expression -they are presented in table SAM format optional fields. The type names of the presented SAM record fields correspond to their definition presented in table SAM format field description. However, the optional fields are treated as a one single long string. The specified compression methods (using the default block sizes) were selected through a set of experiments and provide a satisfactory compression ratio and performance.

Translation
The record definition was saved under SamRecord.cargo file name; after translating the record definition: cargo_translate -t -i SamRecord.cargo the following files will be created: •

Translated record type definition
The translated C++ record definition SamRecord.h:

SAM-STD
The corresponding record type definition in TypeAPI due to it's length and for clarity is skipped here -it can be found in the generated SamRecord_Type.h file in the example subdirectory cargo/examples/sam/sam-std.

Records parser
The generated parser file SamRecord_Parser.h contains the skeleton of the user record data type parser class which functions are to be implemented by the user (for more see: Record type parser). The code for the complete parser class is presented on the snippet below: 1 using namespace type;  As in the SAM format definition -a sample SAM file can contain an optional header, so a function SkipToEndOfHeader needs to be defined. The function implements skipping of the blocks of text, where each line starts with @ symbol and finish with a newline \n character (line 12).
Skipping the position in the memory stream until the end of the currently parsed SAM record is trivial, as it is only requires to skip until the end of the currently read text line (line 17).
The actual parsing the next SAM record starts in line 20 (and ''40) and it goes straightforward, reading (or writing) the concurrent fields in the specified in the SAM format description.

Record transform
In this example, as a transformation function, the Illumina quality scores reduction scheme was implemented based on Q-scores mapping using Illumina quality scores reduction scheme.  ,40,40,40,40, 40,40,40,40,40, // 4x 15 40,40,40,40,40, 40,40,40,40,40 To implement the quality transformation function, a lookup table with quality values was created in order to skip multiple comparison of quality value range -the simplified version with branches was implemented in FASTQ example subchapter.

Building
The generated build script file SamRecord_Makefile.mk contains the required recipes for building SAM toolkit with options specified by user. Running from the example's directory gnu make: make -f SamRecord_Makefile.mk will produce the executable cargo_samrecord_toolkit.

SAM-EXT
The SAM-EXT example is an extended version of the previous SAM-STD, which implements a tokenization of SAM optional fields with an additional Illumina quality scores reduction scheme transformation. Tip: Should one prefer to skip the following steps and build the SAM-EXT example straight away, a Makefile file is provided in the main directory together with a SAM record type definition in CARGO metalanguage (SamRecord.cargo), a complete record parser (SamRecord_Parser.bak) and a record transformation function (SamRecord_Transform.bak).

Record type definition
The extened version of SAM record in CARGO metalanguage differs only in optional fields representation and is defined the following way: The types of the presented SAM record fields correspond to their definition presented in table SAM format field description with optional fields following the simplified definition presented in SAM format optional fields. In this example, the optional fields are defined as an array of tokens of OptionalField type, where each token contains a tag identifying the data with it's value of OptionalValue union type.
The fields compression methods with block sizes were selected through a set of experiments and provide a satisfactory compression ratio and performance.

Translation
The translation process follows a similar path defined in the previous SAM-STD example.

Translated record type definition
The translated into C++ record definition SamRecord.h is as follows: The corresponding record type definition in TypeAPI, due to it's length and for clarity, is skipped here and can be found in the generated SamRecord_Type.h in the example's subdirectory cargo/examples/sam/sam-ext.

Records parser
The generated parser file SamRecord_Parser.h contains the skeleton of the user record data type parser class which functions are to be implemented by the user (for more see: Record type parser). The code for such filled parser  As in the previous example, the SAM file can contain an optional header, so a function SkipToEndOfHeader needs to be defined. The function implements skipping of the blocks of text, where each line starts with @ symbol and finish with a newline \n character (line 12).
Skipping the position in the memory stream until the end of the currently parsed SAM record is trivial, as it is only requires to skip until the end of the currently read text line (line 17).
The actual parsing the next SAM record starts in line 27 (and ''97) and it goes straightforward, reading (or writing) the concurrent fields in the specified in the SAM format description. From line 45 (and 118) the tokenization of the optional fields i.e. splitting or joining of the sub-fields takes place, parsing and saving the individual sub-fields into an optString array.

Record transform
As a transformation function, the Illumina quality scores reduction was implemented based on Q-scores mapping using Illumina quality scores reduction scheme, with the same code was presented in the previous example Record transform.

Building
The generated Makefile file SamRecord_Makefile.mk contains the required recipes for building SAM toolkit with options specified by user. Running from the example's directory gnu make: make -f SamRecord_Makefile.mk will produce the executable cargo_samrecord_toolkit.

Running
• Using the compiled SAM record type toolkit cargo_samrecord_toolkit store the C_SAM306.sam input SAM file in C_SAM container under dataset name C_SAM306: cargo_samrecord_toolkit c -c C_SAM -n C_SAM306 -i C_SAM306.sam • Retrieve the SAM dataset C_SAM306 from container C_SAM using 8 processing threads saving the output as C_SAM306.out.sam: cargo_samrecord_toolkit d -c C_SAM -n C_SAM306 -o C_SAM306.out.sam -t 8 • Store the C_SAM306.sam input SAM file in C_SAM container under C_SAM306 dataset name applying additionally records transformation with 256 MiB as an input block buffer and 8 processing threads: cargo_samrecord_toolkit c -c C_SAM -n C_SAM306 -i C_SAM306.sam -a -t 8 -b 256 • Decompress the C_SAM306.bam using SAMTools and apply the user transformation (in this case Illumina quality scores reduction) saving the output to C_SAM306.out file: samtools view -h C_SAM306.bam | cargo_samrecord_toolkit t -o C_SAM306.out -a • Print the C_SAM306 dataset info from the C_SAM container: cargo_tool --print-dataset --dataset-name=C_SAM306 --container-file=C_SAM • Remove the C_SAM306 dataset from the C_SAM container: cargo_tool --remove-dataset --dataset-name=C_SAM306 --container-file=C_SAM

SAM-REF
The SAM-REF example is a more advanced version of the SAM-EXT, which implements a set of additional features, providing much better compression ratio and performance. Those include: • an extended tokenization of the SAM optional fields • an internal alignment format defined as a special combination of SEQ, CIGAR fields and MD optional field • reference-based sequence compression • transformations of SAM numerical fields including TLEN and PNEXT • optionally, Illumina quality scores reduction transformation • optionally, BAM-compatible range queries by chromosome and position.

SAM-REF
As this example uses additional functionalities outside the standard CARGO library, i.e. reference based compression using BFF (Binary Fasta File -see BFF file format) files, it was necessary to implement our own modified versions of sub-routines in order to be able to handle advanced application initialization (i.e. to load the compressed reference file) and an extended number of parameters. Thus we are not using the application template files generated by default.  Using CARGO metalanguage such information can be represented by defining 2 compound types -OptionalField and OptionalValue types. OptionalField structure holds the field's tag information and it's corresponding value which is of OptionalValue type. An OptionalValue type is a tagged union type, which can hold values of either int^32, char or string types.
To aid the reference-based compression, the reference alignment information is represented in a MdOperation structure holding the operation identifier opId and the mismatch information, which is of the MdOperationValue type. The MdOperationValue type, in a similar way like OptionalValue, defines a tagged union type, which can hold values of 2 different types -int^32 and string.
The actual SAM record type definition in CARGO metalanguage in the major part is the same as in the previous SAM-STD example differing only in optional fields representation, altering from single string type to two complex array types (line 36).

BFF file format
To handle the reference-based sequence compression and to keep the sequences in memory in a compact way, a simple BFF -Binary Fasta File -format file was developed, which binary encodes the reference sequences and enables random access for sequence retrieval. The source code of BFF file methods, conversion to/from FASTA format and methods to access the sequences are available in cargo/examples/sam/common/SequenceRetriever.h.

Building
To build the SAM-REF example a Makefile is available in the example directory: make Invoking gnu make will produce two executables cargo_samrecord_toolkit-ref, cargo_samrecord_toolkit-ref-q8 and cargo_samrecord_toolkit-ref-q8-max, where the former one implements additionally Illumina quality scores reduction when performing records transformations.

Command line options
The SAM-REF application example is based on the standard CARGO application templates (see: Application templates), but with an additional set of input parameters used for handling the reference file (-f) and BAM-compatible key generation (-z). When run directly from the command line displays: CARGO SAM toolkit usage: ./cargo_samrecord_toolkit-ref <c|d|t|e|r> [options] where a selected subprogram can be either a c-ompressor, d-ecompressor, t-ransformator, e-xtractor or r-eference BFF file generator.

Compressor
The SAM-REF compressor subroutine processes and stores the input SAM data into container; when launched, displays the following message: where the available options are: -c container, --container=container container file name prefix (e.g. <container_file>.cargo-)