BioSeqZip: a collapser of NGS redundant reads for the optimization of sequence analysis

Abstract Motivation High-throughput next-generation sequencing can generate huge sequence files, whose analysis requires alignment algorithms that are typically very demanding in terms of memory and computational resources. This is a significant issue, especially for machines with limited hardware capabilities. As the redundancy of the sequences typically increases with coverage, collapsing such files into compact sets of non-redundant reads has the 2-fold advantage of reducing file size and speeding-up the alignment, avoiding to map the same sequence multiple times. Method BioSeqZip generates compact and sorted lists of alignment-ready non-redundant sequences, keeping track of their occurrences in the raw files as well as of their quality score information. By exploiting a memory-constrained external sorting algorithm, it can be executed on either single- or multi-sample datasets even on computers with medium computational capabilities. On request, it can even re-expand the compacted files to their original state. Results Our extensive experiments on RNA-Seq data show that BioSeqZip considerably brings down the computational costs of a standard sequence analysis pipeline, with particular benefits for the alignment procedures that typically have the highest requirements in terms of memory and execution time. In our tests, BioSeqZip was able to compact 2.7 billion of reads into 963 million of unique tags reducing the size of sequence files up to 70% and speeding-up the alignment by 50% at least. Availability and implementation BioSeqZip is available at https://github.com/bioinformatics-polito/BioSeqZip. Supplementary information Supplementary data are available at Bioinformatics online.

character represents the lowest quality and the '∼' character the highest, respectively. In a FASTQ file each sequence is represented on four lines: i) the first line start with a '@' character and is followed by a sequence identifier and an optional description (like in a FASTA title line). ii) the second line contains the sequence of letters of a read. iii) the third line begins with a '+' character; iv) the fourth line encodes the quality values of the sequence, one per base. Hence, it must contain the same number of symbols as line two.
• Tag and Tagq file formats are customized output files of BioSeqZip_Collapser, and have a column-based structure that is usually adopted by the smallRNA-Seq analysis tools. Tagq files have three columns: i) in the first column, the unique sequence (called tag) obtained after the read collapsing. ii) in the second column, the tag quality, obtained as the average quality of the collapsed sequences. iii) in the third column, the number of identical sequences in the input file that were collapsed to generate the tag. Tag files will have the same structure but with two columns only, as the quality information is not provided in this format.

In/Out File Formats of BioSeqZip_Expander
BioSeqZip_Expander can work on SAM and BAM file formats detailed in the following.
• SAM format is used for storing sequence data, either aligned or unaligned, in a human readable format [?]. A header section provides a list of reference sequences as well as other supplementary information provided by the alignment tool. In the second part, the file lists all the reads and the mapping positions on the reference in a tabular format. Each line includes eleven mandatory fields, the most important ones being the query name, the reference name, the position in the reference, and the read sequence). • BAM format provides a binary version of the same data available in the SAM format. Clear advantage of this format is that it can be efficiently compressed to reduce storage occupancy.

Supplementary Availability
For the BioSeqZip project, we provide the following three repositories: • https://github.com/bioinformatics-polito/BioSeqZip.git containing the BioSeqZipsource code implementing the collapser and expander functionalities. • https://github.com/bioinformatics-polito/BioSeqZip-BWA.git containing a modified version of BWA capable of using as input the read files generated by BioSeqZip for creating the full mapping files. This tool version does not need an expansion procedure of the output mapping files. • https://github.com/bioinformatics-polito/BioSeqZip-Yara.git containing a modified version of Yara capable of using as input the read files generated by BioSeqZip for creating the full mapping files. This tool version does not need an expansion procedure of the output mapping files.

BioSeqZip on DNA-Seq data
We tested the collapsing procedure provided by BioSeqZip on a 74x coverage DNA-seq paired-end sample (SRR7890958) with the following Library specs: • Name: FFG_IL_N_6h • Instrument: Illumina HiSeq 4000 • Strategy: WGS • Source: GENOMIC • Selection: RANDOM • Layout: PAIRED • Coverage: 74x • Total reads after trimming: 1,523,752,982 First of all, we tested single-end collapsing running BioSeqZip on the first mate of the sample, experiencing an 18.38% gain in terms of the number of reads in the file which corresponds to 19.65% gain in disk space. Then, we run the BioSeqZip_collapser on the paired-end files experiencing a read compression of 8.70% and a disk space reduction of 10.01%.
The reason of these observations could be that DNA-seq samples, given the origin of the sequenced data, exposes a lower redundancy, leading to lower compression, making the collapsing procedure less effective reducing the gain the user can expect to have in post-collapsing pipelines.

BioSeqZip -RAM/Runtime trade-off
In Table 1 we explore the trade-off between the maximum amount of RAM BioSeqZip is allowed to use and the number of seconds required for running the collapse procedure. The analysis highlights that the runtime required for collapsing a sample slightly increases as the size of the internal buffer, used for storing the sequences, increase. This is due to the algorithmic structure of BioSeqZip: it is composed of an I/O part, which scales linearly with the amount of data to be read or written, and a collapsing part whose core is the buffer sorting engine, which sorts the sequences to be collapsed in alphabetical order. The higher runtime observed when collapsing a larger amount of data is related to the asymptotic complexity of the sorting algorithm being O(n * log(n)). Basically, given X sequences read from the disk in time R and sorted in time S, performing I/O on 2 * X data requires 2 * R, but sorting them requires more than 2 * S.
This trend is broken when the collapser is allowed to use an amount of RAM bigger than the size of the samples (all three samples have approximately 17-18 GB size). In this case, the higher runtime required by the sorting algorithm is balanced by the fact that BioSeqZip does not need to create temporary files on the disk for performing sequences merging, allowing the collapser to used much less I/O operations, which are the real runtime bottleneck in the typical scenario where BioSeqZip operates.

BioSeqZip+STAR vs Rail-RNA
In table 2 we reports the output of the comparison between samples alignment with Rail-RNA and with the combination of BioSeqZip and STAR. Rail-RNA was run with the default local configuration using 8 running threads, in order to be fair with the previous analysis carried on with STAR. Also, Rail-RNA was not asked to report deliverables different from its default one. The exact command line used is: r a i l −r n a go l o c a l −m < m a n i f e s t p a t h > \ −x < b o w t i e i n d e x p a t h > < b o w t i e 2 i n d e x p a t h > \ −p 8 −o < o u t p u t d i r e c t o r y p a t h > As highlighted by the result table the runtime of the pipeline featuring BioSeqZip and STAR is lower than the one of Rail-RNA. However, such comparison is not completely fair: first of all, the information the two alignment tools provide are not the same, with Rail-RNA providing more data than STAR. Then, it should be noticed that Rail-RNA is a cloud-ready alignment tool whose design and implementation may have led to solutions which do not perform best on local and resource-constrained machines, which are the targets BioSeqZip was mainly designed for.  Table 4: Runtime report of three experiments: (BWA only) samples alignment using the BWA tool as it is, (BWA + BioSeqZip separate expander) collapsed samples alignment using BWA, considering the runtime of the BioSeqZipexpander tool, (BWA + BioSeqZip embedded expander) collapsed samples alignment using BioSeqZip-BWA, i.e. BWA with the expanding functionalities integrated (BioSeqZip-BWA is available at https://github.com/bioinformaticspolito/BioSeqZip-BWA.git Table 5: Runtime report of three experiments: (Yara only) samples alignment using the Yara tool as it is, (Yara + BioSeqZip separate expander) collapsed samples alignment using Yara, considering the runtime of the BioSeqZipexpander tool, (Yara + BioSeqZip embedded expander) collapsed samples alignment using BioSeqZip-Yara, i.e. Yara with the expanding functionalities integrated (BioSeqZip-Yara is available at https://github.com/bioinformaticspolito/BioSeqZip-Yara.git