-
PDF
- Split View
-
Views
-
Cite
Cite
Zekun Yin, Hao Zhang, Meiyang Liu, Wen Zhang, Honglei Song, Haidong Lan, Yanjie Wei, Beifang Niu, Bertil Schmidt, Weiguo Liu, RabbitQC: high-speed scalable quality control for sequencing data, Bioinformatics, Volume 37, Issue 4, February 2021, Pages 573–574, https://doi.org/10.1093/bioinformatics/btaa719
- Share Icon Share
Abstract
Modern sequencing technologies continue to revolutionize many areas of biology and medicine. Since the generated datasets are error-prone, downstream applications usually require quality control methods to pre-process FASTQ files. However, existing tools for this task are currently not able to fully exploit the capabilities of computing platforms leading to slow runtimes.
We present RabbitQC, an extremely fast integrated quality control tool for FASTQ files, which can take full advantage of modern hardware. It includes a variety of operations and supports different sequencing technologies (Illumina, Oxford Nanopore and PacBio). RabbitQC achieves speedups between one and two orders-of-magnitude compared to other state-of-the-art tools.
C++ sources and binaries are available at https://github.com/ZekunYin/RabbitQC.
Supplementary data are available at Bioinformatics online.
1 Introduction
Many applications of high-throughput sequencing technologies, such as somatic variant discovery or single-cell RNA-seq, are sensitive to artifacts of the produced read datasets. Thus, typical workflows apply pre-processing methods to FASTQ input files in order to improve data quality (such as trimming, adapter removal, etc.) before any downstream analysis is conducted. Consequently, various tools for corresponding tasks have been designed including FASTQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), Cutadapt (Martin, 2011) and Trimmomatic (Bolger et al., 2014). More recently, integrated pre-processing methods, such as AfterQC (Chen et al., 2017) and fastp (Chen et al., 2018), have been introduced, which avoid the repetitive loading of sequencing data. However, none of them is able to take full advantage of modern CPUs, which leads to high execution times for large-scale datasets. Furthermore, most quality control tools are suboptimal when applied to long-read sequencing data from Oxford Nanopore Technologies or Pacific Biosciences (PacBio). NanoQC (De Coster et al., 2018) has recently been introduced to bridge this gap but also suffers from severe performance issues. This establishes the need for a high-speed integrated quality control software for processing sequencing data produced by various technologies.
We address this need by presenting RabbitQC. RabbitQC is at least one order-of-magnitude faster than other state-of-the-art tools and can process Illumina data with a throughput of over 3.7 (2.7) million reads per second in single-end (SE) [paired-end (PE)] format. Furthermore, it successfully processes 203 GB of PacBio long-read data in only 6 min while NanoQC is not able to complete this task within 2 days.
2 Methods
We exploit the capacity of modern CPUs by proposing a novel I/O-efficient framework based on multi-threading that embeds various operations for quality control. Our design is based on a producer–consumer pattern. A producer thread loads chunks of data in FASTQ format into a data pool. Several consumer threads concurrently access the data pool and format consecutive chunks into individual sequence read entries. Each consumer thread then calls embedded function modules to complete the requested per-read processing tasks and writes the results to an output buffer.
Our framework enables thread scalability by means of an efficient reading and formatting method as follows. The producer thread reads a chunk of raw sequence data from a FASTQ file and calls a lightweight formatting function to cut the last incomplete tail record leaving it to next chunk (see Supplementary Figs S2 and S3 and Section S2 for more details). This ‘lightweight’ formatting guarantees that each chunk in the data pool contains a batch of complete FASTQ records and thus is more efficient compared to full formatting. Each consumer thread then fetches a chunk from the data pool and further formats each read entry into a per-read data structure used by the function modules. Multiple consumer threads work concurrently thereby eliminating the performance bottleneck for formatting. The single producer thread handles efficient disk I/O. Each consumer launches its own processing thread. Thus, we can scale the number of consumer threads until all available CPU cores are occupied or bottlenecked by disk I/O. The processing results are organized in an output buffer, which is then written to (a number of) output file(s). Our framework is mainly optimized for plain FASTQ files but also supports gziped compressed files, see Supplementary Section S2.3 for more details.
RabbitQC integrates a number of quality control operations within a single application. For Illumina data, we adopt the algorithms for sliding window quality pruning, PolyG/PolyX tail trimming, UMI pre-processing, adapter trimming, duplication evaluation and over-represented sequence analysis as proposed by Chen et al. (2018) but offer more efficient (parallelized) implementations. For MinION and PacBio data, we have re-implemented all the quality control functions (summary of read length distribution, ATCG content and quality score summary) in NanoQC (De Coster et al., 2018) using C++. More details of our function module implementations are described in Supplementary Section S3.
3 Results
We have implemented RabbitQC in C++, compiled it with GCC 5.4.0, and compared its performance to fastp (v.0.19.5), AfterQC (v.0.9.7), Trimmomatic (v.0.38), FASTQC (v.0.11.8) and SOAPnuke (v.1.5.6) for short-read data and to NanoQC (v.0.9.1), FASTQC (v.0.11.8) for long-read data. All experiments have been conducted on a Linux server with a 20-core Xeon Gold 6148 CPU, 96GB of DDR4 RAM, a 128GB NVMe SSD and a raid HDD array (MegaRAID SAS 9271-8i controller RAID 5) running Ubuntu 16.04. In our experiments, we use two Illumina datasets consisting of 18.6 M reads of length 100 in SE format (SRR2496699_1) and 25.2 M reads of length 100 in PE format (SRR2496709), a PacBio dataset consisting of 10.7 M reads of average length 10 106 (SRX2645672), and a MinION dataset consisting of 206.8 K reads of average length 9357 (SRR9588287). All input files except for the large 203 GB PacBio dataset are loaded from SSD.
Tools were executed with default parameters and a varying number of threads (if multi-threading is supported). When dealing with Illumina data, RabbitQC, fastp and AfterQC perform the same series of operations including filtering, trimming and quality control, while Trimmomatic, FASTQC and SOAPnuke only perform a subset of these operations. For MinION and PacBio data, RabbitQC and NanoQC perform the same quality control functions. When processing Illumina and PacBio/MinION data, the per-read processing results of RabbitQC are identical to fastp and NanoQC, respectively.
Table 1 shows the measured runtimes and execution speeds. For tools supporting multi-threading, we have reported the best runtime between 1 and 20 threads. The results show that RabbitQC is fastest for each tested dataset achieving a throughput of over 3 million reads per second for Illumina SE data and over 2 million reads per second for Illumina PE data. This corresponds to speedups between one and two orders-of-magnitude compared to all other tools. Furthermore, RabbitQC achieves a speedup of at least 450 compared with NanoQC for processing long-read data.
Tool . | Dataset . | Time (s) . | Speed (reads/s) . | Speedup . |
---|---|---|---|---|
RabbitQC | Illumina | 5.0 | 3.725 M | — |
fastp | SE | 83.3 | 0.223 M | 16.7 |
AfterQC | SRR2496699_1 | 618.2 | 0.030 M | 123.6 |
Trimmomatic | — | 248.8 | 0.074 M | 49.8 |
FASTQC | — | 59.3 | 0.314 M | 11.9 |
SOAPnuke | — | 330.7 | 0.056 M | 66.1 |
RabbitQC | Illumina | 9.1 | 2.770 M | — |
fastp | PE | 116.2 | 0.217 M | 12.8 |
AfterQC | SRR2496709 | 1757.8 | 0.014 M | 193.2 |
Trimmomatic | — | 240.1 | 0.105 M | 26.4 |
SOAPnuke | — | 187.8 | 0.134 M | 20.6 |
RabbitQC | PacBio | 360.0 | 30.0 K | — |
NanoQC | SRX2645672 | >2 days | <62 | >480 |
FASTQC | — | 8576.8 | 1.2 K | 23.8 |
RabbitQC | MinION | 2.0 | 103.0 K | — |
NanoQC | SRR9588287 | 990.0 | 0.2 K | 450.0 |
FASTQC | — | 230.1 | 0.9 K | 115.0 |
Tool . | Dataset . | Time (s) . | Speed (reads/s) . | Speedup . |
---|---|---|---|---|
RabbitQC | Illumina | 5.0 | 3.725 M | — |
fastp | SE | 83.3 | 0.223 M | 16.7 |
AfterQC | SRR2496699_1 | 618.2 | 0.030 M | 123.6 |
Trimmomatic | — | 248.8 | 0.074 M | 49.8 |
FASTQC | — | 59.3 | 0.314 M | 11.9 |
SOAPnuke | — | 330.7 | 0.056 M | 66.1 |
RabbitQC | Illumina | 9.1 | 2.770 M | — |
fastp | PE | 116.2 | 0.217 M | 12.8 |
AfterQC | SRR2496709 | 1757.8 | 0.014 M | 193.2 |
Trimmomatic | — | 240.1 | 0.105 M | 26.4 |
SOAPnuke | — | 187.8 | 0.134 M | 20.6 |
RabbitQC | PacBio | 360.0 | 30.0 K | — |
NanoQC | SRX2645672 | >2 days | <62 | >480 |
FASTQC | — | 8576.8 | 1.2 K | 23.8 |
RabbitQC | MinION | 2.0 | 103.0 K | — |
NanoQC | SRR9588287 | 990.0 | 0.2 K | 450.0 |
FASTQC | — | 230.1 | 0.9 K | 115.0 |
The bold values mean the best performance achieved for each dataset.
Tool . | Dataset . | Time (s) . | Speed (reads/s) . | Speedup . |
---|---|---|---|---|
RabbitQC | Illumina | 5.0 | 3.725 M | — |
fastp | SE | 83.3 | 0.223 M | 16.7 |
AfterQC | SRR2496699_1 | 618.2 | 0.030 M | 123.6 |
Trimmomatic | — | 248.8 | 0.074 M | 49.8 |
FASTQC | — | 59.3 | 0.314 M | 11.9 |
SOAPnuke | — | 330.7 | 0.056 M | 66.1 |
RabbitQC | Illumina | 9.1 | 2.770 M | — |
fastp | PE | 116.2 | 0.217 M | 12.8 |
AfterQC | SRR2496709 | 1757.8 | 0.014 M | 193.2 |
Trimmomatic | — | 240.1 | 0.105 M | 26.4 |
SOAPnuke | — | 187.8 | 0.134 M | 20.6 |
RabbitQC | PacBio | 360.0 | 30.0 K | — |
NanoQC | SRX2645672 | >2 days | <62 | >480 |
FASTQC | — | 8576.8 | 1.2 K | 23.8 |
RabbitQC | MinION | 2.0 | 103.0 K | — |
NanoQC | SRR9588287 | 990.0 | 0.2 K | 450.0 |
FASTQC | — | 230.1 | 0.9 K | 115.0 |
Tool . | Dataset . | Time (s) . | Speed (reads/s) . | Speedup . |
---|---|---|---|---|
RabbitQC | Illumina | 5.0 | 3.725 M | — |
fastp | SE | 83.3 | 0.223 M | 16.7 |
AfterQC | SRR2496699_1 | 618.2 | 0.030 M | 123.6 |
Trimmomatic | — | 248.8 | 0.074 M | 49.8 |
FASTQC | — | 59.3 | 0.314 M | 11.9 |
SOAPnuke | — | 330.7 | 0.056 M | 66.1 |
RabbitQC | Illumina | 9.1 | 2.770 M | — |
fastp | PE | 116.2 | 0.217 M | 12.8 |
AfterQC | SRR2496709 | 1757.8 | 0.014 M | 193.2 |
Trimmomatic | — | 240.1 | 0.105 M | 26.4 |
SOAPnuke | — | 187.8 | 0.134 M | 20.6 |
RabbitQC | PacBio | 360.0 | 30.0 K | — |
NanoQC | SRX2645672 | >2 days | <62 | >480 |
FASTQC | — | 8576.8 | 1.2 K | 23.8 |
RabbitQC | MinION | 2.0 | 103.0 K | — |
NanoQC | SRR9588287 | 990.0 | 0.2 K | 450.0 |
FASTQC | — | 230.1 | 0.9 K | 115.0 |
The bold values mean the best performance achieved for each dataset.
In order to evaluate the efficiency of our I/O framework, we have performed a thread scalability analysis. Supplementary Figure S1 shows the speedups for each tested multi-threaded tool when increasing the number of threads for the Illumina SE dataset. RabbitQC is the only tool achieving near-linear scalability with a speedup of over 13 for 20 threads (parallel efficiency of over 65%). In comparison, none of the other tested tools can improve its performance beyond 4 threads corresponding to a parallel efficiency of ≤10% for 20 threads.
In summary, RabbitQC is an extremely fast integrated quality control tool that can exploit the capabilities of modern CPUs, thus, boosting the downstream analysis of sequencing datasets produced by a variety of technologies.
Acknowledgement
This work is partially supported by NSFC Grants 61972231 and U1806205; the Key Project of Joint Fund of Shandong Province (Grant No. ZR2019LZH007); the Shenzhen Basic Research Fund (Grant No. JCYJ20180507182818013); the PPP project from CSC and DAAD; the program for outstanding PhD candidates of Shandong University; Center for High Performance Computing and System Simulation, Pilot National Laboratory for Marine Science and Technology (Qingdao).
Conflict of Interest: none declared.