Nextflow vs. plain bash: different approaches to the parallelization of SNP calling from the whole genome sequence data

Abstract This study compared computational approaches to parallelization of an SNP calling workflow. The data comprised DNA from five Holstein-Friesian cows sequenced with the Illumina platform. The pipeline consisted of quality control, alignment to the reference genome, post-alignment, and SNP calling. Three approaches to parallelization were compared: (i) a plain Bash script in which a pipeline for each cow was executed as separate processes invoked at the same time, (ii) a Bash script wrapped in a single Nextflow process and (iii) a Nextflow script with each component of the pipeline defined as a separate process. The results demonstrated that on average, the multi-process Nextflow script performed 15–27% faster depending on the number of assigned threads, with the biggest execution time advantage over the plain Bash approach observed with 10 threads. In terms of RAM usage, the most substantial variation was observed for the multi-process Nextflow, for which it increased with the number of assigned threads, while RAM consumption of the other setups did not depend much on the number of threads assigned for computations. Due to intermediate and log files generated, disk usage was markedly higher for the multi-process Nextflow than for the plain Bash and for the single-process Nextflow.


Introduction
In animal genomics, the rapid development of highthroughput technologies over the past few decades has seen a considerable increase in the availability of data ( 1 ).Among them, the most common data structure is the whole genome sequence (WGS) which is now available to thousands of individuals.For example, the 1000 Bull Genomes Project database for cattle ( 2 ) currently harbours polymorphic variants identi-fied from WGSs of over 5000 individuals.Not only the efficient storage of WGS data and stable processing pipelines are essential for the analysis, but also the whole pipeline from the raw fastq files to Variant Call Format (VCF) files containing identified variants has to be completed in a time-effective manner using systems that are executed in a parallel mode and are robust towards the fluctuation of computational resources available at run-time.The so-called rWGS (rapid WGS) is an emerging topic in the analysis of bio-data (see e.g. ( 3)), including cattle, for which fast variant identification may have implications on fast selection decisions.Therefore, effective, efficient, and robust computing approaches are becoming increasingly important ( 4 ,5 ) and so is the pipeline management system software.
There exists a plethora of workflow management systems, ranging from open-source solutions e.g.Jenkins ( www.jenkins.io ) or Snakemake ( 6 ) to commercial software such as the Automic Automation ( www.broadcom.com).Moreover, platform-dedicated systems are also available, like the AWS Code Pipeline for users of Amazon Web Services.The Nextflow pipeline management system ( 7 ) has recently gained popularity, mainly within the field of genomics and, more broadly, bioinformatics to a large extent due to its simplicity of implementation, good tutorials, community support, and most importantly due to the availability of built-in directives dedicated to standard processing of WGS pipelines that are missing in general-purpose pipeline management software.One of the important aspects of using the Nextflow system is the automatic parallelization and scaling of data processing from local computers to clusters, both within and across individual WGS samples, which accelerates the execution of computationally intensive tasks.It also enables the use of multiple scripting languages, including Bash, R and Python, which are very popular within the bioinformatics community.
Our study aimed to compare the computational efficiency and hardware requirements of the native Bash implementation with the implementation through the pipeline management system.The Nextflow DSL2 (domain-specific language) pipeline management system and the context of detecting single nucleotide polymorphisms (SNPs) in the WGS data were chosen as an example management system and a pipeline, respectively.Since all the elements of the pipeline are required for obtaining the final outcome-the VCF of called SNP genotypes.The practical aspect of the underlying comparison was to present the overall runtime of the entire workflow, without splitting between memory, time and disk usage of particular stages.

Animals and DNA-sequencing
The genomic DNA of five Holstein-Friesian cows was sequenced with the Illumina HiSeq2000 platform in pair-end read mode with a read length of 100 bp.The number of reads available for a single animal ranged between 391 952 216 and 407 377 182.In this study, for demonstration purposes, only sequence reads mapped to chromosome 25 (BTA25) were processed.

Bioinformatic analysis
The bioinformatics pipeline for SNP calling consisted of: (i) the quality control step performed using fastQC software ( 8 ) to assess the quality of the raw DNA sequence reads, (ii) the alignment of sequence reads to BTA25 from the ARS-UCD1.2reference genome (NCBI accession number: PRJNA391427) with BWA-MEM software ( 9 ), (iii) the post-alignment processing and (iv) SNP calling both performed with Samtools package ( 10 ).
The pipeline defined above was executed using three different setups visualized in Figure 1 : ( 1 ) as a plain Bash script run in parallel for each of the five individuals (plain Bash), ( 2 ) as an entire Bash script wrapped into a single Nextflow process (single-process Nextflow), ( 3 ) as each component of the Bash script, corresponding to each step of the SNP calling pipeline, defined as a separate Nextflow process, with processes connected via channels (multi-process Nextflow).Each setup ( 2 ,3 ) was executed in a parallel mode across each cow, additionally with multiple numbers of threads defined within each cow.The following constellations of the numbers of threads (T) and the numbers of forks (F) represented by a cow-level process were implemented: F5T1, F5T5, F5T10, and F5T15.In the plain Bash setup, parallelization across individuals was implemented by executing the full pipeline for each cow as a separate process.In single-process and multi-process Nextflow setups, Nextflow was used to implement across-cow parallelization.Furthermore, to compare the parallelization strategies implemented via Nextflow, the multi-process pipeline was executed with 50 threads, but sequentially processing each cow (F1T50).
All setups were compared for execution time, maximum memory usage, and hard disk storage space.

Hardware
All computations were performed on a server equipped with two Intel Xeon CPUs (E5-2699 v4) with 44 threads each, with a base frequency of 2.20 GHz and 188 GB of RAM.During the execution of the pipelines, the server was dedicated solely to our analysis.Only housekeeping processes were active along with pipeline executions.

Results
The result of the plain pipeline consisted of five VCF files generated for BTA25 in the single-individual mode, that is, separately for each cow for which the number of SNPs varied between 312 100 and 353 855.The output comprised quality reports on sequenced reads from fastQC software in HTML format, as well as text log files.Additionally, for pipelines that implement Nextflow, reports were generated in HTML format with information on pipeline execution.Each execution setup (plain Bash, single-process Nextflow, multi-process Nextflow) resulted in the same set of identified SNPs.
Regarding execution time (Figure 2 ), multi-process Nextflow was the fastest regardless of the number of threads assigned, except for sequential implementation on only one core (F5T1), where plain Bash was the most computationally efficient, being 13.80% faster than multi-process Nextflow and 19.80% faster than single-process Nextflow.On the contrary, for the parallelized computations, the advantage in execution time of multi-process Nextflow over plain Bash varied between 15.71% and 21.15% and between 23.10% and 26.79% over single-process Nextflow, depending on thread configuration.The largest difference was observed for F5T10, when multi-process Nextflow executed 11 h and 25 min, while plain Bash ran for 15 h and 30 min.Interestingly, no marked differences in execution time were observed between setups of 10 and 15 cores per animal (approximately 20 min).When comparing plain Bash with single-process Nextflow, there was no clear winner in terms of execution times, since they were very similar.Regardless of the implementation, the largest decrease in execution time occurred when processing of each animal was assigned five cores (F5T5), compared to  serial execution F5T1.The F5T5 configuration was almost 6.5 times faster than the serial implementation.
When comparing parallelization implemented via Nextflow with internal parallelization implemented in the programmes that are the components of the pipeline, i.e. fastQC, BWA-MEM and Samtools, the advantage of splitting the entire pipeline into separate processes corresponding to each individual (F5) became evident (Figure 3 ).F5T10 multi-process Nextflow configuration separately assigns 10 threads to each of the 5 cows executed in 11 h and 29 min, while the sequential approach computing one cow after another with 50 threads assigned for each animal (F1T50), despite the same resources defined, ran over three times longer (34 h and 14 min).
Regarding RAM utilization (Figure 4 A), a large difference was observed between multi-process Nextflow and the other two setups in all thread constellations implemented.In multiprocess Nextflow, RAM consumption increased with the number of assigned threads, while in the other setups, it did not depend much on the number of threads assigned for computations.This result clearly demonstrates the superiority of Nextflow in memory management.The largest difference between the three setups appeared for the F5T1 constellation, for which the multi-process Nextflow pipeline used 1.16 GB of memory, while the other setups consumed up to 7.44 GB (plain Bash) and 8.26 GB (single-process Nextflow).Therefore, the multi-process Nextflow was 7 times more memory efficient.In terms of hard drive space requirements (Figure 4 B), multi-process Nextflow occupied a markedly larger disk space, 1227 GB, compared to 156 GB used by single-process Nextflow and the plain Bash script.These results were consistent in all configurations.This disparity is primarily due to the creation of a working directory in which Nextflow generated temporary files were stored.

Discussion
With the rapidly growing popularity of the Nextflow workflow management system, it is important to implement the available tools in the most effective way to maximize profit in execution time and computer resources ( 11 ).Published reports of genomic workflow comparisons are scarce and do not formally compare the same pipeline implemented with and without a workflow management system.Recently, ( 12 ) proposed a Nextflow pipeline for single-cell A T AC-seq data analysis and compared it with two other pipelines implemented without the workflow management system ( https://github.com/ wbaopaul/ scA T AC-pro ) and with Snakemake workflow management ( https:// github.com/liulab-dfci/ MAESTRO ).Interestingly, all three pipelines produced different results, but with regard to memory consumption and execution time, no marked differences between implementations emerged.The authors of this study stated that the Nextflow pipeline was characterized by a much higher level of flexibility and ease of parameter optimization.( 13) created a Nextflow pipeline to obtain raw read counts from RNA-seq data and compared it with the Rsubread package ( 14 ) to implement the pipeline in R.However, since both implementations used different software, observed differences in execution times or memory usage are not meaningful in the context of a comparison of workflow efficiency.
An added value of using Nextflow pipeline management systems is the presence of the nf-core library, which provides a platform where researchers can contribute and share their analysis workflows that even aim to become standardized workflows for processing various types of omic data ( 15 ).Furthermore, the platform also provides good documentation that facilitates pipeline implementation.Workflow management systems address the problem of pipeline portability and reproducibility, which pose a serious problem in many research areas ( 16 ), ( 17 ).Moreover, the managed system application provides visualization tools during the execution of processes and after their completion, helping to compare and tune pipeline execution parameters.Another very practical feature is the ability to resume a process after a halt.This negates the need to run the process from the very beginning after solving a problem, which makes the debugging process more efficient.However, the downside of this feature lies in creating a work directory that consumes substantial amounts of drive space.Nextflow also enables the user to specify the number of threads used on several forks for each pipeline process.
The feature of using a pipeline management system is sharing common data across processes that does not enforce repeated computations when it is not necessary, e.g.genome indexing.Our comparison demonstrated that, for the shared memory architecture used for computations, Nextflow workflows are more efficient in terms of memory and CPU management of processes running in parallel.The most important benefit is related to the fact that Nextflow implements the functional reactive programming paradigm that supports non-synchronous data processing through defining so-called channels that transfer data between parallel processes, so that, in practice, when computations are completed for one animal, available resources are reallocated to other animals.Separating tasks into channels also allows for running the quality control independently of the alignment process, which is not possible under plain Bash implementation.However, it should be realized that as the number of defined threads per cow increases, multi-process Nextflow begins to use more of the available computing power, resulting in higher memory usage.Still, from a certain point on, increasing the number of threads did not result in a marked benefit in terms of execution time.Although the aspect was not formally investigated in our study, we suspect that the increased computational load of handling multiple parallel processes impeded the benefits of parallel computations, especially in the pipeline (like ours) that contains components that do not execute in the parallel mode, or do not employ parallel processing for the majority of its computations, e.g. the Samtools package, which uses multithreading exclusively for compressing alignment map files but not for their downstream processing.Although parallel computations provide important benefits for the overall execution time, this benefit is hampered by overhead of multithreading that is mainly composed of thread management, context-switch costs, and cash repletion (see already e.g. ( 18)).
A missing aspect of our study was the comparison of workflow performance implemented in a distributed memory architecture.Although due to the lack of the appropriate computing environment dedicated entirely to the comparison, i.e. running no other processes, it can be speculated that multiprocess Nextflow implementation would be even more beneficial over single-process Nextflow.Furthermore, the use of the plain Bash approach would require manual implementation of the MPI directives, which would impose an additional  programming burden.It is also worth mentioning that HDD access is a critical point in pipeline execution.The potential benefit of using a management system is that HDD IO operations can be handled in a more efficient way compared to a naïve implementation of a parallel process and the corresponding naïve HDD access scheme.

Conclusions
In dairy cattle, we currently experience a fast-growing load of digital data that originates on the phenotypic and environmental level from precision livestock farming systems utilized on many farms and on the genomic level-originating from sequencing of whole genomes of many individuals, mainly bulls.The expectation is that this information will be routinely used in dairy management and breeding decisions.In view of those fast-growing sizes of data, workflows and code parallelization are very important computational aspects.In this context, the Nextflow workflow management system is a useful tool not only for managing pipelines, which strongly and efficiently supports computational parallelization and enables the user to specify the number of threads used.Still, it is important to consider that, in parallel computing, a critical element is the proper design of the computing architecture expressed by the number of computing tasks and available CPU cores ( 19 ).Moreover, also efficient handling IO pressure is an important part of optimizing workflows that heavily rely on interacting with HDD for writing and / or reading.Efficient resource management guarantees that these multiple tasks coexist without interfering with one another, resulting in improved system performance, so that a well-designed system implementing optimal number of threads and a number of samples processed in parallel leads to more efficient utilization of computational resources ( 7 ).

Figure 1 .
Figure 1.SNP calling pipelines implemented in the study.From left to right, (i) not managed, plain Bash, (ii) single-process Nextflow where the entirety of the pipeline was wrapped into one Nextflow process, (iii) proper Nextflow pipeline design in which each step of the pipeline was a separate Nextflow process, connected via channels.

Figure 2 .
Figure 2. Average execution times of various setups of the SNP calling pipeline.On the X axis there are a number of threads used per individual.Differences in computation time between animals were not significant.

Figure 3 .
Figure 3. Execution time of NextFlow enforced parallelization (F5T10) to the internal parallelization implemented into programmes being the components of the pipeline: fastQC, B W A-MEM, Samtools (F1T50).

Figure 4 .
Figure 4. ( A ) Maximum RAM utilization of various setups of the SNP calling pipeline.( B ) Usage of hard drive by different implementations of the SNP calling pipeline.