CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language

Abstract Background Massive growth in the amount of research data and computational analysis has led to increased use of pipeline managers in biomedical computational research. However, each of the >100 such managers uses its own way to describe pipelines, leading to difficulty porting workflows to different environments and therefore poor reproducibility of computational studies. For this reason, the Common Workflow Language (CWL) was recently introduced as a specification for platform-independent workflow description, and work began to transition existing pipelines and workflow managers to CWL. Findings Herein, we present CWL-Airflow, a package that adds support for CWL to the Apache Airflow pipeline manager. CWL-Airflow uses CWL version 1.0 specification and can run workflows on stand-alone MacOS/Linux servers, on clusters, or on a variety of cloud platforms. A sample CWL pipeline for processing of chromatin immunoprecipitation sequencing data is provided. Conclusions CWL-Airflow will provide users with the features of a fully fledged pipeline manager and the ability to execute CWL workflows anywhere Airflow can run—from a laptop to a cluster or cloud environment. CWL-Airflow is available under Apache License, version 2.0 (Apache-2.0), and can be downloaded from https://barski-lab.github.io/cwl-airflow, https://scicrunch.org/resolver/RRID:SCR_017196.

Keywords: Common workflow language, workflow manager, pipeline manager, Airflow, reproducible data analysis, workflow portability Background Modern biomedical research has seen a remarkable increase in the production and computational analysis of large datasets, leading to an urgent need to share standardized analytical techniques. However, of the more than one hundred computational workflow systems used in biomedical research, most define their own specifications for computational pipelines [1,2]. Furthermore, the evolving complexity of computational tools and pipelines makes it nearly impossible to reproduce computationally heavy studies or to repurpose published analytical workflows. Even when the tools are published, the lack of a precise description of the operating system environment and component software versions can lead to inaccurate reproduction of the analyses-or analyses failing altogether when executed in a different environment. To ameliorate this situation, a team of researchers and software developers formed the Common Workflow Language (CWL) working group [3] with the intent of establishing a specification for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments. The CWL specification provides a set of formalized rules that can be used to describe each command line tool and its parameters, and optionally a container (e.g., a Docker [4] or Singularity [5] image) with the tool already installed. CWL workflows are composed of one or more of such command line tools. Thus, CWL provides a description of the working environment and version of each tool, how the tools are "connected" together, and what parameters were used in the pipeline. Researchers using CWL are then able to deposit descriptions of their tools and workflows into a repository (e.g., dockstore.org) upon publication, thus making their analyses reusable by others.
After version 1.0 of the CWL standard [6] and the reference executor, cwltool, were finalized in 2016, developers began adapting the existing pipeline managers to use CWL. For example, companies such as Seven Bridges Genomics, Inc. and Curoverse, Inc. are developing the commercial platforms Rabix [7] and Arvados [8], whereas academic developers (e.g., Galaxy [9], Toil [10] and others) are adding CWL support to their pipeline managers (See Discussion).
Airflow [11] is a lightweight workflow manager initially developed by AirBnB, Inc., which has now graduated from Apache Incubator, and is available under a permissive Apache license.
Airflow executes each workflow as a directed acyclic graph (DAG) of tasks that have directional, noncircular dependencies. The tasks are usually atomic and are not supposed to share any resources with each other; therefore, they can be run independently. The DAG describes relationships between the tasks and defines their order of execution. The DAG objects are initiated from Python scripts placed in a designated folder. Airflow has a modular architecture and can distribute tasks to an arbitrary number of workers and across multiple servers while adhering to the task sequence and dependencies specified in the DAG. Unlike many of the more complicated platforms, Airflow imposes little overhead, is easy to install, and can be used to run task-based workflows in various environments ranging from standalone desktops and servers to Amazon or Google cloud platforms.
It also scales horizontally on clusters managed by Apache Mesos [12] and may be configured to send tasks to the Celery [13] task queue. Herein, we present an extension of Airflow, allowing it to run CWL-based pipelines. Altogether, this gives us a lightweight workflow management system with full support for CWL, the most promising scientific workflow description language. The CWL-Airflow package extends Airflow's functionality with the ability to parse and execute workflows written with the CWL version 1.0 (v1.0) specification [6]. CWL-Airflow can be easily integrated into the Airflow scheduler logic as shown in the structure diagram in Figure  1. The Apache Airflow code is extended with a Python package that defines four basic classes-JobDispatcher, CWLStepOperator, JobCleanup and CWLDAG. Additionally, the automatically generated cwl_dag.py script is placed in the DAGs folder. While periodically loading DAGs from the DAGs folder, the Airflow scheduler runs the cwl_dag.py script and creates DAGs on the basis of the available jobs and corresponding CWL workflow descriptor files.

Methods
In order to run a CWL workflow in Airflow, a file describing the job should be placed in the jobs folder (Fig. 1). The jobs are described by a file in JSON or YAML format that includes workflow-specific input parameters (e.g., input file locations) and three mandatory fields: workflow (absolute path to the CWL descriptor file to be run with this job), output_folder (absolute path to the folder where all the output files should be moved after successful pipeline execution) and uid (unique identifier for the run). CWL-Airflow parses every job file from the jobs folder, loads corresponding CWL workflow descriptor file and creates a CWLDAG-class instance on the basis of the workflow structure and input parameters provided in the job file. The uid field from the job file is used to identify the newly created CWLDAG-class instance.
CWLDAG is a class for combining the tasks into a DAG that reflecta the CWL workflow structure. Every CWLStepOperator task corresponds to a workflow step and depends on others on the basis of the workflow step inputs and outputs. This implements dataflow principles and architecture that are missing in Airflow. Additionally, the JobDispatcher and JobCleanup tasks are added to the DAG. JobDisptacher is used to serialize the input parameters from the job file and provide the pipeline with the input data; JobCleanup returns the calculated results to the output folder. When the Airflow scheduler executes the pipeline from the CWLDAG, it runs the workflow with the structure identical to the CWL descriptor file used to create this graph. Though running CWL-Airflow on a single node may be sufficient in most of the cases, it is worth switching to the multi-node configuration ( Fig. 2) for computationally intensive pipelines. Airflow uses the Celery task queue to distribute processing over the multiple nodes. Celery provides the mechanisms for queueing and assigning tasks to the multiple workers, whereas the Airflow scheduler uses Celery executor to submit tasks to the queue. The Celery system helps to not only balance the load over the different machines, but also define task priorities by assigning them to the separate queues.
The example of a CWL-Airflow Celery cluster of 4 nodes is shown in Figure 2. The tasks are submitted to the queue by the node 1 and executed by either of the 3 workers (nodes 2, 3 and 4). Node 1 runs two mandatory components-the Airflow database and scheduler. The latter schedules the task execution by adding tasks to the queue. All Celery workers are subscribed to the same task queue. Whenever an arbitrary worker pulls a new task from the queue, it runs the task and returns the execution results. For the sequential steps, the Airflow scheduler submits the next tasks to the queue. During the task execution, intermediate data are kept in the temp folder.
Upon successful pipeline completion, all output files are moved to the output folder. Both the temp and output folders, as well as the dags and jobs folders, are shared among all the nodes of the cluster. Optionally, node 1 can also run the Airflow webserver (Fig. 3) and the Celery monitoring tool Flower (Fig. 4) to provide users with the pipeline execution details.

ChIP-Seq analysis with CWL-Airflow
As an example, we used a workflow for basic analysis of chromatin immunoprecipitation sequencing (ChIP-Seq) data [14] (Fig. 5, Research object:Additional file 1). This workflow is a CWL version of a Python pipeline from BioWardrobe [15,16]. It starts by using BowTie [17] to perform alignment to a reference genome, resulting in an unsorted SAM file. The SAM file is then sorted and indexed with SAMtools [18] to obtain a BAM file and a BAI index. Next MACS2 [19] is used to call peaks and to estimate fragment size. In the last few steps, the coverage by estimated fragments is calculated from the BAM file and is reported in bigWig format (Fig. 5). The pipeline also reports statistics, such as read quality, peak number and base frequency, and other troubleshooting information using tools such as FASTX-Toolkit [20] and BamTools [21]. The directions for how to run a sample pipeline can be found on the CWL-Airflow webpage [14].
Execution time in CWL-Airflow was similar to that of reference implementation (Table 1).   [25].

Portability of CWL analyses
The key promise of CWL is the portability of analyses. Portability refers to the ability to seamlessly run a containerized CWL pipeline developed for one CWL platform on another CWL platform, allowing users to easily share computational workflows. To check whether CWL-Airflow can use pipelines developed by others, we downloaded an alternative workflow for the analysis of ChIP-Seq data developed by the ENCODE Data Coordination Center [26,27] using a test dataset (CEBPB ChIP-Seq in A549 cells, ENCODE accession: ENCSR000DYI). CWL-Airflow was able to run the pipeline and produced results identical to those obtained with the reference cwltool. The execution time is shown in Table 1. Notably, running the tested pipelines on the single-node CWL-Airflow system increased execution time by 18%, whereas running them on the three-node CWL-Airflow cluster reduced execution time by 41% per workflow compared to the reference cwltool. These results confirm that CWL-Airflow complies with the CWL specification, supports portability and performs analysis in a reproducible manner. Additional testing of pipeline portability is currently being conducted as a part of the Global Alliance for Genomics and Health (GA4GH) workflow portability challenge [28].

CWL-Airflow in multi-node configuration with Celery executor
To demonstrate the use of CWL-Airflow in a multi-node configuration, we set up a Celery cluster of 3 nodes with 4 CPU and 94 GB of RAM each, which each node running an instance of the Airflow Celery worker. Tasks were queued for execution by the Airflow scheduler that was launched on the first node. Communication between the Celery workers was managed by the message queueing service, RabbitMQ. RabbitMQ, as well as the Airflow database and web server, were run on the first node. Executing the two tested pipelines on the Airflow Celery cluster demonstrated only a slight slow-down on a per-run basis (Table 1). Unlike most of the other workflow managers, Airflow provides a convenient, web-based GUI that allows a user to monitor and control the pipeline execution. Within this web interface, a user can easily track the workflow execution history and collect and visualize statistics from multiple workflow runs. Similar to some of the other pipeline managers, Airflow provides a REST API (representational state transfer application program interface) that allows a user to access its functionality through the dedicated endpoints. The API can be used by other software to communicate with the Airflow system.

Discussion
Airflow supports parallel workflow step execution.
Step parallelization can be convenient when the workflow complexity is not high and the computational resources are not limited.
However, when running multiple workflows, especially on a multi-node system, it becomes reasonable to limit parallelism and balance load over the available computing resources. Besides the standard load balancing algorithms provided by the computing environment, Airflow supports pools and queues that allow for even distribution of tasks among multiple nodes.
Addition of the CWL capability to Airflow has made it more convenient for scientific computing, in which the users are more interested in the flow of data than the tasks being executed.
Though Airflow itself (and most of the pipeline managers [28]) only define workflows as sequences of steps to be executed (e.g., DAGs), the CWL description of inputs and outputs leads to better representation of data flow, which allows for a better understanding of data dependencies and produces more readable workflows.
Furthermore, as one of the most lightweight pipeline managers, Airflow contributes only a small amount of overhead to the overall execution of a computational pipeline (Table 1). We believe that this overhead is an advantageous exchange for 1) Airflow's ability to monitor and control workflow execution and 2) CWL's enablement of better reproducibility and portability of biomedical analyses. In summary, CWL-Airflow will provide users with the ability to execute CWL workflows anywhere Airflow can run-from a laptop to a cluster or cloud environment.       We are re-submitting our manuscript titled "CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language" for your consideration for publication in GigaScience as a "Technology Note". We were heartened by the enthusiastic reviews of our original manuscript and hope that the updated version meets the GigaScience requirements for publication.
As suggested by reviewers, we have enhanced our manuscript in several areas: (i) We added a use case for CWL-Airflow for a multi node configuration in the Results section; (ii) We enhanced the Discussion section and improved Table 2 that compares CWL-Airflow with alternative pipeline managers. (iii) We have improved the github page for CWL-Airflow and added a CWL-Airflow virtual machine. Additionally, we have registered our software with SciCrunch and include the Research Object for our CWL workflow as a supplemental file.
Given the strong interest in CWL among computational biologists, we hope that our manuscript and CWL-Airflow will be of high interest to the readership of GigaScience.