PM4NGS, a project management framework for next-generation sequencing data analysis

Background: FAIR (Findability, Accessibility, Interoperability, and Reusability) next-generation sequencing (NGS) data analysis relies on complex computational biology workflows and pipelines to guarantee reproducibility, portability, and scalability. Moreover, workflow languages, managers, and container technologies have helped address the problem of data analysis pipeline execution across multiple platforms in scalable ways. Findings: Here, we present a project management framework for NGS data analysis called PM4NGS. This framework is composed of an automatic creation of a standard organizational structure of directories and files, bioinformatics tool management using Docker or Bioconda, and data analysis pipelines in CWL format. Pre-configured Jupyter notebooks with minimum Python code are included in PM4NGS to produce a project report and publication-ready figures. We present 3 pipelines for demonstration purposes including the analysis of RNA-Seq, ChIP-Seq, and ChIP-exo datasets. Conclusions: PM4NGS is an open source framework that creates a standard organizational structure for NGS data analysis projects. PM4NGS is easy to install, configure, and use by non-bioinformaticians on personal computers and laptops. It permits execution of the NGS data analysis on Windows 10 with the Windows Subsystem for Linux feature activated. The framework aims to reduce the gap between researcher in experimental laboratories producing NGS data and workflows for data analysis. PM4NGS documentation can be accessed at https://pm4ngs.readthedocs.io/ .

or the cloud. In addition, some workflow managers have been developed to manage and execute CWL workflows (e.g., Toil [5] and CWL-Airflow [6]). These systems allow the control, monitoring, and checking of workflows deployed on stand-alone Linux machines or in cloud environments. Similarly, software installation tools specifically dedicated to bioinformatics, such as Bioconda [7], and Docker containers standardizations, such as Biocontainers [8], are available and result in easier bioinformatics tool installation and management. These languages, systems, and technologies solve the problem for data analysis pipeline execution across multiple platforms in a scalable way.
Moreover, applications providing user-friendly interfaces for NGS data analysis without any programming knowledge have been published; PhenoMeNal [9] and BioWardrobe [10] are popular examples. These applications offer a web-based graphical user interface encapsulating different NGS data analysis pipelines. They were implemented to analyze large NGS datasets taking advantage of cloud infrastructures or bare metal clusters. However, they are complex systems that require technical expertise for their use, deployment, and maintenance. Moreover, most of them do not work on personal laptops or workstations because they require a complete GNU/Linux environment with high-performance computing (HPC) or cloud infrastructure software components.
NGS data management through a standard organizational structure, automatic project report, and the production of publication-ready figures is still a complex process that involves the integration of multiple components and knowledge of the standards for publication. Experimental laboratories, however, demand the use of NGS data analysis tools to verify sample quality. These quality control tests are frequently executed by non-bioinformaticians but are limited to the use of tools such as FastQC [11] for pre-processing quality control. Nevertheless, for high-throughput data, pre-processing quality control is not sufficient to guarantee quality standards. Even with a positive pre-processing quality report, samples could include contamination or a high level of read duplication that could compromise further analysis. Time and resources can be saved if experimental researchers are able to align their samples and execute postprocessing quality controls in parallel with their experiments.
Here, we present PM4NGS, a computational framework for NGS data analysis that integrates all required components, including automatic creation of a standard organizational structure of directories and files, bioinformatics tool installation using Docker or Bioconda, and data analysis pipelines in CWL format. Pre-configured Jupyter notebooks with minimum Python code are included in PM4NGS to produce a project report and publication-ready figures. The framework is designed in a modular way that allows non-bioinformaticians a quick check and quality control of produced samples on a personal laptop. The authors would like to highlight that PM4NGS was designed as a fully interactive tool for data analysis on personal laptops or workstations. It can also be used as an educational tool to train new bioinformaticians on how to organize an NGS data analysis project showing a detailed view of the pipeline components.

Methods
Next-generation sequencing data analysis workflows PM4NGS currently includes 3 NGS data analysis workflows: differential gene expression and gene ontology (GO) enrichment analysis from RNA sequencing (RNA-Seq) data, differential binding analysis from chromatin immunoprecipitation sequencing (ChIP-Seq) data, and DNA motif binding detection from ChIPexo data. These workflows are in demand by NGS data analysis researchers, but PM4NGS also can easily be modified to process other NGS technologies and data types, such as ATAC-Seq or END-Seq data.
The workflows included in PM4NGS are based on the ENCODE Data Coordinating Center Uniform Processing Pipelines [12].

Differential gene expression and GO enrichment from RNA-Seq
RNA-Seq has become a standard technique to measure RNA expression levels. It is frequently used in basic and clinical research [13]. Differential expression (DE) analysis is one of the most used applications for RNA-Seq and allows the comparison of RNA expression levels between multiple conditions [14]. Benchmark studies have compared DE analysis tools [15,16]. Currently, the most common tools used for DE analysis are DE-Seq [14] and EdgeR [17]. DE analysis identifies gene expressionlevel alterations between conditions that can later be integrated with GO [18] annotations to identify key biological processes.
Our DE and GO enrichment pipeline comprises 5 steps, presented in Supplementary Table S1. The first step involves downloading samples from the NCBI SRA database [19], if necessary, or executing the pre-processing quality control tools on local samples. Subsequently, sample trimming, alignment, and quantification processes are executed. Once all samples are processed, groups of differentially expressed genes are identified per condition, using DESeq and EdgeR. Over-and underexpressed genes are reported by each program, and the intersection of their results is computed. Finally, once differentially expressed genes are identified, a GO enrichment analysis is executed to provide key biological processes, molecular functions, and cellular components for identified genes. The pipeline can be accessed via GitHub repository [20].

Differential binding analysis from ChIP-Seq
ChIP-Seq is a method to study genomic regions enriched in a specific DNA-binding protein in a massively parallel DNA sequencing experiment. It has broad applications for studying different epigenetic marks processed through histone modifications or transcription factor (TF) binding events [21]. Our differential binding analysis pipeline from ChIP-Seq experiments is applicable to any protein used as an immuno-precipitable target. The workflow comprises 5 steps, as presented in Supplementary  Table S2. The first step, sample download and quality control, is for downloading samples from the NCBI SRA database [19], if necessary, or for executing the pre-processing quality control tools on all samples. Sample trimming and alignment are executed next. Post-processing quality control on ChIP-Seq samples is executed using Phantompeakqualtools [22], as recommended by the ENCODE pipelines [23]. Then, a peak-calling step is performed using MACS2 [24], and peak annotation is executed with Homer [25]. Peak reproducibility is explored with IDR [26]. Finally, a differential binding analysis is completed with DiffBind [27]. The pipeline can be accessed via GitHub repository [28].

DNA motif binding detection from ChIP-exo data
ChIP-exo was developed by Rhee and Pugh [29] to improve the resolution of ChIP-Seq experiments. A phage exonuclease step is added to digest the 5 end of the TF-unbound DNA after ChIP. ChIP-exo reduces the statistically enriched highoccupancy binding regions detected with ChIP-Seq to singlebase accuracy [29]. The method has been widely adopted to study 1-base resolution of DNA protein bound regions, although the tools developed to analyze ChIP-Seq data do not take advantage of the ChIP-exo resolution. The pipeline presented is based on MACE [30], which was developed to take advantage of the characteristics of the ChIP-exo data that produce binding peaks with high sensitivity and specificity.
The 5-step workflow is presented in Supplementary Table  S3. The first step, sample download and quality control, is for downloading samples from the NCBI SRA database [19], if necessary, or for executing the pre-processing quality control tools on all samples. After that, sample trimming and alignment are executed. Post-processing quality control on the ChIP-exo samples is performed using Phantompeakqualtools [22], as recommended by the ENCODE pipelines [23]. After that, peak calling is performed, using MACE [30]. Finally, DNA motif annotation is performed with the MEME Suite [31]. The pipeline can be accessed via GitHub repository [32].

The PM4NGS computational framework
A key factor in any computational biology experiment is organization and effective management of the data [33]. A standard project organizational structure is mandatory to properly handle the interaction of several components such as bioinformatics tools, genome annotations, databases, input data files, output data files, figures, and scientific computational notebooks. PM4NGS automatically generates a standard organizational structure that allows a researcher external to the project, after a quick view of the directories and files, to gain an understanding of the project structure and be able to easily find any relevant data file, input, or output.
Our framework is based on a popular, free Python tool called Cookiecutter [34]. This tool creates projects from previously defined templates. It has been used to generate projects as dissimilar as pre-defined websites or project structures for data science [35]. Cookiecutter requires a Python installation for its execution. A simple command can be executed on Windows Subsystem for Linux (WSL), available from Windows 10 or direct use on Mac OS and Linux. In addition, the framework uses Jupyter Notebooks for data and workflow management. Jupyter is a free, open source, interactive web application that permits the distribution of life code in the form of computational notebooks. These computational notebooks are being used in data science, making data analysis shareable and reproducible [36,37]. Fig. 1 is a schematic description of the PM4NGS project with input requirements, main components, and outputs created for each pipeline. The Project Creation step is performed by Cookiecutter with pre-defined project templates that are based on the data to be analyzed. PM4NGS uses as input a sample sheet in TSV format (see description in the documentation [38]), which describes the samples' location, conditions, and replicates. The 4 Project Components are described in greater detail in the section below.

Standard project organizational structure
Our project organizational structure is inspired by an article published by Noble [33]. We adapted his directory and file system structure to be compatible with the PM4NGS approach, which is executed as including a dedicated folder for computational notebooks and CWL workflows. Fig. 2 shows the organizational structure created for a project. As an illustrative example, we describe the structure created for a differential gene expression and GO enrichment pipeline. First, a top-level directory is created with the project name. This directory is self-contained, which means that the transfer of the entire directory to another computational environment will ensure that all Jupyter Notebooks and CWL workflows will be executable elsewhere. See Box 1 for a full description of each folder inside the organizational structure.

Box 1:
r bin/cwl: the CWL workflows. r config: includes a file named "init.py," which defines all configurational variables for the project.
r data: includes all data used as input for the project, as well as a directory for the dataset to be processed that contains all compressed FASTQ files.    r results: this folder follows the same organizational structure that "data" does. It contains the "dataset" directory with all results from the workflow. r tmp: temporary folders used for CWL or Docker. r index.html: HTML file that will redirect to the 00-Project Report saved in HTML format for visualization in a browser (see [39]). r README.md: text file to add description about the project.
r LICENSE: text file with a license description for project distribution.
Jupyter Notebook for data and workflow management PM4NGS uses Jupyter Notebooks for data and workflow management. The notebooks are designed with minimum Python code, making them simple to understand and execute. For a standard analysis, notebooks can be executed without any modification. It is important to highlight that Jupyter Notebook was selected as the user interface in this project because it allows more advanced users to extend the analysis with customized workflows.
In addition, the entire workflow can be distributed in multiple notebooks, making it easy to understand and re-execute, if necessary. Furthermore, making the workflow interactive allows a visualization and check of the intermediate results, saving time and resources in case errors are detected.
All workflows include a notebook named "00-Project Report." This notebook should be executed after each step is finished. It will automatically create tables and figures that users can use to validate the data analysis process. This notebook also creates its own version in HTML format that can be used to share the results with collaborators or supervisors (see examples for an RNA-Seq project [39]).

Bioconda/Docker for bioinformatics tool management and execution
Software version management is also a key aspect of any computational experiment to obtain reproducible results. Compu-  tational technologies, such as Docker or Conda, have been developed to guarantee simple software version management for a non-expert user to implement computational environments straightforwardly.
Bioconda is a channel for the Conda package manager specialized on bioinformatics software. It contains >7,000 bioinformatics packages. Biocontainers is a project closely related to Bioconda, and it provides access to Bioconda packages through standardized containers. Both Bioconda and Bicontainers are community-driven projects that provide the infrastructure and basic guidelines to create, manage, and distribute bioinformatics packages. PM4NGS adopted Docker (Biocontainers) or Conda (Bioconda) for software management and execution. Users can decide between them at the time of project creation, depending on their needs or infrastructure. The framework also can be used without these technologies, but tool availability and version should be ensured by the user.
If Conda is selected, a Conda directory, named "conda," will automatically be created inside the directory "bin" by cwltool and runtime execution. This environment can be recreated automatically if the project is moved to another computational environment. If Docker is selected, all Docker images required by the workflow will be pulled automatically by cwltool as required by the workflow to the local computer.

CWL tools and workflow specifications
The tools and workflows included in this framework are published in a separate Git repository [40]. This repository comprises 2 main directories. The first directory, named "tools," includes all computational tools used by the workflows in CWL format. The second folder, named "workflows," includes all workflows.
Each CWL tool includes 2 YAML files with suffixes "bioconda.yml" and "docker.yml." As indicated by its name, the bioconda.yml file stores the software requirements for executing the CWL tool using Conda (Bioconda). The files specify the package names and versions. The CWL runner will create a Conda environment and install the package, if it does not exist, at runtime.
The docker.yml file defines the Biocontainers Docker image to be used. This image will be pulled by the CWL runner at runtime. The bioconda.yml and docker.yml files defined for each computational tool can be used in any computational environment to recreate the same scenario in which the project was executed.
PM4NGS uses the Biocontainers Registry [41] through its Python interface named bioconda2biocontainer [42] to update the CWL docker images defined in the docker.yml file during project creation. The Bioconda package name and version defined in the bioconda.yml file is passed as an argument to the update cwl docker from tool name method in bio-conda2biocontainer, returning the latest Docker image available for the tool. PM4NGS, after cloning the CWL repository, parses the Bioconda package names and versions from the bioconda.yml files and updates all defined Docker images to its latest tags, modifying all docker.yml files.
The workflows are developed in a modular way, as defined by Cohen-Boulakia et al. [43], which means that the workflow steps are packed in sub-workflows that can be replaced if required. For instance, the workflow for the alignment and quality control of ChIP-Seq data includes a sub-workflow for sample alignment that uses BWA [44] and SAMtools [45], producing as output a sorted BAM file with an index. This sub-workflow could be replaced by a similar workflow using a different aligner without modifying the main workflow. For more detail on workflow steps, tools, and outputs, refer to Supplementary Tables S1-S3.
Each data analysis project created with PM4NGS will include a cloned version of the CWL repository. The workflow directory will be in the "bin" folder, as shown in Fig. 2. This guarantees that each project is self-consistent and will allow a user to keep updating the tools and workflows without affecting closed or active projects.
PM4NGS uses cwltool as CWL runner and the option "parallel" to execute workflow steps in parallel. Multithreading tools uses the CWL variable "coresMin" to set the number of threads. Cwltool distributes the load using this variable and executes workflow steps until all available CPUs are in use. Tools that required a minimum RAM to be executed used the "minRAM" variable. The "maxRAM" variable is used to limit the amount of RAM that cwltool assigns to a step running the tool. These 2 variables are also used to determine the number of parallel steps to be executed.

Results and Discussion
PM4NGS was designed to be used on laptops or workstations. Execution is possible in any computational system that runs on 64-bit Linux, MAC OS, WSL, or a Linux virtual machine available on Windows.
The minimum computational resources required depends on the target organism of the samples. For analyzing human samples, a minimum of 32 GB of RAM is required owing to the size of the human genome. For analyzing bacterial samples, however, a minimum of 4 GB of RAM is enough.
The examples included in the supplementary materials were executed in a Dell Tower 5810 with 64 GB of RAM, 16-core Intel Xeon CPU E5-2640, and 1 TB of hard disk, using the Linux operating system OpenSUSE Leap 15.1 [46] and WSL with Ubuntu available from Windows 10 64-bit, 10 May 2019 update, downloaded from Microsoft [47].

Software requirements
The framework was developed to manage all software installation automatically. A minimum of Python 3.6 or later is required. In addition, if Conda will be used as a software manager, then a minimum version of Miniconda Python 3.7 64-bit should be installed. If, instead of Conda, Docker will be used, then the Docker Server should be installed. A more detailed description can be found in the project documentation [48].
For use in the WSL, any Linux distribution in the Microsoft Store can be used. We tested Ubuntu 20.04 and OpenSUSE Leap 15.1. Once the Linux distribution is available, the guidelines to follow are the same as those for the native Linux.

Input data
All implemented workflows require a sample sheet file during project creation. The file is copied to the "data/<DATASET>/" folder with name "sample table.csv." Note that the variable "DATASET" is chosen during project creation, and it is the name used to identify the dataset to be analyzed. For example, if data in the NCBI SRA database will be analyzed, we recommend using the BioProject ID as the dataset name. This file should include ≥4 columns, and names are case sensitive: sample name, file, condition, and replicate. See Box 2 for more description and example. PM4NGS uses the dataset encapsulation to allow multiple datasets using the same workflow version and notebooks.

Box 2: Columns:
r sample name: Sample names, which can be different from the sample file name. r file: This is the absolute path or URL to the raw fastq file. For paired-end data the files should be separated using the Unix pipe | as in, e.g., HTTP and FTP URLs are also allowed and PM4NGS will download the data to the folder "data/<dataset name>/" during project creation. This column should remain empty if the data are in the NCBI SRA database; PM4NGS will add a cell to the "01-Pre-processing QC" notebook to download the data using fastq-dump. For RNA-Seq projects, the differential gene expression that compares these conditions will be generated. If there are multiple conditions, all comparisons will be generated. There must be ≥2 conditions. For ChIP-Seq projects, differential binding events will be detected when comparing these conditions. If there are multiple conditions, all comparisons will be generated. There must be ≥2 conditions. For ChIP-exo projects, the samples of the same condition will be grouped for the peak calling with MACE. If the project is analyzing data from the NCBI SRA database, the 01-Pre-processing QC Jupyter notebook will download the raw files, using fastq-dump to read the SRA accessions from the column "sample name" in the sample table.csv file. The raw files downloaded will be placed in the data/<DATASET>/ folder. If the samples are being provided by the users, however, the absolute path or URL needs to be defined in the sample sheet; PM4NGS will copy or download the samples to the data/<DATASET>/ folder. The naming convention defined in Box 2 should be followed.

Reference genome and annotations
All workflows require a reference genome and its annotations. The locations of these directories are defined during project creation but can be changed, in case of need, in the project configuration file config/init.py.
Depending on the NGS technology that needs to be analyzed, a short-read aligner is used. For RNA-Seq, the STAR [49] aligner is used. For ChIP-Seq and ChIP-exo, the BWA [44] aligner is used. The project documentation includes a detailed description of how to create the genome indexes for each aligner used by PM4NGS.

Supporting data
We analyzed 3 public BioProjects for demonstration purposes to show the functionality of the 3 workflows presented in this article (see Table 1). The links in Table 1 give access to the project report, in HTML format, automatically generated by PM4NGS.
A detailed description of the procedure for downloading, installing, and configuring PM4NGS is available in the documentation [48]. The steps for analyzing each BioProject in Table 1 are described with execution time, results and figures, and troubleshooting and expected results.

Comparison with other frameworks
Workflow managers for NGS data analysis are publicly available. The open source solutions are Nextflow [52], CWL-Airflow [6], Arvados [53], REANA [54], and, more recently, SnakePipes [55]. With the exception of SnakePipes, all open source platforms require multiple steps to get a local version installed and configured. These steps demand high expertise in system administration and limit their use for non-bioinformaticians. Nextflow, SnakePipes, CWL-Airflow, and REANA only execute the workflows and do not offer user-friendly interfaces for managing the analysis data. Arvados has a limited local client "Arvados-in-abox" that should be used for workflow development and is not intended for production use. It is focused on executing pipelines on Kubernetes [56], Minikube [57], and GKE [58]. BioWardrobe [10] is no longer supported as stated on the GitHub repository [59].
Commercial platforms, such as DNAnexus [60], DNAstar [61], Seven Bridges [62], and SciDAP [63], also provide web and desktop tools for NGS data analysis. They are designed to target HPC systems or cloud infrastructures and are not designed for use on personal laptops or workstations. In addition, platforms such as DNAnexus, Seven Bridges, and SciDAP constrain clients to use their computational infrastructure.
PM4NGS presents simple solutions to many of these problems. It is free; it is easy to install and use; it includes Jupyter Notebooks as a user interface; it allows the use of Docker or Conda; projects are self-contained through an organizational standard structure; and it automatically creates project reports and publication-ready figures.

Limitations
PM4NGS is currently limited to being executed on a single computer. Although it can execute multiple workflows in parallel, it does not include an implemented interface to submit jobs to HPC clusters or batch submissions to a cloud infrastructure. In addition, the number of available data analysis workflows is limited to 3, but this will be increased in the future.
The framework guarantees the repeatability, replicability, and shareability of the data analysis, following the definitions of Cohen-Boulakia et al. [43]. "Repeatability" means that, if the same analysis is executed multiple times, the results will be the same; "replicability" means that, if the same analysis is executed in different computational environments, the results will be the same; and "shareability" means that sharing a packaged data analysis with all its input data and conditions should guarantee repeatability and replicability. This framework does not, however, present a final solution for reproducibility and reuse because it does not include independent workflows to verify conclusions. In the case of differential gene expression analysis, 2 DGA tools, DESeq [14] and EdgeR [17], are used to corroborate the results with the aim of introducing a minimum level of reproducibility in the analysis.

Future improvements
We are working on implementing standard interfaces that allow users to submit workflows as batch submissions to private HPC clusters or to the different publicly available cloud providers. Users with technical expertise, however, can modify the current notebooks to submit jobs to their computational infrastructure. Furthermore, new pipelines are being implemented and tested for inclusion. Nevertheless, it is useful to highlight that existent workflows can be modified for analyzing other NGS technologies. These are important features, and we think that this framework could help researchers in experimental laboratories by providing a user-friendly tool for sample quality control and quick analysis.

Conclusion
PM4NGS is an open source framework that creates a standard organizational structure for NGS data analysis projects. It is easy to install and configure and is designed to be used by non-bioinformaticians on their personal computers or laptops. It permits execution of NGS data analysis on Windows 10 with the Windows Subsystem for Linux feature activated. Conda/Bioconda or Docker are used for bioinformatics tool management, allowing the framework to keep tool versions fixed for workflow reproducibility.
The users interact with Jupyter Notebooks with minimum Python code to process the data. The workflows are modular, which allows intermediate results to be inspected at any time. Project reports and publication-ready figures are also automatically created.
The 3 NGS data analysis workflows currently included in PM4NGS are based on the ENCODE Data Processing Pipelines.
Finally, we highlight that this framework aims to reduce the gap between researcher in experimental laboratories producing NGS data and the workflows for the data analysis. The complexity of working with multiple directories, data files, and programs on the Linux command line interface is completely managed by PM4NGS, allowing a researcher to focus on interpreting results.