Accumulating computational resource usage of genomic data analysis workflow to optimize cloud computing instance selection

Abstract Background Container virtualization technologies such as Docker are popular in the bioinformatics domain because they improve the portability and reproducibility of software deployment. Along with software packaged in containers, the standardized workflow descriptors Common Workflow Language (CWL) enable data to be easily analyzed on multiple computing environments. These technologies accelerate the use of on-demand cloud computing platforms, which can be scaled according to the quantity of data. However, to optimize the time and budgetary restraints of cloud usage, users must select a suitable instance type that corresponds to the resource requirements of their workflows. Results We developed CWL-metrics, a utility tool for cwltool (the reference implementation of CWL), to collect runtime metrics of Docker containers and workflow metadata to analyze workflow resource requirements. To demonstrate the use of this tool, we analyzed 7 transcriptome quantification workflows on 6 instance types. The results revealed that choice of instance type can deliver lower financial costs and faster execution times using the required amount of computational resources. Conclusions CWL-metrics can generate a summary of resource requirements for workflow executions, which can help users to optimize their use of cloud computing by selecting appropriate instances. The runtime metrics data generated by CWL-metrics can also help users to share workflows between different workflow management frameworks.

start data analysis, researchers need to select the tools by their experimental design and install them to their computing environment.
Installing open source tools in one's computational environment is, however, not always straightforward. Tools developed by different developers and different programming framework require different prerequisites, which forces one to follow the instruction provided by each tool's developer. Installing various software in one environment also can occur a conflict of software dependencies that are hard to resolve. Even if one could successfully install all the tools required for the analysis, maintaining the environment where all the tools keep working as expected is also a burden. There are also many events that can break the environment such as changes or updates of hardware, operating system, or software libraries. Therefore, the complexity of data analysis environment management gets higher when a project performs genomic data analysis that requires many tools. The high cost of setting up an environment results in the prevention of scaling out the computational resources as well. The difficulty also brings researchers' dependency to the existing computing platform already set up, and the concentration of data processing jobs to the limited resource.
The container virtualization technology, represented by Docker, enables users to create a software runtime environment isolated from the host machine [3]. This technology that is getting popular also in the biomedical research domain is a promising method to solve the problem of installing software tools [4]. Along with the containers, using workflow description and execution frameworks such as those from the Galaxy project [5] or the Common Workflow Language (CWL) project [6] lowered the barrier to deploy the data analysis environment to a new computing environment. Moreover, the workflows described in a standardized format can help researchers to share the environment with collaborators with ease. The improvement of portability of data analysis environment, consequently, has made the on-demand cloud infrastructure an appealing option for researchers.
On-demand cloud is beneficial for most cases in genome science because users can increase or decrease the number of computing instances without maintaining hardware as the amount of data from laboratory experiments changes [7]. For example, some sequencing applications require data analysis software that uses a considerable amount of memory, but individual research projects often cannot afford such a large scale computing platform. Users can save their budget by using the on-demand cloud platform as most of the service providers charge per usage.
However, to use an on-demand cloud environment efficiently regarding time and economic cost, it is essential to select a suitable computing unit, so-called instance type, from many options offered by the cloud service providers. For example, Amazon Web Service (AWS), one of the popular cloud service providers, offers instance types of different scales for five categories (general purpose, compute optimized, memory optimized, accelerated computing, and storage optimized) [8]. Each data analysis tool has the different minimum requirement of computational resources such as memory or storage, and it can change by input parameters. Executing data analysis workflows on an instance without enough computational resource will result in a runtime failure or unexpected outputs. For example, tools to assemble short reads to construct genome by constructing De Bruijn graph usually take long processing time and a large amount of memory. If one failed to estimate the required amount of memory, the process might fail after a few days of execution, which results in losing one's time and budget. Thus, users need to know the minimum amount of computational resource required by the execution of their workflows to select a suitable instance type.
To optimize the instance type selection concerning processing time or running cost, users need to compare runtime metrics of workflow executions on environments of different computational specs. Here, we developed CWL-metrics, a system to accumulate runtime metrics of workflow executions with information of the workflow and the machine environment. CWL-metrics provides runtime metrics summary such as usage of CPU, memory, storage I/O with workflow's input files and parameters to help users to select the proper cloud instance for their workflows.

Implementation of CWL-metrics
CWL-metrics is designed to capture runtime metrics data of workflows described in CWL, a workflow description specification developed by an open source community. We designed the system as it does not require the users to perform any configurations to capture runtime metrics. Figure 1 shows the procedures of runtime metrics collection by CWL-metrics. To start collecting metrics, one only needs to install the system, and then run their workflows with cwltool, a reference implementation of CWL [9]. After the installation, the system starts monitoring the processes running on the host machine. Once the system found a cwltool process, it automatically starts collecting runtime metrics via Docker API and environmental information from the host machine.
CWL-metrics also captures the log file generated by cwltool to extract workflow metadata such as input files and input parameters.
To capture and store the information from multiple data source, CWL-metrics launches multiple components as Docker containers ( Figure 2).
These components keep running on the host machine after the initialization to cooperate the data collection. The Telegraf container collects runtime metrics data from the Docker API for every sixty seconds, and send the data to the Elasticsearch container. The Elasticsearch container provides data storage and the data access API. CWL-metrics automatically launches and stops these components on the single host machine. If users need to collect metrics of workflows running on multiple instances, they need to install CWL-metrics on each instance and assemble the summary data after the metrics data capture.
Users can use their Elasticsearch server by setting environment variable ES_HOST and ES_PORT before initializing CWL-metrics.
To access and analyze the data collected by CWL-metrics, users can use the command cwl-metrics to get the data in JSON (Figure 3) or tab separated values (TSV) format. The JSON format contains workflow metadata such as the name of the workflow, the time of start and end of the workflow execution. It also has the information of the environment including the total amount of memory and the size of storage available on the machine. The steps field of the JSON format contains information of the runtime metrics, the executed container, and the input files and parameters. Users can parse the data to analyze the performance of a tool execution or the whole workflow. The TSV format provides minimum information for each container execution so that one can easily compare the metrics data of steps.

Use CWL-metrics to capture runtime metrics of RNA-Seq workflows
As an example use case to capture and analyze runtime metrics of workflows, we performed an analysis to optimize instance type selection for RNA-Seq quantification workflows. We run seven RNA-Seq workflows ( Table   1) for nine public human RNA-Seq data with different read length and number of reads (Table 2) on six types of Amazon Web Service (AWS) Elastic Compute Cloud (EC2) service (Table 3) to capture the runtime metrics by CWL-metrics for each combination. Each workflow description has two different options for read layout; single-end and paired-end. For the selection of workflows, we chose two read mapping tools STAR and Hisat2, with two transcriptome assembly and read count programs Cufflinks and StringTie. We also used two popular tools using alignment-like algorithms, Kallisto and Salmon. TopHat2, the program which was once the most popular, but now obsolete, was added among them for comparing purpose. We performed metrics data collection five times for each combination of workflow, input data, and instance type. The analysis used only the succeeded runs. Table 4 shows that the summary of runtime metrics, processing duration, and the calculated cost of instance usage per run for two workflows, HISAT2-Cufflinks and TopHat2-Cufflinks. The fastest processing time was one of the HISAT2-Cufflinks workflow run on the c5.4xlarge instance, but the execution at the cheapest cost was the HISAT2-Cufflinks workflow on the c5.2xlarge instance. It indicates that workflows on cloud instances can have a trade-off of the processing time and the financial cost. The priority of the research project, the execution speed over the financial cost or vice versa, will be required for the final decision of instance selection optimization. The table also shows the possibility of loss of time or money when one failed to choose a proper instance type. For example, if one used the r5.4xlarge instance to run the HISAT2-cufflinks workflow, it is 7% slower than c5.4xlarge, and about 1.6 times expensive per sample. The impact of the instance type optimization failure will be more serious for the data processing jobs that take days or weeks. The runtime metrics data provided by CWL-metrics also helps to perform a tool comparison. Figure 5 shows that the difference of processing time between the used workflows. Although users need to know the difference of the design concept and the strength of the tools to select the proper one for their research objectives, this result helps to understand the difference of the resource requirement of the workflows for similar purpose. For example, HISAT2 and STAR marked almost the same processing time, but STAR uses far more amount of memory. The plot of the processing time also shows that the obsolete tool TopHat2 is remarkably slower than the other tools.

Discussion
CWL-metrics enabled users to choose a proper cloud instance for workflow runs based on the runtime metrics data. The metrics data summarized by workflow inputs, such as the number of threads to use or total file size of input data, provides the most efficient cloud use for a research project. The data will also help the administrator of computational infrastructure to encourage researchers to use the cloud environment in case their local environment has too many running jobs to accept new job submissions. Each user might perform different analyses and visualizations concerning input parameters of their interest. Thus CWL-metrics outputs JSON and TSV data which are easy to parse and used for visualization by any language of users' favorite, rather having a custom visualization tool other than Kibana.
CWL-metrics is applicable for most cases in bioinformatics data analysis.
However, there are cases that the system does not work as effectively as expected. For example, the current implementation of CWL-metrics cannot capture the precise runtime metrics data of a tool that scatter its processes to multiple computation nodes. Also, it cannot estimate the performance of software that uses hardware acceleration systems such as GPU, since the information of those specific architectures is not available via Docker API.
Nevertheless, in the example use case using RNA-Seq workflows, we showed CWL-metrics could provide beneficial information to help users to decide on the use of cloud infrastructure.
There are also the other workflow operation frameworks that have functions to capture runtime metrics, such as Galaxy [5], Toil [10], or Nextflow [11]. However, we chose CWL as the workflow description framework and its reference implementation cwltool as the workflow runner for the system because CWL is the project providing a way to share the workflow across the different workflow systems. Once users collected the runtime metrics of workflows by CWL-metrics, they can use the same workflow description with multiple workflow runner implementations. There are fifteen implementations listed as those supporting CWL [12]. Some implementations including Galaxy are still not covering full functions to import and export CWL description to share and run workflows, but the others including Arvados, Toil, and Apache Airflow are already available to users. If one wanted to use a workflow system that does not support CWL yet, the summary of runtime metrics collected through Docker container is still valuable resource across the different frameworks.
CWL project has a subproject, CWL-Prov, to provide the provenance information of workflow executions to improve reproducibility of workflows by tracking intermediate files and logs [13]. The provenance information helps users to track inputs and outputs of workflow runs by using file checksum but does not record the detail of the resource usage. Adding runtime metrics data into the provenance information will cover the information regarding deployment, which helps users to reproduce the runs on a proper computing environment. Thus, the summary of runtime metrics collected by CWL-metrics should be bundled with the provenance information.
There will be more amount of sequencing data that one researcher needs to process by the technologies that produce a large amount of sequencing data such as high-throughput single-cell sequencing. In such a situation, it is essential to have a flexible computing environment that can quickly scale out according to the amount of data. The fast deployment of the data analysis environment to the proper cloud instance supported by Docker, CWL, and CWL-metrics is a way to achieve the computational scale out, which brings a huge benefit for bioinformatics researchers.

Potential Implications
The Common Workflow Language project aims to provide the workflow description specification for all domains that work with data analysis pipelines.
Therefore, CWL-metrics can contribute to other domains through the application of CWL. Sharing CWL workflows with the metrics data captured by CWL-metrics can help users to deploy them on an appropriate environment.

CWL-metrics software components
CWL-metrics runtime metrics capturing system is composed of five , and a Perl daemon script. Telegraf is an agent to collect runtime metrics of running containers via Docker API using Telegraf Docker plugin. Fluentd works as a log data collector to send metrics data produced by Telegraf to Elasticsearch server. Elasticsearch is a data store to accumulate runtime metrics data and workflow metadata, accepting JSON format data via API endpoint.
Kibana is a data browsing dashboard for Elasticsearch to view raw JSON data and to summarize and visualize data. Telegraf, Fluentd, Elasticsearch/Kibana launch as a set of containers during the initialization of CWL-metrics.
CWL-metrics runs a Perl script which monitors processes on the host machine to capture cwltool processes. Once the script found a cwltool process, the script runs a function to collect workflow information via debug output of the cwltool process, "docker info" command output, Docker container log via "docker ps" command, and output of system commands to collect environment information.
CWL-metrics provides a command cwl-metrics , which allows users to start and stop the metrics collection system, and fetch summarized runtime metrics data in a specified format, JSON or tab-separated format. The script to launch the whole system, CWL-metrics installation instruction, and the documentation are available on GitHub [18].

Packaging RNA-Seq tools and workflows
We used seven different RNA-Seq quantification workflows to capture runtime metrics and analyze performance on cloud infrastructure. Each To analyze the effect of sequence data quality to workflow runtime performance, we chose nine samples of different read length and number of reads from the public raw sequencing data repository, SRA (Table 2). We used the Quanto database [22] to select the data by filtering length and number of sequence reads, with the condition of read length, 50, 75, or 100 and the approximate number of sequence, 1,000,000, 5,000,000, or 10,000,000. We filtered the data with the query "organism == Homo sapiens", "study type == RNA-Seq", "read layout == PAIRED", and "instrument model == Illumina HiSeq", then manually picked suitable data. Both single-end and paired-end workflows used the same dataset while single-end workflows treated paired-end read files reads as two single-end read files. The version of the reference genome is GRCh38. We downloaded the reference genome file from the UCSC genome browser [23], and the transcriptome was from Gencode [24].

Run workflows on AWS EC2
To evaluate the performance on running different RNA-Seq workflows, we selected instance types of two different sizes 2xlarge and 4xlarge from three categories, general purpose, compute optimized, and memory optimized to run all workflows for all samples (Table 3). Each combination of instance type, workflow, and sample data was executed for five times while CWL-metrics is running on the same machine to capture the runtime metrics information. All workflow runs used Elastic Block Storage of General Purpose SSD volumes as file storage. We downloaded all the reference data used for workflows in advance. The scripts to get reference data and run workflows are available online [21].

Collect runtime metrics and summarize
After the workflow executions, we collected summarized metrics data from Elasticsearch by cwl-metrics fetch command. Exported JSON format data were parsed by a ruby script to create data summarized per workflow runs,       x-axis shows workflow names, and the y-axis shows the processing duration in seconds and total memory usage in bytes. We iterated each combination of workflow and instance type for five times. The plot of processing duration shows that there is a significant difference in execution time between the TopHat2 workflow and the others. While the difference of processing durations is relatively small, workflows with STAR aligner require four or five times much memory than HISAT2 workflows. These data suggest users know about runtime metrics of workflows before selecting cloud instance type.  We described seven different RNA-Seq quantification workflows in CWL. Each workflow description has two different options for read layout, single-end and paired-end. We selected two major read mapping tools STAR and Hisat2, with two transcriptome assemble and read count programs Cufflinks and StringTie.
We also used two popular tools using alignment-like algorithms, Kallisto and Salmon. We added TopHat2, one of the most popular but obsolete program for comparing purpose.   We summarized the runtime metrics values to compare two different workflows HISAT2-cufflinks and TopHat2-cufflinks. All runs are of input data SRR2567462. The read length was 100bp, the number of reads was 10,007,044.00, and the read layout was single-end. The shown values are workflow duration in seconds, the maximum CPU usage in percentage, the total amount of memory in bytes, the total amount of cache in bytes, the total amount of block IO in bytes, and the cost per run in USD. We calculated the median values for metrics values from the data of five times workflow iteration. Values can be zero for short-time steps since CWL-metrics collects these metrics with sixty seconds interval.

RNA-Seq workflows
We used eleven tools in total to construct seven RNA-Seq quantification workflows. The two tools we developed, download-sra and pfastq-dump, are packaged in containers by ourselves. The container of Salmon was available on its developer's build. We found the rest of tools in Biocontainers registry. We wrapped all the tools as CWL CommandLineTool class files and available on GitHub.