Container Profiler: Profiling resource utilization of containerized big data pipelines

Abstract Background This article presents the Container Profiler, a software tool that measures and records the resource usage of any containerized task. Our tool profiles the CPU, memory, disk, and network utilization of containerized tasks collecting over 60 Linux operating system metrics at the virtual machine, container, and process levels. The Container Profiler supports performing time-series profiling at a configurable sampling interval to enable continuous monitoring of the resources consumed by containerized tasks and pipelines. Results To investigate the utility of the Container Profiler, we profile the resource utilization requirements of a multistage bioinformatics analytical pipeline (RNA sequencing using unique molecular identifiers). We examine profiling metrics to assess patterns of CPU, disk, and network resource utilization across the different stages of the pipeline. We also quantify the profiling overhead of our Container Profiler tool to assess the impact of profiling a running pipeline with different levels of profiling granularity, verifying that impacts are negligible. Conclusions The Container Profiler provides a useful tool that can be used to continuously monitor the resource consumption of long and complex containerized applications that run locally or on the cloud. This can help identify bottlenecks where more resources are needed to improve performance.

deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.
Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist?

Background
Large-scale and diverse biomedical data have been generated to advance the understanding of biological mechanisms.Interpreting these data typically includes multiple analytical steps, each of which consists of di erent computational methods and software tools.An analytical pipeline (or work ow) is a sequence of computational tasks used to process and analyze speci c biomedical data.Each analytical step in a pipeline can potentially require a di erent set of applications, libraries, and software dependencies.As a result, software containers that encapsulate executables with their dependencies have become popular to facilitate the deployment of complicated pipelines and to enhance their reproducibility [1,2].Additionally, di erent analytical steps in a pipeline could have di erent computing resource requirements.In particular, many bioinformatics pipelines consist of one or more computationally intensive steps stemming from their operation on large datasets requiring signi cant CPU, memory, network, and disk resources.As an example, the alignment step in a RNA sequencing pipeline typically requires relatively more CPU and memory resources than other steps, while the data download step typically requires more disk I/O and network resources.
Cloud computing has emerged as a solution that can provide the necessary resources needed for computationally intensive bioinformatics analyses [3,4,5,6,7,8,9].However, deployment of analytical pipelines using Infrastructure-as-a-Service (IaaS) cloud platforms requires selecting the appropriate type • We present the Container Pro ler a tool that enables pro ling the resource utilization of any script or container-based task on Linux.• The Container Pro ler collects CPU, memory, disk, and network resource utilization metrics at the virtual machine, container, and process levels.• The Container Pro ler supports delta and time series resource utilization pro ling at an adjustable time interval (e.g.1-second) supporting monitoring and graphing of resource utilization enabling time series analysis to help identify performance bottlenecks for any Linux-based computational task.• The Container Pro ler can pro le complex containerized computational jobs such as bioinformatics pipelines where multiple individual containers are used to implement speci c steps.• The Container Pro ler is provided as a container which can merge with any existing container or used separately to pro le independent Linux scripts or executables to characterize task resource utilization locally or on the cloud.• We illustrate how di erent resources are required when performing di erent steps of a containerized pipeline analyzing unique molecular identi ers (UMI) RNA sequencing data.
and quantity of virtual machines (VMs) to address performance goals while balancing hosting costs.Cloud resource type selection is presently complicated by the rapidly growing number of available VM instance types and pricing models o ered by public cloud providers.For example, the Amazon, Microsoft, and Google public clouds presently o er hundreds of di erent VM types under di erent pricing models.Further, Google allows users to create custom VM types with unique combinations of CPUs, memory, and disk capacity.These cloud VMs are available directly, or through various container platforms.Determining the best cloud deployment requires understanding the resource requirements of the pipeline.

Our Contributions
This paper presents the Container Pro ler, a tool that supports pro ling the computational resources utilized by software within a Docker container.Our tool is simple, easy-touse, and can record the resource utilization for any Dockerized computational job.Understanding ne-grained resource utilization of containerized computational tasks can help identify resource bottlenecks and inform the choice of optimal cloud deployment.The Container Pro ler collects over sixty metrics to characterize the CPU, memory, disk, and network resource utilization at the VM, container, and process level.In addition, the Container Pro ler supports time-series graphing enabling the visualization and monitoring of resource utilization of containerized tasks and pipelines.
We present a case study involving pro ling the resource utilization of a multi-stage containerized bioinformatics pipeline that analyzes the unique molecular identi ers (UMI) of RNA sequencing data.In this study, we demonstrate how the Container Pro ler performed time-series sampling at a one-second interval while a compute-bound bioinformatics pipeline simultaneously ran up to 85 distinct processes.Under load, our tool was able to pro le the RNA-sequencing pipeline with full verbosity (all metrics) with 100% of the pro ling samples obtained in under 100ms.

Related Work
Cloud computing has been used to process massive RNA sequencing (RNA-seq) datasets [10,11].These pipelines typically consist of multiple computational tasks, where not all tasks necessarily have the same resource requirements.Tatlow et al. studied the performance and cost pro les for processing largescale RNA-seq data using pre-emptible virtual machines (VMs) on the Google Cloud Platform [10].The authors collected resource utilization metrics to characterize user and system vCPU utilization, memory usage, disk activity, and network activity for the di erent computational stages of the RNA-seq pipeline.Tatlow et al. observed how resource utilization can vary dramatically across di erent processing tasks in the pipeline, while demonstrating that resource pro ling can help to identify resource requirements of unique pipeline stages.Juve et al. developed a pair of tools called wfprof (pipeline pro ling) to collect and summarize performance metrics for diverse scienti c pipelines from multiple domains including bioinformatics [12].Wfprof consists of two tools, ioprof to measure process I/O, and pprof that characterizes process runtime, memory usage, and CPU utilization.These tools accomplish pro ling at the machine level primarily by analyzing process level resource utilization, and they do not focus on pro ling containerized pipelines, nor do they collect container speci c metrics.
Tyryshkina, Coraor, and Nekrutenko leveraged coarse grained resource utilization data from historical job runs collected over 5 years on the Galaxy platform to estimate the required CPU time and memory to improve task scheduling [13].Galaxy, a scienti c work ow, data integration, data analysis, persistence, and publishing platform was initially developed for genomics research and is now considered largely domain agnostic and is used for processing general bioinformatics pipelines.The authors identi ed the challenge of determining the appropriate amount of memory and processing resources for scheduling bioinformatics analyses at scale.The majority of metrics in the study consisted of metadata regarding job congurations.Assessing the utility of using ne grained operating system metrics as with the Container Pro ler to pro le resource utilization of genomics pipelines was not the focus.This e ort considered many older jobs that ran on Galaxy where containers were not used thus they lacked container based metrics.
Outside bioinformatics, Weingartner et al. highlight the importance of pro ling resource requirements of applications for deployment in the cloud to improve resource allocation and forecast performance [14].Brendan Gregg described the USE method (Utilization, Saturation, and Errors) as a tool to diagnose performance bottlenecks [15].Gregg's method involves checking utilization of every resource involved in the system including CPUs, disks, memory, and more to identify saturation and errors.Lloyd et al. provided a virtual machine manager known as VM-scaler that integrated resource utilization proling of software deployments to Infrastructure-as-a-Service (IaaS) cloud VMs [16].VM-scaler focused on the management and pro ling of cloud infrastructure used to host environmental modeling web services.This work was later extended by building resource utilization models to enable identifying the most cost e ective cloud VM types to host environmental modeling web service workloads without sacri cing runtime or throughput [17].This e ort demonstrated a cost variance of 25% for hosting these workloads across di erent VM types on the Amazon Elastic Compute Cloud (EC2) while identifying potential for cost savings up to $25,000 for 10,000 hours of compute time.
To characterize resource requirements of containerized tasks and pipelines, a variety of commercial and open source tools exist.The vast majority of the available tools, however, require the setup and maintenance of a complete monitoring application including a time-series database and web application server.[18] These monitoring applications require dedicated infrastructure (i.e.servers and/or virtual machines) to run always-on daemons.Many of these tools are also oriented towards monitoring entire container clusters (e.g.Kubernetes).Access to such cluster-level monitoring tools is often restricted organizationally to system administrators and privileged users and not made freely available to any user.For container proling, there are far fewer solutions that enable a user to easily pro le the resource utilization of containerized tasks or pipelines on a local computer or personal cloud VM with minimal e ort and expertise.The lack of lightweight easy-to-use developer tools that require no setup or maintenance of a permanent monitoring application and/or database server is what motivated the creation of the Container Pro ler.
CMonitor as a related tool has been developed to support similar goals of lightweight container pro ling without setup of a full monitoring application [19,20].CMonitor is installed and run on the host and is used to pro le host metrics in addition to container metrics as the tool is not focused speci cally on pro ling a containerized task or pipeline.CMonitor, however, runs as an external tool which requires the user to posses detailed information about the host's operating system, runtime con guration, and Docker setup.Additionally CMonitor does not support container pro ling of ARM-based Linux VMs or servers.These systems are of interest with the advent of low-cost compute-optimized VMs based on the Graviton series of ARM CPUs (e.g.c6g and c7g) on Amazon EC2 [21,22,23].These VMs o er performance improvements and cost savings of interest for executing bioinformatics pipelines.CMonitor is installed as a package requiring several dependencies.

Container Pro ler: Overview
The Container Pro ler tool supports pro ling resource utilization including CPU, memory, disk, and network metrics of containerized tasks.Resource utilization metrics are obtained across three levels: virtual machine (VM)/host, container, and process.Our implementation leverages facilities provided by the Linux operating system that is integral with Docker containers.Development and testing of the Container Pro ler described in this paper was completed using Debian-based Ubuntu Linux.
The Container Pro ler collects information from the Linux /proc and /sys/fs/cgroup le systems while a workload is running inside a container on the host machine.To support collecting metrics the Container Pro ler is implemented using Python3 while leveraging psutil, a cross-platform library for retrieving information on running processes and system utilization [24].It should be noted that psutil itself is not a pro ling tool.Psutil assists with collecting host-level and process-level metrics from the system, but does not process metrics for time-series analysis or graphing.Psutil also does not output metrics in speci c formats (e.g.JSON, CSV) or orchestrate time-series pro ling.The host machine could be a physical computer such as a laptop or a virtual machine (VM) in the public cloud.The workload being pro led can be any job capable of running inside a Docker container.Figure 1 provides an overview of the various metrics collected by the Container Pro ler.
Host-Level Metrics: Host/VM level resource utilization metrics are obtained from the Linux /proc virtual lesystem using psutil.The /proc lesystem is a virtual lesystem that consists of dynamically generated les produced on demand by the Linux operating system kernel providing an immense amount of data regarding the state of the system [25].Files in the /proc lesystem are generated at access time from metadata maintained by Linux to describe current resource utilization, devices, and hardware con guration as managed by the Linux kernel.The Container Pro ler queries the /proc lesystem directly and by using the psutil library at regular time intervals to obtain resource utilization metrics.Documentation regarding the Linux /proc lesystem is found on the /proc Linux manual pages [25] though other references provide more detailed descriptions of available metadata: [26,27,28,29,30,31,32,33,34,35,36].User-mode and kernel-mode CPU utilization metrics can be obtained found in the /proc/stat le.Table 1 provides a subset of CPU, disk, and network utilization metrics pro led by the Container Pro ler at the VM/host level.
Container-Level Metrics: Docker relies on the Linux cgroup and namespace features to facilitate the aggregation of a set of Linux processes together to form a container.Cgroups were originally added to the Linux operating system to provide system administrators with the ability to dynamically control hardware resources for a set of related Linux processes [37].Linux control groups (cgroups) provide a kernel feature to both limit and monitor total resource utilization of containers.Docker leverages cgroups for resource management to restrict hardware access to the underlying host machine to facilitate sharing when multiple containers share the host.Linux subsystems such as CPU and memory are attached to a cgroup enabling the ability to control resources of the cgroup.Resource utilization of cgroup processes is aggregated for reporting purposes under the /sys/fs/cgroup virtual lesystem and we leverage this lesystem to obtain container-level metrics in the Container Pro ler.Cgroup les provide aggregated resource utilization statistics describing all of the processes inside a container.Container-level metrics are not available from psutil.As a proling example, a container's CPU utilization statistics can be obtained from /sys/fs/cgroups/cpuacct/cpuacct.stat.Table 2 describes a subset of the CPU, disk, and network utilization metrics pro led at the container level by the Container Pro ler.
Process-Level Metrics: The Container Pro ler also supports pro ling the resource utilization for each process running inside a container.The Container Pro ler leverages support from the psutil library to capture process level metrics from Linux.Table 3 describes a subset of the process-level metrics collected by the Container Pro ler to pro le resource utilization of container processes.
Resource utilization data collected at the VM/host, container, and process level allows characterization of resource use with increasingly greater isolation.Host-level resource metrics for example, do not isolate background processes.This could lead to variance in measurements as background processes on the host machine outside the container may be randomly present.Pro ling at the container level allows negrained resource pro ling of ONLY the resources used by the containerized task or pipeline.Finally, pro ling at the process level allows very ne-grained pro ling so that resource bottlenecks can be attributed to the speci c activities or tasks.The ability of the Container Pro ler to characterize resource utilization at multiple levels enables high observability of the resource requirements of computational tasks.This observability can be crucial to improving job deployments to cloud platforms to al-

Results
We demonstrate the Container Pro ler using unique molecular identi er (UMI) RNA sequencing data generated by the LINCS Drug Toxicity Signature (DToxS) Generation Center at Icahn School of Medicine at Mount Sinai in New York [38].The scripts and supporting les for the analytical pipeline to analyse this originated from the Broad Institute [39].In addition to downloading the datasets, there are 3 other stages.The rst stage is a demultiplexing or split step that sorts the reads using a sequence barcode to identify the originating sample.The second stage aligns the reads to a human reference sequence to identify the gene that produced the transcript.The nal stage is the "merge" step which counts all the aligned reads to identify the number of transcripts produced by each gene.The unique molecular identi er (UMI) sequence is used to lter out reads that arise from duplication during the sample preparation process.In the original pipeline, only the most CPU intensive part of the pipeline, the alignment step, was optimized and executed in parallel.We further optimized the split and align steps in the original pipeline [39] to decrease the running time from 29 to 3.5 hours in our previous work [40].We also encapsu-  To pro le resource utilization, we deployed our UMI RNAsequencing pipeline alongside the Container Pro ler on an IBM Cloud bx2d-metal-96x384 virtual machine with dual Intel Platinum 8260 CPUs at 2.4 GHz, with 96 virtual CPU cores, 384GB of memory, and a 960 GB SATA M.2 mirrored SSD as the local boot disk.We leveraged the UMI RNA-sequencing pipeline as our case study as each stage of the RNA-seq pipeline exhibits di erent resource utilization characteristics.Specifically, the dataset download stage is limited by the network capacity.The split stage writes many les and is limited by the speed of disk writes.The alignment stage is performed by multiple CPU-intensive processes and performance is primarily limited by the CPU.However, it is possible that available memory capacity will limit the performance in some circumstances.The nal merge stage involves reading many les in parallel, consuming both memory and CPU resources depending on the number of threads used.Despite the fact that the align stage is expected to be limited by the CPU resources, there is signi cant CPU-idle time during that stage.This suggests the presence of a bottleneck that may be the target for further optimization.We collected CPU, memory, network, and disk utilization metrics at both the container and VM/host levels for the RNA sequencing analytical pipeline.These are visualized in Figure 3.Note that the x-axis depicting time in this gure encompasses the entire pipeline incorporating all stages: download, split, align, and merge.Overall our pro ling results depict resource utilization patterns that we expected.The download stage consumes network resources.The split stage is the most disk intensive step.The alignment and merge stages consume the most CPU resources.Our pro ling data also points to areas where resource consumption may be a problem.For example, memory usage is high for all the stages.This may be due to greedy allocation by the executables, or it may indicate that allocating more memory could bene t the pipeline.Most interesting, is CPU utilization during the alignment stage.Just before the 3 hour mark, we see a series of drops over the next 30 minutes, creating a ladder of 8 steps.The alignment stage uses up to 8 vCPUs to align di erent les of reads simultaneously.Near the end of the alignment stage, For disk usage and memory usage, the native host metric was transformed to have the same units as the container metric.All metrics have been transformed to the same units and scaled as a percentage of the maximum observed value.The four stages of the pipeline include downloading the data (download), splitting and demultiplexing the reads (split), aligning the reads to the reference (align), and assembling the counts while removing duplicate reads (merge).We observe that the container and VM-level metrics mostly overlap in the stages.However, there are di erences when there are background processes, most notably when there is considerable disk usage.The alignment stage is also notable in that we can see that the CPU usage declines near the end, probably indicating that the pipeline is waiting on some slower threads (i.e.stragglers) to nish before it can proceed, indicating this stage might be improved with better load balancing, or with smaller workloads for the threads.This is an example of how the Container Pro ler can be used to identify portions of the pipeline that can be optimized.most of the les will have been processed and there will be more available vCPUs than unprocessed les.As a result, the CPU utilization drops as vCPUs lie idle waiting for the nal les to be processed.However, this under-utilization of resources lasts for 30 minutes indicating that the nal les are rather large.This presents an opportunity to improve pipeline performance by splitting the processing into smaller les (which is an option in the split software), or by processing the largest les rst.We would not have known about these potential optimizations without ne-grained pro ling results from the Container Pro ler.

Container-level metrics can provide useful additional information
A key feature of the Container Pro ler is the ability to capture container-level metrics to describe resource utilization of only the containerized task(s).We expect these metrics to be similar, and that they could di er given that the VM/host level metrics also encompass resources being used by processes running on the host external to the container and pipeline.Since we only ran our pipeline on an dedicated test VM, the container metrics should be very similar to the VM/host metrics, which was in fact the case from our observations.However, one can see di erences between the disk utilization metrics during the split and alignment stages where there are a large number of disk writes to the host le system.Docker manages these disk writes by providing the container with an internal mount point which is eventually written to a host le.The caching and management of this data is external to the container and is not captured by the container metrics, but is captured by the host metric.In addition, during the alignment stage, intermediate results from the aligner are continuously piped to another process which then re-formats the intermediate output and writes the nal output to a le on the host system.Multiple threads are used, more than the available number of cores resulting in frequent context switches.The pipe management and context-switching are also handled by the operating system and are captured by the host metric and not the container metrics.The separation of container and OS based consumption can be useful for example, when trying to assess e ects due to resource contention that may occur when multiple jobs are run on the same physical host, which often happens on public clouds where the assignment of instances to hosts is controlled by the vendor.

Container Pro ler can sample container and host metrics with sub-second resolution
For the Container Pro ler to be useful, the collection of proling metrics must have su ciently low overhead to enable rapid sampling of resource utilization to collect many samples for time series analysis.The time required to collect the metrics limits the granularity of the pro le.To achieve 1 second sampling for time series analysis requires the ability to repeatedly sample resource utilization every 1 second (1000 ms).However, pro ling time is not constant, and depends on the state of resources being utilized by the containerized pipeline and the host.The variability of pro ling time is shown in the histogram in Figure 4.When pro ling our RNA-sequencing pipeline, VM-level and container-level proling had a bi-modal distribution, while process-level sampling had a tri-modal distribution.The slowest pro ling was observed during the stressful compute-bound alignment stage of the pipeline.For all levels of pro ling verbosity, the Container Pro ler was able to pro le resource utilization in less than 100ms.The longest pro ling time and highest variation was for process-level pro ling as metrics are collected for each process in the pipeline.The number of processes can vary throughout the execution of complex parallel pipelines, as was the case for the align stage of our RNA-sequencing pipeline.Our RNA-sequencing pipeline featured a maximum of 85 concurrent processes during the align stage.These processes ran for approximately 39% of the duration of the align stage.The time required to capture host and container level metrics was less variable as the number of metrics collected is xed.As shown in Figure 4, 90% of the time, the container and host level metrics were collected in less than 63 milliseconds and always under 75 milliseconds.The process metrics do take longer to collect but still less than 100 milliseconds in the worst case.Pro ling at the process-level involves collecting all metrics every second.For pro ling our UMI RNA-sequencing pipeline use case which required 2.5 hours to execute with one-second sampling and full pro ling verbosity (process-level metrics), 9,000 JSON les were collected which required 296 MB of storage space.

Container Pro ler has lower overhead than the variation in pipeline execution time on public clouds
A design objective for the Container Pro ler is to not signi cantly impact the performance of the pipeline being pro led.Failing to realize this objective may result in the overhead from resource pro ling impacting the collected metrics.While some overhead is unavoidable, ideally it should be lower than the inherent variations of pipeline execution time on the public cloud.
To measure the performance impact of resource utilization pro ling when running the RNA-seq pipeline, we initially attempted to assess the overhead using Amazon Elastic Compute Cloud (EC2) cloud VMs.However, we discovered that the runtime of the RNA-seq pipeline varied by more than 5% on Amazon EC2, which was more than 5x greater than the overhead of the Container Pro ler.This degree of performance variance made it di cult to evaluate the performance overhead of the Container Pro ler since we could not easily distinguish between pipeline performance variance and pro ling overhead on EC2.We then measured the performance overhead of the Container Pro ler by pro ling the pipeline using the IBM cloud bx2dmetal-96x384 server which had performance variance around  1%. Figure 5 depicts the overhead from one-second resource utilization sampling by the Container Pro ler for the RNA-seq pipeline on the IBM metal server.IBM metal servers are private and not shared with multiple users.Running on this isolated server greatly reduced the performance variance of running RNA-seq.We measured worst case overhead for the Container Pro ler to be 0.71%, which equates to about 3.4 minutes for an 8-hour pipeline with full verbosity metrics collection (VM + container + process).Overhead is reduced to as little as .07%overhead, or about 20 seconds for an 8-hour pipeline when only collecting VM-level metrics.Adding container-level, and especially process-level metrics slightly increased the runtime of the RNA-seq pipeline for collecting resource utilization data.We believe that this pro ling overhead is within an acceptable level and note that even at maximum pro ling verbosity, it is substantially less than the observed performance variance for running our RNA-seq pipeline on a public cloud VM.Users can re ect on our reported overhead times to make informed decisions when planning to pro le their own pipelines.The software to be pro led can be installed using an install script, or the Container Pro ler can be installed on top of the original container image.The user provides a sampling interval and pro ling arguments to initialize pro ling.

Methods Implementation Details
The Container Pro ler is implemented as a collection of Bash and Python scripts.Figure 6 provides an overview.There are three basic use cases for building a docker image for the Container Pro ler.The rst use case allows users to pro le an existing Docker container by providing a Docker le which speci es their own setup and software installation inside the container.This is the simplest approach to pro ling when the user has a working Docker le.The other two use cases support users who do not know how to write Docker les but are familiar with writing Bash scripts.The second use case gives users the ability to install all software inside the Docker image when the software installation becomes too complicated to put in the Docker le.In other words, it puts the required installation commands into a script that will be executed by the Docker le.For the third use case, the user provides their own executable bash script as the entry point in the Docker container.This use case can help the user simplify a set of commands they have to pro le.In this case, the user just puts a set of commands into an executable script le and runs it as the entrypoint of the container.When the Container Pro ler is executed inside a Docker container, it snapshots the resource utilization for the host (i.e.VM), container, and all processes running inside the container producing output statistics to a .jsonle.A sampling interval (e.g.once per second) is speci-ed to con gure how often resource utilization data is collected to support time series analysis of containerized applications and pipelines.Time series data can be used to train mathematical models to predict the runtime or resource requirements of applications and pipelines.Time series data can be visualized by using matplotlib Python graphing scripts that are included with the Container Pro ler.
To improve the periodicity of time series sampling, we continuously subtract the most recent observed run time of the Container Pro ler for sample collection from the con gured sampling interval (e.g. 1 second) in rudataall.py.This approach notably improved the periodicity of sampling when the container was under load improving our ability to obtain samples at evenly spaced intervals.To enable addressing any potential drift of sample collection times, we capture timestamps for when each resource utilization metric is sampled in the output JSON.These timer ticks enable precise calculation of the time that transpires between resource utilization samples for each metric.This allows the rate of consumption of system resources (e.g.CPU, memory, disk/network I/O) to be precisely determined throughout the pipeline's execution.The Container Pro ler consists of the pro ling script and two supporting scripts (for installation and pipeline execution) depicted in Figure 6: profiler.sh,install.sh,and execute.sh.
The profiler.shscript is the primary script that generates pro ling information in JSON format describing resource utilization of the containerized task.The profiler.shscript requires the user to provide a command or a set of commands along with arguments to start the pro ling.This script internally invokes another Python script rudataall.py.
The rudataall.pyscript collects the resource utilization data.Speci cally, this script takes a snapshot of the resource utilization metrics and records output to a JSON le using the time of the sample as a unique lename.The script accepts parameters -v, -c, and -p to inform the tool what type of data to collect: VM, container, and/or processlevel metrics respectively.The default behavior when running this script without any parameters is to collect all metrics.
The profiler.shscript only works if the work ow/software is already installed in the containerized environment.This means that we cannot pro le work ows/software that has not been containerized.The Container Pro ler provides an option that enables users to install software in a container using the install.shscript.Users provide a set of commands in the install.shscript to install their dependencies and software they wish to pro le.Once installed, the user can run the profiler.shscript against the newly installed software.To pro le resource utilization of a bash script, users can specify a series of commands using the optional execute.shscript to con gure pro ling.Some users may be more familiar with editing Docker les instead of bash scripts.We provide support for users to provide their own Docker le to build a custom container to be pro led.

Technical details using our scripts
To use the Container Pro ler scripts with any container, a Linux based Docker container that encapsulates a script or job to run inside is required.To con gure the Container Pro ler tool to prole the container, users can optionally provide an executable script inside the Container Pro ler which is speci ed during the build.shscript.In the executable script, the user launches the container's job or task to be pro led.
The profiler.shscript has four di erent modes: profile, delta, csv, and graph: For the profile mode, there are two required parameters: the output directory speci es the location of generated pro l- For the delta mode, there are two required parameters: the input directory which contains the original raw JSON les, and the output directory where the delta JSON les will be written.The delta| mode also provides an option to allow users to specify the modi cation operator for performing the delta.The default delta operator calculates the di erence between two samples (i.e.nal minus initial value).The typical use case is to calculate the delta of the resource utilization between the rst and last sample to capture the full resource utilization of a task or pipeline.Other operators include max, min, and average to determine the max, min, and average values of metrics from a set of JSON les.
For the csv mode, there are two required parameters: the input directory that contains processed JSON les in delta format, and the name of an output CSV le where all resource utilization data from the processed JSON les will be aggregated to.
For the graph mode, there are two required parameters: the input CSV le capturing all resource utilization data from processed JSON les, and the output directory for writing graph les.In addition, there are a few other options such as one to specify whether to plot the curves together or using separate graph les.

Visualization
The Container Pro ler in the graph mode also provides an option to specify the creation of time-series graphs.The graphing con guration le supports multiple settings to specify how to generate graph(s).Each graph con guration le should start with a line that includes the components: the ### followed by the title and the y-coordinate label.This is followed by line(s) that describe the metric(s) that users want to output in a single graph (one metric per line).As a starting point, a default graph con guration le graph.cfg is provided in the cfg directory.
We are writing to submit a revised manuscript as a technical note for Gigascience.Our original submission (GIGA-D-20-00159) was titled "Profiling Resource Utilization of Bioinformatics Workflows".The revised paper's title is now "Container Profiler: Profiling Resource Utilization of Containerized Big Data Pipelines".
Our article describes a new tool that helps in identifying performance bottlenecks to aid in optimizing the performance and costs of cloud computing for bioinformatics applications deployed using containers.This paper presents the ContainerProfiler, a software tool that measures and records the resource usage of any containerized task.Our tool profiles the CPU, memory, disk, and network utilization of any containerized task by collecting Linux operating system metrics at the virtual machine, container, and process levels.The Container Profiler can produce utilization snapshots at multiple time points, allowing for continuous monitoring of the resources consumed by a container workflow.For this resubmission, we have developed a new version of the tool that addresses performance issues and critiques from the initial manuscript submission.
To illustrate the utility of the Container Profiler, as a case study we examine the resource utilization of a multi-stage bioinformatics analytical pipeline for RNA sequencing data using unique molecular identifiers (UMI).We examine and visualize resource utilization metrics across four stages of this pipeline including download, split, align, and merge.We measured the profiling overhead introduced by the Container Profiler to investigate its impact on runtime with different profiling verbosities.
Below we describe how the revised manuscript addresses feedback from the initial submission.Responses to the reviewers appear in italic text.

Reviewer 1
* Please modify the figures in the paper (and in the tool, if applicable) to use color palette(s) that are distinguishable to people with red/green color blindness.The current figures are not colorblind friendly.* In Figure 2, there is no reason for the graphs to be 3 dimensional.In fact, it makes it more difficult to see the sizes of the subsets and to compare the bars (see https://serialmentor.com/dataviz/no-3d.html).It is also difficult (for me) to see the colors in the legend of the graph, so it's difficult to interpret the meaning of the bars.* It would be helpful to discuss the tradeoff between frequent polling and data-storage issues.Is all the profiling data stored in memory?For example, how much memory is required to store data that are collected at 1-second intervals for an hour-long process.At what point should users be concerned about memory usage (or the size of the resulting JSON files)?
We have included details in the Results section under the subheading: 'Container Profiler can sample container and host metrics with sub-second resolution'.
We profiled our UMI RNA-sequencing pipeline use case which required ~2.5 hours to execute.This pipeline featured up to 85-concurrent processes.Profiling with one-second sampling and full profiling verbosity (collection of all metrics: process-level + container-level + host-level), produced ~9,000 JSON files requiring 296 MB of storage space.We believe 296 MB is not a significant amount of storage space given the size of modern day storage systems.* The GitHub repository should have a file that indicates the terms of the open-source license.As it is, the README file just says "Copyright."But the terms of reuse or liability are unclear.GitHub provides options for selecting a license.
The MIT License is now included.* It would be very helpful if the README page on GitHub had a brief tutorial about how to install and use the tool.The paper provides some insight on this, but it would be helpful to provide more detailed and specific instructions there.
The GitHub description has been expanded to include instructions on how to install and use the tool.In addition a series of videos have been posted on a YouTube channel which is linked from the project's GitHub.

Container Profiler YouTube Channel:
https://www.youtube.com/@containerprofiler6371* I was not able to test the app because I was unclear on how to do it.The paper mentions scripts called processpack.sh,runDockerProfile.sh,ru_profiler.sh,and rudataall.sh.But I do not see all of these in the GitHub repository.Although the functionality sounds very interesting in principle, I cannot be confident that this tool is useful to the research community until I can try it myself.It would be very useful to provide a full working example of using Container Profiler for one or more real-world data-processing tasks, such as the one described in the paper.(Also, there are some extra files in the GitHub repo that seem unnecessary…like an empty file called "1", etc.) The tool has been redesigned for the paper resubmission.Figure 6 now describes the structure.

The GitHub includes installation instructions as well as profiling examples to help users carry out a variety of profiling tasks depending on the profiling use case.
* The paper mentions the ability to produce visualizations, but there are few details about how to do this.I am also unsure about this because I am not sure what software is used to create the visualizations or how to install those (or whether there is some way of doing this within a Docker container).Please provide information on this so that I and others can do that.
The GitHub now includes instructions on how to generate graphs.In addition, a YouTube video has been created to demonstrate and describe how to generate graphs from the ContainerProfiler: https://youtu.be/cI8D4JRuyjw

Reviewer 2
While resource monitoring is certainly an important technique, this paper does not present any new method.Essentially, the monitoring is already done in the Linux kernel, and the provided tool simply reads out those values and reports them without any further analysis.There is no discussion of any particular challenge or problem encountered in doing so.
While the Linux kernel provides profiling metrics, they are not organized, archived, or aggregated in a usable way.In particular, profiling metrics are not accessible to novice or casual users, those who may seek to profile bioinformatics pipelines.The paper discusses the requirement to have a fast tool which is capable of time series sampling at a one second interval.The tool must be able to profile all metrics in under one second while also not increasing the execution time of the pipeline being profiled.
The performance is quite slow (typically ~1s per sample) and the paper does not explore or explain why over 10% of the process-level measurements take ~9s.The reader may guess that part of the problem lies in invoking a new shell script that invokes a new python process to open and access a large number of files.
To address performance issues with the ContainerProfiler in the previous paper submission, we have rewritten the tool reducing samping time from 9,000 milliseconds for full verbosity sampling (process + container + vm metrics) to under 100 milliseconds.We provide a 90x performance improvement over the previous version.The tool can profile any containerized pipeline while sustaining one-second periodicity of the sampling interval.
Further, a quick google search for "container resource monitoring" shows a large number of commercial solutions in this space, which are not discussed.There are a number of open problems related to monitoring at high frequencies, monitoring short processes, monitoring policy violations, and the like, but this paper does not explore them.
While there are a number of commercial and open source tools, the majority of the available tools require the setup and maintenance of a complete monitoring application including a time-series database and web server.This effort can be considerable, creating a technical hurdle for users.Our intent is to offer a profiling solution which is lightweight with minimal-to-no setup required.Biomedical scientists seeking to profile computational pipelines should not require extensive database or web server administration skills.These burdens will impact the likelihood that biomedical scientists will successfully profile their pipelines, preventing them from gaining a better understanding regarding how to optimize them for deployment to clouds and clusters.
The second half of the paper gives a suggestion as to what can be done with resource measurements, which is a more promising area to focus on research.A case study is given in which CPU utilization is much lower than expected for over two hours, and this is indicated as a problem.It may very well be, but any adjustment to the configuration of a program always involves a tradeoff between time, scale, and other resources consumed.This is a complex multi-constraint problem worthy of some study, but this paper does not address it.
Since the initial submission of the paper, errors in the merge step of the UMI RNA-sequencing use case have been corrected.The multi-hour period of CPU idle time is gone.The pipeline now only has a short idle period at the end of the alignment phase before completing the merge phase which is also shorter.

Reviewer 3
Container Profiler is implemented as a collection of bash and python scripts to collect performance data, including CPU, memory, disk, and network-related monitoring metrics.However, cache misses, branch predict, instructions per cycle, and other performance metrics of the hardware layer also play a key role in detecting the bottlenecks of the programs.Performance collection tool such as Perf [1] has included many metrics mentioned in the paper.It is recommended to refer to Perf for optimizing Container Profiler.
Process-level, container-level, and host-level memory page faults and major page faults are reported.Profiling cache misses, branch predictions, and instructions per cycle are primarily considered a low-level code profiling tasks which are performed at the source level and not at the container level.Providing simultaneous profiling of these metrics for all processes running concurrently inside a container is beyond the scope of the tool.A developer looking for this information can consider running the Perf tool to directly profile their code.
Container Profiler is a tool for online data collection and offline visual analysis.It is recommended to add visualization functions such as real-time display of CPU utilization and memory utilization.For details, please refer to the GPU performance tool Nvprof [2].
Adding real-time display features is not presently considered a goal or objective for the Container Profiler.Our focus has been to enable profiling total resource utilization deltas to observe the total resource footprint and also to perform time series sampling of computational pipelines.Real-time display of metrics is possible, but creation of a GUI and display add-on is beyond the scope of this paper.
The authors demonstrate the Container Profiler only using one workload.Is it validated on other workloads?Only one dataset may not be convincing enough.

Yes, we have validated the Container Profiler on multiple computational tasks (or workloads).
Specifically, the ContainerProfiler is used to profile a four-stage unique molecular identifier (UMI) RNA sequencing analytical pipeline including distinct tasks: download, split, alignment, and merge.Each of these stages is implemented using a separate container and represents a distinct computational step in the analysis pipeline.Additionally, we have applied the Container Profiler to assess common Linux benchmarks including sysbench and pgbench.These benchmarks are presented as examples in the GitHub instructions and videos.In light of this comment, we defined the terms "pipelines" and "workflows" in the first paragraph of the paper.
The profiling tool saves the collected data in JSON files.If the workload runs for a long time and the sampling time interval is 1s, how to deal with the case of JSON text length overflow?It is recommended to save the data directly to the database or periodically clean the old data (or useless data) in JSON files.
The maximum supported size for a single JSON file is approximately 4 GB.On average, JSON files produced by the Container Profiler with full verbosity (host + container + process) are approximately 33 KB.JSON text overflow should not be an issue.The Container Profiler includes an option to purge existing files when repeatedly running an identical pipeline.For profiling our full UMI RNA-sequencing pipeline with full profiling verbosity (collection of all metrics: process-level + container-level + host-level) and one-second sampling, ~9,000 JSON files were collected requiring 296 MB of storage space.This pipeline featured up to 85 concurrent processes.For a 24-hour pipeline with a similar number of concurrent processes we estimate a data storage requirement of ~3GB with full verbosity.For long running pipelines, to reduce the storage requirements of profiling, the user can adjust the profiling interval from 1-second to 1-minute.This would reduce the storage requirement for a 24-hour job to approximately 50 MB.
The authors evaluate the overhead of the profiling tool, but they only verify the impact on the running time of the workload.Will it interfere with other metrics such as CPU or memory utilization?
We did not measure the % increase in CPU, disk, or network metrics.To determine baseline CPU, disk, and network utilization still requires a profiling tool.Perhaps it would be possible to create a tool that only measures one metric, and then repeat the profiling task many times to check the % increase in metrics from the baseline 1-metric-tool measurement vs. the Container Profiler.However, for an 8-hour dataset, the increase in runtime for the UMI RNA-Seq pipeline with full profiling verbosity and one-second sampling added just 3.4 minutes to the pipeline execution time, which is an increase of only 0.71%.We do not believe the effort to measure % increase in metric utilization from baseline will be particularly interesting.Utilization will increase, but likely not very much if runtime increases less than 1%.The goal of the ContainerProfiler here has been to profile containerized genomics computational pipelines.

Figure 1 .
Figure1.Overview summarizing resource utilization metrics (61 total) collected by the Container Pro ler across three levels (i.e.host/VM, container, and process level) and four categories (i.e.CPU, memory, network, and disk).Process level metrics are depicted by red and prefaced with lower case "p", container level metrics by yellow and prefaced with lower case "c", and host/VM level metrics by blue and prefaced with lower case "v".

Figure 2 .
Figure 2. CPU utilization graph for the four stages (e.g.download, split, align, and merge) of the UMI RNA-seq pipeline.This graph depicts the percentage of CPU utilization in each CPU mode.CpuUsr (shown in green) captures time the pipeline spent executing its source code.CpuKrn (shown in yellow) captures time when the processor executed code in the Linux kernel.Typically the kernel is invoked to support disk and network I/O which are considered privileged operations.CpuIdle (shown in blue) is unused time across the 8 available CPU cores throughout each stage.CpuIdle time is common when waiting for disk or network I/O to complete.High CpuIdle time during computational stages indicates potential for performance optimization with better parallelization of code.CpuIOWait (shown in maroon) depicts CPU time where the pipeline was waiting for I/O (disk or network) to complete.cpuSftIntSrvc (shown in magneta) is time spent handling soft interrupts.Soft interrupts commonly occur with network I/O.

Figure 2
Figure 2 summarizes the CPU utilization characteristics of different stages of the UMI RNA-seq pipeline.The CPU usage pro le is consistent with our expectations.The execution of the align and merge steps are expected to be bound by CPU resources and they indeed spent the majority of the time executing source code.Download is limited by the network bandwidth and the split stage by disk I/O.Hence the cpuIdle time is highest in these stages.

Figure 3 .
Figure 3. Output graphs comparing Container and VM (host) level metrics over time for a multi-stage RNA sequencing data pipeline.Four output graphs are shown: disk writes (top left), CPU usage (top right), network usage (bottom left) and memory usage (bottom right).In each graph, the container level metrics are shown in red and the VM (host) level metrics are shown in blue.For disk usage and memory usage, the native host metric was transformed to have the same units

Figure 4 .
Figure 4. Distribution plot (log-scale) of time required to collect pro ling data.We pro led resource utilization of the RNA-sequencing pipeline on an IBM Cloud bx2d-metal-96x384 virtual machine with dual Intel Platinum 8260 CPUs at 2.4 GHz, with 96 virtual CPU cores, 384GB of memory, and a 960 GB SATA M.2 mirrored SSD as the local boot disk).We executed the complete RNAseq pipeline four times to pro le 1) only VM/host metrics, 2) VM/host and container metrics, 3) ALL metrics, and no metrics by running the pipeline in the absence of the pro ler.Plots depict time to collect resource utilization samples at one-second intervals with the Container Pro ler while running the entire RNA-seq pipeline.Time to collect 9000 samples of each type (Process-level, Container-level, and VM-level) is shown.99.95% of process-level samples were collected under 100 milliseconds, while all container-level samples were collected under 74 milliseconds, and all VM-level samples were collected at or under 60 milliseconds.The gure shows the process-level, container-level, and VM-level pro ling time distribution over 120 milliseconds on the x-axis.The 90th percentiles for sample collection are shown.

Figure 5 .
Figure 5.This gure depicts the pro ling overhead of the Container Pro ler and the resulting percentage increase in the total runtime of the entire RNA-seq pipeline.The increases in run time are very modest: Host/VM only (0.07%), Host/VM + Container (0.09%), and Host/VM + Container + Process (0.71%).Error bars depict one standard deviation from the average.Standard deviation of pipeline runtime for 5 runs of the RNA-seq pipeline on the IBM bx2d-16x64 Virtual Machine with no pro ling was (1.38%), approximately 194% greater than the worst case overhead of the Container Pro ler when pro ling with full verbosity (i.e.collecting all metrics).

Figure 6 .
Figure 6.Pro ling scripts used in the implementation of Container Pro ler.All scripts are deployed inside the container alongside the software being pro led.

Figure 1 ,
Figure 1, Figure 2, and Figure 3 have been recreated and recolored after checking with a color-blind image tool: https://www.color-blindness.com/coblis-color-blindness-simulator/,Other images only feature limited use of colors.

Figure 2
Figure 2 has been recreated and recolored and is no longer 3-dimensional.The size of the color boxes in the legend has been quadrupled.Colors are left-to-right in the legend, and top-to-bottom in the graph.

Table 1 .
Selected CPU, disk, and network utilization metrics pro led at the VM/host level.
vCpuTimeUserMode Time the CPU spent executing in user mode /proc/stat vCpuTimeKernelMode Time the CPU spent executing in kernel mode /proc/stat

Table 2 .
Selected CPU, disk, and network utilization metrics pro led at the container level.

Table 3 .
List of important metrics for pro ling process resource utilization.
pNonvoluntaryContextSwitches Number of involuntary context switches /proc/[pid]/status pBlockIODelays Aggregated block I/O delays /proc/[pid]/stat pResidentSetSize Number of pages the process has in real memory /proc/[pid]/stat lated each step in the pipeline in separate Docker containers to facilitate deployment and ensure reproducibility.

Table 4 .
Container Pro ler with four di erent modes JSON format, and the time interval speci es a time series sampling interval in milliseconds.The pro ler generates a JSON le at the beginning and the end of the process if the sampling interval is set to zero.Otherwise, the pro ler generates a JSON le at each sampling interval.The Container Pro le also collects static metrics which typically describe hardware characteristics.The pro ler rst checks if a static information le exists (static.json).If missing, the pro ler captures static parameters and writes out the static information le at the start of pro ling.By default 11 static metrics are captured.They include: the host's kernel info, the host's cpu type, CPU Level 1 instruction cache size, CPU Level 1 data cache size, CPU Level 2 cache size, CPU Level 3 cache size, host boot time, host VM ID, the number of CPU cores available to the container, and the container ID.