Hot-starting software containers for bioinformatics analyses

Using software containers has become standard practice to reproducibly deploy and execute biomedical workflows on the cloud. We demonstrate that hot-starting, from containers that have been frozen after the application has already begun execution, reduces the costs of cloud computing by avoiding repetitive initialization steps. The method is widely applicable and can provide substantial savings both for small jobs and for large-scale deployments using automated schedulers.


Abstract
Using software containers has become standard practice to reproducibly deploy and execute biomedical workflows on the cloud. We demonstrate that hot-starting, from containers that have been frozen after the application has already begun execution, reduces the costs of cloud computing by avoiding repetitive initialization steps. The method is widely applicable and can provide substantial savings both for small jobs and for large-scale deployments using automated schedulers.
With the availability of high-throughput next generation sequencing technologies and the subsequent explosion of big biomedical data, the processing of biomedical big data has become a major challenge. Cloud computing plays an important role in addressing this challenge by offering massive scalable computing and storage, data sharing and on-demand access to resources and applications 1,2 . The National Institutes of Health is launching a Data Commons Pilot Phase to provide access and storage of biomedical data and bioinformatics tools on the cloud (https://commonfund.nih.gov/bd2k). Additionally, software containers have become increasingly popular for deploying bioinformatics workflows on the cloud. Docker When containers are deployed, applications are launched de novo each time the container is spun up. This means that any initial preparatory steps are repeated each time the container is used. For applications such as the alignment of reads, these initial steps can be quite substantive as a entire reference genome is read in and indices are generated. In an automated large-scale deployment, these steps are replicated many times. It would be far more efficient if one could "checkpoint" and save containers in states where the application has already completed the initialization steps so as to avoid unncessary repetitions. One could then "hotstart" workflows from these checkpoints. This is analogous to hot-start PCR where all the necessary reagents are pre-mixed awaiting only the addition of the template.
Our key idea is to save and restore memory states in software containers using the Checkpoint Restore in Userspace (CRIU) tool. CRIU freezes a running container and saves the checkpoint as a collection of files on disk (https://criu.org/Main_Page). These files can subsequently be used to restore and resume the application from that checkpoint. CRIU was originally developed for Linux, but has recently become available for Docker (https://criu.org/Docker).
While it is possible to stop Docker containers with native docker commands, this process does not preserve the memory state. Although re-starting from a ready-to-go state is an intuitive application of checkpointing, we have been unable to find any previous description of using checkpointing as a general method for improving the efficiency of container deployments.
We demonstrate that hot-starting from a saved container checkpoint can significantly reduce the execution time using the STAR aligner 5, 6 for RNA-seq data analyses. We choose STAR as a proof-of-concept example because it has an option to save an intermediate state. However, our idea of using checkpoints has broad applications in optimizing performance using software containers on the cloud when deploying any bioinformatics task where a pause could be inserted to capture a re-usable state.
The STAR aligner 5, 6 consists of two steps. In the first step, genome indices using the reference genome as input are generated. In the second step, read sequences from a specific experiment sample are mapped to the reference genome assuming that the genome indices have already been generated. In particular, STAR has the option of keeping the indices in memory after they have been generated to avoid repeating the first step when multiple files are to be aligned to the same reference genome. We used the CRIU tool to create checkpoints after the first step of generating genome indices. Instead of launching a new container and starting STAR from scratch, we restore the container state using CRIU and resume running STAR after it has loaded the indices. Figure 1 shows an overview of our approach with and without using checkpoints.  and Microsoft Azure. Both the Azure File Storage and the Amazon Elastic Block Store (EBS) represent network disks. We observe that our "hot-start" containers (orange and grey bars) provide major reduction in execution time, especially on local disks. In this article, we have presented a novel idea for optimising cloud deployments using checkpointing to save containers where the applications are already started. Using CRIU for Docker, we can save the container with a preloaded genome for STAR alignment and restore the container from these checkpoint files to any host. We have achieved successful migration of checkpointed containers to different virtual machine instances running on the Amazon and Azure cloud platforms while realizing up to a 3.57x speedup using our approach saving up to 20 minutes for a single STAR alignment workflow on Azure with network disks. For STAR alignment, it is possible to use a checkpointed container to align multiple sequences at once by retaining the genomic indices in memory. Our approach yields a significant benefit with hotstarting when as few as one or two files are aligned. Additionally, multiple STAR alignment tasks can be computed in parallel using the same genome indices hosted by different processes. For automated schedulers such as Docker Compose (https://docs.docker.com/compose/), "hot-starting" reduces execution time every single time the STAR container is launched. While it is possible to design a workflow to perform all the alignments in a single container first, load-balancing optimizations would be better utilized by allowing the scheduler to distribute the computation over the cluster as shorter jobs.
Our hot-start strategy only requires that there is a convenient place for a pause, checkpoint and re-start. In the case of STAR, this is provided by a flag that allows the container to keep genomic indices in shell memory between invocations of STAR. For other workflows, one could add a flag to pause the computation where the checkpoint is to be created, and a flag to resume the computation afterwards. With these straightforward modifications, any application could take advantage of checkpointing to avoid repetitive initialization. This is a novel and unexplored approach to optimising containerized workflows while reducing the costs of cloud computing.

METHODS
CRIU. CRIU (Checkpoint/Restore In Userspace) is a Linux software tool that freezes a running application and saves it as collection of files to disk (https://criu.org). The application can later be restored on the same or on a different host. Docker currently integrates CRIU as an experimental checkpoint sub-command that saves the state of processes to a collection of files on disk. The checkpointing command has been used to migrate containers from the source host to target host when the resources of the source are limited 8 , for fault tolerance purposes 9 , and can provide highly available and scalable of micro-services 10 .

Cloud configurations tested
In our experiments, we deployed our containers on instances from two cloud platforms: This keeps the indices in shared memory after STAR exits. To trap the container in this state, we launched STAR using a parent shell script that did not exit, and checkpointed the container after STAR exited. This results in the generation of checkpoint files that store the state of the hot-start container. Due to different kernel versions being used, we created separate hot-start containers for AWS and Azure.
Comparing hot-start containers and standard cold-start containers