Toward a scalable framework for reproducible processing of volumetric, nanoscale neuroimaging datasets

Abstract Background Emerging neuroimaging datasets (collected with imaging techniques such as electron microscopy, optical microscopy, or X-ray microtomography) describe the location and properties of neurons and their connections at unprecedented scale, promising new ways of understanding the brain. These modern imaging techniques used to interrogate the brain can quickly accumulate gigabytes to petabytes of structural brain imaging data. Unfortunately, many neuroscience laboratories lack the computational resources to work with datasets of this size: computer vision tools are often not portable or scalable, and there is considerable difficulty in reproducing results or extending methods. Results We developed an ecosystem of neuroimaging data analysis pipelines that use open-source algorithms to create standardized modules and end-to-end optimized approaches. As exemplars we apply our tools to estimate synapse-level connectomes from electron microscopy data and cell distributions from X-ray microtomography data. To facilitate scientific discovery, we propose a generalized processing framework, which connects and extends existing open-source projects to provide large-scale data storage, reproducible algorithms, and workflow execution engines. Conclusions Our accessible methods and pipelines demonstrate that approaches across multiple neuroimaging experiments can be standardized and applied to diverse datasets. The techniques developed are demonstrated on neuroimaging datasets but may be applied to similar problems in other domains.


Introduction
Testing modern neuroscience hypotheses often requires robustly processing large datasets. Often the labs best suited for collecting such large, specialized datasets lack the capabilities to store and process the resulting images [1]. A diverse set of imaging modalities, including electron microscopy (EM) [1], array tomography [2], CLARITY [3], light microscopy [4], and Xray microtomography (XRM) [5] will allow scientists unprecedented exploration of the structure of healthy and diseased brains. The resulting structural connectomes, cell type maps, and functional data have the potential to radically change our understanding of neurodegenerative disease.
Traditional techniques and pipelines developed and validated on smaller datasets may not easily transfer to datasets that are acquired by a di erent laboratory or that are too large to analyze on a single computer or with a single script. Prior machine vision pipelines for EM processing, for instance, have had considerable success [6,7,8,9,10]. However, these pipelines may require extensive con guration and are not scalable [8], may require proprietary software and have unknown hyperparameters [9], or are highly optimized for a single hardware platform [10].
In other domains, computer science solutions exist for improving algorithm portability and reproducibility, including containerization tools like Docker [11] and work ow specication such as the Common Work ow Language (CWL) [12]. Cloud computing frameworks enable the deployment of containerized tools [13,14], pipelines for scalable execution of Python code [15], and reproducible execution [16]. Work ow management and execution systems such as Apache Air ow [17] and related projects such as Toil [18] and CWL-Air ow [19] allow execution of pipelines on scalable cloud resources. Despite the existence of these tools, a gap currently exists for extracting knowledge from neuroimaging datasets (due to the general lack of experience with these solutions as well as a lack of neuroimaging-speci c features). We propose a solution that includes a library of reproducible tools and pipelines, integration with compute and storage solutions, and tools to automate and optimize deployment over large (spatial) datasets. This gap is highlighted in Table 1 and discussed further in the methods section; critically our proposed solution combines common work ow speci cations, Dockerized tools, and automation for large-scale jobs over volumetric neuroimaging datasets.
We introduce a library of neuroimaging pipelines and tools, Scalable Analytics for Brain Exploration Research (SABER), to address the needs of the neuroimaging community. SABER introduces canonical pipelines for EM and XRM, speci ed in CWL, with a library of Dockerized tools. These tools are deployed using the work ow execution engine Apache Air ow [17] using Amazon Web Services (AWS) Batch to scale compute resources with imaging data stored in the volumetric database bossDB [20]. Metadata, parameters, and tabular results are logged using the neuroimaging database Datajoint [21]. Automated tools allow deployment of pipelines over blocks of spatial data, as well as end-to-end optimization of hyperparameters given labeled training data.
We demonstrate the use of SABER for three use cases critical to neuroimaging using EM, XRM, and light microscopy methods as exemplars. While light microscopy is commonly used to image cell bodies and functional activity with calcium markers, EM o ers unique insight into nanoscale connectivity [22,23,24,25], and XRM allows for rapid assessment of cells and blood vessels at scale [26,5,27]. These approaches provide complementary information and have been successfully used on the same biological sample [5], as XRM is non-destructive and compatible with EM sample preparations and light microscopy preparations. Being able to extract knowledge from large-scale volumes is a critical capability, and being able to reliably and automatically apply tools across these large datasets will enable the testing of exciting new hypotheses.
Our integrated framework is an advance toward easily and rapidly processing large-scale data, both locally and in the cloud. Processing these datasets is currently the major bottleneck in making new, large-scale maps of the brain -maps that promise insights into how our brains function and are impacted by disease.

Findings Pipelines and Tools for Neuroimaging Data
To address the needs of the neuroimaging community, we have developed a library of containerized tools and canonical workows for reproducible, scalable discovery. Key features required for neuroimaging applications include: • Canonical neuroimaging work ows speci ed in CWL [12] and containerized, open-source image processing tools • Integration of work ows with infrastructure to deploy jobs and store imaging data at scale • Tools to optimize work ow hyperparameters and automate deployment of imaging work ows over blocks of data Building on existing tools, our framework provides a more accessible approach for neuroimaging analysis, and can enable a set of use cases for the neuroscientist by improving reproducibility. Details on adoption can be found in the Section "Required Background and Getting Started." To ensure broad impact, the SABER is designed to be as generalizable as possible. The core abilities to schedule and launch Dockerized work ows are applicable to a wide range of volumetric datasets provided that 1) Dockerized tools exist, 2) CWL work ows can be speci ed, and 3) raw data can be accessed from existing volumetric repositories [20,28,29], local les or cloud buckets. The standardized work ows described below are developed speci cally for EM and XRM. These workows perform generalized, repeated processing techniques like classi cation, object detection, and 2D and 3D segmentation, but with parameters and weights speci c to these modalities. Users may be able to adapt these tools to additional problems with the use of annotated training data and appropriate tuning.

Standardized Work ows and Tools
While many algorithms and work ows exist to process neuroimagery datasets, these tools are frequently lab and task speci c. As a result, teams often duplicate common infrastructure code (e.g., data download or contrast enhancement) and re-implement algorithms, when it would be faster and more reliable to instead reuse previously vetted tools. This hinders attempts to reproduce results and accurately benchmark new image processing algorithms.
In our framework, work ows are speci ed by CWL pipeline speci cations. Individual tools are then speci ed by an additional CWL le, a container le, and corresponding source code. This ensures a modular design for pipelines and provides a library of tools for the neuroscientist. This library of pre-packaged tools and work ows helps reduce the number of computational frameworks and software libraries users need to be familiar with, helping to limit the computational experience required to run these pipelines.
Initially, we have implemented two canonical pipelines for EM and XRM processing. For EM, we estimate graphs of connectivity between neurons from stacks of raw images. Given XRM images, we estimate cell body position and blood vessel position. Each of these work ows is broken into a sequence of canonical steps. Such a step-wise work ow can be viewed as a directed, acyclic graph (DAG). Each step of a pipeline is implemented by a particular containerized software tool. The speci c tools implemented in our reference canonical pipelines are discussed below.

Cell Detection from X-ray Microtomography and Light Microcscopy
XRM provides a rapid approach for producing large-scale submicron images of intact brain volumes, and computational work ows have been developed to extract cell body densities and vasculature [5]. Individual XRM processing tools have been developed for tomographic reconstruction [30], pixel classi cation [31], segmentation of cells and blood vessels [5], estimation of cell size [5], and computation of the density of cells and blood vessels [5]. Running this work ow on a volume of X-ray images produces an estimate of the spatially-varying density of cells and vessels. Cubic millimeter-sized samples (100 GB) can be imaged, reconstructed, and analyzed in a few hours [5].
To implement a canonical XRM work ow, we de ne a set of steps: extracting subvolumes of data, classifying cell and vessel pixel probabilities, identifying cell objects and vasculature, merging the results, and estimating densities. Details on data storage and access can be found in the implementation section. We de ned Dockerized tools implementing a random forest classi er, a Gaussian Mixture Model, and a U-net [32] for pixel classi cation and the cell detection and vessel detection strategies [5]. These tools provide a standard reference for the XRM community, and modular replacements can be made as new tools are developed and benchmarked against this existing standard. Figure 1 shows this canonical work ow for XRM data, with each block representing a separate containerized tool. Also shown in Panel B is example output from running the pipeline, highlighting the resulting cell body positions and blood vessels.
These same tools can also be applied (with appropriate retraining) to detecting cell bodies from light microscopy data, such as from the Allen Institute Brain Atlas [4]. Here the same pipeline tools can be reused to detect cell bodies using the step for pixel classi cation followed by the step for cell detection. This result demonstrates the application of these tools across modalities and datasets to ease the path to discover.

Deriving Synapse-level Connectomes from Electron Microscopy
Several work ows exist to produce graphs of brain connectivity from EM data [6,10,7], including an approach that optimizes each stage in the processing pipeline based on end-to-end performance [8]. However, these tools were not standardized into a reproducible processing environment, making reproduction of results and comparison of new algorithms challenging.
We have de ned a series of standard steps required to produce brain graphs from EM images, seen in Figure 2. First, data is divided into subvolumes; cell membranes are estimated for each volume. Next, synapses are estimated and individual neurons are segmented from the data. After this, synaptic connections must be associated with neurons, and results merged together across blocks. Then a graph can be generated by iterating over each synapse to nd the neurons representing each connection. Many tools have been developed for various sections of this pipeline, and a single tool may accomplish more than one step of the pipeline. Examples of tools for membrane segmentation include CNN [33] and U-nets [32] approaches. Synapse detection has been achieved using deep learning techniques and random forest classi ers [34,35]. Neural segmentation has been previously done using agglomeration-based approaches [36] and automated selection of neural networks [9]. For our initial implementation of this work ow, we create CWL speci cations and containerized versions of U-nets [32] for synapse and membrane detection, the GALA tool [37] for neuron segmentation, and algorithms for associating synapses to neurons and generating connectomes [8].
When creating this canonical pipeline for EM processing, our initial implementation goal is not to focus on pipeline performance in the context of reconstruction metrics. Rather, we aim to provide a reference pipeline for scientists and algorithms developers. For scientists, this provides an established and tested pipeline for initial discovery. For algorithm developers, this pipeline can be used to benchmark algorithms which encompass one or more steps in the pipeline.

Optimization and Deployment of Work ows
To process modern neuroimaging datasets, users need more than standardized pipelines and the ability to deploy them to individual blocks of data. Scaling these work ows to current datasets requires specialized interfaces to distribute jobs over large volume and tune them to new data. The SABER project provides 1) a paramterization API to distribute jobs over large volumes of data and 2) an optimization API to train pipelines and ne-tune hyperparameters for new datasets.
To apply SABER work ows to large volumetric datasaets, such as those hosted in bossDB [20], a parameterization API allows control over creating blocks from large datasets (by specifying sizes and overlap of blocks in each dimension), running pipelines on each block, and merging results (i.e., a distributecollect approach). A second parameter le speci es these desired parameters and can be used with any compatible workow to deploy it to a new dataset. Deployment scripts enable rapid con guration and deployment of work ows for new datasets.
In order to tune SABER work ows for new datasets, it in necessary to train the parameters of the pipeline, including any hyperparameter optimization ( Figure 3). Our tools currently require a small volume of labeled training data from the new dataset (although recent e orts are also exploring unsupervised methods [38]). To perform the hyperparameter search, we pursue an optimization strategy that assumes a black-box work ow, avoiding assumptions such as di erentiability of the objective function. SABER enables iteratively selecting parameters, scheduling parallel jobs, and collecting results. This approach supports both batch and sequential optimization approaches. Initially, we implemented a simple grid search, random search, and the adaptive search method shown in Figure  3, based on random resampling [39]. This will be expanded to techniques such as sequential Bayesian optimization [40] and convex bounding approaches [41] to develop a library of readily available, proven techniques. To provide benchmarking for these approaches, the team hosts available ground truth data (e.g., [23] for EM), and scoring tools to compute metrics such as precision-recall or f1-score.

Datasets for Benchmarking Work ows
A critical feature for new users as well as developers of new containerized tools is the availability of benchmark datasets for deriving synapse-level connectomes from EM as well as segmentation of cell bodies and vasculature from XRM data.
Datasets are hosted in the bossDB syatem ( [20],https://bossdb.org/projects) for this purpose. For testing XRM pipelines, data from the datasets "Dyer et al. 2016" [5] and "Prasad et al. 2020" [42] can be used. These datasets contain di erent brain regions including labels of cell bodies and vasculature for training new users, developing new algorithms, Similarly, for EM data, datasets such as "Kasthuri et al. 2015" [23] provide EM data along with segmentation and synapse labels. These similarly enable new users and algorithm developers to compare to existing data and approaches.

Use Case 1: Pipeline Optimization
When collecting a new neuroimaging dataset, it is often necessary to ne-tune or retrain existing pipelines. This is typically done by labeling a small amount of training data, which can often be labor intensive, followed by optimizing the automated image processing pipeline for the new dataset. These pipelines consist of heterogeneous tools with many hyperparameters and are not necessarily end-to-end di erentiable.
Users can execute the optimization routines using a simple con guration le to specify algorithms, parameter ranges, and metrics. Figure 3 demonstrates the application of three algorithms for pipeline optimization. We choose the Allen Institute for Brain Science (AIBS) Reference Atlas [4] as a demonstration of generalization beyond EM and XRM datasets. In order to optimize the pipeline, this example optimizes over: the initial threshold applied to the probability map, size of circular template, size of circular window used when removing a cell from the probability map, and the stopping criterion for maximum correlation within the image. The user speci es the range of each parameter.
Our framework supports implementations of di erent optimization routines, such as random selection of parameters with resampling, as seen in Figure 3. Random selection of parameters often produces comparable results to grid search, and users may need to explore algorithms to nd an approach that works well for the structure of their pipeline [39]. For the resampling approach, we initially choose parameters at random, and then re ne search parameters by choosing new parameters near the best initial set, with the user setting a maximum number of iterations. Figure 3B shows a parameter reduction of twenty percent at each resampling, leading to a more e cient parameter search and improved performance. Using SABER, it is possible for a user to explore the trade-o s for a range of hyperparameter optimization routines.

Use Case 2: Scalable Pipeline Deployment
The second critical use case of interest to neuroscientists is the deployment of pipelines to large datasets of varying sizes. Datasets may be on the order of gigabytes or terabytes, as in XRM, to multiple petabytes, as in large EM volumes used for connectome estimation. SABER provides a framework for blocking large datasets, executing optimized pipelines on each block, then merging the results through a functional API. Given a dataset in a volumetric database, such as bossDB, our Python scripts control blocking, execution, and merging. Results are placed back into a database for further analysis, or stored locally. An example of this use case for XRM data can be seen in Figure 4, and another example of this use case for extracting synapse-level connectomes can be seen in Figure 5.

Use Case 3: Benchmarking Neuroimaging Algorithms
The third major use case applies to developers implementing new algorithms for neuroimaging datasets. Due to tools being written in a variety of languages for a variety of platforms, it has been di cult for the community to standardize comparison between algorithms. Moreover, it is important to assess end-to-end performance of new tools in a pipeline which has been properly optimized. Without this comparison, it is difcult to directly compare algorithms or their impact. Using the speci ed pipelines, a new tool may subsume one or more of these steps, with the speci cation de ning the inputs and outputs. A new CWL pipeline can be quickly speci ed with the new tool replacing the appropriate step or steps. Hyperparameter optimization can be run on each example to compare tools, leveraging reference images and annotations for the pipelines provided in SABER.

Discussion
We have developed a framework for neural data analysis along with corresponding infrastructure tools to allow scalable computing and storage. We facilitate the sharing of work ows by compactly and completely describing the associated set of tools and linkages. Future enhancements will introduce versioning to track changes in work ows and tools.
The SABER project aims to support multiple modalities, focusing initially on EM and XRM data through the development of containerized tools for di erent steps such as synapse and cell detection. The same tools can be used for di erent steps of both work ows. For instance, our U-net [32] tool can be used to generate probability maps for synapses, cell bodies, or cell membranes when training with di erent data. The framework also allows for joint analysis of co-registered datasets using our CWL pipelines using di erent parameterized sweeps. The user can then use simple Python scripts to pull and analyze any parts of these data.
While the SABER project has focused on tools for processing large EM and XRM datasets, many of the tools and infrastructure developed would also be of interest to researchers investigating light microscopy, PET, and fMRI. The features of SABER are most appropriate for large-scale volumetric data, where records are large (gigabytes or larger) and it is di cult to process a dataset in memory. Therefore, larger light microscopy datasets may bene t the most from SABER. The developed tools focus on canonical problems such as object detection, 2D segmentation, and 3D segmentation. These are generally useful for structural neuroimaging datasets, and may be reused in other contexts.
Our goal is to establish accessible reference work ows and tools which can be used for benchmarking new algorithms and assessing performance on new datasets. Moving forward, we will encourage algorithm developers to containerize their solutions for pipeline deployment and to incorporate state-of-theart methods. Through community engagement, we hope to grow the library of available algorithms and demonstrate largescale pipelines which have been vetted on di erent datasets. We also hope to recruit researchers from di erent domains to explore how these tools apply outside of the neuroimaging community.
Prior solutions have taken di erent approaches to processing neuroimaging data. For example, the work ow execution engine LONI has been used for processing EM data [8], but requires extensive con guration and is not scalable to very large volumes. The SegEM framework [9] o ers extensive features for optimizing and deploying EM pipelines, but is speci cally focused on neuron segmentation from EM data and is tied to a MATLAB cluster implementation. Highly optimized pipelines can be deployed on a single workstation [10], which is ideal for proven pipelines as part of ongoing data collection, but is limited in developing and benchmarking new pipelines.
A major strength of the SABER approach is the use of CWL to provide a common speci cation for work ows, which has considerable advantages compared to work ow managers with speci c Python syntax (e.g. [15,43]. The common, interoperable standard is important to allow reuse of the SABER workows in other work ow managers as they continue to evolve. This approach also encourages tools developed for other open source projects to be deployed using the SABER system.
A limitation of our existing tooling is interactive visualization. Although we provide basic capabilities, additional work is needed to interrogate raw and derived data products and identify failure modes. We are extending open source packages, substrate [44] and neuroglancer [45], to easily visualize data inputs and outputs of our work ows and tools.
Scalable solutions for container such as Kubernetes [13] and general work ow execution systems like Apache Air ow [17] have provided the ability to orchestrate execution of containers at scale. These solutions, however, lack work ow de nitions, imaging databases, and deployment tools to enable neuroimaging usecases. SABER builds on top of these technologies to enable neuroimaging use cases while avoiding the specialized, one-o approaches often used in conventional neuroimaging pipelines.
Our solution leverages many powerful existing 3rd party solutions (e.g. AWS, Apache Air ow). While this allows use of powerful modern software packages and shared development, it creates a risk if these technologies are not supported and developed in the future. While it is not possible to completely mitigate this risk, the modular strategies for storage and computation, described below, help to mitigate this challenge by allowing components related to these services to be replaced.
The key dependency is Apache Air ow, but even in this case the work ows and Dockerized tools have potential applications with future work ow managers.

Potential implications
While our initial work ows focus on XRM and EM datasets, many of these methods can be easily deployed to other modalities like light microscopy [46], and the overall framework is appropriate for problems in many domains. These include other scienti c data analysis tasks as varied as machine learning for processing noninvasive medical imaging data or statistical analysis of population data.
Code, demos, and results of the SABER platform are available on GitHub under an open source license, along with documentation and tutorials (see below). We make SABER available to the public with the expectation it will help to enable and democratize scienti c discovery of large, high-value datasets, and that these results will o er insight into neurally-inspired computation, the processes underlying disease, and paths to e ective treatment. Contributors and developers are also encouraged to visit and join the open source developers on the project.
Future work will focus on usability, while integrating SABER into existing open-source frameworks for data storage and visualization (e.g. [20], [45]). In an e ort to lower the barriers for new users, this work will include Graphical User Interfaces (GUIs), as well as the development of additional reference pipelines. Integration with datastores like bossDB will enable a common ecosystem for new users to nd storage, processing, and visualization in a common location.

Existing Software Solutions
For small-scale problems, individual software tools and pipelines which are fully portable and reproducible have been produced (e.g., [47]), but this challenge has not yet been solved at the scale of modern EM and XRM volumes.
Many tools have become available for scalable computation and storage, such as Kubernetes [13] and Hadoop [48], which enable the infrastructure needed for running containerized code at scale. However, such projects are domain-agnostic and do not necessarily provide the features or customization needed by a neuroscientist. As scalable computation ecosystems, these solutions can be integrated as the backend for work ow management systems such as SABER.
Traditional work ow environments (e.g., LONI Pipeline [49], Nipype [43], Galaxy [50], and Knime [51]) provide a tool repository and work ow manager, but require connection to a shared compute cluster to scale. All of these systems rely on software that are installed locally on the cluster or local workstation, and can result in challenging or con icting con gurations that slow adoption and hurt reproducibility.
New frameworks for work ow execution have been developed, but solve only a subset of the challenges for neuroimaging. Boutiques [52] manages and executes single, commandline executable neuroscience tools in containers. Pipelines must be encapsulated in a single tool, meaning that coding is required to swap pipeline components. Dray [53] executes container-based pipelines as de ned in a work ow script. While Dray contains some of the core functionality to execute container-based pipelines, non-programmers cannot easily use the system and it is limited in the types of work ows that are supported.
Similarly, Pachyderm [14], o ers execution of containerized work ows but lacks support for storage solutions appropriate for neuroimaging as well as optimization tools needed for these neuroimaging pipelines. Work ow execution engines such as Toil [18] and CWL-Air ow [19] are closely related to SABER, providing light-weight Python solutions for work ow scheduling. However, like Pachyderm, they lack the automation tools and storage scripts required by neuroimaging applications. The most closely related tool is Air-tasks [54], which provides tools to automate deployment of neuroimaging pipelines. Air-tasks, however, provides fewer capabilities to the user and does not support a common work ow speci cation or explicitly support optimization or benchmarking. Table 1 breaks down this comparison between SABER and existing work ow managers and execution solutions for scienti c computing. In general, neuroimaging applications bene t from several key features which are not provided in these more general purpose scienti c work ow approaches due to the use of volumetric data, few large datasets (vs. many smaller images in a large collection), and the need for tool crosscompatibility. SABER delivers these key features through the use of standardized work ows, containerized tools, automation of deployment over volumetric data (as opposed to processing individual records) and the ability to optimize pipelines. The closest existing solutions are work ow managers such as TOIL [18], Galaxy [50], and CWL-Air ow [19]. These approaches are powerful but focused on other problems in bioinformatics, such as gene sequence analysis, consisting of many small records. SABER adds the necessary features to provide these capabilities for the neuroimaging community.
While existing pipeline tools like LONI [49] and Nipype [43] enable the execution of scienti c work ows, they are still lacking a few key features for the neuroimaging user and may limit the portability and utility of work ows. SABER provides a the library of tools required for modern segmentation and detection problems on EM and XRM data, including GPU enabled DNN tools. These tools, and their corresponding CWL de nitions, are useful in any system which can support them, rather than being speci c to a work ow manager, as with LONI and Nipype. We enable the use and sharing of Dockerized tools and standardized work ows within and beyond the SABER framework.

SABER
To overcome limitations in existing solutions, SABER provides canonical neuroimaging work ows speci ed in a standard work ow language (CWL), integration with a work ow execution engine (Air ow), imaging database (bossDB), and parameter database (Datajoint) to deploy work ows at scale, and tools to automate deployment and optimization of neuroimag-ing pipelines. Our automation tools include end-to-end hyperparameter optimization methods and deployment by dividing data into blocks, executing pipelines, and merging results (block-merge). In our repository, this is broken into two key components. The rst is CONDUIT, which is the core framework for deploying work ows. The second is SABER, which contains the code, Docker les, and CWL les for the workows (Fig. 6). A comparison of SABER/CONDUIT to existing solutions is seen in Table 1.
The core framework (called CONDUIT) is provided in a Docker container to reduce installation constraints and increase portability (Figure 6). The core framework interfaces with scalable cloud compute and storage resources as well as local resources. The user interacts via command line tools, and can visualize the status of work ows using Air ow's graphical user interface (GUI). Each tool used in the work ows will also be built into a separate image.
In our CONDUIT framework ( Figure 6 highlights the architecture of the system), work ows and tools are de ned with CWL v1.0 speci cations. Tools additionally include Docker les and source code. Parameter les contain user-speci ed parameters for optimization and deployment of pipelines. The features of CONDUIT include parsing the CWL parameters and deploying work ows, as in the CWL-Air ow project [19]. Features added on top of the existing CWL-Air ow functionality include an API for parameterizing jobs for deployment over chunks of data in large volumetric datasets (speci ed by coordinates), iterative execution of the same work ow with di erent parameters (for parameter optimization), and logging of metadata and job results. Moreover, wrappers allow for the use of local les and cloud les (S3) for intermediate results with the same work ows and minimal recon guration.
The repository at github.com/aplbrain/saber contains both our CONDUIT framework and the SABER work ows and tools, as is visualized in Figure 6. The CONDUIT framework consists of the Python code and scripts that build upon CWL and Air ow to enable the deployment of work ows. The SABER work ow code contains the tools, Docker les, CWL de nitions for tools, CWL de nitions for work ows, and example job les. This structure emphasizes the portability of SABER tools-the use of Docker and CWL encourages their reuse in other contexts where the full power of the framework may not be needed (e.g. running on small, locally-stored datasets).

Framework Components
The overall structure of SABER is seen in Fig. 6, and consists of tools, work ows, parsers for user commands, work ow execution, and cloud computation and storage. Work ows, found in the SABER component, consist of code, Docker les, and CWL les. The core functionality of parsing work ows, running airow, and scheduling jobs is found in the CONDUIT component.

SABER Work ow Library
The SABER subproject consists of a library of code, tools, and work ows. Each SABER tool much have a corresponding Docker le. Tools and work ows are speci ed following CWL specications. To package a tool for SABER, a developer must • Provide a Docker le for the tool • Use command line arguments to specify le input and outputs (which can be read as any local le the tool can use) • Provide a CWL tool le with tool parameters and input and output le names speci ed Optionally, developers can choose to print metrics, scores, or other information on the command line. When building work ows, tools are wrapped to allow for either local or cloud execution and no additional requirements are placed on the tool developer.
Work ows are speci ed using standard CWL syntax. To specify local versus cloud execution, the CWL "doc" ag can be set to run with completely local compute and storage. Individual step "hints" can be used to specify that an individual step should use local compute resources. GPU resources can be used through con guration of the system Docker installation. Work ow parameters are also speci ed with standard CWL les.
To enable our neuroimaging use cases, parameter sweeps are speci ed with a new custom parameterization le. This speci es the parameter start, stop, step, and overlap. A typical use case is the speci cation of boundaries of a large volumetric dataset (xmin, xmax, ymin, ymax, zmin, zmax, and stepsize). Any parameter speci ed by a tool CWL can be included in the parameterization le.
To enable hyperparameter optimization of pipelines, a similar format to parameterization is used to specify which parameters are to be optimized and the range of these parameters, as well as the algorithm (e.g. grid or random search). A CWL "hint" is added to the work ow indicating the name of the optimization metric for each step, which will be parsed from standard out. This allows the speci cation of multiple objective functions or metrics for each work ow stage.

CONDUIT Docker Container
The CONDUIT component (Fig. 6) contains the scripts for parsing CWL work ows, processing user commands, scheduling jobs using Air ow, and storing and accessing metadata in the metadata store (Datajoint [21]). All of this functionality is itself contained in a Docker container to simplify installation on the user's machine. The CONDUIT container and related containers are started with Docker-compose.
The user interacts with conduit through a series of command line tools. The user interface consists of: • conduit init: used to con gure AWS for cloud use through the provided cloudformation template. Optional for local use, and only needs to be run when con guring a new AWS account. • conduit build: used to build the necessary tool Docker containers • conduit parse: used to create a Directed Acyclic Graph from the CWL and schedule with air ow. Accepts an optional parameterization le • conduit collect: used to collect metadata results related to a work ow from the metadata database • conduit optimize: used to schedule hyperparamter search for a given work ow These commands provide the key method for users to schedule work ows, which can be monitored using the Apache Air ow webserver started with CONDUIT.

Work ow Execution
The CONDUIT container shown in Figure 6 provides SABER with a managed pipeline execution environment that can run locally or scale using the AWS Batch service. Our custom command scripts and CWL parser generate DAG speci cations for execution by Apache Air ow. We select Apache Air ow to interface with a cloud-based computing solution. As an example, we utilize the AWS Batch service, although Air ow can interface with scalable cluster solutions such as Kubernetes or Hadoop. The framework facilitates the execution of a batch processing (versus streaming) work ow composed of software tools packaged inside multiple software containers. This reduces the need to install and con gure many, possibly con ict-ing software libraries.

Cloud Computation and Storage
Large neuroimaging datasets are distinct from many canonical big data solutions because researchers typically analyze a few (often one) very large datasets instead of many individual images. Custom storage solutions [20,28,29] exist, but often require tools, knowledge, and access patterns that are disparate from those used by many neuroscience laboratories. SABER provides tools to connect to specialized neuroimaging databases which integrate into CWL tool pipelines. We use intern [55,56] to provide access to bossDB and DVID and abstract data storage, RESTful calls, and access details. Work ow parameters, objective functions, and summary results such as graphs and cell densities can be stored using a DataJoint database [21] using a custom set of table schemas. Some datasets, however, can be stored locally but are too large to process in memory on a single workstation. In addition to volumetric data stored in bossDB, SABER also supports local imaging le formats such as HDF5, PNG, or TIFF. As users share pipelines, they might wish to use a pipeline originally developed for data stored in one archive with that stored in another. Therefore, using the existing SABER tools raw and annotated data can be accessed, retrieved, and stored using: For intermediate results in a pipeline, les can be stored locally (or on any locally mounted drive) in numpy or HDF5 les or stored in AWS S3 buckets. Future work will increase the number of supported le formats. The modular nature of raw data access will allow additional tools to access new data sources as they emerge. Supporting further cloud systems will require additional development, although it will not a ect the SABER tools or work ows. Currently only AWS is supported.
Modern cloud computing tools, such as AWS Batch or Kubernetes, allow large scale deployment of containerized tools on demand. The CONDUIT container schedules work ows using Apache Air ow, and currently supports two execution methods: • AWS Batch • Local compute resources Work ows have a "local" ag which can be set to indicate a choice of resources. Tools can also be con gured to run with GPU resources. Both methods can be used with local or remote data storage. Further development will be required to enable support of further executors, such as Kubernetes, using the operators which exist in Apache Air ow.

Required Background and Getting Started
A new user to the SABER framework will require intermediate familiarity with Python programming, the use of command line tools (e.g. Bash), and Docker. These capabilities are often found in capable computer science undergraduates or new computationally-oriented graduate students. To get started, new users will: • Install Docker • Build the desired tool containers (e.g. EM or X-ray containers) in the SABER folder • Build and con gure the core CONDUIT Docker containers • Use the command line interface to schedule work ows However, the use of SABER with the AWS cloud will require an AWS account, and at least one experienced AWS user to congure the system and serve as the administrator. To con gure this system, the user needs to • Use the cloudformation template to con gure AWS Batch and S3 • Create credentials for other users and con gure access from local machines The envisioned users of this tool are neuroimaging labs, algorithm developers, and data analysts. One experienced user can quickly con gure a cloud SABER deployment for use by others in the lab. Envisioned use cases include neuroimaging labs wanting to apply tools to newly collected datasets and tool developers who want to package and benchmark software tools to reach new users. While this framework certainly does not remove all barriers to entry, the use of Dockerized tools limits the number of competing software con gurations for neuroimaging users and provides a common and powerful system for tool developers to share their work. Our system accomplishes this with a set of Dockerized tools to replace installing many, often con icting dependencies with a single tool (i.e., Docker), the use of standard CWL de nitions which are cross-compatible with other e orts, and specialized scripts to handle di cult use cases such as scheduling runs over large datasets using cloud computing resources. This approach attempts to balance the exibility needed by tool developers with standardization to help the novice user. A user looking to deploy existing tools and work ows to new data will primarily interface through the user commands for CONDUIT, and a tool developer will primarily package tools following the Docker les and CWL examples in SABER (to ensure compatibility with existing tools).

Availability of source code and requirements
The SABER framework is open source and available online:

Availability of supporting data and materials
The source code for this project is available on GitHub, including code for tools and demonstration work ows. An extensive wiki documenting the repository is also hosted on github. The data are stored in a bossDB instance at https://api.bossdb.org. Snapshots of our code and other supporting data are openly available in the GigaScience repository, Giga DB [57].

List of abbreviations
EM-Electron Microscopy, XRM-X-Ray Microtomography, AWS-Amazon Web Services, CWL-Common Work ow Language DVID-Distributed, Versioned, Image-Oriented Dataservice, bossDB-Block and Object Storage Service Database, SABER-Scalable Analytics for Brain Exploration Research, DAG-Directed Acyclic Graph, CNN-Convolutional Neural Network, AIBS-Allen Institute for Brain Science Figure 1. Work ow for processing XRM data to produce cell and vessel location estimates. Raw pixels are used to predict probabilities of boundaries, followed by detection of cell bodies and blood vessels. Finally, cell density estimates are created. Panel A shows the reconstruction pipeline, whereas Panel B shows a reconstruction of the detected cells and blood vessels in the test volume. Cells are shown as spheres and blood vessels as red lines.

Figure 2.
Canonical Work ow for Graph Estimation in EM data volumes. This work ow provides the ability to reconstruct a nanoscale map of brain circuitry at the single synapse level. The procedure of mapping raw image stacks to graphs representing synapse-level connectomes consists of synapse and membrane detection, segmentation of neurons, assignment of synapses, merging, and graph estimation. Panel A shows the reconstruction pipeline, and Panel B shows an example segmentation of a neuron from a block of data. Figure 3. Use case of optimizing a pipeline for light microscopy data, comparing grid search, random search, and the random resampling approach described in the text. We demonstrate these tools on a light microscopy dataset, leveraging methods originally developed for XRM -showcasing the potential for applying tools across diverse datasets. The framework allows a user to easily compare the trade-o s of di erent approaches for a particular dataset. The maximum f1 score for each approach is marked with a red 'x'. Automating this process using SABER allows for rapid deployment and optimization.  . Example deployment of EM segmentation pipeline to extract graphical models of connectivity from raw images. The processing pipeline (Fig.1) consists of neural network tools to perform A) membrane detection and B) synapse detection. This is followed by a segmentation tool (Panel C). Finally, segmentation and synapses are associated to create a graphical model. Visualizations of segmentations are done with Neuroglancer [45], a tool compatible with SABER and integrated with the bossDB [20] system. Figure 6. The architecture and components of SABER. Tools, work ows, and parameters for individual use cases (optimization, deployment) are captured in a le structure using standardized CWL speci cations and con guration les. The core of the framework (called CONDUIT) is run locally in a Docker container.
CONDUIT consists of scripts to orchestrate deployment and optimization, a custom CWL parser, Apache Air ow for work ow execution, and tools to collect and visualize results. Containerized tools are executed locally or using AWS Batch for a scalable solution. The bossDB provides a solution for scalable storage of imaging data, and a local database is used for storing parameters and derived information.