SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines

Abstract Background The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machine learning. Findings SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline. Conclusions SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine learning, through a flexible application programming interface suitable for scientists used to programming or scripting.


Findings
Driven by the highly complex and heterogeneous nature of biological data [ , ], computational biology is characterized by an extensive ecosystem of command-line tools, each specialized on one or a few of the many aspects of biological data. Because of their specialized nature these tools generally need to be assembled into sequences of processing steps, of en called "pipelines", to produce meaningful results from raw data. Due to the increasingly large sizes of biological data sets [ , ], such pipelines of en require integration with High-Performance Computing (HPC) infrastructures or cloud computing resources to complete in an acceptable time. This has created a need for tools to coordinate the execution of such pipelines in an e cient, robust and reproducible manner. This coordination can in principle be done with simple scripts in languages like Bash, Python or Perl, but such scripts can quickly become fragile.
When the number of tasks becomes su ciently large, and the execution time su ciently long, the risk for failures during the execution of such scripts increases almost linearly with time, and simple scripts are not a good strategy for when large jobs need to be restarted from a failure. They lack the ability to distinguish between nished and half-nished les, and can not by default detect if intermediate output les are already created and can be reused to save computing time and resources. These limits with simple scripts calls for a strategy with a higher level of automation. This need is addressed by a class of sof ware commonly referred to as "scienti c work ow management systems" or simply "work ow tools". Through a more automated way of handling the execution, work ow tools can improve the robustness, reproducibility and understandability of computational analyses. In concrete terms, work ow tools provide means for handling atomic writes (making sure nished and half-nished les can be separated af er a crashed or stopped work ow), caching of intermediate results, distribution of tasks to the available computing resources and automatically keeping or generating records of exactly what was run, to make analyses reproducible.
It is widely agreed upon that work ow tools generally make it easier to develop automated, reproducible and fault-tolerant pipelines, although many challenges and potential areas for improvement still do exist with existing tools [ ]. This has made scienti c work ow systems a highly active area of research. Numerous work ow tools have been developed and many new ones are continuously being developed.
The work ow tools developed di fer quite widely in terms of how work ows are being de ned and what features are included out-of-the box. This probably re ects the fact that di ferent types of work ow tools can be suited for di ferent categories of users and use cases. Graphical tools like Galaxy [ , , ] (DSLs), that can of en provide a higher level of exibility, at the expense of the ease of use of a graphical user interface. They can thus be well suited for "power users" with experience in scripting or programming.
Even more power and exibility can be gained from work ow tools implemented as programming libraries, which provide their functionality through a programming API accessed from an existing programming language such as Python, Perl or Bash. By implementing the API in an existing language, users get access to the full power of the implementation language when writing work ows, as well as the existing tooling around the language. One example of a work ow system implemented in this way is Luigi [ ].
As reported in [ ], although many users nd important bene ts in using work ow tools, many also experience limitations and challenges with existing work ow tools, especially regarding the ability to express complex work ow constructs such as branching and iteration, as well as limitations in terms of audit logging and reproducibility. Below we will brie y review a few of existing, popular systems, and highlight areas where we found that the development of a new approach and tool was desirable for use cases that includes very complex work ow constructs.
Firstly, graphical tools like Galaxy and Yabi, although being easy-to-use even for users without programming experience, is of en perceived to be limited in their exibility due to the need to install and run a web server in order to use them, which is not always permitted, or practical, on HPC systems.
Text-based tools implemented as DSLs, such as Snakemake, Next ow, BPipe, Pachyderm and Cuneiform do not have this limitation, but have other characteristics which might be problematic for for complex work ows in some cases.
For example, Snakemake is dependent on le naming strategies for de ning dependencies, which can in some situations be limiting, and also uses a "pull-based" scheduling strategy (the work ow is invoked by asking for a speci c output le, where af er all tasks required for reproducing the le will be executed). While this makes it very easy to reproduce speci c les, it can make the system hard to use for work ows involving complex constructs such as nested parameter sweeps and cross-validation fold generation, where the nal le names might be hard if at all possible to foresee. Snakemake also performs scheduling and execution of the work ow graph in separate stages, which means that it does not support dynamic scheduling.
Dynamic scheduling, which basically means on-line scheduling during the work ow execution [ ], is useful both where the number of tasks is unknown before the work ow is executed, and where a task needs to be scheduled with a parameter value obtained during the work ow execution. An example of the former is reading row by row from a database, splitting a le of unknown size into chunks, or processing a continuous stream of data from an external process such as an automated laboratory instrument. An example of the latter is training a machine learning model with hyperparameters obtained from a parameter optimization step prior to the nal training step.
BPipe constitutes a sort of middle-ground in terms of dynamic scheduling. It supports dynamic decisions of what to run by allowing execution-time logic inside pipeline stages, as long as the structure of the work ow does not need to change. Dynamic change of the work ow structure can be important in work ows for machine learning though, e.g. if parametrizing the number of folds in a cross-validation based on a value calculated during the work ow run, such as dataset size.
Next ow, has push-based scheduling and supports full dynamic scheduling via the data ow paradigm, does not have this limitation. It does not, however, support creating a library of reusable work ow components, because of its use of data ow variables shared across component de nitions, as this requires processes and the work ow dependency graph to be de ned together.
Pachyderm is a container-based work ow system which uses a JSON and YAML-based DSL to de ne pipelines. It has a set of innovative features including a version-controlled data management component with Git-like semantics and support for triggering of pipelines based on data updates, among others. These in combination can provide some aspects of dynamic scheduling. On the other hand, the more static nature of the JSON/YAML-based DSL might not be optimal for really complex setups such as creating loops or branches based on parameter values obtained during the execution of the work ow. The requirement of Pachyderm to be run on a Kubernetes [ ] cluster can also make it less suitable for some academic environments where ability to run pipelines also on traditional HPC clusters is required. On the other hand, because of the easy incorporation of existing tools, it is possible to provide such more complex behavior by including a more dynamic work ow tool as a work ow step in Pachyderm instead. We thus primarily see Pachyderm as a complement to other light-weight work ow systems, rather necessarily than an alternative.
The usefulness of such an approach where an over-arching frameworks provide primarily an orchestration role while calling out to other systems for the actual work ows, is demonstrated by the Arteria project [ ]. Arteria builds on the event-based StackStorm framework to allow triggering of external work ows based on any type of event, providing a exible automation framework for sequencing core facilities.
Going back to traditional work ow systems, Cuneiform takes a di ferent approach compared to most work ow tools by wrapping shell commands in functions in a exible functional language (described in [ ]), which allows leveraging common bene ts in functional programming languages such as side-e fect free functions, to de ne work ows. It also leverages the distributed programming capabilities of the Erlang Virtual Machine (EVM), to provide automatic distribution of workloads. It is still a new, domain speci c language though, which means that tooling and editor support might not be as extensive as for an established programming language.
Luigi is a work ow library developed by Spotify, which provides a high degree of exibility due to its implementation as a programming library, Python. For example, the programming API exposes full control over le name generation. Luigi also provides integration with many Big Data systems such as Hadoop and Spark, and cloud-centric storage systems like HDFS and S .
SciLuigi [ ] is a wrapper library for Luigi, previously developed by the authors, which introduces a number of bene ts for scienti c work ows by leveraging selected principles from Flow-based programming (FBP) (named ports and separate network de nition) to achieve an API that makes iteratively changing the work ow connectivity easier than in vanilla Luigi. While Luigi and SciLuigi were shown to be a helpful solution for complex work ows in drug discovery, they also have a number of limitations for highly complex and dynamic work ows. Firstly, since Python is an untyped, interpreted language, certain sof ware bugs are discovered only far into a work ow run, rather than while compiling the program. Secondly, the fact that Luigi creates separate processes for each worker, which communicate with the central Luigi scheduler via HTTP requests over the network, can lead to robustness problems when going over a certain number of workers (around in the authors' experience) leading to HTTP connection time-outs.
The mentioned limitations for complex work ows in existing tools is the background and motivation for developing the SciPipe library.

The SciPipe work ow library
SciPipe is a work ow library based on Flow-Based Programming principles, implemented as a library in the Go programming language. The library is freely available as open source on GitHub [ ]. All releases of GitHub are also archived on Zenodo [ ]. Similarly to Next ow, SciPipe leverages the data ow paradigm to achieve  Figure : A simple example work ow implemented with SciPipe. The work ow computes the reverse base complement of a string of DNA, using standard UNIX tools. The work ow is a Go program and is supposed to be saved in a le with the .go extension and executed with the go run command. On line , the SciPipe library is imported, to be later accessed as scipipe. On line , a short string of DNA is de ned. On line -, the full work ow is implemented in the program's main() function, meaning that it will be executed when the resulting program is executed. On line , a new work ow object (or "struct" in Go terms) is initiated with a name and the maximum number of cores to use. On lines -, the work ow components, or processes, are initiated, each with a name and a shell command pattern. Input le names are de ned with a placeholder on the form {i:INPORTNAME} and outputs on the form {o:OUTPORTNAME}. The port-name will be used later to access the corresponding ports for setting up data dependencies. On line , a component that writes the previously de ned DNA string to a le is initiated, and on line , the le path pattern for the out-port dna is de ned (in this case a static le name). On line , a component that translates each DNA base to its complementary counterpart is initiated. On line , the le path pattern for its only out-port is de ned. In this case, reusing the le path of the le it will receive on its in-port named in, thus the {i:in} part. The %.txt part removes .txt from the input path. On line , a component that will reverse the DNA string is initiated. On lines -, data dependencies are de ned via the in-and out-ports de ned earlier as part of the shell command patterns. On line , the work ow is being run.
dynamic scheduling of tasks based on input data, allowing many work ow constructs not easily coded in many other tools. Combined with design principles from Flow-based programming such as separate network de nition and named ports bound to processes, this has resulted in a productive and discoverable API that enables agile authoring of complex and dynamic work ows. The fact that the work ow network is de ned separately from processes, enables building work ows based on a library of reusable components, although the creation of ad-hoc shell-command based components is also supported. SciPipe provides a provenance tracking feature that creates one audit log per output le, rather than only one for the whole work ow run. This means that it is always easy to verify exactly how each output of a work ow was created.
SciPipe also provides a few features which are not very common among existing tools, or which are not commonly occurring together in one system. These include support for streaming via Unix named pipes, ability to run push-based work ows up to a speci c stage of the work ow, and exible support for le naming of intermediate data les generated by work ows.
By implementing SciPipe as a library in an existing language, the language's ecosystem of tooling, editor support and third-party libraries can be directly used to avoid "reinventing the wheel" in these areas. By leveraging the built-in concurrency features of Go, such as go-routines and channels, the developed code base has been kept small compared with similar tools, and also does not require external dependencies for basic usage (some external tools are used for optional features like PDF generation and graph plotting). This means that the code base should be possible to maintain for a single developer or small team, and that the code base is practical to include in work ow developers' own source code repositories, in order to future-proof the functionality of work ows.
Below, we rst brie y describe how SciPipe work ows are created. We then describe in some detail the features of SciPipe that are the most novel or improves most upon existing tools, followed by a few more commonplace technical considerations. We nally demonstrate the usefulness of SciPipe by applying it to a set of case study work ows in machine learning for drug discovery and next-generation sequencing genomics and transcriptomics.

Writing work ows with SciPipe
SciPipe work ows are written as Go programs, in les ending with the .go extension. As such, they require the Go tool chain to be installed for compiling and running them. The Go programs can be either compiled to self-contained executable les with the go build command, or run directly, using the go run command.
The simplest way to write a SciPipe program is to write the work ow de nition in the program's main() function, which is executed when running the compiled executable le, or running the le as script with go run. An example work ow written in this way is shown in in gure , which provides a simple example work ow consisting of three processes, demonstrating a few of the basic features of SciPipe. The rst process writes a string of DNA to a le, the second computes the base complement, and the last process reverses the string. All in all, the work ow computes the reverse base complement of the initial string.
As can be seen in gure on line , a work ow object (or struct, in Go terminology) is rst initialized, with a name and a setting for the maximum number of tasks to run at a time. Furthermore, on line -, processes are de ned with the Workflow.NewProc() method on the work ow struct, with name and a command pattern which is very similar to the Bash shell command that would be used to run a command manually, but where concrete le names have been replaced with placeholders, on the form {i:INPORTNAME}, {o:OUTPORTNAME} or {p:PARAMETERNAME}. These placeholders de ne input and output les, as well as parameter values, and works as a sort of templates, that will be replaced with concrete values as concrete tasks are scheduled and executed.
As can be seen on lines , and , output paths to use for output les are de ned using the Process.SetOut() method, taking an out-port name and a pattern for how to generate the path. For simple work ows this can be just a static le name, but for more complex work ows with processes that produce more than one output on the same port -e.g. by processing di ferent input les, or using di ferent sets of parameters -it is of en best to reuse some of the input paths and parameter values con gured earlier in the command pattern to generate a unique path for each output.
Finally, on lines -, we see how in-ports and out-ports are connected in order to de ne the data dependencies between tasks. Here, the in-port and out-port names used in the placeholders in the command pattern  Figure : Example audit log le in JSON format [ ], for a le produced by a SciPipe work ow. The work ow used to produce this audit log in particular, is the one in gure . The audit information is hierarchical, with each level representing a step in the work ow. The rst level contains meta-data about the task executed last, to produce the output le that this audit log refers to. The eld Upstream on each level, contains a list of all upstream task of the current task, indexed by the le paths that each of the upstream tasks did produce, and which was subsequently used by the current task. Each task is given a globally unique ID, which helps to deduplicate any duplicate occurrences of tasks, when converting the log to other representations. Execution time is given in nanoseconds. Note that input paths in the command eld, is prepended with ../, compared to how they appear in the Upstream eld. This is because each task is executed in a temporary directory created directly under the work ow's main execution directory, meaning that to access existing data les, it has to rst navigate up one step out of this temporary directory. described above, are used to access the corresponding in-ports and out-ports, and making connections between them, with a syntax on the general form of InPort.From(OutPort).
The last thing needed to do to run the work ow, is seen on line , where the Workflow.Run() method is executed. Provided that the work ow code in gure is saved in a le named workflow.go, it can be run using the go run command, like so: $ go run workflow.go This will then produce three output les and one accompanying audit log for each le, which we can be seen by listing the les in a terminal: The le dna.txt should now contain the string AAAGCCCGTGGGGGACCTGTTC, and dna.compl.rev.txt should contain GAACAGGTCCCCCACGGGCTTT, which is the reverse base complement of the rst string. In the last le above, the full audit log for this minimal work ow can be found. An example content of this le is shown in gure .
In this code example, it can be seen that both of the commands we executed are available, and also that the Reverse process lists its "upstream" processes, which are indexed by the input le names in its command. Thus, under the dna.compl.txt input le, we nd the Base Complement process together with its meta-data, and one further upstream process (the Make DNA process). This hierarchic structure of the audit log ensures that the complete audit trace, including all commands contributing to the production of an output le, is available for each output le from the work ow.
More information about how to write work ows with SciPipe is available on the documentation website [ ]. Note that the full documentation on this website is also available in a folder named docs inside the SciPipe Git repository, which ensures that documentation for the version currently used, is always available.

Dynamic scheduling
Since SciPipe is built on the principles from Flow-based programming (see the methods section for more details), a SciPipe program consists of independently and concurrently running processes, which schedule new tasks continually during the work ow run. This is here referred to as dynamic scheduling. This means that it is possible to create a process that obtains a value and passes it on to a downstream process as a parameter, so that new tasks can be scheduled with it. This feature is important in machine learning work ows, where hyper parameter tuning is of en employed to nd an optimal value of a parameter, such as cost for Support Vector Machines (SVM), which is then used to parametrize the nal training part of the work ow.

Reusable components
Based on principles from Flow-based programming, the work ow graph in SciPipe is de ned by making connections between port objects bound to processes. This enables to keep the dependency graph de nition separate from the process de nitions. This is in contrast to other ways of connecting data ow processes, such as with data ow variables, which are shared between process de nitions. This makes processes in ow-based programming fully self-contained, meaning that libraries of reusable components can be created and that components can be loaded on-demand when creating new work ows. A graphical comparison between dependencies de ned with data ow variables and ow-based programming ports, is shown in gure .

Figure :
Comparison between data ow variables and Flow-based programming ports in terms of dependency de nition. a) shows how data ow variables (blue and green) shared between processes (in gray) make the processes tightly coupled. In other words, process-and network de nitions get intermixed. b) shows how ports (in orange) bound to processes in Flow-based programming allows keeping the network de nition separate from process de nitions. This enables processes to be reconnected freely without changing their internals.

Running subsets of work ows
With pull-based work ow tools like Snakemake or Luigi, it is easy to on-demand reproduce a particular output le, since the scheduling mechanism is optimized for the use case of asking for a speci c le and calculating all the tasks required to be executed based on that.
With push-based work ow tools though, reproducing a speci c set of les without running the full work ow is not always straight-forward. This is a natural consequence of the push-based scheduling strategy, and data ow in particular, as the identities and quantities of output les might not be known before the work ow is run.
SciPipe provides a mechanism for partly solving this lack of "on demand le generation" in push-based data ow tools, by allowing to reproduce all les of a speci ed process, on-demand. That is, the user can tell the work ow to run all processes in the work ow upstream of, and including, a speci ed process, while skipping processes downstream of it.
This has turned out very useful when iteratively refactoring or developing new pipelines. When a part in the middle of a long sequence of processes need to be changed, it is helpful to be able to test-run the work ow up to that particular process only, not the whole work ow, to speed up the development iteration cycle.

Other characteristics
Below are a few technical characteristics and considerations that are not necessarily unique to SciPipe, but could be of interest to potential users assessing whether SciPipe ts their use cases.

Data centric audit log
The audit log feature in SciPipe collects meta data about every executed task (concrete shell command invocation) which is passed along with every le that is processed in the work ow. It writes a le in the ubiquitous JSON format, with the full trace of tasks executed for every output in the work ow, with the same name as the output le in question but with the additional le extension .audit.json. Thus, for every output in the work ow, it is possible to check the full record of shell commands used to produce it. An example audit log le can be seen in gure .
This data-oriented provenance reporting contrasts to provenance reports common in many work ow tools, which of en provide one report per work ow run only, meaning that the link between data and provenance report is not as direct.
The audit log feature in SciPipe in many aspects re ects the recommendations in [ ] for developing provenance reporting in work ows, such as producing a coherent, accurate, inspectable record for every output data item from the work ow. By producing provenance records for each data output rather than for the full work ow only, SciPipe could provide a basis for the problem of iteratively developing work ow variants, as outlined in [ ].
SciPipe also loads any existing provenance reports for existing les that it uses, and merges these with the provenance information from its own execution. This means that even if a chain of processing is spread over multiple SciPipe work ow scripts, and executed at di ferent times by di ferent users, the full provenance record is still being kept and assembled, as long as all work ow steps were executed using SciPipe shell command processes. The main limitation to this "follow the data" approach, is for data generated externally to the work ow, or by SciPipe components implemented in Go. For external processes, it is up to the external process to generate any reporting. For Go-based components in SciPipe, these can not currently dump a textual version of the Go code executed. This constitutes an area of future development.
SciPipe provides experimental support for converting the JSON-structure into reports in HTML and TeX format, or into executable Bash scripts that can reproduce the le which the audit report describes from available inputs or from scratch. These tools are available in the scipipe helper command. The TeX report can be easily further converted to PDF using the pdflatex command of the pdfTex sof ware [ ]. An example of such a PDF report, is shown in gure , which was generated from the audit report for the last le generated by the code example in gure .

Atomic writes
SciPipe ensures that cancelled work ow runs do not result in half-written output les being mistaken for nished ones. It does this by executing each task in a temporary folder, and moving all newly created les into their nal location a er the task is nished. By using a folder for the execution, any extra les created by a tool that are not explicitly con gured by the work ow system, are captured and treated in an atomic way. Examples of where this is needed is for the ve extra les created by bwa index [ ], when indexing a reference genome in FASTA format.

Streaming support
In data intensive elds like Next-Generation Sequencing, it is common that intermediate steps of pipelines produce large amounts of intermediate data, of en multiplying the storage requirements considerably compared to the raw data from sequencing machines [ ]. To help ease these storage requirements, SciPipe provides the ability to optionally stream data between two tasks via Random Access Memory (RAM) instead of saving to disk between task executions. This approach has two bene ts. Firstly, the data does not need to be stored on disk, which can lessen the storage requirements considerably. Secondly, it enables the downstream task to start processing the data from the upstream task immediately as soon as the rst task has started to produce partial output. It thus enables to achieve pipeline parallelism in addition to data parallelism, and can thereby shorten the total execution time of the pipeline.

Flexible file naming and data "caching"
SciPipe allows exible naming of the le path of every intermediate output in the work ow, based on input le names and parameter values passed to a process. This enables creating meaningful le naming schemes, to make it easy to manually explore and sanity-check outputs from work ows.
Con guring a custom le naming scheme is not required though. If no le path is con gured, SciPipe will automatically create a le path that ensures that two tasks with di ferent parameters or input data will never Figure : Audit report for the last le generated by the code example in gure , converted to TeX with SciPipe's experimental audit2tex feature and then converted to PDF with pdfTeX. In the top, the PDF le includes summary information about the SciPipe version used and the total execution time. Af er this follows an execution time line, in a gantt-chart style, that shows the relative execution times of individual tasks in a graphical way. Af er this follows a comprehensive list of tables with information for each task executed towards producing the le for which the audit report belongs. The task boxes are color coded and ordered in the same way that the tasks appear in the timeline. clash, and that two tasks with the same command signature, parameters and input-les, will reuse the same cached data.

Known limitations
Below we list some design decisions and known limitations of SciPipe that might a fect the decision whether to use SciPipe for a particular use case or not.
Firstly, the fact that writing SciPipe work ows requires some basic knowledge of the Go programming language, can be o f-putting to users who are not well acquainted with programming. Go code, although having taken inspiration from scripting languages, is still markedly more verbose and low-level in nature than Python, and can take a little longer to get used to.
Secondly, the level of integration with HPC resource managers is currently quite basic compared to some other work ow tools. The SLURM resource manager can readily be used by using the Prepend eld on processes to add a string with a call to the salloc SLURM command, but more advanced HPC integration is planned to be addressed in upcoming versions.
Furthermore, reproducing speci c output les is not as natural and easy as with pull-based tools like Snakemake, although SciPipe provides a mechanism to partly resolve this problem.
Finally, SciPipe does not yet support integration with the Common Work ow Language [ ], for interoperability of work ows. This is a prioritized area for future development.

Case Studies
To demonstrate the usefulness of SciPipe, we have used it to implement a number of representative pipelines from drug discovery and bioinformatics with di ferent characteristics and hence requirements on the work ow system. These work ows are available in a dedicated git repository on GitHub [ ].
Machine learning pipeline in drug discovery Figure : Directed graph of the machine learning drug discovery case study work ow, plotted with SciPipe's work ow plotting function. The graph has been modi ed for clarity by collapsing the individual branches of the parameter sweeps and cross validation fold-generation. The layout has also been manually made more compact to be viewable in print. The collapsed branches are indicated by intervals in the box labels. tr{500-8000} represent branching into training dataset sizes , , , , . c{0.0001-5.0000} represent cost values .
The initial motivation for building SciPipe stemmed from problems encountered with complex dynamic work ows in machine learning for drug discovery applications. It was thus quite natural to implement an example of such a work ow in SciPipe. To this end we re-implemented a work ow implemented previously for the SciLuigi library [ ], which was itself based on an earlier study [ ].
In short, this work ow trains predictive models using the LIBLINEAR sof ware [ ] with molecules represented by the signature descriptor [ ]. For linear SVM a cost parameter needs to be tuned, and we tested values ( . , . , . , . , . , . , . , . , . , . , , , , , ) in a -fold cross-validated parameter sweep. Five di ferent training set sizes ( , , , , ) were tested and evaluated with a test set size of . The raw data set consists of , logarithmic solubility values chosen randomly from a dataset extracted from PubChem [ ] according to details in [ ]. The work ow is schematically shown in gure and was plotted using SciPipe's built-in plotting function. The gure has been modi ed for clarity by collapsing the individual branches of the parameter sweeps and cross validation folds, as well as by manually making the layout more compact.
The implementation in SciPipe was done by creating components which are de ned in separate les (named comp COMPONENTNAME in the repository), which can thus be reused in other work ows. This shows how SciPipe can successfully be used to create work ows based on reusable, externally de ned components.
The fact that SciPipe supports parametrization of work ow steps with values obtained during the work ow run, meant that the full work ow could be kept in a single work ow de nition, in one le. This also made it possible to create audit logs for the full work ow execution for the nal output les, and to create the automatically plotted work ow graph shown in gure . This is in contrast to the SciLuigi implementation, where the parameter sweep to nd the optimal cost, and the nal training, had to be implemented in separate work ow les (wffindcost.py and wfmm.py in [ ]), and executed as a large number of completely separate work ow runs (one for each dataset size) which meant that logging became fragmented into a large number of disparate log les.
Genomics cancer-analysis pipeline  Figure : Directed graph of work ow processes in the Genomics / Cancer Analysis pre-processing pipeline, plotted with SciPipe's work ow plotting function. Nodes represent processes, while edges represent data dependencies. The labels on the edge heads and tails represent ports.
Sarek [ ] is an open-source analysis pipeline to detect germ-line or somatic variants from whole genome sequencing, developed by the National Genomics Infrastructure and National Bioinformatics Infrastructure Sweden which are both platforms at Science for Life Laboratory.
To test and demonstrate the applicability of SciPipe to genomics use cases the pre-processing part of the Sarek pipeline was implemented in SciPipe. See gure for a directed process graph of the work ow, plotted with SciPipe's work ow plotting function.
The test data in the test work ow consists of multiple samples of normal and tumor pairs. The work ow starts with aligning each sample to a reference genome using BWA [ ] and forwarding the results to Samtools [ ] which saves the result as a sorted BAM le. Af er each sample has been aligned, Samtools is again used, to merge the normal-and tumor samples into a one BAM [ ] le for tumor samples, and one for normal. Picard [ ] is then used to mark duplicate reads in both the normal-and tumor sample BAM les, whereaf er GATK [ ] is used to recalibrate the quality scores of all reads. The outcome of the work ow is two BAM les; one containing all the normal samples and one containing all the tumor samples.
Genomics tools and pipelines have their own set of requirements, which was shown by the fact that some aspects of SciPipe had to be modi ed in order to ease development of this pipeline. In particular, many genomics tools produce additional output les apart from those speci ed on the command-line. One example of this is the extra les produced by BWA when indexing a reference genome in FASTA format. The bwa index command produces some ve les, which are not explicitly de ned on the command-line (with the extensions of .bwt, .pac, .ann, .amb and .sa). Based on this realization, SciPipe was amended with a folder-based execution mechanism which executes each task in a temporary folder, that keeps all output les separate from the main output directory until the whole task has completed. This ensures that also les that are not explicitly de ned and handled by SciPipe, are also captured and handled in an atomic manner, so that nished and un nished output les are always properly separated.
Furthermore, agile development of genomic tools of en requires being able to see the full command that is used to execute a tool, because of the many options that are available to many bioinformatics tools. This work ow was thus implemented with ad-hoc commands, which are de ned in-line in the work ow. The ability to do this shows that SciPipe supports di ferent ways of de ning components, depending on what ts the use case best.
The successful implementation of this genomics pipeline in SciPipe, thus both ensures and shows that SciPipe is works well for tools common in genomics.  Figure : Directed graph of work ow processes in the RNA-Seq Pre-processing work ow, plotted with SciPipe's work ow plotting function. Nodes represent processes, while edges represent data dependencies. The labels on the edge heads and tails represent ports.

RNA-seq / transcriptomics pipeline
To test the ability of SciPipe to work with sof ware used in transcriptomics, some of the initial steps of a generic RNA-sequencing work ow were also implemented in SciPipe. Common steps that are needed in transcriptomics is to run quality controls and generate reports of the analysis steps.
The RNA-seq case study pipeline implemented for this paper uses FastQC [ ] to evaluate the quality of the raw data being used in the analysis before aligning the data using STAR [ ]. Af er the alignment is done it is evaluated using QualiMap [ ], while the Subread package [ ] is used to do a feature counting.
The nal step of the work ow is to combine all the previous steps for a composite analysis using Mul-tiQC [ ], which will summarize the quality of both the raw data and the result of the alignment into a single quality report. See gure for a directed process graph of the work ow, plotted with SciPipe's work ow plotting function.
The successful implementation of this transcriptomics work ow in SciPipe ensures that SciPipe works well for di ferent types of bioinformatics work ows and is not limited to one speci c sub-eld of bioinformatics.
SciPipe is a programming library that provides a way to write complex and dynamic pipelines in bioinformatics, cheminformatics, and more generally in data science and machine learning pipelines involving command-line applications.
Dynamic scheduling allows parametrizing new tasks with values obtained during the work ow run, and the Flow-based programming principles of separate network de nition and named ports allow creating a library of reusable components. By having access to the full power of the Go programming language to de ne work ows, existing tooling is leveraged.
Scipipe adopts state-of-the art strategies for achieving atomic writes, caching of intermediate les and a data-centric audit log feature that allows identifying the full execution trace for each output, that can be exported into either human-readable HTML or TeX/PDF formats, or executable Bash-scripts.
SciPipe also provides some features not commonly found in many tools such as support for streaming via Unix named pipes, ability to run push-based work ows up to a speci c stage of the work ow, and exible support for le naming of intermediate data les generated by work ows. SciPipe work ows can also be compiled into standalone executables, making deployment of pipelines maximally easy, requiring only Bash and any external command-line tools used, to be present on the target machine.
By being a small library without required external dependencies apart from the Go tool chain and Bash, SciPipe is expected to be possible to be maintained and developed in the future even without a large team or organization backing it.
The applicability of SciPipe for cheminformatics, genomics and transcriptomics pipelines has been demonstrated with case study work ows in these elds.

Methods
The Go Programming Language The Go Programming Language (referred to as just "Go") was developed by Robert Griesemer, Rob Pike and Ken Thompson at Google, to provide a statically typed and compiled language that makes it easier to build highly concurrent programs, that can also make good use of multiple CPU cores (i.e. "parallel program"), than what is the case in widespread compiled languages like C++ [ ]. It tries to provide this by providing a small, simple language, with concurrency primitives -go-routines and channels -built-in to the language. Go-routines, which are so called light-weight threads, are automatically mapped, or multiplexed, onto physical threads in the operating system. This means that very large numbers of go-routines can be created while maintaining a low number of operating system threads, such as one per CPU core on the computer at hand. This makes Go an ideal choice for problems where many asynchronously running processes need to be handled at the same time, or "concurrently", and for making e cient use of multi-core CPUs.
The Go compiler is statically linking all its code as part of the compilation. This means that all dependent libraries are compiled into the executable le. Because of this, SciPipe work ows can be compiled into selfcontained executable les without external dependencies apart from the Bash shell and any external command line tools used by the work ow. This makes deploying Go programs (and SciPipe work ows) to production very easy.
Go programs are very performant, of en an order of magnitude faster than interpreted languages like Python, and in the same order of magnitude as the fastest languages, like C, C++ and Java [ ].
Data ow and Flow-based programming Data ow is a programming paradigm oriented around the idea of independent, asynchronously running processes, that only talk to each other by passing data between each other. This data passing can happen in di ferent ways, such as via data ow variables, or via rst-in-rst-out channels.
Flow-Based Programming (FBP) [ ] is a paradigm for programming developed by John Paul Morrison at IBM in the late s / early s, to provide a composable way to assemble programs to be run at mainframe computers at customers such as large banks.
It is a specialized version of data ow, adding the ideas of separate network de nition, named ports, channels with bounded bu fers and information packets (representing the data) with de ned lifetimes. Just as in data ow, the idea is to divide a program into independent processing units called "processes", which are allowed to communicate with the outside world and other processes solely via message passing. In FBP, this is always done over channels with bounded bu fers which are connected to named ports on the processes. Importantly, the network of processes and channels is in FBP described "separate" from the process implementations, meaning that the network of processes and channels can be reconnected freely without changing the internals of processes.
This strict separation of the processes, the separation of network structure from processing units, and the loosely-coupled nature of its only way of communication with the outside world (message passing over channels) makes ow-based programs extremely composable, and naturally component-oriented. Any process can always be replaced with any other process that supports the same format of the information packets on its in-ports and out-ports.
Furthermore, since the processes run asynchronously, FBP is, just like Go, very well suited to make e cient use of multi-core CPUs, where each processing unit can suitably be placed in its own thread or co-routine to spread out on the available CPU-cores on the computer. FBP has a natural connection to work ow systems, where the computing network in an FBP program can be likened to the network of dependencies between data and processing components in a work ow [ ]. SciPipe leverages the principles of separate network de nition and named ports on processes. SciPipe has also taken some inspiration for its API design from the GoFlow [ ] Go-based ow-based programming framework.
Availability of supporting source code and requirements

Consent for publication
Not applicable.