Interoperable and scalable metabolomics data analysis with microservices

Developing a robust and performant data analysis workflow that integrates all necessary components whilst still being able to scale over multiple compute nodes is a challenging task. We introduce a generic method based on the microservice architecture, where software tools are encapsulated as Docker containers that can be connected into scientific workflows and executed in parallel using the Kubernetes container orchestrator. The access point is a virtual research environment which can be launched on-demand on cloud resources and desktop computers. IT-expertise requirements on the user side are kept to a minimum, and established workflows can be re-used effortlessly by any novice user. We validate our method in the field of metabolomics on two mass spectrometry studies, one nuclear magnetic resonance spectroscopy study and one fluxomics study, showing that the method scales dynamically with increasing availability of computational resources. We achieved a complete integration of the major software suites resulting in the first turn-key workflow encompassing all steps for mass-spectrometry-based metabolomics including preprocessing, multivariate statistics, and metabolite identification. Microservices is a generic methodology that can serve any scientific discipline and opens up for new types of large-scale integrative science.


Introduction
Metabolomics studies measure the occurrence, concentrations and changes of small molecules (metabolites) in organisms, organs, tissues, cells and cellular compartments. Metabolite abundances are assayed in the context of environmental or dietary changes, disease or other conditions 1 . Metabolomics experimental measurements are performed using a variety of spectroscopic methods: the two most common ones are Mass Spectrometry (MS) and Nuclear Magnetic Resonance (NMR). The use of metabolomics as a molecular phenotyping technique is growing across all biomedical domains, due to its ability to reflect the influence of external factors to which an organism is exposed, such as stress, nutrition and disease, subsumed under the term 'exposome' 2 . Metabolomics data analysis has matured over the years, but is still largely developed and performed at a laboratory level with the use of conventional computing solutions and little standardisation for reproducible research. The PhenoMeNal ( Pheno me and Me tabolome a Nal ysis) project ( http://phenomenal-h2020.eu/home/about ) was conceived to ameliorate this situation by bringing advancements in computing architecture and technology into a modern and easily deployed e-infrastructure -i.e., a computing environment combining hardware and software technology as well as required protocols and data resources -tailored specifically for efficient processing and analysis of molecular phenotype data.
Metabolomics is, as most other omics technologies, characterized by the use of high-throughput experiments that produce large amounts of data 3 . With increasing data size and number of samples, the analysis process becomes intractable for desktop computers due to requirements on 3 compute cores, memory, storage etc. As a result, large-scale computing infrastructures have become important components in scientific projects 4 . Moreover, making use of such complex computing resources in an analysis workflow presents its own challenges, including achieving efficient job parallelism and scheduling as well as error handling 5 . In addition, configuring the necessary software tools and chaining them together into a complete re-runnable analysis workflow commonly requires substantial IT-expertise, while creating portable and fault-tolerant workflows with a robust audit trail is even more difficult.
Currently, the most common large-scale computational infrastructures in science are shared High-Performance Computing (HPC) systems. Such systems are usually designed primarily to support computationally intensive batch jobs -e.g., for the simulation of physical processesand are managed by specialized system administrators. This model leads to rigid constraints on the way these resources can be used. For instance, the installation of software must undergo approval and may be restricted, which contrasts with the needs in omics analysis where a multitude of software components of various versions -and their dependencies -are needed, and where these need to be continuously updated.
Cloud computing offers a compelling alternative to shared HPC systems, with the possibility to instantiate and configure on-demand resources such as virtual computers, networks, and storage, together with operating systems and software tools. Users only pay for the time the virtual resources are used, and when they are no longer needed they can be released and incur no further costs for usage or ownership. A few examples of cloud-based systems for metabolomics include XCMS ONLINE 6 , Chorus (chorusproject.org) and The Metabolomics Workbench (www.metabolomicsworkbench.org), all of which provide virtual environments that scale with computational demands. However, these applications provide limited flexibility in terms of incorporating and maintaining tools as well as constructing and using customizable workflows.
Along with infrastructure provisioning, software provisioning -i.e., installing and configuring software for users -has also advanced. Consider, for instance, containerization 7 , which allows entire applications with their dependencies to be packaged, shipped and run on a computer but isolated from one another in a way analogous to virtual machines, yet much more efficiently.
Containers are more compact, and since they share the same operating system kernel, they are fast to start and stop and incur little overhead in execution. These traits make them an ideal solution to implement light-weight microservices , a software engineering methodology in which complex applications are divided into a collection of smaller, loosely coupled components that communicate over a network 8 . Microservices share many properties with traditional always-on web services found on the Internet, but microservices are generally smaller, portable and can be started on-demand within a separate computing environment. Another important feature of microservices is that they have a technology-agnostic communication protocol, and hence can serve as building blocks that can be combined and reused in multiple ways 9 .
Microservices are highly suitable to run in elastic cloud environments that can dynamically grow or shrink on demand, enabling applications to be scaled-up by simply starting multiple parallel instances of the same service. However, to achieve effective scalability a system needs to be appropriately sectioned into microservice components and the data to be exchanged between the microservices needs to be defined for maximum efficiency-both being challenging tasks.
In this manuscript, we present a method which uses components for metabolomics data analysis encapsulated as microservices and connected into computational workflows to provide complete, ready-to-run, reproducible data analysis solutions that can be easily deployed on desktop computers as well as public and private clouds. Our approach requires virtually no involvement in the setup of computational infrastructure and no special IT skills from the user. We validate the method on four metabolomics studies and show that it enables scalable and interoperable data analysis.

Microservices
In order to construct a microservice architecture for metabolomics we used Docker 10 ( https://www.docker.com/ ) containers to encapsulate a large suite of software tools (See Table   S1). To automate the instantiation of this cloud-portable microservice-based system and its components for metabolomics analysis, we developed a Virtual Research Environment (VRE) which uses Kubernetes ( https://kubernetes.io/ ) to orchestrate containers over multiple compute nodes. Scientists can interact with the microservices programmatically via an Application Programming Interface (API) or via a web-based graphical user interface (GUI), as illustrated in repository such as GitHub, and are subject to continuous integration testing. The containers that satisfy testing criteria are pushed to a public container repository, and containers that are included in stable VRE releases are also pushed to Biocontainers 9 .

Demonstrator 1: Scalability of microservices in a cloud environment in the analysis of a human renal proximal tubule cells dataset
The objective of this analysis was to demonstrate the scalability of an existing workflow on a

Demonstrator 2: Start-to-end LC-MS-analysis workflow on Multiple Sclerosis data
The objective of this analysis was to demonstrate interoperability as well as to present a real-world scenario in which patients' data are processed using a microservices-based platform. We

Demonstrator 3: 1D NMR-analysis workflow on human type 2 diabetes mellitus data
This NMR-based metabolomics study was originally performed by Salek et al. 17  http://www.ebi.ac.uk/metabolights/MTBLS412 ), from raw mass spectra contained in netCDF files, using the workflow illustrated in Figure 6. The result was a detailed description of the magnitudes of the fluxes through the reactions accounting for glycolysis and pentose phosphate pathway.

Discussion
Implementing the different tools and processing steps of a data analysis workflow as separate services that are made available over a network was in the spotlight in the early 2000's 21 as service-oriented architectures (SOA) in science. At that time, web services were commonly deployed on physical hardware and exposed and consumed publicly over the internet. However, it soon became evident that this architecture did not fulfill its promises as it did not scale well from a computational perspective. In addition, the web services were not portable and mirroring them was complicated (if at all possible). Furthermore, API changes and frequent services outage made it frustrating to connect them into functioning computational workflows. Ultimately, the ability to replicate an analysis on local and remote hardware (such as a computer cluster) was very difficult due to heterogeneity in the computing environments.
At first sight microservices might seem similar to abovementioned SOA web services, but microservices are generally executed in virtual environments (abstracting over OS and hardware architectures) in such a way that they are only instantiated and executed on-demand, and then terminated when they are no longer needed. This makes such virtual environments inherently portable and they can be launched on demand on different platforms (e.g., a laptop, a powerful physical server or an elastic cloud environment). A key aspect is that workflows are still executed identically, agnostic of the underlying hardware platform. Container-based microservices provide a wide flexibility in terms of versioning, allowing the execution of newer and older versions of each container as needed for reproducibility. Since all software dependencies are encompassed within the container, which is versioned, the risk of workflow failure due to API changes is minimized. An orchestration framework such as Kubernetes further allows for managing errors in execution and transparently handles the restarting of services.
Hence, technology has caught up with service-oriented science, and microservices have taken the methodology to the next level, alleviating many of the previous problems related to scalability, portability and interoperability of software tools. This is advantageous in the context of omics analysis, which produces multidimensional data sets reaching beyond gigabytes, on into terabytes, leading to ever-increasing demand on processing performance 22,23 .
In Demonstrator 1, we showed that microservices enable highly efficient and scalable data analyses by executing individual modules in parallel, and that they effectively harmonize with on-demand elasticity of the cloud computing paradigm. The reached scaling efficiency of~88% indicates remarkable performance achieved on generic cloud providers. Furthermore, although our results in positive ionization model was slightly different to that of Ranninger et al. 15 , the results of our analysis were reproducible regardless of the platform used to perform the computations, indicating a level of replicability of study results and reusability of workflows in the analysis that -to the best of our knowledge -has never been reported before in metabolomics data analysis.
In addition to the fundamental demand for high performance, the increased throughput and complexity of omics experiments has led to a large number of sophisticated computational tools 24 , which in turn necessitates integrative workflow engines 25 . In order to integrate new tools in such workflow engines, compatibility of the target environment, tools and APIs needs to be considered 25  where a complete start-to-end workflow was run on the Galaxy platform on a secure server at Uppsala University Hospital, Sweden, leading to the identification of novel disease fingerprints in the CSF metabolome of RRMS and SPMS patients. It is worth mentioning that the selected metabolites were part of tryptophan metabolism (alanyltryptophan and indoleacetic acid) and endocannabinoids (linoleoyl ethanolamide), both of which have been previously implicated in multiple sclerosis [27][28][29][30][31][32] . However, since the cross-validated predictive performance (Q2Y = 0.286) is not much higher than some of the models generated after random permutation of the response ( Figure 4A), the quality of the model needs to be confirmed in a future study on an independent cohort of larger size.
In Demonstrator 3, we highlighted the fact that the microservice architecture is indeed domain-agnostic and is not limited to a particular assay technology, i.e. mass spectrometry.
Using a fully automated 1D NMR workflow, we showed that the pattern of the metabolite expression is different between type 2 diabetic and healthy controls, and that a large number of metabolites contribute to such separation. The preprocessing of NMR-based experiments can be performed with minimal effort on other studies (i.e. simply by providing a MetaboLights accession number), leading to the capability to re-analyze data and compare the results with the original publication findings. Furthermore, it demonstrates the value of standardised dataset descriptions using nmrML 33 and ISA format 34 , 35 for representing NMR based studies, as well as the potential of the PhenoMenNal VRE to foster reproducibility.
A complete understanding of metabolic function implies a complete metabolic profile, but also knowledge of the associated distribution of metabolic fluxes in the metabolic network. In 13 Demonstrator 4, the microservices architecture is applied to deal with flux distributions derived from the application of stable isotope resolved metabolomics. Here we showed high rate of glycolysis in cell cultured in hypoxia which is consistent with the one expected for endothelial cells 36 and also further confirmation on how these cells maintain energy in low oxygen environments and without oxidative phosphorylation 37 , 38 .
While microservices are not confined to metabolomics and generally applicable to a large variety of applications, there are some important implications and limitations of the method. Firstly, tools need to be containerized in order to operate in the environment. This is however not particularly complex, and an increasing number of developers provide containerized versions of their tools on public container repositories such as Dockerhub or Biocontainers 9 . Secondly, uploading data to a cloud-based system can take a considerable amount of time, and having to re-do this every time a VRE is instantiated can be time-consuming. This can be alleviated by using persistent storage on a cloud resource, but the availability of such storage varies between different cloud providers. Further, the storage system can become a bottleneck when many services try to access a shared storage. We observe that using a distributed storage system with multiple storage nodes can drastically increase performance, and the PhenoMeNal VRE comes with a distributed storage system by default. When using a workflow system to orchestrate the microservices, stability and scalability are inherently dependent on the workflow system's job runner. We observed that in the Galaxy workflow engine, executing a large number of jobs resulted in the VRE becoming unresponsive whereas the Luigi engine did not have these shortcomings. Although this problem can be resolved by defining the required resources in the Galaxy job runner for each tool, the issue of knowing how much computational resources a specific tool needs remains. This can be partially addressed by tool/workflow developers to estimate the required resources for their tools and workflows. With cloud and microservices maturing, workflow systems will need to evolve and further embrace the new possibilities of these infrastructures. Also, not all research can be easily pipelined, for example exploratory research might be better carried out in an ad-hoc manner than with workflows and the overhead this implies. A Jupyter Notebook as used in in Demonstrator 1 or embedded in Galaxy 39 constitute promising ways to make use of microservices for interactive analysis.
In summary, we showed that microservices allow for efficiently scaling up analyses on multiple computational nodes, enabling the processing of large data sets. By applying a number of data (mzML 40 , nmrML) and metadata standards (ISA serialisations for study descriptions 34 , 35 ), we also demonstrated a level of interoperability which has never been achieved in the context of metabolomics, by providing completely automated start-to-end analysis workflows for mass spectrometry and NMR data. The PhenoMeNal VRE realizes the notion of "bringing compute to the data" by enabling the instantiation of complete virtual infrastructures close to large datasets that could not be uploaded over the internet, and can also be launched close to ELSI sensitive data that is not allowed to leave a secure computing environment. While the current PhenoMeNal VRE implementation uses Docker for software containers and Kubernetes for container orchestration, the microservice methodology is general and not restricted to these frameworks. In addition, we emphasise that the presented methodology goes beyond metabolomics and can be applied to virtually any field, lowering the barriers for taking advantage of cloud infrastructures and opening up for large-scale integrative science.  Demonstrator 1 to illustrate the scalability of a microservice approach. The preprocessing workflow is 23 composed of 5 OpenMS tasks that were run in parallel over the 12 groups in the dataset using the Luigi workflow system. The first two tasks, peak picking (528 tasks) and feature finding (528 tasks), are trivially parallelizable, hence they were run concurrently for each sample. The subsequent feature linking task needs to process all of the samples in a group at the same time, therefore 12 of these tasks were run in parallel. In order to maximize the parallelism, each feature linker container (microservice) was run on 2

Main figures
CPUs. Feature linking produces a single file for each group, that can be processed independently by the last two tasks: file filter (12 tasks) and text exporter (12 tasks