- Split View
-
Views
-
Cite
Cite
Alejandra Cervera, Ville Rantanen, Kristian Ovaska, Marko Laakso, Javier Nuñez-Fontarnau, Amjad Alkodsi, Julia Casado, Chiara Facciotto, Antti Häkkinen, Riku Louhimo, Sirkku Karinen, Kaiyang Zhang, Kari Lavikka, Lauri Lyly, Maninder Pal Singh, Sampsa Hautaniemi, Anduril 2: upgraded large-scale data integration framework, Bioinformatics, Volume 35, Issue 19, October 2019, Pages 3815–3817, https://doi.org/10.1093/bioinformatics/btz133
- Share Icon Share
Abstract
Anduril is an analysis and integration framework that facilitates the design, use, parallelization and reproducibility of bioinformatics workflows. Anduril has been upgraded to use Scala for pipeline construction, which simplifies software maintenance, and facilitates design of complex pipelines. Additionally, Anduril’s bioinformatics repository has been expanded with multiple components, and tutorial pipelines, for next-generation sequencing data analysis.
Freely available at http://anduril.org.
Supplementary data are available at Bioinformatics online.
1 Introduction
Measurement technologies, such as next-generation sequencing, proteomics and automated imaging, are able to produce enormous amounts of data, which have transformed medical research into a data-rich field. Although producing data from biological samples is cost efficient and easy, data analysis and interpretation has become a bottleneck. Computational frameworks that allow systematic, parallel and flexible pipeline design are indispensable for the reproducibility, maintenance and execution of large-scale analyses.
The design of current frameworks is heavily influenced by the expected end-user. Some frameworks like Galaxy (Goecks et al., 2010), Taverna (Wolstencroft et al., 2013) and GenePattern (Reich et al., 2006), offer easy-to-run capabilities of existing pipelines with a graphical user interface, whereas frameworks like Anduril (Ovaska et al., 2010), Snakemake (Koster and Rahmann, 2012), Ruffus (Goodstadt, 2010) and Nextflow (Di Tommaso et al., 2017), offer more flexibility in pipeline construction and integration of tools for users with at least some level of programing skills. Currently no single framework caters to all users and the differing demands of all data analysis projects (Leipzig, 2017). Here we present an updated version of the Anduril data analysis and integration framework, designed for bioinformaticians and ideal for laboratories working with few in-house samples or considerably larger datasets, e.g. from The Cancer Genome Atlas (TCGA) (Bell et al., 2011), which may require integration of several layers, such as clinical data and outputs of various high-throughput technologies.
The two major improvements in Anduril 2 are (i) the change from a custom-made scripting language to Scala (Odersky et al., 2004) which grants more freedom and flexibility in pipeline construction, and (ii) the expansion of Anduril’s bioinformatics resource bundles. These resources confer built-in support to pipelines for analysis of central technologies in biomedicine, such as high-throughput imaging (Rantanen et al., 2014), exome or whole genome and micro- and mRNA data analysis (Icay et al., 2016). Other recent additions to the Anduril framework include both components for specific analysis such as methylation extraction and decomposition based on tumor purity (Häkkinen et al., 2018), as well as components that facilitate general data analysis through a quick interface to R library dplyr (Wickham et al., 2018) or Python Data Analysis library (pandas) (Mckinney, 2011). Anduril 2 comes with extensive documentation, which shows not only how to get started, build new pipelines and make use of parallelization, but also how to best exploit the available components, and start processing and analyzing high-throughput datasets. Several worked examples are available in https://bitbucket.org/anduril-dev/sequencing/wiki. Anduril 2 is freely extendable and is distributed and licensed as open source software. An overview of the framework is depicted in Figure 1.
2 Software description
Most popular bioinformatics frameworks, including Anduril 2, handle both serial and parallel steps, complex dependencies; varied software and data file types, user-defined parameters and deliverables. Below we describe additional features and advantages of Anduril 2.
2.1 Automatic parallelization
Anduril models the component dependencies as a graph and parallelizes independent parts. The generalized prefixing of the processes enables flexible use of SLURM (Jette et al., 2002) and Sun Grid Engine (qsub).
2.2 Reentrancy
Resuming execution at the point of interruption is extremely useful when executing long-running complex pipelines on big datasets, as it spares the user from having to identify from which point onward to re-execute or which samples have been already analyzed. It is possible to update the component or their parameters and to add samples into the workflow without triggering execution of the completed independent steps. This results in significant improvements on both computing and programing time.
2.3 Dependency support
An update, such as a change in parameter, on a given step will cause re-execution of all dependent downstream processes. Components can be annotated to create synthetic dependencies between them when their input–outputs are not linked. For example, a component may not produce an output but can modify its environment, such as a database entry, and trigger downstream execution of a component marked as its dependent.
2.4 Bioinformatics resources
More than 400 components and functions for performing common tasks for diverse bioinformatics analyses are available and fully documented (see Fig. 1). Installers, for most third-party software supported by Anduril components, are included, which simplifies the task of installing the myriad of software packages used in standard bioinformatics analysis. Anduril 2 can use its own optional installation or a user-defined one. Furthermore, any component can be run outside Anduril 2 with the same parameters and inputs derived from the pipeline since the effective configuration of each component is stored in a bash script facilitating testing and providing reproducibility.
2.5 Ease of integration of new tools or custom analysis
Integrating additional tools into a pipeline is extremely simple since own or third-party code can be embedded in eval-based components. Adding a new tool to the repository of components, for private or community use requires only defining inputs, parameters and outputs through an XML file, ideally with appropriate test cases. Tools like Taverna require third-party software to implement plug-ins to be used in the pipelines. Both Galaxy and Anduril 2 offer an easy way to build wrappers, but Anduril also supports immediate integration of custom analysis and software in any pipeline (see Supplementary Material).
3 Results and conclusions
3.1 Case study
To illustrate Anduril 2 in data analysis, we studied RNA-seq data from good and poor responding high-grade serous ovarian cancer patients from level 1 TCGA data. We hypothesized that comparing the 10% patients with the longest response (n = 26) to the 10% with the shortest response (n = 24) would reveal genes and genetic variants that are associated with treatment resistance and disease progression.
An interesting finding emerged by combining the variants with survival analysis. The polymorphism (T->C in chr3: 48695047) in IP6K2 showed the most significant association to survival (P < 0.0001) as shown in Figure 1. IP6K2 activity has been linked to therapy response in ovarian cancer (Morrison et al., 2002), but the mechanism on how IP6K2 mediates apoptosis is still unclear (Nagata et al., 2005). Figures and reports produced for this case-study, as well as the pipelines for processing and analyzing the data, are available in http://csbi.ltdk.helsinki.fi/p/anduril2.
3.2 Conclusions
With many frameworks for data analysis available, the choice of which to adopt needs to take into consideration the skills and backgrounds of the user, as well as the needs of the projects. When compared with Galaxy and Taverna, Anduril 2 offers ease of integration of new tools and custom analysis as well as batch processing. Frameworks like Luigi (https://github.com/spotify/luigi) handle efficient execution of pipelines, but do not provide any bioinformatic-related components. Cromwell + WDL (https://software.broadinstitute.org/wdl/), a workflow management can mimic Anduril 2 dynamic for-loops with a scattering control flow, although nested scattering and parametric data typing are not supported. A comparison of Anduril 2 to other frameworks is shown in Supplementary Material. Current and future work on Anduril 2 include integration of new tools and data types for single-cell transcriptomics and proteomics data analysis, as well as extended support for docker-based components and kubernetes integration.
Acknowledgements
We thank CSC – IT Center for Science Ltd. for computing resources as well as the current and past Hautaniemi lab members for their contributions to components and documentation. The results published here are in part based upon data generated by The Cancer Genome Atlas (TCGA) managed by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). Information about TCGA can be found at http://cancergenome.nih.gov.
Funding
This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. [66740], the Academy of Finland [project 305087 and 292402], the Sigrid Jusélius Foundation, and Finnish Cancer Associations.
Conflict of Interest: none declared.
References