Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv

Abstract Background The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms. Results Based on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups. Conclusions The underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.


Introduction
Out of the many big data domains, genomics is considered "the most demanding" with respect to all stages of the data lifecycle -from acquisition, storage, distribution and analysis [1].
As genomic data is growing at an unprecedented rate due to improved sequencing technologies and reduced cost, it is currently challenging to analyse the data at a rate matching its production. With data growing exponentially in size and volume, the practice to perform computational analyses using work ows has overtaken more traditional research methods using ad-hoc scripts which were the typical modus operandi over the last few decades [2,3]. Scienti c work ow design and management has become an essential part of many computationally driven data-intensive analyses enabling Automation, Scaling, Adaptation and Provenance support (ASAP) [4]. Increased use of work ows has driven rapid growth in the number of computational data analysis WMSs, with hundreds of heterogeneous approaches now existing for work ow speci cation and execution [5]. There is an urgent need for a common format and standard to de ne work ows and enable sharing of analysis results using a given work ow environment.
Common Work ow Language (CWL) [11] has emerged as a work ow de nition standard designed to enable portability, interoperability and reproducibility of analyses between work ow platforms. CWL has been widely adopted by more than 20 organisations, providing an interoperable bridge overcoming the heterogeneity of work ow environments. Whilst a common standard for work ow de nition is an important step towards interoperable solutions for work ow speci cations, sharing and publishing the results of these work ow enactments in a common format is equally important. Transparent and comprehensive sharing of experimental designs is critical to establish trust and ensure authenticity, quality and reproducibility of any work ow-based research result. Currently there is no common format de ned and agreed upon for interoperable work ow archiving or sharing [12].
In this paper, we utilize open-source standards such as CWL together with related e orts such as Research Objects (ROs) [13], BagIt [14] and PROV [15] to de ne CWLProv, a format for the interoperable representation of a CWL work ow enactment. We focus on production of a work ow-centric executable RO as the nal result of a given CWL work ow enactment. This RO is equipped with the artefacts used in a given execution including the work ow inputs, outputs and, most importantly, the retrospective provenance. This approach enables the complete sharing of a computational analysis such that any future CWLbased work ow can be re-run given the best practices discussed later for software environment provision are followed.
The concept of work ow-centric ROs has been previously considered [13,16,17] for structuring the analysis methods and aggregating digital resources utilized in a given analysis. The generated ROs in these studies typically aggregate data objects, example inputs, work ow speci cations, attribution details, details about the execution environment amongst various other elements. These previous e orts were largely tied to a single platform or a single Work ow Management System (WMS). CWLProv aims to provide a platform-independent so-lution for work ow sharing, enactment and publication. All the standards and vocabularies used to design CWLProv have an overarching goal to support a domain-neutral and interoperable solution (detailed in Section Applied Standards and Vocabularies).
The contribution of this work are summarized and listed in the Key Points section and the remainder of this paper is structured as follows. In Section Background and Related Work we discuss the key concepts and related work followed by a summary of the published best-practices and recommendations for work ow representation and sharing in Section Levels of Provenance and Resource Sharing. This section also details the hierarchical provenance framework that we de ne to provide a principled approach for provenance capture and method sharing. Section CWLProv 0.6.0 and utilized standards introduces CWLProv and outlines its format, structure and the details of the standards and ontologies it utilizes. Section Practical Realisation of CWLProv presents the implementation details of CWL-Prov using cwltool [10] and Section CWLProv Evaluation with Bioinformatics Work ows demonstrates and evaluates the implemented module for three existing work ow case studies. We discuss the challenges of interoperable work ow sharing and the limitations of the proposed solution listing several possible future research directions in Section Discussion and Future Directions before nally drawing conclusions on the work as a whole in Section Conclusion.

Background and Related Work
This work draws upon a range of topics as Provenance and Interoperability. We de ne these here to provide better context for the reader.

Provenance
A number of studies have advocated the need for complete provenance tracking of scienti c work ows to ensure transparency, reproducibility, analytic validity, quality assurance and attribution of (published) research results [18]. The term Provenance is de ned by World Wide Web Consortium (W3C) [19] as: "Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness." nance and Work ow Evolution. Retrospective Provenance refers to the detailed record of the implementation of a computational task including the details of every executed process together with comprehensive information about the execution environment used to derive a speci c product. Prospective Provenance refers to the 'recipes' used to capture a set of computational tasks and their order, e.g. the work ow speci cation [20]. This is typically given as an abstract representation of the steps (tools/data analysis steps) that are necessary to create a particular research output, e.g. a data artefact. Work ow Evolution refers to tracking of any alteration in the existing work ow resulting in another version of the work ow that may produce either the same or di erent resultant data artefacts [21]. In this work, our focus is mainly on improving representation and capture of Retrospective Provenance.

Interoperability
The concept of interoperability varies in di erent domains. Here we focus on computational interoperability de ned as: The ability of two or more components or systems to exchange information and to use the information that has been exchanged [22].
The focus of this study is to propose and devise methods to achieve syntactic, semantic and pragmatic interoperability as de ned in Levels of Conceptual Interoperability Model (LCIM) [23]. Syntactic interoperability is achieved when a common data format for information exchange is unambiguously de ned. The next level of interoperability, referred to as semantic interoperability, is reached when the content of the actual information exchanged is unambiguously de ned. Once there is an agreement about the format and content of the information, pragmatic interoperability is achieved when the context, application and use of the shared information and data exchanged is also unambiguously de ned. In the section Evaluation Results, we relate these general de nitions to speci c work ow applications with respect to work ow-centric ROs and describe to what extent these interoperability requirements are addressed.

Related Work
We focus on relevant studies and e orts trying to resolve the issue of availability of required resources used in a given computational analysis. In addition, we cover e orts directed towards provenance capture of work ow enactments. As these concepts have been around for a considerable time, we restrict our attention to scienti c work ows and studies related to the bioinformatics domain.

Work ow Software Environment Capture
Freezing and packaging the run-time environment to encompass all the software components and their dependencies used in an analysis is a recommended and widely adopted practice [24] especially after use of cloud computing resources where images and snapshots of the cloud instances are created and shared with fellow researchers [25]. Nowadays, preservation and sharing of the software environment e.g. in open access repositories, is becoming a regular practice in the work ow domain as well. Leading platforms managing infrastructure and providing cloud computing services and con guration on demand include DigitalOcean [26], Amazon Elastic Compute Cloud [27], Google Cloud Platform [28] and Microsoft Azure [29]. The instances launched on these platforms can be saved as snapshots and published with an analysis study to later recreate an instance representing the computing state at analysis time.
Using "System-wide packaging" for data-driven analyses, although simplest on part of the work ow developers and researchers, has its own caveats. One of the notable issue is the size of the snapshot as it captures everything in an instance at a given time, hence the size can range from few gigabytes to many terabytes. To distribute research software and share execution environments, various light-weight and containerbased virtualisation and package managers are emerging, including: Docker, Singularity, Debian Med and Bioconda.
Docker [30] is a lightweight container-based virtualisation technology that facilitates the automation of application development by archiving software systems and environment to improve portability of the applications on many common platforms including Linux, Microsoft windows, Mac OS X and cloud instances. Singularity [31] is also a cross-platform open source container engine speci cally supporting High Performance Computing (HPC) resources. An existing Docker format software image can be imported and used by the Singularity container engine. Debian Med [32] contribute packages of medical practice and biomedical research to the Debian Linux distribution, lately also including work ows [8]. Bioconda [33] packages, based on the an open source package manager Conda [34], are available for Mac OS X and Linux environments, directing towards availability and portability of software used in the life science domain.

Data/Method Preservation, Aggregation & Sharing
Preserving and sharing only the software environment is not enough to verify results of any computational analysis or reuse the methods used (e.g. work ows) with a di erent dataset. It is also necessary to share other details including data (example or the original), scripts, work ow les, input con guration settings, the hypothesis of the experiment and any/all trace/logging information related to "what happened", i.e. the retrospective provenance of the actual work ow enactment.  [38]. These repositories facilitate collaborative research, in addition to public sharing of source code and the results of a given analysis. There is however no agreed format that must be followed when someone shares artefacts associated with an analysis. As a result, the quality of the shared resources can range from a highly annotated, properly documented and complete set of artefacts, to raw data with undocumented code and incomplete information about the analysis as a whole. Individual organisations or groups might provide a set of "recommended practices", e.g. in readme les, to attempt to maintain the quality of shared resources. The initiative Code as a Research Object [39] is a joint project between Figshare, GitHub and Mozilla Science Lab [40] and aims to archive any GitHub code repository to Figshare and produce a Digital Object Identi er (DOI) to improve the discovery of resources 1 .
Reprozip [41] aims to resolve portability issues by identifying and packaging all dependencies in a self-contained package which when unpacked and executed on another system (with Reprozip installed) should reproduce the methods and results of the analysis. Each package also contains a human readable con guration le containing provenance information obtained by tracing system calls during system execution. The corresponding provenance trace is however not formatted using existing open standards established by the community. Several platform-dependent studies have been targeted towards extensions to existing standards by implementing the Research Ob-ject model and improving aggregation of resources. Belhajjame et al. [13] proposed the application of ROs to develop work owcentric ROs containing data and metadata to support the understandability of the utilized methods (in this case work ow speci cations). They explored ve essential requirements to work ow preservation and identi ed data and metadata that could be stored to satisfy the said requirements. These requirements include providing example data, preserving work ows with provenance traces, annotating work ows, tracking the evolution in work ows and packaging the auxiliary data and information with work ows. They proposed extensions to existing ontologies such as Object Reuse and Exchange (ORE), the Annotation Ontology (AO) and PROV-O, with four additional ontologies to represent work ow speci c information. However, as stated in the paper, the scope of the proposed model at that time was not focused on interoperability of heterogeneous work ows as it was demonstrated for a work ow speci c to Taverna WMS using myExperiment, which makes it quite platform-dependent.
A domain-speci c solution is proposed by Gomez-Perez et al. [42] by extending the RO model to equip work owcentric ROs with information catering for the speci c needs of the Earth Science community, resulting in enhanced discovery and reusability by experts. They demonstrated that the principles of ROs can support extensions to generate aggregated resources leveraging domain speci c knowledge. Hettne et al. [16] used three genomic work ow case studies to demonstrate the utilisation of ROs to capture methods and data supporting querying and useful extraction of information about the scienti c investigation under observation. The solution was tightly coupled with the Taverna WMS and hence if shared, would not be reproducible outside of the Taverna environment. Other notable e orts to use ROs for work ow preservation and method aggregation include [7] in systems biology, [43] in clinical settings and [9] in precision medicine.

Provenance Capture & Standardization
A range of standards for provenance representation have been proposed. Many studies have emphasized the need for provenance focusing on aspects such as scalability, granularity, security, authenticity, modelling and annotation [18]. They identify the need to support standardized dialogues to make provenance interoperable. Many of these were used as inputs to initial attempts at creating a standard Provenance Model to tackle the often inconsistent and disjointed terminology related to provenance concepts. This ultimately resulted in the speci cation of the Open Provenance Model (OPM) [44] together with an opensource model for the governance of OPM [45]. Working towards similar goals of interoperability and standardization of provenance for web technologies, the World Wide Web Consortium (W3C) Provenance Incubator Group [46] and the authors of OPM together set the fourth provenance challenge at the International Provenance and Annotation Workshop, 2010 (IPAW'10) that later resulted in PROV, a family of documents serving as the conceptual model for provenance capture, its representation, sharing and exchange over the Web [47] regardless of the domain or platform. Since then, a number of studies have proposed extensions to this domain-neutral standard. The model is general enough to be adapted to any eld and exible enough to allow extensions for specialized cases.
Michaelides et al. [48] presented a domain-speci c PROVbased solution for retrospective provenance to support portability and reproducibility of a statistical software suite. They captured the essential elements from the log of a work ow enactment and represented them using an intermediate notation.
This representation was later translated to PROV-N and used as the basis for the PROV Template System. A Linux speci c system provenance approach was proposed in [49] where they demonstrated retrospective provenance capture at the system level. Another project UniProv is working to extract information from Unicore middleware and transform it into a PROV-O representation to facilitate the back-tracking of work ow enactments [50]. Other notable domain-speci c e orts leveraging the established standards to record provenance and context information are PROV-man [51], PoeM [52] and micropublications [53]. Platforms such as VisTrails and Taverna have built in retrospective provenance support. Taverna [7] implements an extensive provenance capture system TavernaProv [54], utilising both PROV ontologies as well as ROs aggregating the resources used in an analysis. VisTrails [55] is an open source project supporting platform-dependent provenance capture, visualisation and querying for extraction of required information about a work ow enactment. [41] provide an overview of PROV terms and how they can be translated from the VisTrails schema and serialized to PROV-XML. WINGS [56] can report ne-grained work ow execution provenance as Linked Data using the OPMW ontology [57], which builds on both PROV-O and OPM.
All these e orts are fairly recent and use a standardized approach to provenance capture and hence are relevant to our work on the capture of retrospective provenance. However, our aim is a domain-neutral and platform-independent solution that can be easily adapted for any domain and shared across di erent platforms and operating systems.
As evident from the literature, there are e orts in progress to resolve the issues associated with e ective and complete sharing of computational analysis including both the results and provenance information. These studies range from highly domain-speci c solutions and platform-dependent objects to open source exible interoperable standards. CWL has widespread adoption as a work ow de nition standard, hence is an ideal candidate for portable work ow de nitions. The next section investigates existing studies focused on work owcentric science, and summarises best practice recommendations put forward in these studies. From this we de ne a hierarchical provenance and resource sharing framework.

Levels of Provenance and Resource Sharing
Various studies have empirically investigated the role of automated computational methods in the form of work ows and published best practice recommendations to support work ow design, preservation, understandability and re-use. We summarise a number of these recommendations and the their justi cations in Table 1, where each recommendation addresses speci c requirement of work ow design and sharing. These recommendations can be clustered into broad themes as shown in Figure 1. This classi cation can be in more than one way e.g. according to how these recommendations are supporting each FAIR dimension [67]. In this study, we have focused on categories with respect to work ow design, prospective provenance, data sharing, retrospective provenance, the computational environment required/used for an analysis and lastly better ndability and understandability of all shared resources.
Sharing "all artefacts" from a computational experiment (following all recommendations and best practices) is a demanding task without any informed guidance. It requires consolidated understanding of the impact of the many di erent artefacts involved in that analysis. This places extra e orts on work ow designers, (re)-users, authors, reviewers and expectations on the community as a whole. Given the numerous WMS and differences in how each system deals with provenance documentation, representation and sharing of these artefacts, the granularity of provenance information preserved will vary for each work ow de nition approach. Hence, devising one universal Avoid manual processing of data and if using shims [61] then make these part of the work ow to fully automate the computational process [58,60].
This ensures the complete capture of the computational process without broken links so that the analysis can be executed without need for performing manual steps.  [59,57,60].
Intermediate data products can be used to inspect and understand shared analysis when re-enactment is not possible.

R4 sw-version
Record the exact software versions used [58,60]. This is necessary for reproducibility of results as di erent software versions can produce di erent results.

R5 data-version
If using public data (reference data, variant databases), then it is necessary to store and share the actual data versions used [3,6,58,60] .
This is needed as di erent versions of data, e.g. human reference genome or variant databases, can result in slightly di erent results for the same work ow.

R6
annotation Work ows should be well-described, annotated and o er associated metadata. Annotations such as user contributed tags and versions should be assigned to work ows and shared when publishing the work ows and associated results [13,17,57,62,63] .
Metadata and annotations improve the understandability of the work ow, facilitate independent re-use by someone skilled in the eld, make work ows more accessible and hence promote the longevity of the work ows.

R7 identi er
Use and store stable identi ers for all artefacts including the work ow, the datasets and the software components [62,63].
Identi ers play an important role in the discovery, citation and accessibility of resources made available in open-access repositories.
Such details support requirements analysis before any re-enactment or reproducibility is attempted.
The same work ow speci cations can be used with di erent datasets thereby supporting re-usability.

R10
software Aggregate the software with the analysis and share this when publishing a given analysis [13,6,63,64,57].
Making software available reduces dependence on third party resources and as a result minimizes work ow decay [65].

R11 raw-data
Share raw data used in the analysis [13,59,57,63,64]. When someone wants to validate published results, availability of data supports veri cation of claims and hence establishes trust in the published analysis R12 attribution Store all attributions related to data resources and software systems used [57,64].
Accreditation supports proper citation of resources used.

R13
provenance Work ows should be preserved along with the provenance trace of the data and results [13,17,57,60,64].
A provenance trace provides a historical view of the work ow enactment, enabling end users to better understand the analysis retrospectively R14 diagram Data ow diagrams of the computational analysis using work ows should be provided [6,59].
These diagrams are easy to understand and provide a human readable view of the work ow.
This improve availability and legal re-use of the resources used in the original analysis, while restricted licenses would hinder reproducibility.

R16
format Data, code and all work ow steps should be shared in a format that others can easily understand preferably in a system neutral language [13,59,66].
System neutral languages help achieve interoperability and make an analysis understandable.

R17 executable
Promote easy execution of work ows without making signi cant changes to the underlying environment [3].
In addition to helping reproducibility, this enables adapting the analysis methods to other infrastructures and improves work ow portability.

R18
resource-use Information about compute and storage resources should be stored and shared as part of the work ow [6].
Such information can assist users in estimating the required resources needed for an analysis and thereby reduce the amount of failed executions.

R19
example Example input and sample output data should be preserved and published along with the work ow-based analysis [13,65].
This information enables more e cient test runs of an analysis to verify and understand the methods used.  Table 1 classi ed into these categories but technology-speci c solution for provenance capture and the related resource sharing is impossible. Instead we propose a generic framework of provenance in Figure 2 that all WMSs can bene t from and conform to with minimum technical overheads.
The recommendations in Table 1 aid in our understanding to de ne this framework by classifying the granularity of the provenance and related artefacts where the uppermost level exhibits comprehensive, reproducible, understandable and provenance-rich computational experiment sharing. The purpose of this framework is threefold. First, because of its generic nature it brings the uniformity in the provenance granularity across various WMS belonging to di erent work ow de nition approaches. Second, it provides comprehensive and well-de ned guidelines that can be used by the researchers to conduct principled analysis of the provenance of any published study. Third, due to its hierarchical nature, the framework can be leveraged by the work ow authors to progress incrementally towards the most transparent work ow-centric analysis. Overall, this framework will help achieve a uniform level of provenance and resource sharing with a given work owcentric analysis guaranteed to ful ll the respective provenance applications.
Our proposed provenance levels are ordered from low granularity to higher degrees of speci city. In brief, Level 0 is unstructured information about the overall work ow enactment, Level 1 adds structured retrospective provenance, access to primary data and executable work ows, Level 2 enhances the white-box provenance for individual steps, and Level 3 adds domain-speci c annotations for improved understanding. These levels are described in the following sub-sections and mapped to the requirements in Table 1 that these levels aim to satisfy.

Level 0
To achieve this level, researchers should share the work ow speci cations, input parameters used for a given work ow enactment, raw logs and output data preferably through an openaccess repository. This is the least information that could be shared without putting any extra e orts to support seamless reuse or understandability of a given analysis. The artefacts shared at this level would only require uploading of the associated resources to a repository without necessarily providing any supporting metadata or provenance information. Information captured at Level 0 is the bare minimum that can be used for result interpretation.
Work ow de nitions based on Level 0 can also potentially be re-purposed for other analyses. As argued by Ludäscher, a well-written scienti c work ow and its graphical representation is itself a source of prospective provenance giving user an idea of the steps taken and data produced [68]. Therefore a well-described work ow speci cation indirectly provides prospective provenance without aiming for it. In addition to the textual work ow speci cation, its graphical representation should also be shared if available for better understandability ful lling R14-diagram. At this level, reproducing the work ow would only be possible if the end-user devotes extra e orts to understand the shared artefacts and carefully recreate the execution environment. As open access journals frequently require availability of methods and data, many published studies now share work ow speci cations and optionally the outputs thereby achieving Level 0 and speci cally sat- isfying R1-parameters and R9-work ow (Table 1). In addition, the resources shared should have open licence starting from Level 0 and this practice proposed by R15-open-source should be adopted at each higher level.

Level 1
At Level 1, R4-sw-version, R5-data-version, R12-attribution and R13-provenance should be satis ed by providing retrospective provenance of the work ow enactment -i.e. a structured representation of machine readable provenance which can answer questions such as "what happened", "when happened", "what was executed", "what was used", "who did this" and "what was produced". Seamless re-enactment of the work ow should be supported at this level. This is only possible when along with provenance information, R8-environment and R10software is satis ed by potentially packaging the software environment for analysis sharing or there is enough information about the software environment that guide the user to reliably re-enact the work ow. Hence R17-executable should be satised making it possible for the end users to re-enact the shared analyses without making major changes to the underlying software environment.
In addition to the software availability and retrospective provenance, access to input data should also be provided fullling R11-raw-data. This data can be used to re-enact the published methods or utilized in a di erent analysis, e.g. for performance comparison of methods. At Level 1, it is preferable to provide content-addressable data artefacts such as input, output and intermediate les, avoiding local paths and le names to make a given work ow executable outside its local environment. The intermediate data artefacts should also be provided to facilitate inspection of all step results, hence satisfying R3intermediate. All resources, including work ow speci cations and provenance, should be shared in a format that is understandable across platforms, preferably in a technology-neutral language as proposed by R16-format.
While software and data can be digitally captured, the hardware and infrastructure requirements also need to be captured to ful ll R18-resource-use. This kind of information can naturally vary widely with runtime environments, architectures and data sizes [69], as well as rapidly becoming outdated as hardware and cloud o erings evolve. Nevertheless a snapshot of the work ow's overall execution resource usage for an actual run can be bene cial to give a broad overview of the requirements, and can facilitate cost-e cient re-computation by taking advantage of spot-pricing for cloud resources [70].

Level 2
It is a common practice in scienti c work ows to modularize the work ow speci cations by separating the related tasks into "sub-work ows" or "nested work ows" [24] to be incorporated and used in other work ows or be assigned to compute and storage resources in case of distributed computing [71]. These modular solutions promote understanding and reusability of the work ows as researchers are inclined to use these modules instead of work ow as whole for their own computational experiments. An example of a sub-work ow is the mandatory "pre-processing" [72] needed for the Genome Analysis ToolKit (GATK) best practice pipelines used for genomic variant calling. These steps can be separated into a subwork ow to be used before any variant calling pipeline, be it somatic or germline.
At Level 1, retrospective provenance is coarse grained and as such, there is no distinction between work ows and their subwork ows. Ludäscher [68] distinguishes work ow provenance between black-box and database provenance as white-box. The reasoning behind this distinction is that often the steps in a work ow, especially those based on graphical user interfacebased platforms, provide levels of abstraction/obscurity to the actual tasks being implemented. In our previous work we used an empirical case study to demonstrate that declarative approaches to work ow de nition resulted in transparent workows with the least number of assumptions [6]. This resolves the black box/white box issue to some extent, but to further support research transparency, we propose to share retrospective provenance logs for each nested/sub-work ow making the details of a work ow enactment as explicit as possible and moving a step closer to white-box provenance. These provenance logs will support the inspection and automatic reenactment of targeted components of a work ow such as a single step or a sub-work ow individually without necessarily having to re-enact the full analysis. Some existing makelike systems such as Snakemake support partial re-enactments but typically rely on xed le paths for input data and require manual intervention to provide the speci c directory structure. With detailed provenance logs and the corresponding contentaddressable data artefacts, the partial re-runs can be achieved with automatic generation of input con guration setting.
In addition, we propose to include permalinks at Level 2 to identify the work ows and their individual steps which facilitates the inspection of each step and aim to improve the longevity of the shared resources, hence supporting R7identi er. Improving R18-resource-use for Level 2 would include resource usage per task execution. Along with execution times this can be useful information to identify bottlenecks in a workow and for more complex calculations in cost optimization models [73]. At this provenance level resource usage data will however also become more noisy and highly variant on scheduling decisions by the work ow engine, e.g. sensitivity to cloud instance reuse or co-use for multiple tasks, or variation in data transfers between tasks on di erent instances. Thus Level 2 resource usage information should be further processed with statistical models for it to be meaningful for a user keen to estimate the resource requirement for re-enactment of a given analysis.

Level 3
Levels 0-2 are generic and domain-neutral, and can apply to any scienti c work ow. However, domain-speci c information/metadata about data and processes plays an important role in better understanding of the analysis and exploitation of provenance information, e.g. for meaningful queries to extract information to the domain under consideration [74,75]. Addition of domain speci c metadata e.g. le formats, userde ned tags and other annotations to generic retrospective provenance can improve the white-boxness by providing domain context to the analysis as described in R6-annotations. Annotations can range from adding textual description and tags to marking data with more systematic and well-de ned domainspeci c ontologies such as EDAM [76] and BioSchemas [77] in the case of bioinformatic work ows. Some studies also propose to provide example or test data sets which eventually helps in analyzing the methods shared and verifying their results (as described in R19-example).
At Level 3, the information from previous levels combined with speci c metadata about data artefacts facilitates higher level classi cation of work ow steps into motifs [78] such as data retrieval, pre-processing, analysis and visualisation. This level of provenance, resource aggregation and sharing can provide a researcher-centric view of data and enable users to reenact a set of steps or full work ow by providing ltered and annotated view of the execution. This can be non-trivial to achieve with mainstream methods of work ow de nition and sharing, as it requires guided user annotations with controlled vocabularies, but this can be simpli ed by reusing related tooling from existing e orts like BioCompute Objects [9] and Dat-aCrate [79].
Communicating resource requirements (R18-resource-use) at Level 3 would involve domain-speci c models for hardware use and cost prediction, as suggested for dynamic cloud costing [80] in BioSimSpace [81], or predicting assembler and memory settings through machine learning of variables like source biome, sequencing platform, le size, read count and base count in the European Bionformatics Institute (EBI) Metagenomics pipeline [82]. For robustness such models typically need to be derived from resource usage across multiple work ow runs with varied inputs, e.g. by a multi-user work ow platform. Taking advantage of Level 3 resource usage models might require pre-processing work ow inputs and calculations in an environment like R or Python, and so we recommend that models are provided with separate sidecar work ows for interoperable execution before the main work ow.
By explicit enumeration of the levels of provenance, it should be possible to quantify and directly assess the e ort required to re-use a work ow and reproduce experiments directly. Similar e ort like 5-star Open Data [83] strongly advocates open-licensed structured representation, use of stable identi ers for data sharing and following Linked Data principles to cross-relate data. One challenge on achieving the Open Data stars is that it needs tool support during data processing. In our framework we proposed systematic work ow-centric resource sharing using structured Linked Data representation, including recording of the executed data operations. Hence, our e ort compliments the already proposed 5-star Open Data principles and contributes to further understanding by sharing the computational method following the same principles.
Requiring researchers to achieve the above de ned levels individually is unrealistic without guidance and direct technical support. Ideally, the conceptual meaning of these levels would be translated into a practical solution utilising the available resources. However, given the heterogeneity of work ow de nition approaches, it is expected that the proposed framework, when translated into practical solutions, will also naturally result in varying work ow-centric solutions tied to speci c WMSs. To support interoperability of the work ow-centric analysis achieving the provenance levels, we propose CWLProv, a format for annotating resource aggregations equipped with retrospective provenance. The next section describes CWLProv and the associated standards that are applied in this process.

CWLProv 0.6.0 and utilized standards
Here we present CWLProv, a format for the methodical representation of work ow enactment, associated artefacts and capturing and using retrospective provenance information. Keeping in view the recommendations from Table 1 for example R15open-source and R16-format, we leverage open-source, domainindependent, system-neutral, interoperable and most importantly community-driven standards as the basis for the design and formatting of reproducible and interoperable work owbased ROs. The pro le description in this section correspond to CWLProv 0.6.0 [84]. (see https://w3id.org/cwl/prov for the latest pro le).

Applied Standards and Vocabularies
We follow the recommendation "Reuse vocabularies, preferably standardized ones" [85] from best practices associated with data sharing, representation and publication on the web to achieve consensus and interoperability of work ow-based analyses. Speci cally we integrate the Common Work ow Language (CWL) for work ow de nition, Research Objects (ROs) for resource aggregation and the PROV-Data Model (PROV-DM) to support the retrospective provenance associated with work ow enactment.
The key properties and principles of these standards are described below.

Common Work ow Language (CWL)
Common Work ow Language [11] provides declarative constructs for work ow structure and command line tool interface de nition. It makes minimal assumptions about base software dependencies, con guration settings, software versions, parameter settings or indeed the execution environment more generally [6]. The CWL object model supports comprehensive recording and capture of information for work ow design and execution. This can subsequently be published as structured information alongside any resultant analysis using that workow.
CWL is a community-driven standard e ort that has been widely adopted by many work ow design and execution platforms, supporting interoperability across a set of diverse platforms. Current adopters include Toil, Arvados, Rabix [86], Cromwell [87], REANA, and Bcbio [88] with implementations for Galaxy, Apache Taverna, and AWE currently in progress.
A work ow in CWL is composed of "steps" where each step refers either to a command line tool (also speci ed using CWL) or another work ow speci cation incorporating the concept of "sub-work ows". Each "step" is associated with "inputs" that are comprised of any data artefact required for the execution of that step ( Figure 3). As a result of the execution of each step, "outputs" are produced which can become (part of) "inputs" for the next steps making the execution data-ow oriented. CWL is not tied to a speci c operating system or platform which makes it an ideal approach for interoperable work ow de nitions.

Research Object (RO)
A Research Object encapsulates all of the digital artefacts associated with a given computational analysis contributing towards preservation of the analysis [89], together with their metadata, provenance and identi ers.
The aggregated resources can include but are not limited to: input and output data for analysis results validation; computational methods such as command line tools and workow speci cations to facilitate work ow re-enactment; attribution details regarding users; retrospective as well as prospective provenance for better understanding of work ow requirements, and machine-readable annotations related to the artefacts and the relationships between them. The goal of ROs is to make any published scienti c investigation and the produced artefacts "interoperable, reusable, citable, shareable and portable".
The three core principles [90] of the RO approach are to support "Identity", "Aggregation", and "Annotation" of research artefacts. They look to enable accessibility of tightly-coupled, interrelated and well-understood aggregated resources involved in a computational analysis as identi able objects, e.g. using unique (persistent) identi ers such as DOIs and/or OR-CIDs. The RO approach is well aligned with the idea of interoperable and platform-independent solutions for provenance capture of work ows because of its domain-neutral and platform-independent nature.
While ROs can be serialized in several di erent ways, in this work we have reused the BDBag approach based on BagIt (see box), which has been shown to support large-scale work ow data [91]. This approach is also compatible with data archiving e orts from the NIH Data Commons, Library of Congress and the Research Data Alliance. The specialized work ow-centric RO in this study encompasses the components mentioned in the previous paragraph annotated with various targeted tools and a PROV-based Work ow provenance pro le to capture the detailed retrospective provenance of the CWL work ow enactment.

PROV Data Model (PROV-DM)
The World Wide Web Consortium (W3C) developed PROV, a suite of speci cations for uni ed/interoperable representation and publication of provenance information on the Web. The underlying conceptual PROV Data Model (PROV-DM) [19] provides a domain-agnostic model designed to capture fundamental features of provenance with support for extensions to integrate domain-speci c information (Figure 4).  We utilize mainly two serialisations of PROV for this study, PROV-Notation (PROV-N) [93] and PROV-JSON [94]. PROV-N is designed to achieve serialisation of PROV-DM instances by formally representing the information using a simpli ed textual syntax to improve human readability. PROV-JSON is a lightweight interoperable representation of PROV assertions using JavaScript constructs and data types. The key design and implementation principles of these two serialisations of PROV are in compliance with the goals of this study, i.e. understandable and interoperable, hence are a natural choice to support the design of an adaptable provenance pro le. For completeness we also explored serializing the provenance graph as PROV-XML [95] as well as PROV-O [96], which provides a mapping to Linked Data and ontologies, with potential for rich queries and further integration using a triple store. One challenge here is the wide variety of OWL and RDF formats, we opted for Turtle, N-Triples and JSON-LD, but concluded that requiring all of these PROV and RDF serializations would be an unnecessary burden for other implementations of CWLProv.

CWLProv Research Object
The provenance framework de ned in previous section can be satis ed by using a structured approach to share the identied resources. In this section, we de ne the representation of data and metadata to be shared for a given work ow enactment, stored as multiple les in their native formats. The folder structure of the CWLProv Research Object complies with the BagIt [14] format such that its content and completeness can be veri ed with any BagIt tool or library (see box What is BagIt?). The les used and generated by the work ow are here considered the data payload; the remaining directories include metadata of how the work ow results were created. We systematized the aggregated resources into various collections for better understanding and accessibility for a CWL work ow execution ( Figure 5).

data/
data/ is the payload collection of the Research Object, in CWL-Prov this contains all input and output les used in a given work ow enactment. Data les should be labelled and identi ed based on a hashed checksum rather than derived from its le path during work ow execution. This use of contentaddressable reference and storage [97] simpli es identi er generation for data and helps to avoid local dependencies, e.g. hard-coded le names. However, the work ow execution engine might use other unique identi ers for le objects. It is advised to re-use such identi ers to avoid redundancy and to comply with the system/platform used to run the work ow.

work ow/
CWLProv ROs must include a system-independent executable version of the work ow under the workflow/ folder. When using CWL, this sub-folder must contain the complete executable work ow speci cation le, an input le object with parameter settings used to enact the work ow and an output le object generated as a result of work ow enactment. The latter contain details of the work ow outputs such as data les produced by the work ow, but may exclude intermediate outputs.
To ensure RO portability, these le objects may not exactly match the le names at enactment time, as the absolute paths of the inputs are recommended to be replaced with relativized content-addressed paths within the RO, e.g.
/home/alice/exp15/sequence.fa is replaced with ../data/b1/b1946ac92492d2347c6235b4d2611184. The input le object should also capture any dependencies of the input data les, such as .bam.bai indexes neighbouring .bam (Binary Alignment Map) les. Any folder objects should be expanded to list contained les and their le names at time of enactment.
In the case of a CWL work ow, cwltool can aggregate the CWL description and any referenced external descriptions (such as sub-work ows or command line tool descriptions) into a single work ow le using cwltool --pack. This feature is used in our implementation (details in section Practical Realisation of CWLProv) to rewrite the work ow les, making them re-executable without depending on work ow or commandline descriptions on the le system outside the RO. Other work ow de nition approaches, WMS or CWL executors should apply similar features to ensure work ow de nitions are executable outside their original le system location.

What is BagIt?
BagIt is an IETF Internet Standard (RFC8493)[14] that de nes a structured le hierarchy for the purpose of digital preservation of data les. BagIt was initiated by the US Library of Congress and the California Digital Library, and is now used by libraries and archives to ensure safe transmission and storage of datasets using "bags".
A bag is indicated by the presence of bagit.txt and a payload of digital content stored as les and sub-folders in the data/ folder. Other les are considered tag les to further describe the payload. All the payload les are listed in a manifest with checksums of their byte content, e.g. manifest-sha256.txt and equivalent for tag les in tagmanifest-sha256.txt. Basic metadata can be provided in bag-info.txt as key-value pairs.
A bag can be checked to be complete if all the les listed in the manifests exist, and is also considered valid if the manifest matches the checksum of each le, ensuring they have been correctly transferred.
BDBag (Big Data bag) [91] is a pro le of BagIt that adds a Research Object [98] metadata/manifest.json in JSON-LD [99] format to contain richer Linked Data annotations that may not t well in bag-info.txt, e.g. authors of an individual le. BDBags can include a fetch.txt to reference external resources using ARK MinIDs or HTTP URLs, allowing bags that contain large les without necessarily transferring their bytes.

snapshot/
snapshot/ comprises copies of the work ow and tool specications les "as-is" at enactment time, without any rewrites, packing or relativizing as described above.
It is recommended to use snapshot resources only for validity checking results and for understanding the work ow enactment, since these les might contain absolute paths or be host-speci c, and thus may not be possible to re-enact elsewhere. Preserving these les untouched may nevertheless retain information that could otherwise get lost, e.g. commented out work ow code, or identi ers baked into le names.
A challenge in capturing snapshot les is that they typically live within a le system hierarchy which can di cult to replicate accurately, and may have internal references to other les. In our implementation we utilize cwltool --print-deps to nd indirectly referenced les and store their snapshots in a at folder.

metadata/
Each CWLProv RO must contain an RO manifest le metadata/manifest.json and two sub-directories metadata/logs and metadata/provenance. The RO manifest, part of the BDBag [91] pro le, follows the JSON-LD structure de ned for Research Object Bundles [98] and can provide structured Linked Data for each le in the RO, like le type and creation date. Further detail about the manifest le contents is documented on GitHub as CWLProv speci cation [84].
Any raw log information from the work ow enactment should be made available in metadata/logs. This typically includes the actual commands executed for each step. Similar to the snapshot les, log les may however be di cult to process outside the original enactment system. An example of such processing is CWL-metrics [100], which post-process cwltool log les to capture runtime metrics of individual Docker containers.
Capturing the details of a work ow execution require rich metadata in provenance les (see section Retrospective Provenance Pro le). These should exist in the sub-folder metadata/provenance. It is recommended to make the availability of a primary provenance le mandatory, which should conform with the PROV-N [93] format. This le describes the toplevel work ow execution. As described in Level 2 (Section Levels of Provenance and Resource Sharing), it is quite possible to have nested work ows. In that case, a separate provenance le for each nested work ow execution should be included in this folder. If there are additional formats of provenance les such as PROV-JSON [94], PROV-XML [95], PROV-O [96] etc, then these should be included in the said folder with a declaration using conformsTo to declare their formats in the RO manifest being mandatory. The nested work ow pro le should be named such that there is a link between the respective step in the primary work ow and the nested work ow preferably using unique identi ers.
As the PROV-DM has a generalized structure, there might be some provenance aspects speci c to particular work ows that are hard to capture if only using PROV-N, hence ontologies such as wfdesc [101] can be used to describe the abstract representation of the work ow and its steps. Use of wfprov [102] to capture some work ow provenance aspects is also encouraged. Alternative extensions such as ProvOne [103] can also be utilized if the WMS or work ow executor is using these extensions already.
CWLProv reuses Linked Data standards like JSON-LD [99], W3C PROV [19] and Research Object [16]. A challenge with Linked Data in distributed and desktop computing is how to make identi ers that are absolute URIs and hence globally unique. For example, for CWLProv a work ow may be executed by an engine that does not know where its work ow provenance will be stored, published or nally integrated. To this end CWLProv generators should use the proposed arcp [104] URI scheme to map local le paths within the RO BagIt folder structure to absolute URIs for use within the RO manifest and associated PROV traces. Consumers of CWLProv ROs that do not contain an arcp-based External-Identi er should generate a temporary arcp base to safely resolve any relative URI references not present in the CWLProv folder. Implementations processing a CWLProv RO may convert arcp URIs to local file:/// or http:// URIs depending on how and where the CWLProv RO was saved, e.g. using the "arcp.py" library [105].

Retrospective Provenance Pro le
As stated earlier, the primary provenance le should conform to the PROV-N [93] serialisation of PROV data model, and may optionally use ontologies speci c to the work ow execution.
The key features used in the structure of the retrospective provenance pro le for a CWL work ow enactment in CWLProv are listed in Table 2). These features are not tied to any platform or work ow de nition approach and hence can be used to document retrospective provenance of any work ow irrespective of the work ow de nition approach.
The core mapping is following the PROV data model as in Figure 4): The PROV Activity represent the duration of a workow run, as well as individual step executions, which used le and data (Entity), which again may be wasGeneratedBy previous step activities. The work ow engine (e.g. cwltool) is the Agent controlling these activities according to the work ow de nition (Plan).
PROV is a general standard not speci c to work ows, and lacks features to relate a plan (i.e. a work ow description) with sub-plans and work ow-centric retrospective provenance elements e.g. speci c work ow enactment and its related steps enactment. We have utilized wfdesc and wfprov to represent few elements of prospective and retrospective provenance respectively. In addition, the provenance pro le documented details of all the uniquely identi ed activities e.g. work ow enactment and related command line tool invocations, their associated entities (e.g. input and output data artefacts, input con guration les, work ows and command line tool speci cations). The pro le also documents the relationship between activities such as which activity (work ow enactment) was responsible for starting and ending another activity (command line tool invocation).
As described in Section Levels of Provenance and Resource Sharing, in order to achieve maximum white-box provenance, the inner workings of a nested work ow should also be included in the provenance trace. If a step represents a nested work ow, a separate provenance pro le is included in the RO. Moreover, in the parent work ow trace, this relationship is recorded using has_provenance as an attribute of the Activity step which refers to the pro le of the nested work ow.

Practical Realisation of CWLProv
CWLProv [84] provides a format that can be adopted by any work ow executor or platform, provided that the underlying work ow de nition approach is at least as declarative as CWL, i.e. it captures the necessary components described in Section Applied Standards and Vocabularies. In the case of CWL, as long as the conceptual constructs are common amongst the available implementations and executors, a work ow enactment can be represented in CWLProv format. To demonstrate the practical realisation of the proposed model we consider a Python-based reference implementation of CWL cwltool.
cwltool is a feature complete reference implementation of CWL. It provides extensive validation of CWL les as well as o ering a comprehensive set of test cases to validate new modules introduced as extensions to the existing implementation. Thus it provides the ideal choice for implementing CWLProv for provenance support and resource aggregation. The existing classes and methods of the implementation were utilized to achieve various tasks such as packaging of the work ow and all associated tool speci cations together. In addition, the existing python library prov [106] was used to create a provenance document instance and populate it with the required artefacts generated as the work ow enactment proceeds.
It should be noted that we elected to implement CWLProv in the reference implementation cwltool instead of the more scalable and production-friendly CWL implementations like Toil [107], Arvados [108], Rabix [86], CWL-Air ow [109] or Cromwell [87]. An updated list of implementations is available at the CWL homepage 2 . Compared to cwltool these generally have extensive scheduler and cloud compute support, and extensions for large data transfer and storage, and should therefore be considered for any adopters of the Common Work ow Language. In this study we have however focused on cwltool as its code base was found to be easy to adapt for rich provenance capture without having to modify subsystems for distributed execution or data management, and as a reference implementation better informing us on how to model CWLProv for the general case rather than being tied into execution details of the more sophisticated CWL work ow engines.
CWLProv support for cwltool is built as an optional module which when invoked as "cwltool --provenance ro/ work ow.cwl job.json", will automatically generate an RO with the given folder name ro/ without requiring any additional information from the user. Each input le is assigned a hash value and placed in the folder ro/data, making it content-addressable to avoid local dependencies ( Figure 6).
In order to avoid including information about attribution without consent of the user, we introduce an additional ag " --enable-user-provenance". If a user provides the options --orcid and --full-name, this information will be included in the provenance pro le related to user attribution. Enabling " --enable-user-provenance" and not providing the full name or ORCID will store user account details from the local machine for attribution, i.e. the details of the agent that enacted the work ow.
The work ow and command line tool speci cations are aggregated in one le to create an executable work ow and placed in folder ro/work ow. This folder also contains transformed input job objects containing the input parameters with references to artefacts in the ro/data based on relativising the paths present in the input object. These two les are su cient to re-enact the work ow, provided the other required artefacts are also included in the RO and comply to the CWLProv format. The cwltool control ow [110] indicates the points when the execution of the work ow and command line tools involved in the work ow enactment start, end and how the output is reported back. This information and the artefacts are captured and stored in the RO.
When the execution of a work ow begins, CWLProv extensions to cwltool generate a provenance document (using the prov library) which includes default namespaces for the workow enactment "activity". The attribution details as an agent are also added at this stage if user provenance capture is enabled, e.g. to answer "who ran the work ow?". Each step of the work ow can correspond to either a command line tool or another nested work ow referred to as a sub-work ow in the CWL documentation. For each nested work ow, a separate provenance pro le is initialized recursively to achieve a white-box ner-grained provenance view as explained in Section Levels of Provenance and Resource Sharing. This prole is continually updated throughout the nested work ow enactment. Each step is identi ed by a unique identi er and recorded as an activity in the parent work ow provenance prole, i.e. the "primary pro le". The nested work ow is recorded as a step in the primary pro le using the same identi er as the "nested work ow enactment activity" identi er in the respective provenance pro le. For each step in the activity, the start time and association with the work ow activity is created and stored as part of the overall provenance to answer the question "when did it happen?".
The data used as input by these steps is either provided by the user or produced as an intermediate result from the previous steps. In both cases, the Usage is recorded in the respective provenance pro le using checksums as identi ers to answer the question "what was used?". The non-le input parameters such as strings and integers are stored "as-is" using an additional optional argument, prov:value. Upon completion, each step typically generates some data. The provenance prole records the generation of outputs at the step level to record "what was produced?" and "which process produced it?". Once all steps complete, the work ow outputs are collected and the generation of these outputs at the work ow level are recorded in the provenance pro le. Moreover, using the checksum of these les generated by the cwltool, content-addressable copies are saved in the folder ro/data. The provenance pro le refers to these les using the same checksum such that they are traceable or can be used for further analysis if required. The workow speci cation, command line tool speci cations and JSON job le is archived in the ro/snapshot folder to preserve the actual work ow history.
This prototype implementation provides a model and guidance for work ow platforms and executors to identify their respective features that can be utilized in devising their own implementation of CWLProv. Table 3 map the best practices and recommendations from Table 1 to the Levels of Provenance (Figure 2). The shown methods and implementation readiness indicate to which extent the recommendations are addressed by the implementation of CWLProv (detailed in this section).

Achieving recommendations with provenance levels
Note that other approaches may solve this mapping di erently. For instance, Next ow [111] may ful ll R18-resource-use at Provenance Level 2 as it can produce trace reports with hardware resource usage per task execution [112], but not for the overall work ow. While a Next ow trace report is a separate CSV le with implementation-speci c columns, our planned R18-resource-use approach for CWL is to combine CWL-metrics [113], permalinks and the standard GFD.204 [114] to further relate resource use with Level 1 and Level 2 provenance within the CWLProv Research Object.
In addition to following the recommendations from Table  1 through computational methods, the work ow authors are Partially implementeď Implementation planned/ongoing also required to exercise best practices for work ow design and authoring. For instance, to achieve R1-parameters the work ow must be written in such a way that parameters are exposed and documented at work ow level, rather than hard-coded within an underlying Python script. Similarly, while the CWL format support rich details of user annotations that can ful ll R6-annotation, for these to survive into a Research Object at execution time, such annotation capabilities must actually be used by work ow authors instead of unstructured text les.
It should be a goal of a scienti c WMS to guide users towards achieving the required level of the provenance framework through automation where possible. For instance a user may in the work ow have speci ed a Docker container image without preserving the version, but the provenance log could still record the speci c container version used at execution time, achieving R4-sw-version retrospectively by computation rather than relying on a prospective declaration in the workow de nition.

CWLProv Evaluation with Bioinformatics Work ows
CWLProv as a standard supports syntactic, semantic and pragmatic interoperability (de ned in Section Interoperability) of a given work ow and its associated results. We have de ned a "common data format" for work ow sharing and publication such that any executor or WMS with CWL support can interpret this information and make use of it. This ensures the syntactic interoperability between the work ow executors on di erent computing platforms. Similarly the "content" of the shared aggregation artefact as a work ow-centric RO is unambiguously de ned, thus ensuring uniform representation of the work ow and its associated results across di erent platforms and ex-ecutors hence supporting semantic interoperability. With Level 3 provenance satis ed providing domain-speci c information along with level 0-2 provenance tracking, we posit that CWL-Prov would be able to accomplish pragmatic interoperability by providing unambiguous information about the "context", "application" and "use" of the shared/published work ow-centric ROs. Hence, extension of the current implementation (described in section ) in future to include domain-rich information in the provenance traces and the CWLProv RO will result in pragmatic interoperability.
To demonstrate the interoperability and portability of the proposed solution, we evaluate CWLProv and its reference implementation using open source bioinformatics work ows available on GitHub from di erent research initiatives and from di erent developers. Conceptually, these work ows are selected for evaluation due to their excessive use in real-life data analyses and variety of the input data. Alignment workow is included in the evaluation as it is one of the most time consuming yet mandatory steps in any variant calling workow. Practically, choosing the work ows by these particular groups out of numerous existing implementations is justi ed in each section below.

RNA-seq Analysis Work ow
RNA sequencing (RNA-seq) data generated by Next Generation Sequencing (NGS) platforms is comprised of short sequence reads that can be aligned to a reference genome, where the alignment results form the basis of various analyses such as quantitating transcript expression; identifying novel splice junctions and isoforms and di erential gene expression [116]. RNA-seq experiments can link phenotype to gene expression and are widely applied in multi-centric cancer studies [24]. Computational analysis of RNA-seq data is performed by different techniques depending on the research goals and the organism under study [117]. The work ow [118] included in this case study has been de ned in CWL by one of the teams [119] participating in NIH Data Commons initiative [120], a large research infrastructure program aiming to make digital objects (such as data generated during biomedical research and software/tools required to utilize such data) shareable and accessible and hence aligned with the FAIR principles [67].
This work ow (Figure 7), designed for the pilot phase of the NIH Data Commons initiative [121], adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed) [122]. The RNA-seq pipeline originated from the Broad Institute [123]. There are in total ve steps in the workow starting from: 1) Read alignment using STAR [124] which produces aligned BAM les including the Genome BAM and Transcriptome BAM. 2) The Genome BAM le is processed using Picard MarkDuplicates [125] producing an updated BAM le containing information on duplicate reads (such reads can indicate biased interpretation). 3) SAMtools index [126] is then employed to generate an index for the BAM le, in preparation for the next step. 4) The indexed BAM le is processed further with RNA-SeQC [127] which takes the BAM le, human genome reference sequence and Gene Transfer Format (GTF) le as inputs to generate transcriptome-level expression quanti cations and standard quality control metrics. 5) In parallel with transcript quanti cation, isoform expression levels are quanti ed by RSEM [128]. This step depends only on the output of the STAR tool, and additional RSEM reference sequences.
For testing and analysis, the work ow author provided example data created by down-sampling the read les of a TOPMed public access data [129]. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the work ow authors. The required GTF and RSEM reference data les are also provided. The work ow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this speci c CWL work ow for CWLProv evaluation.

Alignment Work ow
Alignment is an essential step in variant discovery work ows and considered an obligatory pre-processing stage according to Best Practices by the Broad Institute [72]. The purpose of this stage is to lter low-quality reads before variant calling or other interpretative steps [130]. The work ow for alignment is designed to operate on raw sequence data to produce analysisready BAM les as the nal output. The typical steps followed include le format conversions, aligning the read les to the reference genome sequence, and sorting the resulting les. The CWL alignment work ow [131] included in this evaluation ( Figure 8) is designed by Data Biosphere [132]. It adapts the alignment pipeline [133] originally developed at Abecasis Lab, The University of Michigan [134]. This work ow is also part of NIH Data Commons initiative (as RNA-seq Analysis Work ow) and comprises of four stages. First step, "Pre-align" accepts a Compressed Alignment Map (CRAM) le (a compressed format for BAM les developed by European Bioinformatics Institute (EBI) [135]) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and xmate returns a list of fastq les which can be used as input for the next step. The next step "Align" also accepts the human reference genome as input along with the output les from "Pre-align" and uses BWA-mem [136] to generate aligned reads as BAM les. SAMBLASTER [137] is used to mark duplicate reads and SAMtools view to convert read les from SAM to BAM format. The BAM les generated after "Align" are sorted with "SAMtool sort". Finally these sorted alignment les are merged to produce single sorted BAM le using SAMtools merge in "Post-align" step. The authors provide an example CRAM le, Homo Sapien Assembly 38 reference genome along with its index les to be used as inputs for testing and analysis of the work ow.

Somatic Variant Calling Work ow
Variant discovery analysis for high-throughput sequencing data is a widely used bioinformatics technique, focused on nding genetic associations with diseases, identifying somatic mutations in cancer and characterizing heterogeneous cell populations [138]. The pre-processing explained for the Alignment work ow is part of any variant calling work ow as reads are classi ed and ordered as part of the variant discovery process. Numerous variant calling algorithms have been developed depending on the input data characteristics and the speci c application area [130]. Somatic variant calling work ows are designed to identify somatic (non-inherited) variants in a sample -generally a cancer sample -by comparing the set of variants present in a sequenced tumour genome to a non-tumour genome from the same host [139]. The set of tumour variants is a super-set of the set of host variants, and somatic mutations can be identi ed through various algorithmic approaches to subtracting host familial variants. Each somatic variant calling work ow typically consists of three stages: preprocessing; variant evaluation and post-ltering.
The somatic variant calling work ow (Figure 9) included in this case study is designed by Blue Collar Bioinformatics (bcbio) [140], a community-driven initiative to develop best-practice pipelines for variant calling, RNA-seq and small RNA analysis  work ows. According to the documentation, the goal of this project is to facilitate the automated analysis of high throughput data by making the resources quanti able, analyzable, scalable, accessible and reproducible. All the underlying tools are containerized facilitating software use in the work ow. The somatic variant calling work ow de ned in CWL is available on GitHub [141] and equipped with a well de ned test dataset.

Evaluation Activity
This section describes the evaluation of cross-executor and cross-platform interoperability of CWLProv. To test crossexecutor interoperability, two CWL executors cwltool and toilcwl-runner were selected. toil-cwl-runner is an open source Python work ow engine supporting robust cross-platform work ow execution on Cloud and High Performance Computing (HPC) environments [107]. The two operating system platforms utilized in this analysis were MacOS and Ubuntu Linux. For the Linux OS, a 16-core Linux instance with 64GB RAM was launched on the Australian National eResearch Collaboration Tools and Resources (NeCTAR) research cloud [143]. To cater for the storage requirements, a 1000GB persistent volume was attached to this instance. For MacOS, a local system with 16GB RAM, 250GB storage and 2.8 GHz Intel Core i7 processor was used. These platforms were selected to cater for the required storage and compute resources of the work ows described above. The reference genome provided with Alignment Work ow was not down-sampled and hence this work ow required most resources among the three evaluated.
It is worth mentioning that this evaluation does not include details of the installation process for cwltool, toil-cwl-runner and Docker on systems described above. To create CWLProv ROs during work ow execution, it is necessary to use the CWL reference runner (cwltool) until this practice spreads to other CWL implementations. Moreover, it is assumed that the software container (Docker) should also be installed on the system to use the work ow de nitions aggregated in a given CWLProv RO.
In addition, the resource requirements (identi ed in R18resource-use and discussed in Section Discussion and Future Directions) should also be satis ed by choosing a system with enough compute and storage resources for successful enactment. The systems used in this case study should be a reference when selecting a system as inadequate compute and storage resources such as insu cient RAM or number of cores will hinder the successful re-enactment of work ows using these ROs. The hardware requirements may also vary if a di erent dataset is used as input to re-enact the work ow using the methods aggregated in the RO. In that case, the end user must ensure availability of adequate compute and storage resources by choosing a system that meets the required speci cations [144].
Since the CWLProv implementation is demonstrated for one of the executors (cwltool), currently a CWLProv RO for any workow can only be produced using cwltool. Hence, in this activity the work ows are initially enacted using just cwltool ( Table 4). The outline of the steps performed to analyse CWLProv for each case study is as follows.
I) The work ow was enacted using cwltool to produce a RO on a MacOS computer.
1) The resulting RO and aggregated resources were used to re-enact the work ow using toil-cwl-runner on the same MacOS computer; 2) The RO produced in step I was transferred to the cloudbased Linux instance used in this activity; 3) On the cloud-based Linux environment and only utilizing the resources aggregated in the RO, the work ow was re-enacted using cwltool and toil-cwl-runner.
II) The work ow was enacted using cwltool to produce a RO on Linux.
1) The resulting RO and aggregated resource were utilized to re-enact the work ow using toil-cwl-runner on the same cloud-based Linux instance; 2) The RO produced in step II was transferred to the MacOS computer used in this activity; 3) On the MacOS computer and only utilizing the resources aggregated in the RO, the work ow was re-enacted using cwltool and toil-cwl-runner.
The CWLProv ROs produced as a results of this activity are published on Mendeley Data [145,146,147] with mirrors on Zenodo.

Evaluation Results
The steps described above were taken to produce ROs which were then used to re-enact the work ows (outlined in Table  4), without any further changes required. This demonstration illustrated the syntactic and semantic interoperability of the work ows across di erent systems. It shows that both CWL executors were able to exchange, comprehend and use the information represented as CWLProv ROs. The current implementation described in section Practical Realisation of CWLProv does not resolve Level 3. Hence, the inclusion of domain-speci c annotations referring to scienti c context to address pragmatic interoperability is identi ed as crucial future direction and further detailed in section Discussion and Future Directions.  [99] for data modeling and Docker [30] to support portability of the run-time environments. The portability and interoperability as basic principles of the underlying work ow de nition approach for any work ow-centric analysis implies that the analysis should also be portable and interoperable. However, the work ow definition/speci cation alone is insu cient when dealing with commandline tool speci cations, data, and input con guration les used in the analysis if these are not readily available.
CWLProv ensures availability of these resources for a given analysis conforming to the framework de ned in Section CWL-Prov 0.6.0 and utilized standards. The input con gurations are saved as primary-job.json in folder work ow/ and refer to the input data contained in the payload data/ folder of the given RO. In this way, availability of data aggregated with the analysis is made possible. Existing features of cwltool are used to generate the CWL work ow speci cation le containing all of the commandline tool speci cations referred to in the workow speci cation and placed in the same work ow/ folder. One might argue that copying a folder tree might serve the same purpose but in that case we again will be relying on users to put substantial amount of e ort on top of the actual analysis, i.e. they would have to carefully structure their directories to be aligned with the work ow creators. Instead CWL encourages researchers to utilize container technologies such as Docker, Singularity, or software packaging systems like Debian (Med) or Bioconda to ensure availability of underlying tools as recommended by numerous studies [13,6,57,63,64,148]. This practice facilitates the preservation of methods utilized in data-intensive scienti c work ows and enables veri cation of the published claims without requiring the end-user to do any manual installation and con guration. Examples of tools available via Docker containers used here are the alignment tool (BWA mem) used in the Alignment work ow and STAR aligner used in RNA-seq work ow.

Evaluating Provenance Pro le
The retrospective provenance pro le generated as part of CWL-Prov for each work ow enactment can be examined and queried to extract the required subset of information. Provenance Analytics is a separate domain and a next step after provenance collection in the provenance life cycle [149]. Often provenance data is queried using specialized query languages such as SQL SPARQL or TriQL depending on the storage mechanism used. Query operations can combine information from prospective and retrospective provenance to understand computational experiments better.
The focus of this paper is not in-depth provenance analytics but we have demonstrated the application of the provenance pro le generated as part of CWLProv. We have developed a commandline tool and Python API "cwlprov-py" [150] for CWLProv RO analytics to interpret the captured retrospective provenance of CWL work ow enactment. This API currently supports the following use-cases.
Given a CWLProv RO: • Work ow Runs As each RO can contain more than one work ow run if subwork ows are utilized to group related tasks into one workow. In that case, the provenance traces are stored in separate les for each work ow run. cwlprov-py identi es the work ow enactments including the sub-work ows (if any) and returns the work ow identi ers annotated with the step names. The user can select the required trace and explore particular traces in detail.

• Attribution
Each RO is assumed to be associated with a single enactment of the primary work ow and hence assumed to be enacted by one person. As discussed previously, CWLProv provides additional ags to enable user provenance capture. A user can provide their name and ORCID details that can be stored as part of a RO. cwlprov-py displays attribution details of the researcher responsible for the enactment (if enabled) and the versions of the work ow executor utilized in the analysis.

• Input/Output of a Process
Provenance traces contain associations between the steps/work ows with the data they used or generated. A user interested in a particular step can identify the inputs used and outputs produced linked explicitly to that process Re-running or re-using only desired parts of a given workow has been emphasized [24] as important to evaluate the work ow process or validate the published results associated without necessarily re-enacting the work ow as a whole. cwlprov-py uses the identi er of the step/work ow to be re-run, parses the provenance trace to identify the inputs required and ultimately creates a JSON input object with the associated input parameters. This input object can then be used for partial re-runs of the desired step/work ow, making segmented analysis possible even for CWLProv consumers who don't have su cient hardware resources for re-executing more computationally heavy steps.
While the above explores some use cases for consuming and re-using work ow execution data, we have not explored this in full detail. Further work could develop more speci c user scenarios and perform usability testing with independent domainexperts who have not seen the executed work ow before.
An important point of CWLProv is to capture su cient information at work ow execution time, so that post-processing (potentially by a third-party) can support unforeseen queries without requiring instrumentation at work ow design time. For instance, cwlprov runtimes calculates average runtime per step (requiring capture of start/stop time of each step iteration), while cwlprov derived calculates derivation paths back to input data (requiring consistent identi ers during execution). Further work could build a more researcher-oriented interface based on this approach, e.g. hardcoded data exploration for a particular work ow. Table 5 shows the run-times for the three work ow enactments using cwltool and toil-cwl-runner on Linux and MacOS with and without enabling provenance capture as described in the evaluation activity section. These work ows were enacted at least once before this time calculation, hence the timing does not include the time for Docker images to be downloaded. On a new system, when re-running these work ows for the rst time, the Docker images will be downloaded and may take signi cantly longer than the time speci ed here especially in case of the Somatic Variant Calling work ow because of the image size.

Temporal and Spatial Overhead with Provenance
Run-time and storage overheads are important for provenance-enabled computational experiments. The choice of di erent operating systems and provenance capture mechanisms such as operating-system level, application-level or work ow-level as well as I/O workload, interception mechanism and ne-grained information capture are key for provenance [151,152].
In our case study, signi cant time di erence can be seen for the alignment work ow that used the most voluminous dataset, hence producing a sizable RO as well. This was due to the RO-generation where data was aggregated within the RO. The di erence between the provenance-enabled enactment versus the enactment without provenance is barely noticeable for the other two work ow enactments with the smaller datasets. The discussion about handling the big 'omics' data such as human genome reference sequence, its index les and other database les (e.g. dbsnp) in Section Discussion and Future Directions provides a possible solution to avoid such overheads.
In addition, noticeable time di erence between the cwltool and toil-cwl-runner enactments is because of the default parallel versus serial job execution in case of toil-cwl-runner and cwltool respectively. The "scatter" operation in CWL when applied to one or more input parameters of a work ow step or a sub-work ow, supports parallel execution of the associated processes. Parallelism is also available without "scatter" when separate processes have all their inputs ready. If su cient compute resources are available, these jobs will be enacted concurrently otherwise they are queued for subsequent execution. Compute intensive steps of a work ow can bene t from scatter features for parallel execution by reducing the overall runtime. Both Alignment and Somatic Variant Calling work ows utilize the scatter feature to enable higher degrees of parallel job execution in case of toil-cwl-runner which explains the time di erence for the cross-executor of these two work ows. The di erence is negligible for RNA-Seq work ow which is comprised of serial jobs with comparatively small test data.

Output Comparison Across Enactments
We compared the work ow outputs after each enactment to observe the concordance and/or discordance (if any) for the work ow enactment results produced across the platforms and across the executors. As CWLProv RO refers to the data with hashed checksums, these checksums are utilized for the result comparison. It is worth-mentioning that the comparison was made between the output les generated by the di erent enactments against a single "truth-set" output le available and checksum in the respective Git repositories.
The checksum of the output data generated cross-platform and cross-executor comparison data as a result of the initial enactments and re-runs using the CWL ROs to elicit the concordance in all but one cases. The "correctness" as well as agreement of these outputs given di erent execution environments (e.g. platform and executor) hold true except for Alignment work ow. Alignment work ow produced varying outputs after every execution even with the same executor and platform. The output of the alignment algorithm, "BWA mem" used in this work ow was non-deterministic as it depended on the number of threads --t and the seed length --K which affected the output produced. While the seed length in this case was set to a constant value, the number of threads varied depending on the availability of hardware resources at run-time, thereby resulting in varying output for the same input les.

Discussion and Future Directions
This section discusses the current and future work with reference to enriched provenance capture and smart resource aggregation, and enhancements to both the CWLProv standard and implementation.

Compute and Storage Resources
The CWLProv format encapsulates the data and work ow denitions involved in a given work ow enactment along with its retrospective provenance trace. CWL as a standard provides constructs to declare basic hardware resource requirements such as minimum and maximum cores, RAM and reserved le system storage required for a particular work ow enactment. The work ow authors can provide this information in the "requirements" or "hints" section as "ResourceRequirement". These requirements/hints can be declared at work ow or individual step level, to help platforms/executors to allocate the required resources. This information indirectly stores some aspects of prospective view of provenance with respect to hardware requirements of the underlying system used to enact a work ow. Currently this information is only available if declared as part of work ow speci cation. In future, we plan to include these requirements as part of provenance for a given work ow such that all such information is gathered in one space and users are not required to inspect multiple sources to extract this information. This information can then be used as a pre-condition for potential successful enactment of a given work ow.
As CWLProv is focused on retrospective provenance capture of work ow enactment, we plan to include provenance information about the compute and storage resources utilized in a given enactment to ful ll R18-resource-use. We believe that documenting these resources will allow users to analyse their environment and resource allocations before execution, as opposed to trial and error methods that may result in multiple failed enactments of a given work ow. Despite being an important factor, it is surprising to see that most of existing provenance standards lack dedicated constructs to represent the underlying hardware resource usage information as part of prospective or retrospective provenance. In the case of complex work ows using distributed resources, where each step could be executed on a di erent node/server, including all this information in a single PROV pro le will clutter the pro le and render it potentially incomprehensible. Therefore, we plan to add a separate Usage Record document in the RO conforming to GFD.204 [114] to describe Level 1 (and potentially Level 2) resource usage in a common format independent on actual execution environment.
Capturing such resource usage records require a tighter integration with the execution platform, and so we consider this future work better suited for a cloud-based CWL engine like Toil or Arvados, as the reference implementation cwltool does not exercise ne-grained control of its task execution. Detailed raw log les can also be provided as Level 0 provenance, as we have demonstrated with cwltool, but these will by their nature be custom per execution platform and thus should be considered unstructured. Related work that is already exploring this approach is cwl-metrics [113], which analyses raw cwltool log les in combination with detailed Docker invocation statistics using the container monitoring tool Telegraf. Ongoing collaboration is exploring adding these metrics as additional provenance to the CWLProv RO with summaries in PROV and GFD.204 formats.

Provenance Pro le Augmented with Domain Knowledge
CWLProv bene ts from existing best practices proposed by numerous studies (Table 1) and includes de ned standards for work ow representation, resource aggregation and provenance tracking (Section Applied Standards and Vocabularies). We posit that the principle of following well-de ned data and metadata standards enables explicit data sharing and reuse. In order to include rich metadata for bioinformaticians to produce specialized ROs for bioinformatics to achieve CWLProv Level 3 as de ned in section Levels of Provenance and Resource Sharing, we are investigating re-use of concepts from the BioCompute Object (BCO) project [9]. This domain-speci c information is not necessary for computation and execution but for understandability of the shared resources. We encourage work ow authors to include such metadata and external identi ers for data and underlying tools, e.g. EDAM identi ers for the resources employed in designing a given work ow. The plan is to extract these annotations and represent in the retrospective provenance pro le in CWLProv to ultimately achieve pragmatic interoperability by providing domain-speci c scienti c context of the experiments. Domain-speci c information is essential in determining the nature of inputs, outputs and context of the processes linked to a given work ow enactment [74]. This information can be captured in the RO if and only if the workow author adds it in the work ow de nition, thus achieving CWLProv Level 3 depends on the individual work ows.

Big -omics Data
While aggregating all resources as one download-able object improves reproducibility, the size of the resulting RO is an important factor in practice. On one hand, completeness of the resources contributes towards minimizing the work ow decay phenomenon by least dependence on availability of third party resources. On the other hand, the nature of -omics data sizes can result in hard-to-manage work ow-centric ROs also leading to the spatial and temporal overheads as discussed in evaluation.
One solution is archiving the big datasets in online repositories or data stores and including the existing persistent identi ers and checksums in the RO instead of the actual data les, as previously demonstrated with BDBags [91,153]. While CWL executors like toil-cwl-runner can be con gured to deposit data in a shared repository, the cwltool reference implementation explored in this study can only write to the local le system. External references raise the risk of unavailability of data at a later time. Therefore we recommend including the data in the RO if su cient network and storage resources are available. Future work may explore post-processing CWLProv ROs to replace large data les with references to stable data repositories, producing a slimmer RO for transfer where individual data items can be retrieved on demand, as well as reducing data duplication across multiple related ROs.

Improving CWLProv e ciency with selective provenance capture
Shim refers to an adaptor step to resolve a format incompatibility issues between two work ow tasks [61], typically converting the previous output into an acceptable format for the next step. For example in our case study RNA-seq work ow, RNA-SeQC require an indexed BAM le, whereas the output of STAR or Picard MarkDuplicates only comprises of the BAM le alone. Hence, a shim step executing SAMtools index make the aligned reads analysis ready for RNA-SeQC. Compared to the more analytical steps, the provenance of such shim steps are not particularly interesting for domain scientists, and in many cases their intermediate data would e ectively double the storage cost with little information gains, as such data can be reliably recreated by re-applying the predictable transformation step (considering it as a pure function without side-e ects). Another type of ignorable steps could be purely diagnostic, which outputs are used primarily during work ow design to verify tool settings. A work ow engine does not necessarily know which steps are "boring" 3 and our proof of concept implementation will dutifully store provenance from all steps.
To improve e ciency, future CWLProv work could add options to ignore capturing outputs of speci ed shim steps, or to not store les over a particular le size. Similarly a scientist or a WMS may elect to only capture provenance at a particular provenance level (see Section Levels of Provenance and Resource Sharing). Provenance captured under such settings would be "incomplete" (e.g. PROV would say RNA-SeQC consumed an identi ed BAM index le, but the corresponding bytes would not be stored in the RO), thus it is envisioned this can be indicated in the RO manifest as a variant of the CWLProv pro le identi er to give the end-user clear indication of what to expect in terms of completeness, so that tools like cwlprov-py could be extended to re-create missing outputs, verifying their expected checksums, or collapse provenance listing of "boring" steps to improve human tractability.

Enforcement of Best Practices -An Open Problem
Recommendations and best practices from the scienti c community are proposed frequently, to guide researchers to design their computational experiments in such a way as to make their research reproducible and veri able. Not only the best practices for work ow design, but also for resource declaration, software packaging and con guration management are put forward [148] to avoid dependence on local installations and manual processes of dependency management. The term "Better Software, Better Research" [154] can also be well-applied on and adapted for the work ow design process.
Declarative approaches to work ow de nition such as CWL facilitate and encourage users to explicitly declare everything in a work ow, improving white-box view of the retrospective as well as prospective provenance. Such work ows should provide insights of the complete process followed, to produce a data artefact resolving the black-boxness often associated with the work ow provenance. However, it is entirely up to researchers to leverage these approaches to produce well-de ned work ows with explicit details facilitating enriched capture of the provenance trace at the appropriate level, and this can require considerable e ort and consistency on the work ow designer's behalf. For instance, the alignment work ow used in this case study embeds bash scripts into the CWL tool de nition, therefore requiring another layer needed to be penetrated for provenance information extraction. Despite using CWL for the work ow de nition and CWLProv for provenance capture, the provenance trace will be missing critical information making it coarse-grained, and the raw logs capturing the enactment will also not be as informative.
The three criteria de ned by Cohen-Boulakia et al. [24] to be followed by work ow designers are: modularized speci cations, uni ed representation and work ow annotations. CWL facilitates a modular structure to work ow de nitions by coupling similar steps to subwork ows; and, as an interoperable standard, CWL provides a common platform moving towards resolution of the heterogeneity of the work ow speci cation languages. In addition, users can add standardised domainspeci c annotations to data and work ows incorporating the constructs de ned by external ontologies (e.g. EDAM) to enhance understanding of the shared speci cation and the resources it refers to. All these features can be utilized to design better work ows and maximize the information declaration resulting in semantically-rich and provenance-complete CWL-Prov ROs, and should thus be expressed clearly in user guides 4 for work ow authors.
The usability of any CWLProv RO directly relies on the choice of practices followed by the researchers to design and communicate their computational analyses. Work ow-centric initiatives similar to software carpentry [155] and code is science [156] are one possible way to organize training and create awareness around best practices. Community-driven e orts to further consolidate the understanding of requirements to make a given work ow explicit and understandable should be made. Not only awareness about the work ow design is needed, but also the availability of the associated resources should be emphasized e.g. software as containers or software packages, big datasets in public repositories and pre-processing/postprocessing as part of work ow. Without putting proposed best practices into actual practice, complete communication and hence the reproducibility of a work ow-centric computational analysis is likely to remain challenging.

Conclusion
The comprehensive sharing and communication of the computational experiments employed to achieve a scienti c objective establishes trust on published results. Shared resources are sometimes rendered ine ective due to incomplete provenance, heterogeneity of platforms, unavailability of software and limited access to data. To this context, the contributions of this study are four-fold. First, we have provided a comprehensive summary of the recommendations put forward by the community regarding work ow design and resource sharing. Second, we de ne a hierarchical provenance framework to achieve homogeneity in the granularity of the information shared with each level addressing speci c provenance recommendations.
Third, we leverage the existing standards best suited to dene a standardized format, CWLProv for methodical representation of work ow enactments, its provenance and the associated artefacts employed. Finally, to demonstrate the applicability of CWLProv, we extend an existing work ow executor (cwltool) to provide a reference implementation to generate interoperable work ow-centric ROs, aggregating and preserving data and methods to support the coherent sharing of computational analyses and experiments.
With any published scienti c research, statements such as "Methods and data are available upon request" should no longer be acceptable in a modern open-science-driven research community. Considering on one hand the collaborative nature and emerging openness of bioinformatics research and on the other hand the heterogeneity of work ow design approaches, it is essential to provide open access to the structured representation of the data and methods utilized in any scienti c study to achieve interoperable solutions facilitating reproducibility of science.
Provenance capture and its subsequent use to support published research transparency should not be treated as an afterthought but rather as a standard practice of up-most priority. With adoption of well-de ned standards for provenance and declarative work ow de nition approaches, the assumption of black-box provenance often associated with work ows can be addressed. The work ow authors should be encouraged to follow well-established and agreed upon best practices for workow design and software environment deployment. In conclusion, we do not require new standards, new WMSs or indeed new best practices, instead the focus should be to implement, utilize and re-use existing mature community-driven initiatives to achieve consensus in representing di erent aspects of computational experiments.