ReGaTE: Registration of Galaxy Tools in Elixir

Abstract Background: Bioinformaticians routinely use multiple software tools and data sources in their day-to-day work and have been guided in their choices by a number of cataloguing initiatives. The ELIXIR Tools and Data Services Registry (bio.tools) aims to provide a central information point, independent of any specific scientific scope within bioinformatics or technological implementation. Meanwhile, efforts to integrate bioinformatics software in workbench and workflow environments have accelerated to enable the design, automation, and reproducibility of bioinformatics experiments. One such popular environment is the Galaxy framework, with currently more than 80 publicly available Galaxy servers around the world. In the context of a generic registry for bioinformatics software, such as bio.tools, Galaxy instances constitute a major source of valuable content. Yet there has been, to date, no convenient mechanism to register such services en masse. Findings: We present ReGaTE (Registration of Galaxy Tools in Elixir), a software utility that automates the process of registering the services available in a Galaxy instance. This utility uses the BioBlend application program interface to extract service metadata from a Galaxy server, enhance the metadata with the scientific information required by bio.tools, and push it to the registry. Conclusions: ReGaTE provides a fast and convenient way to publish Galaxy services in bio.tools. By doing so, service providers may increase the visibility of their services while enriching the software discovery function that bio.tools provides for its users. The source code of ReGaTE is freely available on Github at https://github.com/C3BI-pasteur-fr/ReGaTE.


Introduction
Over the recent years, various initiatives have aimed at cataloguing bioinformatics tools and services [1][2][3]. The ELIXIR Tools and Data Services Registry [4] (bio.tools) aims to provide a central point of information, independent of any specific scientific scope within bioinformatics or technological implementation. Another ongoing trend is the integration of bioinformatics software in workbench and workflow environments, which allow data analysts to design, automate, and reproduce bioinformatics experiments. The * Correspondence: olivia.doppelt@pasteur.fr 1  Galaxy framework [5][6][7] is one of the most popular of such environments, with currently more than 80 publicly available Galaxy servers 1 around the world. In the context of a generic registry for bioinformatics software, such as bio.tools, Galaxy instances constitute a major source of valuable content. The ReGaTE utility is a software component that automates the registration of the bioinformatics tools installed on a Galaxy server. We will present in the following sections the major aspects of its implementation, its architecture, and finally the mapping of tool metadata from Galaxy to bio.tools.

Manuscript
Click here to download Manuscript ReGaTE_GigaScience.pdf Implementation ReGaTE pulls tool descriptions from a Galaxy server, augments the information and pushes it to the bio.tools registry.
A Galaxy server is a framework that supports users to configure and run a range of bioinformatics tools and workflows, and which gathers many other features for the sharing, visualization, and reproducibility of analyses. The user interface and execution of tools are based on a tool definition in an XML file 2 . Each file describes the bioinformatics tool in a detailed way, including the tool parameters, inputs and outputs. This allows to display their sometimes complex configuration options in a graphical user interface, primarily, to enable tool parameterisation and its execution. Such tool definitions are loaded by the Galaxy server, and are accessible through the Galaxy Restfull interface. The BioBlend library [8] allows convenient access to the Galaxy API from python. Here, we have used BioBlend to extract Galaxy tool definitions from remote Galaxy instances.
bio.tools [4] is a web portal provided by ELIXIR -the European infrastructure for biological information -for the exploration of bioinformatics resources including software packages, Web services, database portals. Through a dedicated graphical interface, users can search for and compare resources. Thus, bioinformatics resource providers can use bio.tools to enhance the visibility of their services. The description and registration of a resource can be done manually via a Web user interface, or resources may be registered using the registry API. Registry entries follow a model which is formalized in biotoolsXSD 3 , an XML schema which defines a resource description model for bioinformatics with a mandatory core of ten attributes.
ReGaTE fetches the Galaxy tool definitions, enhances them with additional annotations, and converts them into the biotoolsXSD format, based on the mapping mechanism described in the next section, before pushing them to bio.tools. This process can be triggered all at once or step by step, first extracting the tool metadata, and second pushing enhanced metadata to bio.tools. A ReGaTE user needs to have an account on the targeted Galaxy and retrieve his API key.

ReGaTE architecture
ReGaTE is a Python script coupled with a configuration file and mapping of semantics used by Galaxy and bio.tools. An overview of its architecture is shown in (Figure 1).
The configuration file includes the Galaxy server url, an API key, and a directory to store the generated tool files uploadable to bio.tools. Suffix and prefix variables, for tagging the names of the tools Figure 1 ReGaTE software architecture extracted by ReGaTE, may also be specified. For example, the name of the tool sartools deseq2 [9], implemented at Institut Pasteur can be named sartools deseq2:InstitutPasteur.

Tool metadata mapping
The ReGaTE package includes mapping files for the annotation enhancement (see below), as well as a copy of the biotoolsXSD schema (XSD) for validation of tool descriptions before they are pushed to the registry. A biotoolsXSD XML file describes a given software application, covering different properties: • scientific properties, such as the domain catered for and description of the type of task(s) done by a tool • technical properties, such as the type of software and its interface(s), e.g., command line tool, Web application, Web service etc. • credit, for instance the references that need to be cited when referring to this work • administrative information, such as the license used in the software Some of these properties are described using the EDAM ontology [10], a community-defined and machineunderstandable vocabulary of common bioinformatics concepts: • topics i.e., scientific disciplines or domains covered by the resource • specific operations performed by a tool or service • type of input and output data • format in which inputs and outputs are available The mapping from a Galaxy tool definition file 4 to a bio.tools file is handled by the ReGaTE code, taking advantage of the important number of common properties between such workbench wrappers and registry entries [11]. A few properties are not natively available in the Galaxy tool files retrieved by BioBlend; these missing data are provided by the ReGaTE configuration files. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 The mapping of Galaxy tool properties to EDAM concepts is a key component. This translation is handled by YAML mapping files included in the ReGaTE distribution that handle the conversion of Galaxy datatypes to EDAM data and format concepts, and which also allow EDAM topics and operations to be specified.

Conclusion and future work
The bio.tools registry allows Galaxy server maintainers to increase the visibility of their services, set in context of offerings from other providers. The ReGaTE utility is a fast and convenient solution to enhance, publish and maintain any services provided by a Galaxy server in the registry. Furthermore, ReGaTE can prove a valuable contribution toward providing bio.tools with more comprehensive coverage of the community resources.
Current work on ReGaTE is focused on migration of the core functionality and tool semantics to the Galaxy Project itself. This integration will rely on the direct annotation of Galaxy datatypes with EDAM format and data concepts 5 , as well as the possibility to specify EDAM topic 6 and operation 7 concepts directly in Galaxy tool definitions.