-
PDF
- Split View
-
Views
-
Cite
Cite
Manuel Rueda, Roberto Ariosa, Mauricio Moldes, Jordi Rambla, Beacon v2 Reference Implementation: a toolkit to enable federated sharing of genomic and phenotypic data, Bioinformatics, Volume 38, Issue 19, October 2022, Pages 4656–4657, https://doi.org/10.1093/bioinformatics/btac568
- Share Icon Share
Abstract
Beacon v2 is an API specification established by the Global Alliance for Genomics and Health initiative (GA4GH) that defines a standard for federated discovery of genomic and phenotypic data. Here, we present the Beacon v2 Reference Implementation (B2RI), a set of open-source software tools that allow lighting up a local Beacon instance ‘out-of-the-box’. Along with the software, we have created detailed ‘Read the Docs’ documentation that includes information on deployment and installation.
The B2RI is released under GNU General Public License v3.0 and Apache License v2.0. Documentation and source code is available at: https://b2ri-documentation.readthedocs.io.
Supplementary data are available at Bioinformatics online.
1 Introduction
The Global Alliance for Genomics and Health (GA4GH) released in April 2022 the v2 of the Beacon specification, which defines an open standard for secure federated discovery of genomic and phenotypic data in biomedical research and clinical applications (Rambla et al., 2022). Beacon v2 specification consists of two components, the Framework and the Models. The Framework defines the format for the requests and responses, whereas the Models define the structure of the biological data response (see Supplementary Data ST1 and SF1). The overall function of these components is to provide the instructions to design a REST API.
Implementing a Beacon v2 API directly from the specification can be challenging for centers not having trained personnel. To demonstrate Beacon v2 capabilities and to facilitate the adoption, at the Centre for Genomic Regulation (CRG) we have developed the Beacon v2 Reference Implementation (B2RI), an open source Linux-based software toolkit that allows lighting up a local instance of Beacon ‘out-of-the-box’. In this communication, we will describe the software and summarize how its components work together to enable ‘beaconization’ of biological data.
2 Methods and implementation
Overall, two basic elements are needed to implement a local instance of Beacon v2: (i) an internal database (where the biological data are stored), and (ii) a REST API that provides a standardized way to receive requests and send responses. The B2RI provides these basic elements, as well as a set of tools to transform biological data to the internal database format. The B2RI consists of four components:
A set of tools for extraction, transformation and loading of metadata (e.g. sequencing methodology, bioinformatics tools), phenotypic data and genomic variants into a database.
The database (an instance of MongoDB) (https://www.mongodb.com).
The Beacon v2 query engine (i.e. a REST API).
An example dataset consisting of synthetic data (CINECA synthetic cohort EUROPE UK1) (see Supplementary Text ST6).
The software is available for download from Docker Hub (https://hub.docker.com/r/beacon2ri/beacon_reference_implementation) or through GitHub repositories (see Supplementary Text ST7) and must be deployed on a local workstation/server. Hence, some security aspects of data access, such as external IP access, rely on the ‘jurisdiction’ of each research center. The software is written in Python, Perl and Bash and functions with a command-line interface for control and operation.
We will now describe how the components work together to enable data conversion and access through the REST API.
2.1 Data ingestion
The data ingestion consists of three steps:
2.1.1 Transforming metadata and phenotypic data
Researchers/clinicians store metadata and phenotypic data in a wide variety of sources/formats (e.g. text files, CSV, Excel, databases, Electronic Health Records, PDF, etc.). The idea is that B2RI will facilitate converting data in those formats to the hierarchical structure of the Beacon v2 Models. The Models are a set of seven (analyses, biosamples, cohorts, datasets, genomicVariations, individuals and runs) entities (entry types in Beacon v2 specification) created to provide uniformity for the biological data responses (see Supplementary Fig. SF2). The entry types are defined using JSON Schema and consist of multiple properties (or terms). As input, we provide an Excel template (see https://github.com/EGA-archive/beacon2-ri-tools/tree/main/utils/bff_validator) consisting of all Models properties ‘flattened-out’ and separated into seven sheets (one per entry type). Note that it is not necessary to fill out all the sheets to light up a Beacon v2 instance. The user is responsible for filling out the Excel according to the entities and terms they want to share. Ontologies are defined at this level, but we are not enforcing the use of any particular ones, as ontologies depend on the domain of study (in any case, we provide examples in the documentation). Once the sheets are filled out, the B2RI comes with a utility that validates the Excel file against the Models JSON Schemas, and, if successful, it creates a set of JSON text files (JSON arrays) as an output that will be later loaded into the database.
2.1.2 Transforming genomic variations
For genomic data, the B2RI comes with a tool (see https://github.com/EGA-archive/beacon2-ri-tools) that takes as input a VCF (Danecek et al., 2011) file (from DNAseq) and uses BCFftools (Narasimhan et al., 2016), SnpEff (Cingolani et al., 2012a) and SnpSift (Cingolani et al., 2012b) [with data from dbNSFP (Liu et al., 2020) (see Supplementary Text ST8) and ClinVar (Landrum et al., 2016)] to annotate each VCF. Once annotated, the tool transforms VCF data to the genomicVariations entry type and serializes it as a JSON file.
2.1.3 Load data into MongoDB
Once transformed, the set of seven JSON files defines what we call the Beacon Friendly Format (BFF) (see online documentation). The same tool used to process the VCF (see above) also enables loading BFF files into a MongoDB instance. We have chosen MongoDB as a de facto database as it works directly with JSON files. This way, we can store the data directly in the database according to the Beacon v2 Models and provide responses (Beacon v2 compliant) without the need of re-mapping the data at the API level (see Supplementary Text ST2). Once loaded into the database, the entry types will be referred to as MongoDB collections.
2.2 REST API
2.2.1 Queries
The API (see https://github.com/EGA-archive/beacon2-ri-api) follows REST principles and queries are carried out by sending requests (using either GET or POST HTTP methods) to Beacon v2 API endpoints (see Supplementary Text ST2). Queries are performed using request parameters to map the API’s vocabulary to MongoDB collections. Queries can be further refined by using filtering terms. There exist four types of filtering terms Bio-ontology, Custom, Numeric and Alphanumeric (see Supplementary Text ST4). Please see examples of API requests and responses in the Supplementary Text ST5 and in the online documentation.
2.2.2 Security
The API can be configured according to different security and granularity levels. Three security levels (public, registered and controlled) can be set to grant differential external access and another three (boolean, counts and records) can be set for the granularity of the response (see Supplementary Text ST3).
Acknowledgements
We would like to thank Dietmar Fernández-Orth, Sabela de La Torre and Toshiaki Katayamai for their contribution to previous versions of the software, and, to Prof. Michael Baudis (UZH) and EGA members for their comments. We also thank all early testers of the software and the referees for their valuable feedback.
Funding
This study was funded by ELIXIR, the research infrastructure for life-science data (ELIXIR Beacon Implementation Studies 2019–2021 and 2022–2023).
Conflict of Interest: none declared.
References
Author notes
The authors wish it to be known that, in their opinion, Manuel Rueda and Roberto should be regarded as Joint First Authors.