Abstract

Summary

Beacon v2 is an API specification established by the Global Alliance for Genomics and Health initiative (GA4GH) that defines a standard for federated discovery of genomic and phenotypic data. Here, we present the Beacon v2 Reference Implementation (B2RI), a set of open-source software tools that allow lighting up a local Beacon instance ‘out-of-the-box’. Along with the software, we have created detailed ‘Read the Docs’ documentation that includes information on deployment and installation.

Availability and implementation

The B2RI is released under GNU General Public License v3.0 and Apache License v2.0. Documentation and source code is available at: https://b2ri-documentation.readthedocs.io.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The Global Alliance for Genomics and Health (GA4GH) released in April 2022 the v2 of the Beacon specification, which defines an open standard for secure federated discovery of genomic and phenotypic data in biomedical research and clinical applications (Rambla et al., 2022). Beacon v2 specification consists of two components, the Framework and the Models. The Framework defines the format for the requests and responses, whereas the Models define the structure of the biological data response (see Supplementary Data ST1 and SF1). The overall function of these components is to provide the instructions to design a REST API.

Implementing a Beacon v2 API directly from the specification can be challenging for centers not having trained personnel. To demonstrate Beacon v2 capabilities and to facilitate the adoption, at the Centre for Genomic Regulation (CRG) we have developed the Beacon v2 Reference Implementation (B2RI), an open source Linux-based software toolkit that allows lighting up a local instance of Beacon ‘out-of-the-box’. In this communication, we will describe the software and summarize how its components work together to enable ‘beaconization’ of biological data.

2 Methods and implementation

Overall, two basic elements are needed to implement a local instance of Beacon v2: (i) an internal database (where the biological data are stored), and (ii) a REST API that provides a standardized way to receive requests and send responses. The B2RI provides these basic elements, as well as a set of tools to transform biological data to the internal database format. The B2RI consists of four components:

  • A set of tools for extraction, transformation and loading of metadata (e.g. sequencing methodology, bioinformatics tools), phenotypic data and genomic variants into a database.

  • The database (an instance of MongoDB) (https://www.mongodb.com).

  • The Beacon v2 query engine (i.e. a REST API).

  • An example dataset consisting of synthetic data (CINECA synthetic cohort EUROPE UK1) (see Supplementary Text ST6).

The software is available for download from Docker Hub (https://hub.docker.com/r/beacon2ri/beacon_reference_implementation) or through GitHub repositories (see Supplementary Text ST7) and must be deployed on a local workstation/server. Hence, some security aspects of data access, such as external IP access, rely on the ‘jurisdiction’ of each research center. The software is written in Python, Perl and Bash and functions with a command-line interface for control and operation.

We will now describe how the components work together to enable data conversion and access through the REST API.

2.1 Data ingestion

The data ingestion consists of three steps:

2.1.1 Transforming metadata and phenotypic data

Researchers/clinicians store metadata and phenotypic data in a wide variety of sources/formats (e.g. text files, CSV, Excel, databases, Electronic Health Records, PDF, etc.). The idea is that B2RI will facilitate converting data in those formats to the hierarchical structure of the Beacon v2 Models. The Models are a set of seven (analyses, biosamples, cohorts, datasets, genomicVariations, individuals and runs) entities (entry types in Beacon v2 specification) created to provide uniformity for the biological data responses (see Supplementary Fig. SF2). The entry types are defined using JSON Schema and consist of multiple properties (or terms). As input, we provide an Excel template (see https://github.com/EGA-archive/beacon2-ri-tools/tree/main/utils/bff_validator) consisting of all Models properties ‘flattened-out’ and separated into seven sheets (one per entry type). Note that it is not necessary to fill out all the sheets to light up a Beacon v2 instance. The user is responsible for filling out the Excel according to the entities and terms they want to share. Ontologies are defined at this level, but we are not enforcing the use of any particular ones, as ontologies depend on the domain of study (in any case, we provide examples in the documentation). Once the sheets are filled out, the B2RI comes with a utility that validates the Excel file against the Models JSON Schemas, and, if successful, it creates a set of JSON text files (JSON arrays) as an output that will be later loaded into the database.

2.1.2 Transforming genomic variations

For genomic data, the B2RI comes with a tool (see https://github.com/EGA-archive/beacon2-ri-tools) that takes as input a VCF (Danecek et al., 2011) file (from DNAseq) and uses BCFftools (Narasimhan et al., 2016), SnpEff (Cingolani et al., 2012a) and SnpSift (Cingolani et al., 2012b) [with data from dbNSFP (Liu et al., 2020) (see Supplementary Text ST8) and ClinVar (Landrum et al., 2016)] to annotate each VCF. Once annotated, the tool transforms VCF data to the genomicVariations entry type and serializes it as a JSON file.

2.1.3 Load data into MongoDB

Once transformed, the set of seven JSON files defines what we call the Beacon Friendly Format (BFF) (see online documentation). The same tool used to process the VCF (see above) also enables loading BFF files into a MongoDB instance. We have chosen MongoDB as a de facto database as it works directly with JSON files. This way, we can store the data directly in the database according to the Beacon v2 Models and provide responses (Beacon v2 compliant) without the need of re-mapping the data at the API level (see Supplementary Text ST2). Once loaded into the database, the entry types will be referred to as MongoDB collections.

2.2 REST API

2.2.1 Queries

The API (see https://github.com/EGA-archive/beacon2-ri-api) follows REST principles and queries are carried out by sending requests (using either GET or POST HTTP methods) to Beacon v2 API endpoints (see Supplementary Text ST2). Queries are performed using request parameters to map the API’s vocabulary to MongoDB collections. Queries can be further refined by using filtering terms. There exist four types of filtering terms Bio-ontology, Custom, Numeric and Alphanumeric (see Supplementary Text ST4). Please see examples of API requests and responses in the Supplementary Text ST5 and in the online documentation.

2.2.2 Security

The API can be configured according to different security and granularity levels. Three security levels (public, registered and controlled) can be set to grant differential external access and another three (boolean, counts and records) can be set for the granularity of the response (see Supplementary Text ST3).

Acknowledgements

We would like to thank Dietmar Fernández-Orth, Sabela de La Torre and Toshiaki Katayamai for their contribution to previous versions of the software, and, to Prof. Michael Baudis (UZH) and EGA members for their comments. We also thank all early testers of the software and the referees for their valuable feedback.

Funding

This study was funded by ELIXIR, the research infrastructure for life-science data (ELIXIR Beacon Implementation Studies 2019–2021 and 2022–2023).

Conflict of Interest: none declared.

References

Cingolani
P.
 et al. (
2012a
)
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3
.
Fly (Austin)
,
6
,
80
92
.

Cingolani
P.
 et al. (
2012b
)
Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program
.
Front. Genet
.,
3
,
35
.

Danecek
P.
 et al. ;
1000 Genomes Project Analysis Group
. (
2011
)
The variant call format and VCFtools
.
Bioinformatics
,
27
,
2156
2158
.

Landrum
M.J.
 et al. (
2016
)
ClinVar: public archive of interpretations of clinically relevant variants
.
Nucleic Acids Res
.,
44
,
D862
D868
.

Liu
X.
 et al. (
2020
)
dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs
.
Genome Med
.,
12
,
103
.

Narasimhan
V.
 et al. (
2016
)
BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data
.
Bioinformatics
,
32
,
1749
1751
.

Rambla
J.
 et al. (
2022
)
Beacon v2 and Beacon networks: a “lingua franca” for federated data discovery in biomedical genomics, and beyond
.
Hum. Mutat
., 43,
791
799
.

Author notes

The authors wish it to be known that, in their opinion, Manuel Rueda and Roberto should be regarded as Joint First Authors.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Peter Robinson
Peter Robinson
Associate Editor
Search for other works by this author on:

Supplementary data