The genetics-BIDS extension: Easing the search for genetic data associated with human brain imaging

Abstract Metadata are what makes databases searchable. Without them, researchers would have difficulty finding data with features they are interested in. Brain imaging genetics is at the intersection of two disciplines, each with dedicated dictionaries and ontologies facilitating data search and analysis. Here, we present the genetics Brain Imaging Data Structure extension, consisting of metadata files for human brain imaging data to which they are linked, and describe succinctly the genomic and transcriptomic data associated with them, which may be in different databases. This extension will facilitate identifying micro-scale molecular features that are linked to macro-scale imaging repositories, facilitating data aggregation across studies.

Have you included all the information requested in your manuscript?

Introduction
Brain imaging genetics aims at studying the association between brain structure or function and genetic variation [1]. Since gene expression influences cellular mechanisms, which in turn influences neural circuits underlying behaviour, studying associations at the brain level deepens our understanding of gene function at the system level. There is also evidence that using endophenotypes (i.e. brain phenotypes [2,3]) is better suited to understand diseases, providing an intermediate description level between genes and clinical phenotypes.
Both brain imaging and genetics are fields in which researchers are used to share data in order to replicate findings and allow secondary usage. The way data are shared is however different with genetic data being sensitive personal information that must, therefore, be shared through secured and controlled access. Brain imaging, by contrast, is often shared via open databases, or with fewer restrictions on who can access. This has led to different approaches in data sharing for brain imaging genetics: fully secured and controlled for all data (e.g. UK biobank [4]) vs. splitting data with open access to brain images but secured access for genetic data (e.g. Human Connectome Project [5]). The former approach works for large homogenous projects requiring heavy data management while the latter approach is easier, especially for multiple individual smaller studies or multicentric studies with heterogeneous data collection. The Brain Imaging Data Structure (BIDS) describes a way of organising neuroimaging and behavioural data using dedicated names and dictionaries, documenting metadata [6]. Over time, extensions are being developed and integrated to address users' needs. Here we present the BIDS genetics extension . The primary goal of the BIDS genetics extension is to link BIDS datasets to associated genetic data, especially those existing in separate repositories. The secondary goal is to provide a succinct description of the type of genetic data available, thus enabling searches through multiple imaging datasets.

The brain imaging data structure genetic descriptor
Data organized according to BIDS have a rigid folder structure and naming convention. Every dataset comes with a dataset_description.json file that contains information relative to authors, funders, ethics, licence, etc. To refer to associated genetic data, this file must now include the URL pointing to the genetic data, and optionally the URL of the database, and other associated materials like dataset descriptor articles. This will allow searching quickly through BIDS compliant repositories for datasets with associated genetic data.
Another requirement of BIDS is that a participants.tsv file is included with, at minimum, the subject's identifier. This file is now also used to link the brain imaging and genetic datasets if different pseudo-identifiers are used, making it easy to associate pseudo-IDs without need to ever access personal information.

Extension characteristics and imaging genetic information
This extension of the BIDS project aims to help researchers to structure their molecular (multi-level) and imaging datasets side-by-side in order to improve data linkage and search performance. To facilitate metadata search, a genetic_info.json file must be associated with a BIDS dataset describing which type of genetic information is available. Among the multiple available fields, it minimally requires the GeneticLevel at which genetic analyses were carried out: genetic, genomic, epigenomic, transcriptomic, metabolomic or proteomic [7] (figure 1), and the SampleOrigin data: blood, saliva, brain, csf, breast milk, bile, amniotic fluid, or other biospecimen. If the SampleOrigin value is brain, it is further recommended to specify the TissueOrigin (gray matter, white matter, csf, meninges, macrovascular or microvascular) as the genetic or the genomic information may be more specifically related to the available imaging data. This can further be refined by indicating the CellType analysis with values taken from the cell ontology [8] and, if the TissueOrigin is gray matter, white matter or CSF, by indicating the BrainLocation (either using MNI coordinates or labels from the Allen Brain Atlas [9]).
A last, recommended, field is the AnalyticApproach , that is the sampling methodology. While optional, this is of particular importance since it indicates in greater detail the type of genetic data available using values from the database of Genotypes and Phenotypes (dbGaP [10]). As an example, the single nucleotide polymorphisms (SNP) genotyping (Array) and whole genome sequencing approaches provide both a whole genome level of genetic information, albeit with some critical differences. The SNP genotyping reports genomic data with lower density compared to the whole genome sequencing which cover over~95% of the genomic DNA.

Conclusion
BIDS is an openly developed, community-led standard to name, document and organize human brain imaging data, allowing FAIR data sharing and the automation of complex data preprocessing. In just four years of existence, it has revolutionized data sharing and analysis in neuroscience, from a wide-adoption and a reference for publications to supporting data repository architectures, and it is critical to many open-source analysis pipelines. Here, we present the genetic extension which is integrated into the BIDS specification providing a full documentation of the fields that may be provided, along with online examples and a Javascript validator to ensure datasets are compliant. By adding a genetic descriptor for imaging data, we hope to facilitate data mining to constitute large multi-scale heterogeneous analysis human datasets that reflect human variability, necessary to enhance our understanding of genetic influence on brain phenotypes.

Author contributions
ClM, JT, TN, VC and CP conceptualized the BIDS extension, ClM and CP wrote the manuscript draft; ClM, MJL, ChM and CP wrote the extension and example, RB wrote the javascript validator. All co-authors have contributed to the preparation of the manuscript, and/or read and approved the final version.