Abstract

Summary

Over the past decade, there has been an exponential increase in the amount of disease-related genomic data available in public databases. However, this high-quality information is spread across independent sources and researchers often need to access these separately. Hence, there is a growing need for tools that gather and compile this information in an easy and automated manner. Here, we present ‘VarGen’, an easy-to-use, customizable R package that fetches, annotates and rank variants related to diseases and genetic disorders, using a collection public databases (viz. Online Mendelian Inheritance in Man, the Functional Annotation of the Mammalian genome 5, the Genotype-Tissue Expression and the Genome Wide Association Studies catalog). This package is also capable of annotating these variants to identify the most impactful ones. We expect that this tool will benefit the research of variant-disease relationships.

Availability and implementation

VarGen is open-source and freely available via GitHub: https://github.com/MCorentin/VarGen. The software is implemented as an R package and is supported on Linux, MacOS and Windows.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Complex genetic diseases are often caused by the accumulation of a large number of low impacting variants rather than a single defective gene. With the current epidemics of complex non-transmittable diseases such as diabetes mellitus and obesity and the recent advances in sequencing technologies and genotyping, it is nowadays possible to gain comprehensive insights into the genetics behind these disorders. Moreover, there has been an exponential increase in the amount of high-quality information available in public databases, e.g. the current build of dbSNP contains >660 million human RefSNP clusters (Sherry, 2001). Unfortunately, useful information is often scattered between independent sources, such as the Online Mendelian Inheritance in Man (OMIM), the Functional Annotation of the Mammalian genome 5 (FANTOM5), the Genotype-Tissue Expression (GTEx) and Genome Wide Association Studies (GWAS). Each one of these databases provide useful and complementary information about the impact of variants on diseases but have to be accessed separately and sometimes are not based on the same version of the human genome. Some previous attempt to integrate single-nucleotide polymorphism (SNP)-related knowledge already exist (e.g. Cao et al., 2017; Ferrero, 2018; Pinero et al., 2017), but these often lack completeness and/or the sensitivity required. Here, we present VarGen, an easy-to-use R package for disease-associated variant discovery and annotation based on information integrated from different and complementary high-quality databases.

2 VarGen

VarGen implements a highly customizable workflow for retrieving and annotating SNPs using publicly available repositories (see Supplementary Fig. S1). The workflow’s entry point is typically a disease ID entered by the user. Alternatively, VarGen can retrieve causative SNPs based on a customized list of genes of interest. Genes related to the disease are first retrieved from the OMIM database (Amberger and Hamosh, 2017), subsequently called the ‘OMIM genes’. VarGen then retrieves variants situated directly on the OMIM genes, as well as variants present in their promoter regions using the FANTOM5 database. Integrating FANTOM5 data will allow the detection of non-coding variants, as there is more and more evidence of their non-negligible impact on diseases (Ward and Kellis, 2012). Additionally, tissue-specific SNPs can also be retrieved using the GTEx eQTLs database (GTEx Consortium, 2017). VarGen also accesses the GWAS catalog to get variants associated with GWAS traits of interest (Buniello et al., 2019). VarGen accesses these databases via BiomaRt (Smedley et al., 2009), Ensembl application programming interface (API) and local files. The latter can be downloaded via the vargen_install function, making the installation straightforward. Moreover, all the positions are reported in hg38 coordinates. If a source still uses hg19, VarGen will lift-over the positions. The variants are then annotated according to their location, impact, clinical significance and scored with the Combined Annotation Dependent Depletion phred score (Rentzsch et al., 2019). Since VarGen typically outputs a large number of variants as a result of the comprehensive list of repositories queried, this annotation step is helpful to rank them and identify the most relevant.

VarPhen, a more specific alternative pipeline, is also available within the package. VarPhen limits the variant output by retrieving the variants directly linked to a list of phenotypes in BiomaRt. The list of phenotypes is automatically obtained from keywords entered by the user as input (see Supplementary Fig. S2).

In order to have an overview of the variants discovered by the pipeline, we developed a custom visualization function, which displays the variants on each OMIM gene, grouped by according to their corresponding impact (see Supplementary Fig. S3).

3 Benchmarking

VarGen was compared to two other similar tools: DisGeNET and VarFromPDB using the term ‘obesity’ (OMIM: 601665) as a use-case. The benchmarking script is available as Supplementary Data. Results of the benchmarking can be seen in Fig. 1.

Venn diagram of the variants obtained with the different pipelines. Obesity (OMIM: 601665) was chosen as a use-case. VarGen and VarPhen are the two alternative pipelines available in the package, focused on sensitivity and specifity, respectively
Fig. 1.

Venn diagram of the variants obtained with the different pipelines. Obesity (OMIM: 601665) was chosen as a use-case. VarGen and VarPhen are the two alternative pipelines available in the package, focused on sensitivity and specifity, respectively

VarGen and VarPhen are sharing, respectively, 456 and 365 variants with DisGeNET and VarFromPDB. Moreover, 882 variants are shared only between VarGen and VarPhen, highlighting the higher sensitivity of both pipelines.

In total, DisGeNET and VarFromPDB are sharing 68 variants and only 2 of them are not discovered by either VarGen or VarPhen. Almost all of the 584 variants unique to DisGeNET are from literature mining and GwasDB which are not implemented in the other pipelines (see Supplementary Fig. S4). Arguably, literature mining may introduce a large number of false positives and therefore was not included in our package. It was found that 408 variants out of the 479 variants unique to VarFromPDB are not associated directly with obesity but other phenotypes, such as Leptin dysfunction, Intellectual Disability, Bardet-Biedl syndrome; which therefore explains the limited overlap with the other tools (see Supplementary Fig. S5 and Table S1).

Some of the 119 243 unique variants from VarGen are potentially false positives or with little confirmed clinical evidence. It is possible to filter most of them using the phred score, source and clinical significance while keeping almost all the variants found by the other databases (see Supplementary Fig. S6). As some users will prefer a more specific approach, we provide an alternative pipeline, VarPhen, which gets only the most relevant variants.

Similarly, an additional benchmarking has been carried using Alzheimer’s disease (OMIM ID: 104300) as a use-case, where the results were comparable to these for obesity described above (see Supplementary Figs S7–S9 and Table S2).

4 Conclusions

VarGen is a flexible, well-documented and easy-to-use R package for disease-related SNP discovery. The pipeline offers higher degree of sensitivity compared to other existing tools, notably because it uses databases often overlooked by other tools (e.g. FANTOM5). The output is a comprehensive list of annotated variants, ranked according to their phred score and clinical impact, which can also be visualized within a genome visualization track.

Funding

Vargen was developed as part of the European Union’s Horizon 2020-funded project Nutrishield (GA 818110).

Conflict of Interest: none declared.

References

Amberger
 
J.S.
,
Hamosh
A.
(
2017
)
Searching Online Mendelian Inheritance in Man (OMIM): a knowledgebase of human genes and genetic phenotypes
.
Curr. Protoc. Bioinformatics
,
58
,
1.2.1
1.2.12
.

Buniello
 
A.
 et al. (
2019
)
The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019
.
Nucleic Acids Res
.,
47
,
D1005
D1012
.

Cao
 
Z.
 et al. (
2017
)
VarfromPDB: an automated and integrated tool to mine disease-gene-variant relations from the public databases and literature
.
J. Proteomics Bioinformatics
, 10, 311–315.

Ferrero
 
E.
(
2018
)
Using regulatory genomics data to interpret the function of disease variants and prioritise genes from expression studies
.
F1000Res
,
7
,
121
.

GTEx Consortium. (

2017
)
Genetic effects on gene expression across human tissues
.
Nature
,
550
,
204
213
.

Pinero
 
J.
 et al. (
2017
)
DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants
.
Nucleic Acids Res
.,
45
,
D833
D839
.

Rentzsch
 
P.
 et al. (
2019
)
CADD: predicting the deleteriousness of variants throughout the human genome
.
Nucleic Acids Res
.,
47
,
D886
D894
.

Sherry
 
S.T.
(
2001
)
dbSNP: the NCBI database of genetic variation
.
Nucleic Acids Res
.,
29
,
308
311
.

Smedley
 
D.
 et al. (
2009
)
BioMart—biological queries made easy
.
BMC Genomics
,
10
,
22
.

Ward
 
L.D.
,
Kellis
M.
(
2012
)
Interpreting noncoding genetic variation in complex traits and human disease
.
Nat. Biotechnol
.,
30
,
1095
1106
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Jonathan Wren
Jonathan Wren
Associate Editor
Search for other works by this author on:

Supplementary data