Efficient population-scale variant analysis and prioritization with VAPr

Birmingham, Amanda; Mark, Adam M; Mazzaferro, Carlo; Xu, Guorong; Fisch, Kathleen M

doi:10.1093/bioinformatics/bty192

Abstract

Summary

With the growing availability of population-scale whole-exome and whole-genome sequencing, demand for reproducible, scalable variant analysis has spread within genomic research communities. To address this need, we introduce the Python package Variant Analysis and Prioritization (VAPr). VAPr leverages existing annotation tools ANNOVAR and MyVariant.info with MongoDB-based flexible storage and filtering functionality. It offers biologists and bioinformatics generalists easy-to-use and scalable analysis and prioritization of genomic variants from large cohort studies.

Availability and implementation

VAPr is developed in Python and is available for free use and extension under the MIT License. An install package is available on PyPi at https://pypi.python.org/pypi/VAPr, while source code and extensive documentation are on GitHub at https://github.com/ucsd-ccbb/VAPr.

1 Introduction

Precision medicine’s promise of diagnoses and treatments precisely tuned to each patient cannot be realized without a comprehensive understanding of the genomic variants underlying individual differences (Morgan et al., 2010; Lek et al., 2016). The rise of next-generation sequencing has drastically decreased the cost and increased the availability of the raw genomic data upon which such understanding is built, but the technical hurdles to mining full annotation of these data, especially for variants from population-scale whole genome sequencing, hinder the transformation of variant information into knowledge (Pabinger et al., 2014).

Due to the glut of sequencing data and the scarcity of computational biologists with deep specialization in variant analysis, much variant annotation and prioritization must be performed by computer-savvy lab biologists and by bioinformatics generalists. Currently, they are hampered by the necessity of learning details of multiple annotation software tools and integrating these into an analysis environment, as well as developing bespoke data structures supporting storage and filtering of variants’ extremely heterogeneous information (Beck et al., 2014). Given these issues, many such analyses are performed manually or semi-manually, using ad hoc filtering approaches that are difficult to accurately record and reproduce. Variant Analysis and Prioritization (VAPr) addresses these challenges by providing a simple software package that painlessly integrates high-quality reproducible variant investigation into any analysis pipeline.

2 Main features

The VAPr workflow is described in Figure 1. As a wide variety of tools already exist for annotating variants (Pabinger et al., 2014), VAPr builds upon the best-in-class choices from this field. For basic evaluation of novel variants, VAPr leverages the popular command-line software ANNOVAR (Wang et al., 2010); since this tool requires the installation and update of various local data files, VAPr also provides set-up methods to easily manage these resources. For collation of existing annotations for known variants, VAPr queries the MyVariant.info web service (Xin et al., 2016), which aggregates continually updated information from more than a dozen disparate variant annotation efforts. Both sets of resulting annotations for each variant are merged and recorded in a local NoSQL MongoDB, against which filtering and output steps are performed. The BigQ extension to the i2b2 framework for clinical research management (Gabetta et al., 2015) has already demonstrated that such databases provide fast, scalable and flexible storage for variant information, and VAPr thus extends these benefits to the large body of data analysts working outside the i2b2 framework. Access to all data is provided by a single Python API (Application Programming Interface); furthermore, since VAPr is available through PyPi, it can be installed with a simple one-line command (pip install VAPr).

Fig. 1.

Open in new tab Download slide

VAPr workflow for variant annotation and prioritization. Variant calls in VCF format are parsed, annotated, aggregated and stored in a MongoDB database where they can be quickly retrieved through filtering queries or output as CSV files

An additional advantage of VAPr’s storage approach over a traditional relational database is that users need not learn a complex table structure in order to filter for variants of interest (Schulz et al., 2016). VAPr manages the data in such a way that custom queries can be formulated using only MyVariant.info or ANNOVAR field names using the standard PyMongo syntax (O’Higgins, 2011). Many users, however, will find that they do not need to write custom queries at all, as VAPr provides pre-built variant analysis filters for four classes of variants that are commonly of interest: deleterious rare variants, known disease variants, deleterious compound heterozygous variants, and de novo variants from trios. The full definitions for these filters are detailed in the VAPr documentation. A user requiring different filtering criteria can easily create a custom query using these preset filters as a template.

As variant data are primarily conveyed via the variant call format (VCF) (Danecek et al., 2011), VAPr supports single-sample and multi-sample VCF file input and output. VAPr results can also be exported to comma-separated value (CSV) files for examination in an Excel spreadsheet and/or integration with a wide variety of downstream analysis tools. VAPr supports human genome assemblies hg19 and hg38 and comes with extensive user documentation that presupposes no expertise beyond basic experience with Python, including a tutorial on how to compose custom queries. In addition, we provide installation instructions, an extensive suite of unit tests and a fully executable Jupyter Notebook workflow to demonstrate a sample usage of VAPr.

3 Sample implementation

VAPr supports extremely lightweight implementation of variant investigation in Python scripts. The code sample below demonstrates the minimal code necessary to read in and parse a VCF file, annotate its variants with MyVariant.info and ANNOVAR and identify those meeting criteria for known disease variants. As shown here, these tasks can be accomplished with only four statements.

from VAPr import vapr_core.VaprAnnotator

# Parse, annotate, and save variants from input VCF

annotator = VaprAnnotator(“/path/to/vcfs/”, “/path/to/output_dir/”, mongo_db_name, mongo_ collection_name, “/path/to/annovar/”, build_ver=“hg19”)

dataset = annotator.annotate()

# Identify and prioritize known variants

known_disease_variants = dataset.get_known_disease_variants()

Run times on a laptop computer with four cores and 16GB memory were 2.5 min for a whole exome vcf file (40 000 variants) and 5 h for a whole genome vcf file (4.8 million variants).

4 Conclusions

VAPr provides a simple, turn-key Python API that integrates popular annotation tools with a flexible local data store to efficiently annotate, analyze and prioritize population-scale variant calls. It accommodates large variant files (such as those from whole genome sequencing) and meshes easily with upstream variant calling pipelines, while offering simplified Python-based custom filtering for variant prioritization and built-in filter options for common criteria. Furthermore, as VAPr is entirely open-source and available for expansion under the MIT license, users can easily extend it to include their own prioritization strategies and thus reduce dependence on manual variant filtering. We hope that, given its native ease-of-use as well as its extensibility, VAPr will empower lab biologists and non-specialist bioinformaticists to perform more transparent and reproducible analyses. Such reproducible, automated variant evaluation methods are critical not only for primary research but also in realizing many precision medicine workflows.

Funding

This work was supported by the National Institutes of Health [UL1TR001442] of CTSA and the San Diego Center for Systems Biology [P50GM085764]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Conflict of Interest: none declared.

References

Beck

T.

et al. (

2014

)

GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies

.

Eur. J. Hum. Genet

.,

22

,

949

–

952

.

Danecek

P.

et al. (

2011

)

The variant call format and VCFtools

.

Bioinformatics

,

27

,

2156

–

2158

.

Gabetta

M.

et al. (

2015

)

BigQ: a NoSQL based framework to handle genomic variants in i2b2

.

BMC Bioinformatics

,

16

,

415.

Lek

M.

et al. (

2016

)

Analysis of protein-coding genetic variation in 60, 706 humans

.

Nature

,

536

,

285

–

291

.

Morgan

A.A.

et al. (

2010

)

Clinical assessment incorporating a personal genome—authors’ reply

.

Lancet

,

376

,

869

–

870

.

Google Scholar

Crossref

WorldCat

O’Higgins

N.

(

2011

) MongoDB and Python: Patterns and processes for the popular document-oriented database. ‘O’Reilly Media, Inc, Sebastopol, CA.

Pabinger

S.

et al. (

2014

)

A survey of tools for variant analysis of next-generation genome sequencing data

.

Brief. Bioinformatics

,

15

,

256

–

278

.

Google Scholar

Crossref

WorldCat

Schulz

W.L.

et al. (

2016

)

Evaluation of relational and NoSQL database architectures to manage genomic annotations

.

J. Biomed. Inform

.,

64

,

288

–

295

.

Wang

K.

et al. (

2010

)

ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data

.

Nucleic Acids Res

.,

38

,

e164

–

e164

.

Xin

J.

et al. (

2016

)

High-performance web services for querying gene and variant annotation

.

Genome Biol

.,

17

,

91.

Author notes

The authors wish it to be known that, in their opinion, Amanda Birmingham, Adam M. Mark and Carlo Mazzaferro authors should be regarded as Joint First Authors.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Associate Editor:

Download all slides

Month:	Total Views:
April 2018	81
May 2018	32
June 2018	41
July 2018	29
August 2018	199
September 2018	56
October 2018	17
November 2018	15
December 2018	8
January 2019	11
February 2019	5
March 2019	8
April 2019	45
May 2019	87
June 2019	19
July 2019	36
August 2019	27
September 2019	27
October 2019	30
November 2019	30
December 2019	21
January 2020	35
February 2020	89
March 2020	65
April 2020	57
May 2020	9
June 2020	58
July 2020	69
August 2020	25
September 2020	7
October 2020	26
November 2020	21
December 2020	17
January 2021	19
February 2021	14
March 2021	18
April 2021	41
May 2021	13
June 2021	23
July 2021	18
August 2021	18
September 2021	17
October 2021	24
November 2021	14
December 2021	18
January 2022	17
February 2022	25
March 2022	20
April 2022	28
May 2022	22
June 2022	18
July 2022	21
August 2022	27
September 2022	31
October 2022	34
November 2022	28
December 2022	12
January 2023	18
February 2023	24
March 2023	55
April 2023	19
May 2023	20
June 2023	11
July 2023	10
August 2023	18
September 2023	12
October 2023	11
November 2023	21
December 2023	23
January 2024	13
February 2024	19
March 2024	24
April 2024	11

Article Contents

Efficient population-scale variant analysis and prioritization with VAPr

Abstract

1 Introduction

2 Main features

3 Sample implementation

4 Conclusions

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Efficient population-scale variant analysis and prioritization with VAPr

Abstract

1 Introduction

2 Main features

3 Sample implementation

4 Conclusions

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only