GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor

Puigdevall, Pau; Castelo, Robert

doi:10.1093/bioinformatics/bty311

Abstract

Summary

Genomewide position-specific scores, such as those estimating conservation, constraint, fitness or mutation tolerance, are ubiquitous in current genome analyses. The diversity of sources and formats of these scores, as well as their size, increase the burden to use them. We present GenomicScores, a Bioconductor package that provides efficient storage and seamless access of genomewide position-specific scores from R, facilitating their use in genome analysis workflows.

Availability and implementation

GenomicScores is implemented in R and available at https://bioconductor.org/packages/GenomicScores under the open source ‘Artistic-2.0’ license.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Genome-wide position-specific scores assign each genomic position a numeric value typically measuring conservation, constraint, fitness or mutation tolerance. Genomic scores are built on the basis of different sources of information such as sequence homology, functional domains or physicochemical changes of amino acid residues, among others. These scores are widely used to filter single nucleotide variants or assess the degree of functionality of genome features. One particular and widely used instance of genomic scores are phastCons scores providing a measure of conservation calculated from genomewide alignments (Siepel et al., 2005).

However, retrieving, parsing and integrating diverse genomic scores often becomes a cumbersome task due to the multiple sources and formats of storage, as well as the size of the files that contain them, which grows substantially in the case of genomes of higher eukaryotes. The GenomicScores package addresses this problem to facilitate an efficient storage and seamless access of these scores, and their integration into genome analysis workflows on top of R and Bioconductor (Huber et al., 2015). The GenomicScores software package, and others used in this article through the call to the R function library(), can be installed in Linux, Windows or Mac OSX, by following the instructions available at https://bioconductor.org/install. The functionality illustrated in this article forms part of the Bioconductor release version 3.7 and it is more comprehensively documented in the online manual and vignette included in the software.

2 Results

Storing and accessing genomic scores is challenging when their values cover large regions of the genome, resulting in gigabytes of double-precision numbers. This is the case, for instance, of phastCons (Siepel et al., 2005), CADD (Kircher et al., 2014) or M-CAP scores (Jagadeesh et al., 2016). We address this problem by using lossy compression, storing quantized scores with run-length encoding (RLE) compressed vectors. Lossy compression attempts to trade off precision for compression without compromising the scientific integrity of the data (Zender, 2016). Limiting numeric precision leads to a subset of quantized values much smaller than the original range of genomic scores. This enables using only one or two bytes of storage for each score, and leads to long runs of identical values along the genome that can be compressed with RLE vectors; see Supplementary Note and Figure S1.

We have processed in this way a number of genomic score sets, rounding off the original values to 1 or 2 decimal places, or significant figures; see Supplementary Tables S1–S3. The GenomicScores package exposes these scores to the user through the object class GScores, which jointly with its associated methods, is designed with three goals. First, retrieve metadata including versioning, provenance and citation data. Second, minimize the memory footprint by loading only the scores associated with the queried genomic sequences. Third, provide a common access point through the functions gscores() and score(). The following example shows a basic use case where we query the phastCons average score for the genomic range 117, 232, 380–117, 232, 384 in the human chromosome 7.

> library(GenomicScores)

> library(phastCons100way.UCSC.hg19)

> phast <- phastCons100way.UCSC.hg19

> phast

GScores object

# organism: Homo sapiens (UCSC, hg19)

# provider: UCSC

# provider version: 09Feb2014

# download date: Apr 10, 2018

# loaded sequences: chr19_gl000208_random

# number of sites: 2858 millions

# maximum abs. error: 0.05

# use ’citation()’ to cite these data in publications

> gscores(phast, GRanges(“chr7: 11834150-11834154”))

GRanges object with 1 range and 1 metadata column:

seqnames ranges strand | default

[1] chr7 11834150-11834154 * | 0.2

——-

seqinfo: 1 sequence from unspecified genome; no seqlengths

Using a GRanges object (Lawrence et al., 2013) we queried a single genomic range of positions for which the mean score was calculated, but we can retrieve other summary statistics such as minimum or median values with the argument summaryFun. By default, the gscores() method returns the scores in a GRanges object but it can return them as a numeric vector by using the function score() instead.

> score(phast, GRanges(“chr7: 11834150-11834154”))

[1] 0.2

The previous example loads a Bioconductor annotation package (phastCons100way.UCSC.hg19) that stores phastCons genomic scores. Other Bioconductor annotation packages storing genomic score sets are listed in Supplementary Table S1. Among them, those whose name starts with the MafDb prefix provide minor allele frequency (MAF) data. Some score sets may be formed by different scores populations, corresponding to different precision levels (e.g. rounding off to 1 or 2 decimal places) or different data sources from which the scores were derived (e.g. MAF data derived from different human populations). To find out the available scores populations available in each score set, we can use the function populations(), and to use a specific one we can use the argument pop where appropriate.

> populations(phast)

[1] “default” “DP2”

> score(phast, GRanges(“chr7: 11834150-11834154”), pop=“DP2”)

[1] 0.214

Additional genomic scores sets (Supplementary Table S2) are also available online as AnnotationHub resources (Morgan, 2017) that can be retrieved as follows:

> availableGScores()

[1] “cadd.v1.3.hg19” “fitCons.UCSC.hg19”

[3] “linsight.UCSC.hg19” “mcap.v1.0.hg19”

[5] “phastCons7way.UCSC.hg38” “phastCons27way.UCSC.dm6”

[7] “phastCons60way.UCSC.mm10” “phastCons100way.UCSC.hg19”

[9] “phastCons100way.UCSC.hg38” “phyloP60way.UCSC.mm10”

[11] “phyloP100way.UCSC.hg19” “phyloP100way.UCSC.hg38”

> cadd <- getGScores(“cadd.v1.3.hg19”)

> mcap <- getGScores(“mcap.v1.0.hg19”)

Genomic score sets downloaded as AnnotationHub resources are cached on disk, so that a subsequent call to getGScores() on a previously downloaded score set, will just load it from disk. We illustrate below a use case of GenomicScores to find the top-3 most conserved single nucleotide variants in coding regions with largest deleterious effects predicted by CADD (Kircher et al., 2014) and M-CAP scores (Jagadeesh et al., 2016), and their gnomAD MAF values (Lek et al., 2016). For this purpose, we use the VariantAnnotation package (Obenchain et al., 2014) and one of its sample VCF files. Figure 1 shows a graphical representation of all the retrieved scores.

Fig. 1.

Open in new tab Download slide

Relationship between CADD, M-CAP, PhastCons scores and gnomAD MAF values for variants from coding regions of chromosome 22. Values on the y-axis are jittered to improve visualization. Short lines along each axis are a 1-d representation of the axis data to visualize values that are present in one axis only. No dots are hidden behind the legends

> library(MafDb.gnomAD.r2.0.1.hs37d5)

> library(VariantAnnotation)

> library(TxDb.Hsapiens.UCSC.hg19.knownGene)

> mafdb <- MafDb.gnomAD.r2.0.1.hs37d5

> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

> fl <- system.file(“extdata”, “chr22.vcf.gz”,

package=“VariantAnnotation”)

> vcf <- readVcf(fl, “hg19”)

> vcf <- vcf[isSNV(vcf)]

> seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)

> loc <- unique(locateVariants(vcf, txdb, CodingVariants()))

> loc$PHAST <- score(phast, loc)

> loc$CADD <- score(cadd, loc, ref = ref(vcf)[loc$QUERYID],

alt = alt(vcf)[loc$QUERYID])

> loc$MCAP <- score(mcap, loc, ref = ref(vcf)[loc$QUERYID],

alt = alt(vcf)[loc$QUERYID])

> seqlevelsStyle(loc) <- seqlevelsStyle(mafdb)

> loc$MAF <- score(mafdb, loc)

> loc[order(loc\textit{PHAST}, \textit{loc}CADD, loc$MCAP, na.last = NA,

decreasing = TRUE)[1: 3], ]

GRanges object with 3 ranges and 13 metadata columns:

seqnames ranges strand | LOCATION LOCSTART

[1] 22 50699625 - | coding 226

[2] 22 50927976 + | coding 652

[3] 22 50689460 -| coding 2

LOCEND QUERYID TXID CDSID

[1] 226 5615 75274 218661

[2] 652 8788 74392 216419, 216418

[3] 2 5450 75268 218646

GENEID PRECEDEID FOLLOWID

[1] 6300 <NA> <NA>

[2] 55586 <NA> <NA>

[3] 83933 <NA> <NA>

PHAST CADD MCAP MAF

[1] 1 40 0.86 4e-04

[2] 1 40 0.21 0.001

[3] 1 30 0.82 3e-05

——-

seqinfo: 1 sequence from unspecified genome; no seqlengths

3 Conclusion

The GenomicScores package facilitates the integration of genomic scores into genome analysis workflows on top of R and Bioconductor. Additional score sets can be added on request at the Bioconductor support site (https://support.bioconductor.org). The underlying storage and representation of the serialized lossy-compressed genomic scores are RDS files (R Core Team, 2018, p. 12). This precludes the interoperability with platforms other than R, if a developer wants to access directly the content of those files from those platforms. However, because the access to the scores within R is achieved through an object class built on top of other Bioconductor data structures, improvements on compressing and distributing the scores can be made without breaking R code from end-users and other packages importing the GenomicScores functionality.

Acknowledgements

We would like to thank the Bioconductor core team for their technical support to implement some of the features of GenomicScores, and the two anonymous reviewers for their constructive comments that have improved this paper.

Funding

This work was supported by Spanish MINECO/FEDER (TIN2015-71079-P) and ‘Maria de Maeztu Unit of Excellence’ (MDM-2014-0370), and Catalan AGAUR (SGR17-1020; FI-DGR 2015).

Conflict of interest: none declared.

References

Huber

W.

et al. (

2015

)

Orchestrating high-throughput genomic analysis with bioconductor

.

Nat. Methods

,

12

,

115.

Jagadeesh

K.A.

et al. (

2016

)

M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity

.

Nat. Genet

.,

48

,

1581

–

1586

.

Kircher

M.

et al. (

2014

)

A general framework for estimating the relative pathogenicity of human genetic variants

.

Nat. Genet

.,

46

,

310

–

315

.

Lawrence

M.

et al. (

2013

)

Software for computing and annotating genomic ranges

.

PLOS Comp. Biol

.,

9

,

e1003118.

Google Scholar

Crossref

WorldCat

Lek

M.

et al. (

2016

)

Analysis of protein-coding genetic variation in 60,706 humans

.

Nature

,

536

,

285

–

291

.

Morgan

M.

(

2017

) AnnotationHub: Client to access AnnotationHub resources. R package version 2.10.1. doi: 10.18129/B9.bioc.AnnotationHub.

Obenchain

V.

et al. (

2014

)

Variantannotation: a bioconductor package for exploration and annotation of genetic variants

.

Bioinformatics

,

30

,

2076.

R Core Team

(

2018

)

R Internals

.

R Foundation for Statistical Computing

,

Vienna, Austria

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Siepel

A.

et al. (

2005

)

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

.

Genome Res

.,

15

,

1034

–

1050

.

Zender

C.S.

(

2016

)

Bit Grooming: statistically accurate precision-preserving quantization with compression, evaluated in the netCDF Operators (NCO, v4.4.8+)

.

Geosci. Model Dev

.,

9

,

3199

–

3211

.

Google Scholar

Crossref

WorldCat

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
April 2018	75
May 2018	72
June 2018	22
July 2018	38
August 2018	8
September 2018	119
October 2018	39
November 2018	40
December 2018	18
January 2019	22
February 2019	4
March 2019	24
April 2019	47
May 2019	43
June 2019	16
July 2019	22
August 2019	9
September 2019	27
October 2019	38
November 2019	27
December 2019	31
January 2020	27
February 2020	24
March 2020	22
April 2020	24
May 2020	35
June 2020	62
July 2020	51
August 2020	23
September 2020	15
October 2020	30
November 2020	22
December 2020	14
January 2021	20
February 2021	15
March 2021	21
April 2021	27
May 2021	16
June 2021	27
July 2021	26
August 2021	15
September 2021	10
October 2021	31
November 2021	9
December 2021	15
January 2022	10
February 2022	14
March 2022	32
April 2022	7
May 2022	20
June 2022	27
July 2022	18
August 2022	20
September 2022	67
October 2022	30
November 2022	10
December 2022	13
January 2023	13
February 2023	13
March 2023	22
April 2023	18
May 2023	21
June 2023	26
July 2023	17
August 2023	14
September 2023	19
October 2023	27
November 2023	11
December 2023	23
January 2024	21
February 2024	21
March 2024	18
April 2024	10

Article Contents

GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor

Abstract

1 Introduction

2 Results

3 Conclusion

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor

Abstract

1 Introduction

2 Results

3 Conclusion

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only