SNiPA: an interactive, genetic variant-centered annotation browser

Motivation: Linking genes and functional information to genetic variants identified by association studies remains difficult. Resources containing extensive genomic annotations are available but often not fully utilized due to heterogeneous data formats. To enhance their accessibility, we integrated many annotation datasets into a user-friendly webserver. Availability and implementation: http://www.snipa.org/ Contact: g.kastenmueller@helmholtz-muenchen.de Supplementary information: Supplementary data are available at Bioinformatics online.

. For genome-annotation we downloaded GENCODE gene data (including OMIM and DECIPHER annotations), regulatory feature clusters and regulatory motif data as well as linked information from the public MySQL database. The SNiPA database contains a total count of 59,006 genes with 206,671 associated transcripts and 99,121 protein products and 406,632 regulatory feature clusters, some of which are associated with transcription factor binding motifs. We also used many of the variant annotations as they are provided with the Variant Effect Predictor (VEP) 3 annotation. In addition, trait annotations and associations from OMIM 25 , HGMD 22 , UniProt 26 , dbGaP 23 and ClinVar 24 were fetched from the public MySQL database. Details are given in the tables below.

Variant set
SNiPA annotates all bi-allelic single nucleotide variants contained in the 1000 Genomes Project phase 1 version 3 and phase 3 version 5 dataset 21 . For each super-population (AFR, AMR, ASN, EUR in phase 1; AFR, AMR, EAS, EUR, SAS in phase 3), linkage disequilibrium data for an r 2 ≥ 0.1 are precalculated. The variant counts for the single super-populations are:

Combined annotation dependent depletion (CADD)
Kircher et al. provide an annotation-aided score for genotype pathogenicity called CADD 5 . CADD-Scores for 1000 Genomes genotypes were obtained from http://cadd.gs.washington.edu/download. The downloaded file was parsed into one compressed Tabix-ready 6 file per chromosome (autosomes and X-chromosome) in General Feature Format (GFF, http://www.sanger.ac.uk/resources/software/gff/spec.html), Tabix-indexed and included in VEP annotation as custom annotation files. We used the PHRED-like transformation of the C score for variant annotation.

Thurman et al. -promoters & distal enhancers/repressors
In essence, Thurman et al. 7 used DNaseI hypersensitive sites (DHSs) and mapped them to transcription start sites (TSSs) of human transcripts. Accessible DHSs in proximity to the TSSs are classified as promoters. The accessibility patterns of more distal DHSs have been correlated with the accessibility patterns of promoters and are thus linked to the genes thought to be regulated by DHSs proximal to a TSS. After data processing, we obtained 412,798 distal elements (enhancers) and 23,749 promoters.

FANTOM5 -expressed promoters & enhancers/repressors
Two papers of the FANTOM5 consortium 8,9 describe the properties, location and transcript associations of expressed regulatory elements (promoters and enhancers). These datasets are provided at http://fantom.gsc.riken.jp/data/ and http://enhancer.binf.ku.dk/, respectively. After data processing, we included 82,420 expressed promoters and 43,002 expressed enhancers and their links to human transcripts in SNiPA.

StarBase v2.0: miRNA target sites
miRNA target sites located in RNA-binding protein (RBP) binding sites were obtained at the starBase v2.0 database (http://starbase.sysu.edu.cn/, released 09/2013, accessed 16/01/2014) 10 . We included target predictions from five prediction tools at positions that are located in experimentally identified regions bound by RBPs (n= 606,408). The downloaded file was parsed into one compressed Tabix-ready 6 file per chromosome (autosomes and X-chromosome) in General Feature Format (GFF, http://www.sanger.ac.uk/resources/software/gff/spec.html), Tabix-indexed and included in VEP annotation as custom annotation files. This database comprises imputed association data on >2 Mio. SNPs. Following the protocol in 11 associations were filtered for genome-wide significance (P>5.78x10 -12 ). This filtered set was intersected with Kruskall-Wallis (KW) test results and filtered to feature a KW P<10 -10 as described by Zeller et al. 11 . SNPs were then split into cis-/trans-associations via distance to their associated expression target (up to 1MB apart: cis, else: trans).

Fairfax et al., 2012 -B-cells and monocytes
Fairfax et al. investigated genotype associations with expression data from B-cells and monocytes from 288 individuals. For >600,000 SNPs cis-(<=2.5 MB away from the probe) and trans-associations were determined at permutation (n=1,000) P<1x10 -3 and Bonferroni-corrected P<1x10 -11 , respectively. All significant associations from the online supplement 14 were mapped to Illumina HumanHT-12 v4 probes using the genomic coordinates provided in the supplemental files to obtain an up-to-date mapping to the corresponding genes. For this, hg18/NCBI36 coordinates had to be converted to hg19/GRCh37 coordinates using the UCSC liftOver tool 15 . Probe mapping data was retrieved from the Ensembl public SQL database 16 .

seeQTL database -LCL and brain
The seeQTL database 17  In addition, association data from an eQTL study on human brain samples (Myers et al. 18 ) in the same file format is available and was also included.

Dixon et al., 2007 -LCL
Dixon et al. investigated genotype associations with expression data (using Affymetrix HG-U133 Plus 2.0 chip) from LCL cell lines of 400 individuals 19 . The threshold for genome-wide significance was set to be a LOD score >6.076 (equivalent to an FDR of 5%). Significant associations were extracted from the online supplement 19 . Associations with probes mapping to multiple locations in the genomes where removed (n=3,309). Associations were defined as trans if SNPs are located more than 1 MB apart from the probe center, and cis else.

Innocenti et al., 2011 -hepatocytes
Innocenti et al. investigated genotype associations with expression data (using Agilent 4x44K arrays) from liver tissue of 266 individuals 20 . The threshold for genome-wide significance was described to be a Bayes factor of >5. We downloaded significant cis-associations from the online supplement 20 .
SNiPA reports the P-values provided with the associations that, thus, may not always seem to be significant on a genome-wide level.

Phenotype data
In addition to the data obtained at Ensembl, we included the NHGRI GWAS Catalog and gene annotations from OrphaNet (details below).

General remarks
SNiPA is a variant-centered resource. There are only two additional annotation tracks available: a gene annotation track and a track consisting of regulatory elements. The latter are linked to Ensembl or to their primary sources, if available. The central content of SNiPA are the "SNiPA cards" containing the annotation of individual variants. As "SNiPA cards" are very detailed and thus cannot be compressed enough to be manageable when investigating large variant sets, we also provide two other displays of annotations: first, the Block Annotation which is basically a "SNiPA card" except that it merges the annotation of all variants specified by the user. And second, a tabular format that contains top-level annotations for the variants (one row per variant). These tables can be sorted and filtered by keywords and individual "SNiPA cards" can then directly be accessed. Tables (as CSV) and "SNiPA cards" (as PDF) can be downloaded for later use.
Dependent on the used SNiPA module, one of the following input types is required by the user: dbSNP rs-identifier(s), a gene identifier, or a chromosomal position. For convenience, we have collected various sets of gene identifiers (such as UniGene IDs, Entrez Gene IDs, HGNC gene symbols, and so on) which makes previous mapping to Ensembl gene IDs (the ID scheme used by SNiPA) unnecessary in most cases.
The approaches behind SNiPA's modules are logically separated by design. However, it is often necessary to analyze variants from different points of view. To simplify that, we have implemented a global interface ("Variant clipboard") that can be used to store variants of interest. The input forms of all SNiPA modules provide functionality to paste variants from the clipboard into the form. This enables, for instance, scrolling through the genome using the Variant Browser, selecting variants of interest, and afterwards switching to the Variant Annotation module (or the Block Annotation if all variants are located on the same chromosome) to retrieve the "SNiPA cards" for all variants at once.

Category "Unknown effect" SO terms: intergenic_variant
Annotations contained in "SNiPA cards" are grouped in the first four categories. In addition, there is another section holding information on trait annotations for variants and genes as well as one section on general information on the variant.
The categories are encoded in all visualizations via the symbol used for the single variants (symbol keys are always listed in legends).

SNiPA modules
Currently, there are eight modules that allow for retrieval of the data contained in SNiPA. In the following, we will shortly introduce them to emphasize their underlying concepts. Detailed usage instructions are given in the documentation section on the SNiPA website.

Variant Browser
The SNiPA variant browser is our version of a genome browser with a variant-centered point of view ( Figure 1A). Our main focus here was to enable the user to visually assess how well the variants in a locus are characterized by evidences. To achieve that, variants are plotted according to their highest effect category (see 2.2 SNiPA effect categories) meaning that the higher a variant is located in the plot, the more evidence exists for it to feature strong effects. Variants that are assigned to more than one effect category are highlighted in green, variants that have trait annotations available are highlighted in blue. Here, the symbols used for the variants and their location in the plot are redundant information. This is because the two other interactive plotting modules of SNiPA (LD plot ( Figure 1B) and regional association plot) implement the interface of the browser and use other means of variant positioning, and there the used symbol is the only visual hint at the assigned effect categories.
The variant browser is intended to provide inspection of genomic loci without a background hypothesis. For these, other modules are better suited (see below).
An additional feature of the browser (and of all visualizations implementing the browser's interface) is that the display can be exported as vector image, PDF, or PNG.

Association Maps
To inspect variants or sets of variants that are associated with a specific trait (or a set of traits), we have implemented this module that allows for access to the data in SNiPA for variants with published associations ( Figure 1D). "SNiPA cards" of the variants can be directly accessed from the karyogram. Furthermore, variants can be added to the Variant clipboard and then be input into other modules such as the linkage disequilibrium plot for an LD-based locus inspection, the LD-based block annotation to get a summary of annotations for all correlating variants, the proxy search to retrieve a table of these variants with or without dense annotations, or the variant browser for further inspection of flanking regions of the locus.

Variant Annotation
This module provides direct access to variant annotations contained in SNiPA. Given a user-specified list of rs-identifiers, SNiPA returns a list of "SNiPA cards".

Block Annotation
SNiPA's block annotation module enables retrieval of merged annotations of a set of variants that can be specified by four different ways: a list of rs-identifiers, one rs-identifier that is first used to obtain a list of correlating variants (user-specified LD-threshold), a gene identifier, or a chromosomal region. Currently, only variants located on the same chromosome can be processed by block annotation. The merged annotation can be used to characterize a whole locus and thus may also be useful for characterizing rare variants for which no annotations are available.

Regional Association Plot
This is the classical plot for visualizing association results (locus-based Manhattan plot). Input is a user-specified list of variant/association p-value pairs. Variants are plotted by their position on the xaxis and -log 10 (p-value) on the y-axis. In addition, variants are colored by their correlation with the sentinel variant (by default, this is the variant with the lowest p-value, but optionally it can also be specified by the user). This plot implements the interface of the variant browser, meaning that all functionalities of the variant browser are provided except for navigating to other loci.

Linkage Disequilibrium Plot
This plot is very useful for instance to inspect a published GWAS hit. It is common practice to select a single variant (e.g. the one with the lowest p-value) as published representative for an association signal. LD data can be used to reproduce the reported locus albeit there always will be differences as the study populations will not be perfectly resembled by 1000 genomes individuals. Input is a single rs-identifier. Variants are plotted by their position on the x-axis and their correlation (r 2 ) to the specified variant on the y-axis. This plot implements the interface of the variant browser, meaning that all functionalities of the variant browser are provided except for navigating to other loci. Instead, the plot can be updated by selecting any contained variant as locus representative.

Proxy Search
This module allows for tabular retrieval of variants in LD with input variants. Dense annotation of the resulting variant set is possible.

Pairwise LD
A common challenge of association studies is to find out if one locus contains more than one association signal. One possible (albeit not the optimal) approach to do so is to check the LD pattern of the variants contained in the locus which can be done using this module.

Updates / new releases
Like Ensembl, SNiPA updates will come in quarterly (or, if Ensembl updates are minor, semiannual) releases. Updates will include the incorporation of new 1000 genomes data releases (if available) as well as complete updates of the Ensembl-based datasets. Using our custom variant annotator, the updated information will be merged with the additional datasets in the SNiPA collection. As soon as all annotations used by SNiPA are available for the GRCh38 genome assembly, we will include this assembly into SNiPA. GRCh37 data will be retained in parallel for a reasonable time.
Release notes will be listed in a corresponding section on the SNiPA website.

Documentation and feedback
SNiPA is a new resource and thus we are dependent on input from external users with respect to improvements of the resource (such as inclusion of additional datasets), development of new modules, as well as the usefulness of the help texts in input forms or the documentation.
We have already created a rudimental FAQ-like documentation. However, we want to extend this as well as we are open for any suggestions for improvement of SNiPA. Therefore, we would be very thankful for all hints, bug reports, questions, and suggestions. These can be sent to either the corresponding author or directly to feedback@snipa.org.