Demystifying non-coding GWAS variants: an overview of computational tools and methods

Abstract Genome-wide association studies (GWAS) have found the majority of disease-associated variants to be non-coding. Major efforts into the charting of the non-coding regulatory landscapes have allowed for the development of tools and methods which aim to aid in the identification of causal variants and their mechanism of action. In this review, we give an overview of current tools and methods for the analysis of non-coding GWAS variants in disease. We provide a workflow that allows for the accumulation of in silico evidence to generate novel hypotheses on mechanisms underlying disease and prioritize targets for follow-up study using non-coding GWAS variants. Lastly, we discuss the need for comprehensive benchmarks and novel tools for the analysis of non-coding variants.


Introduction
Charting the regulatory function of the non-coding genome has been an ongoing effort for over a decade. Although the claims of the ENCODE project that 80% of the whole genome is functional is heavily debated (1,2), a plethora of regulatory features, ranging from small binding motifs to changes in chromatin-packing, has been discovered in what was once considered to be 'junk' DNA (3,4). These mechanisms reveal a strong involvement of the noncoding genome in gene regulation (5).
The role of protein-coding variants in disease is relatively clear. Loss or gain of function mutations can disrupt normal protein function and are therefore able to exert potentially detrimental effects on a phenotype (6). The role of a non-coding variant is less obvious. Genome-wide association studies (GWAS) have found an abundance of statistical associations between both coding and non-coding variants and disease. How these associated variants may impact biological functions, provides insight into the genetic background of disease susceptibility. When considering fine-mapped GWAS hits, a strong enrichment in coding variants and a small depletion of non-coding variants are observed when compared with the expected distribution of GWAS hits given the size of the coding and non-coding genome (7). This indicates that a given coding variant is more likely to be statistically associated with phenotypical change than a non-coding variant. Nevertheless, since vast majority of the genome is non-coding, we observe that approximately 95% of the high confidence fine-mapped SNPs are in non-coding and f lanking regions (7). This implicates a substantial role for non-coding variation in disease.
Research into the regulatory functions of non-coding DNA has allowed for the development of a host of computational tools that aid in the interpretation of disease-associated non-coding variants. Here we provide a non-exhaustive overview of current tools and methods which can be used to interpret and generate hypotheses on the role of non-coding disease-associated variants identified by GWAS.

Exploring GWAS results in an era of data abundance
As the amount of performed GWAS has seen a steady increase, so has the number of available resources at a researcher's disposal to interpret GWAS results. This is particularly helpful for analyzing non-coding variants, where there are no obvious protein alterations that may explain phenotypic effects. Instead, these noncoding variants impact the phenotype by alteration of regulatory elements such as enhancers (8,9), transcription factor binding sites (10) and chromatin state (11). To give an example in a disease context: multiple studies have linked point mutations in the promotor sequence of the TERT gene to cancer (12)(13)(14)(15). Relevant resources for the regulatory functions of non-coding regions can be found in the ENCODE (3,16), FANTOM5 (17), Epigenomics Roadmap (18) and GTEx consortium atlas (19) projects. Yet the abundance of information available from these resources renders manual annotation of variants of interest time-consuming, inefficient and error-prone.
To address this problem, specific tools have been developed that automatically annotate variants, and cross-reference them with relevant data repositories. For example, ANNOVAR (20), FUMA (21), GEMINI (22), HaploREG (23), RegulomeDB (24) and VEP (25) annotate variants with a broad range of sources, including the resources mentioned before ( Table 1). The GWAS results can be visually explored using LocusZoom (26) or FUMA (21). FUMA annotates and visualizes GWAS risk loci and allows for interactive investigation of GWAS results. LocusZoom visualizes the Manhattan plots of risk loci and their underlying linkagedisequilibrium structure. FUMA provides a demonstration case of the relevancy of these annotation tools in investigating noncoding variants in disease: several non-coding variants associated with BMI are located in an intronic region of the FTO gene, which was thought to be the causal genes. Annotation of these variants revealed them to be expression quantitative trait loci (eQTLs) for IRX3, which functional studies revealed to be the actual causal gene (27), exemplifying the use of variant annotation tools in non-coding variant analysis (21). Most of these tools were not developed exclusively for the exploration of non-coding variants, yet they are a useful first step in the analysis of disease-associated non-coding variants.

Predicting non-coding variant effects
When interpreting common or rare variants in a disease risk locus, it is often unclear which non-coding variants may have phenotypical consequences. Although the tools mentioned in the previous paragraph help annotate data with a range of functional genomic features, it can still be difficult to identify whether a polymorphism within a feature will have a phenotypic effect. For example, if ANNOVAR annotates a variant to be within an enhancer region, this does not indicate that the variant alters the enhancer function. To this extent a range of tools has been developed which aim to infer the effect of inputted variants ( Table 2). These tools aim to predict whether a variant has a functional effect or is pathogenic based on the local genetic sequence and features. These predictions are mostly made outside of a specific context, such as cell type or disease. In other words, a variant predicted to be pathogenic/functional will most likely have a phenotypical effect but it is unknown what the precise effect may be on your trait of interest. Nevertheless, these tools can be useful for identifying potential causal variants from a list of diseaseassociated variants for further investigation.
The host of variant pathogenicity prediction tools makes use of different strategies and functional data. Current methods commonly apply one or more of the following three strategies (28):  (37)).
The diversity of methods attests to the competitiveness in pathogenicity prediction algorithm development. With the large variety of tools, utilizing different methods and background data, it can be difficult to select one that best suits a given research project.
Benchmarking this variety of predictive tools is not a trivial task. Typically, a small proportion of tools is chosen to predict variant pathogenicity and benchmarked based on one or two reference databases (e.g. ClinVar or COSMIC). In a recent benchmarking of 24 pathogenicity prediction tools, Wang et al. showed that there is a large variation in tool performance depending on the chosen benchmarking dataset. In this benchmark, four ground truth datasets were constructed using rare somatic cancer variants from COSMIC, rare germline variants from ClinVar, regulatory variants from multiple eQTL databases and diseaseassociated GWAS variants. LINSIGHT and FunSeq2 were the best performers across all four benchmarking datasets (40). Where FATHMM-MKL was the clear winner in a previous benchmark (41), the newer FATHMM-XF was the top performer only on the ClinVar benchmarking set. The earlier benchmark also reports higher predictive accuracy by GWAVA for variants in the COSMIC database than any of the 24 tools benchmarked by Wang et al. (41). It is relatively unclear how the motif-focused deep learning tools JARVIS and DeepSEA perform compared with other tools across different datasets. From this brief overview of recent benchmarking efforts of pathogenicity prediction algorithms, it is evident that none of the tools consistently outperform all others across every benchmarked reference dataset, suggesting that each tool has different sensitivities to the underlying genetic architecture that is most represented in each reference database. Cooper et al. also demonstrate that variant pathogenicity prediction methods correlate poorly with each other and further show that they also correlate poorly with the results from massive parallel reporter assays for non-coding variants associated with Alzheimer's disease and Progressive Supranuclear Palsy. This implies the need for experimental validation after prioritization using pathogenicity prediction tools (42).
Despite their shortcomings, these tools have proven useful for the prioritization of causal disease variants for follow-up studies. Examples of their practicality in disease research are many. FATHMM and LINSIGHT have proven useful in pinpointing novel functional mutations in PMS2 in cancer genomes (43). CADD has aided in prioritizing pathogenic variants associated with Alzheimer's disease (44). Given the large variation in performance of the currently available tools, it is clear that when it comes to a one-size-fits-all pathogenicity prediction tool we are still far from the end-game. As such, the 2013 clinical guidelines for categorizing a non-coding variant as pathogenic, which decree the use of at least three different computational tools, still seem relevant today (45). When selecting algorithms, we encourage the consultation of recent benchmarks on a dataset containing variants with an expected similar genetic architecture as the input variants for the best performing tools. Overall, when combined and selected with care, pathogenicity prediction algorithms are a powerful tool for extracting potentially causal variants from a list of variants of interest.

Quantitative trait loci: a black box approach
One of the core difficulties in non-coding variant analysis is the identification of the mechanisms and biological pathways through which a variant might impact a phenotype. Linking variants to effector genes allows us to shift the analysis to a more interpretable unit of study. One approach to link variants to genes is colocalizing variants with molecular quantitative trait loci (molQTL). MolQTLs are associations between the presence of a variant and a molecular measurement, such as RNA expression levels (eQTL), protein abundance (pQTL) or differential splicing (sQTL). Although molQTLs are blind to the precise mechanism of action, they provide a direct link from variant to gene. To incorporate this data for the analysis of non-coding variants, colocalization methods, such as COLOC (46) and ezQTL (47), have been developed to test whether the overlap between the GWAS and molQTL signal is statistically significant (Table 3). A different molQTL approach has been developed in the machine learning classifiers FIRE (48) and TIVAN (49) (Table 3). Both methods embody machine learning classifiers trained on annotated cisregulatory eQTL variants, which aim to predict whether input variants are QTLs for nearby genes, which can aid causal variant interpretation. Note that these tools were not trained to predict Table 1 Table 1. Pathogenicity prediction tools are listed in Table 2. Tools and methods for MolQTL analysis are listed in Table 3. Finally, tools and methods for functional mapping and integration of experimental SNP-to-gene data can be found in Table 4.
trans effects which might arise from chromatin interaction for example. MolQTL analyses have increased the interpretability of non-coding variant risk alleles of diseases such as type 1 diabetes (T1D) and schizophrenia. Through eQTL analysis, it was found that two T1D risk alleles converge in upregulating interferon-γ response genes (50). Dysregulation of the genes FURIN, TSNARE1, CNTN4, CLCN3 and SNAP91 have been implicated by eQTL analysis of schizophrenia GWAS hits (51), which were later separately prioritized by chromatin interaction analysis of schizophrenia risk variants (52). Despite their practicality, QTLs are far from perfect. One should be aware that the molQTLs can be tissue or cell-type specific. If the investigated trait of interest is analyzed based on molQTLs of non-relevant tissues or cell types this could lead to incorrect prioritization of genes (53). Preliminary results from a recent investigation show GWAS hits and eQTLs have a different makeup of genomic features, suggesting that a large part of GWAS hits cannot be explained by eQTLs (54). Still, QTLs can be a powerful tool for interpreting the role of non-coding variants in disease when appropriate tissue types from well-powered samples are combined with additional lines of evidence.

Linking variants to genes and regulatory mechanisms
Most of the earlier mentioned methods pool a variety of genomic annotations and features to make in silico predictions of pathogenicity or prioritize genes. It can therefore be hard to interpret why a specific variant is predicted to be pathogenic in a specific disease context. Linking variants to genes or epigenetic effects allows for easier integration with known information on the studied disorder. Hi-C (55), enhancer studies (17,56,57), chromatin accessibility (58,59), DNase I hypersensitivity sitegene promotor correlation (60) and the activity-by-contact (61) model can all be used to connect the genomic region containing the variant of interest to a gene it potentially regulates. This does not immediately implicate that the variant alters normal gene regulation. Recently, massive parallel reporter (42,62) and CRISPR/Cas9 (63) have been used to assess whether variants in a single genomic position alter gene regulation. Using data related to a single (epi)genetic mechanism to link non-coding variants to genes allow for easier interpretation than molQTLs, and provide more concrete evidence of biological functionality of the variant of interest than in silico pathogenicity predictions. The downside of using functional data for prioritizing potentially causal genes in a non-coding locus is that these datasets might not cover the relevant tissue or cell type and that different datasets/strategies often indicate different causal genes (72). This makes integrating multiple data sources for the prediction of the causal gene for a GWAS signal difficult. Recent attempts to solve this integration problem through the use of machine learning and regression models which predict causal genes from coding and non-coding variants seem to be relatively accurate (64,65). Owing to the exhaustive preprocessing and statistical finemapping steps combined with the necessary data to annotate variants we have yet to see these one-size-fits-all functionally informed gene prioritization methods be made available in userfriendly tools and broadly applicable tools.
Additionally, there are methods for prioritizing genes in noncoding risk loci, such as Polygenic Prioritization Score (PoPS) (66) or the simple nearest gene method, which do not utilize experimentally derived SNP-to-gene data. PoPS is a ridge regression framework that leverages the polygenic signal in coding regions to identify enriched shared features such as pathways, proteinprotein interaction networks and co-expression modules and prioritize genes in all risk loci (including non-coding loci). The nearest gene method simply states that the nearest gene body to the (lead) variant is most likely the effector gene. This is underpinned by the recent whole exome sequencing (WES) of 454 787 UK biobank participants which showed that genes containing rare coding variants significant at P ≤ 10 −7 are 45.5 times more likely to be the closest gene to a GWAS lead variant for the same trait than expected by chance (67). Weeks et al. demonstrate that overlapping the nearest gene with PoPS prioritizations, or either with any prioritizations based on functional data, the causal gene prediction accuracy is around 80%. Overall, these methods are less interpretable than causal gene prediction based on functional data, yet when combined, they can prioritize causal genes in noncoding risk loci with decent accuracy.

Recommendations
Throughout this review, we have showcased multiple computational tools for the interpretation of non-coding variants in a disease GWAS context. At its core, in silico analysis of GWAS results remains a highly complex task, especially when analyzing non-coding variants. Eventually, a researcher must be able to integrate multiple lines of converging evidence to form a plausible hypothesis of which implicated variants might indeed be causal, and how these variants disrupt normal cell function. Here we provide a proposed workf low that can be used for the post-GWAS analysis of non-coding variants (Fig. 1).
The workf low highlighted before allows for the accumulation of multiple lines of evidence in order to identify putative causal variants and their implications for disease. Eventually no in silico evidence is as robust as in vivo/vitro evidence. Nevertheless, the in silico analysis of disease-associated variants is an invaluable resource for identifying which variants, genes, pathways and cell types should be considered for functional follow-up studies.

Future developments
The variety of different tools and strategies to interpret noncoding variants in a disease context touched upon in this review highlight one underlying truth. There is currently no perfect, one-size-fits-all tool or method to find causal SNPs from GWAS findings. As the costs of WES and WGS are decreasing it is feasible that rare and deleterious coding variants of large effect size will allow us to more easily pinpoint causal genes for a variety of disorders. This luxury is absent for non-coding variants. Therefore, we foresee the need for continued improvement of noncoding variant analysis tools. Accumulation of massively parallel reporter (42,62) and CRISPR/Cas9 (63) assays may in the coming years be a valuable additional source of data for the improvement of predictive algorithms.

Conclusions
The number of databases, roadmaps and atlases (3,(16)(17)(18)(19)) that have been developed over the last years make this an exciting time for investigating the complex and diverse regulatory mechanisms in the non-coding part of the genome. There is a broad range of tools available that allow researchers to accumulate evidence for regulatory effects of non-coding variants that are expected to be causally associated with a disease phenotype (Tables 1-4). Finding the correct tool for your research question can be a challenge. Both the prediction of causal/pathogenic variants and the prediction of effector genes and pathways of non-coding variants display a large amount of variation in their performance. More frequent, comprehensive benchmarks would allow researchers to make more informed choices on which tool suits their research best.
Despite differences in performance across tools, the abundance of data allows for the development of hypotheses on disease mechanisms by the accumulation of functional evidence across multiple tracks from non-coding GWAS hits. Currently, there is no one-size-fits-all tool for identifying causal genes and mechanisms from disease-associated non-coding variants. We therefore strongly encourage the continuation of the development of tools that prioritize causal genes and variants, capitalizing on novel data and insights. These tools should be easy to use for researchers without (bio)informatics backgrounds, and provide interpretable prioritization metrics.
In the end, the tools and suggested workf low highlighted in this review are merely instruments for the development of novel insights and hypotheses into the mechanisms driving a specific disease. Although their results can be used as supporting evidence, in vitro/vivo experimental validation remains necessary to truly establish a causal relationship between non-coding variants and disease.