Proficiency of data interpretation: identification of signaling SNPs/specific loci for coronary artery disease

Abstract Coronary artery disease (CAD) is a complex disorder involving both genetic and non-genetic factors. Genome-wide association studies (GWAS) have identified hundreds of single nucleotides polymorphisms (SNPs) tagging over > 40 CAD risk loci. We hypothesized that some non-coding variants might directly regulate the gene expression rather than tagging a nearby locus. We used RegulomeDB to examine regulatory functions of 58 SNPs identified in two GWAS and those SNPs in linkage disequilibrium (LD) (r2 ≥ 0.80) with the GWAS SNPs. Of the tested 1200 SNPs, 858 returned scores of 1–6 by RegulomeDB. Of these 858 SNPs, 97 were predicted to have regulatory functions with RegulomeDB score of < 3. Notably, only 8 of the 97 predicted regulatory variants were genome-wide significant SNPs (LIPA/rs2246833, RegulomeDB score = 1b; ZC3HC1/rs11556924, CYP17A1-CNNM2-NT5C2/rs12413409, APOE-APOC1/rs2075650 and UBE2Z/rs46522, each with a RegulomeDB score = 1f; ZNF259-APOA5-APOA1/rs964184, SMG6/rs2281727 and COL4A1-COL4A2/rs4773144, each with a RegulomeDB score = 2b). The remainder 89 functional SNPs were in linkage disequilibrium with GWAS SNPs. This study supports the hypothesis that some of the non-coding variants are true signals via regulation of gene expression at transcription level. Our study indicates that RegulomeDB is a useful database to examine the putative functions of large number of genetic variants and it may help to identify a true association among multiple tagged SNPs in a complex disease, such as CAD. Database URLs http://www.regulomedb.org/; https://www.broadinstitute.org/mpg/snap/


Background
Most human DNA sequence is non-coding (98%) and hence only small portion (2%) of human genome encodes proteins (1). Although the pathogenesis of monogenic disorders is largely explained, it has been difficult to determine the underlying mechanisms of complex disorders like coronary artery disease (CAD). Before the development of genome-wide association studies (GWAS), only the APOE*4 allele showed consistent association with the risk of CAD across many populations (2)(3)(4)(5).
The hypothesis-free GWAS approach was designed with the assumption that common DNA variants explain the bulk of the variation in common diseases (6). About 90% of GWAS-implicated variants, exert only minimal to modest effect sizes on disease phenotypes, and they are present in non-coding rather than coding regions (7). Highly sensitive molecular and computational techniques have identified different regulatory elements (DNAse hypersensitive regions, sequences affecting the binding of transcription factors and promoters or enhancers) in intergenic regions (8). Common variants located in one of these regulatory elements may affect gene expression. To predict the role of these variants in gene regulation and to differentiate between physically tagged and functional single nucleotides polymorphism (SNPs), many databases have been created (9). RegulomeDB is one of such databases that describes the role of these variants in transcriptional regulation.
Similar to many other complex diseases, GWAS have identified hundreds of risk variants associated with CAD that need to be analyzed for their functional role in gene expression (10). Recently, we have used SNAP Webportal and Regulome DB to identify potential regulatory function of variants in associated risk loci for Alzheimer's disease (11). In this study, we have applied the same approach to identify the regulatory nature of GWAS-implicated variants with CAD and those that are in linkage disequilibrium (LD) with these variants.

Objective
The objective of our study was to assess the GWASimplicated CAD variants and those variants in LD with GWAS variants for their potential regulatory effects on gene transcription using bioinformatics tools.

SNPs selection
A total of 58 SNPs within 54 CAD loci was selected, including 52 with accepted genome-wide significant threshold (P < 5 Â 10 À8 ) and 6 with suggestive associations (P > 5 Â 10 À8 ) identified in two GWAS (12,13). Detailed information on the selected 58 SNPs is provided in Supplementary Table S1.

Linkage disequilibrium
For the LD assessment of the selected 58 SNPs, we used SNAP web portal (https://www.broadinstitute.org/mpg/ snap/, accessed 13 July 2016) (14) (Supplementary Table  S2). SNAP contains data from the Northern European from Utah (CEU) population derived from the 1000 Genomes Pilot Project 1 and three different releases of the International-Hap Map Project. We used data from both the 1000 Genomes Project and HapMap 3 (release 2) to identify SNPs in strong LD (r 2 ! 0.80) with our SNPs of interest. We did not select an array bound search, and query SNPs were included in the output. We performed the search at three thresholds-r 2 ! 0.80, r 2 ! 0.90 and r 2 ! 1.0-for both SNP datasets and identified a total of 1,200 SNPs in LD with the 58 published GWAS SNPs, including the GWAS SNPs themselves. As shown in Table 1, the number of proxy SNPs decreased with the increased level of r 2 .

Functional assessment of CAD-associated SNPs
We used RegulomeDB to identify potentially functional SNPs among the 1200 SNPs of interest. Regulome DB is a database that scores SNPs functionality based upon experimental data, such as its existence in a DNAase hypersensitive site or transcription factor binding site. These regions have been characterized biochemically, and data are drawn from published literature, Gene Expression Omnibus and ENCODE project that include a total of 962 experimental datasets, covering over 100 tissues and cell lines and representing nearly 60 million annotations. The output data can be mapped to Human genome version 19. It is a user friendly and freely accessible database (http://www.regulo medb.org/accessed 17 July 2016) (15). The functional Grades 1-6 of RegulomeDB are given in Table 2. SNPs showing the strongest evidence of being regulatory (affecting the binding of transcription factor) are given a score of 1 and SNPs demonstrating the least evidence of being functional are given a score of 6.
Three variants, FES/rs1894401, LIPA/rs2246833 and VAMP8/rs1009, were strongly predicted to be functional with score of 1b. FES/rs1894401 is an intronic SNP that is an eQTL for FES in thyroid and transformed lymphoblasts, is present in the binding motif of Pax5, and affects the binding of eleven transcription factors. LIPA/ rs2246833 (RegulomeDB score ¼ 1b), located in Intron 6 of LIPA, in the DNA motif of EWSRCFLI1, is a GWAS reported SNP along with 4 other functional SNPs (of 12 tested) in this region and it and it affects the binding of CTCF. It is an eQTL in the whole blood. VAMP5-VAMP8-GGCX/rs1009 is in exon 3 of VAMP8 and affects the binding of CTCF and HSF1. rs1009 of VAMP8 is an eQTL in lymhoblasts, skeletal muscles, adipose tissue and thyroid. Of 42 SNPs analyzed in this locus, we found 8 other SNPs with RegulomeDB score < 3 (Table 3) Of 97 SNPs with RegulomeDB score < 3, 25 were in the CYP17A1-CNNM2-NT5C2 region, and one of them was a GWAS reported SNP (rs12413409). The regional LD plot of this SNP is given in Supplementary Figure S1. rs9633712 (RegulomeDB score ¼ 1e) is located in Intron 3 of NT5C2 and is an eQTL for USMG5 in monocytes. This SNP was also found in the motifs of the following transcription factors: PU1, ELF1, Sfpil, PU.1 and c-Ets-1.

GWAS SNPs
Functional proxy SNPs Regulome DB score It appears to affect the binding of SPI1. Twenty SNPs returned a score of 1f (likely to affect the binding), and 18 of them were in intronic regions. NT5C2/rs11191558 lies in HOXC series of DNA motifs, and CNNM2/rs3781285 lies between NF-kappaB and P50:50. NT5C2/rs2297787 returned a score of 2a, affecting the binding motifs of FOXI1, HNF3-alpha and FOXP1 and the binding of FOXA1. SNP rs12412038 is located in Intron 10 of NT5C2 and is in the binding motif of Irx. The remaining two SNPs, rs12219901 and rs12221064, lie in the  CNNM2-NT5C2 intergenic region and upstream of CNNM2, respectively. They are located in DNA motifs of SRF and MAZR and affect the binding of POLRA2 and CTCF/ETS. Interestingly, rs943037 resides in exon 7 of CNNM2. Nineteen of the 25 SNPs in the region of CYP17A1-CNNM2-NT5C2 are eQTLs for USMG5 (Table 4). One SNP rs2075650 lies in Intron 2 of ApoEApoC1/ TOMM40 with a RegulomeDB score of 1f. It is located in RREB1 DNA motif and is an eQTL for TOMM40 (Table 4).
In total 3 of 107 SMG6 associated SNPs, rs2281727, rs7217687 and rs9908888 had a score of 2b and they affect the binding of EP300. rs2281727 is a genome-wide significant SNP located in Intron 9 of SMG6. It is in binding motifs of SRY, Srf and Zfp105 and affects the binding of CREBBP, EP300, STAT3, TRIM28, MYC and RBBP5 ( Table 4).
The UBE2Z region had 10 functional SNPs, including a GWAS reported SNP, UBE2Z/rs46522 (RegulomeDB score of 1f). The SNP with the most evidence of regulatory function in this locus is rs4794004 with a score of 1d. It is in DNA motif of Gata5 that alters the expression of UBE2Z and ATP5G1and affects the binding of NR3C1, IN3AK20, CREB1, TAF12, CTCF, POLR2A, USF1, FOXA1, FOXA2 and RBBP5. The other 5 SNPs in this region have a score of 1f. The remaining two regulatory SNPs, rs3744608 and rs12601672, have scores of 2a and 2b, respectively. rs3744608 is located in Intron 3 of UBE2Z and it affects the binding of large number of transcription factors (Table 4).
COL4A1-COL4A2/rs473144 is a GWAS reported SNP, achieving a RegulomeDB score of 2b. This SNP lies in Intron 3 of COL4A2 between STST3:STAT3 DNA motif and affects the binding of POLR2A and EZH2 (Table 4). ZNF259-APOA5-APOA1/rs964184 is a GWAS significant SNP with a score of 1f and is an eQTL for TAGLN. This SNP is located downstream of this gene region and is present in FOXJ2 DNA motif. Another GWAS significant SNP, ZC3HC1/rs11556924 is an exonic variant and the only functional SNP (score ¼ 1f) in this locus; it is also an eQTL for ZC3HC1 (Table 4).
REST-NOA1/rs17087335 is in LD with two functional SNPs (rs2227901 and rs7687767 with RegulomeDB scores of 1f and 1d, respectively). rs768776 lies in DNA motif of Sox8 and affects the binding of FOXA1. SWAP70 has two functional SNPs, rs93138 and rs360136, each with a RegulomeDB score of 1f. SWAP70/rs93138 is an eQTL as evidenced in monocytes.
SMAD3 has three functional SNPs, SMAD3/rs17293632 and SMAD3/rs1866316 and MTERF1/rs8032739 with RegulomeDB scores of 2a, 2a and 2b, respectively. Both are in LD with a lead GWAS SNP (SMAD3/rs56062135). CDKN2BAS1/rs1333049 has one functional SNP(rs4977574) only with RegulomeDB score of 3c. It is a part of a gene cluster on chromosome 9p21 and it maps to Intron 16 of cyclin dependent kinase, an important regulator of cell cycle.

Discussion
Following the sequencing of human genome, a large number of SNPs have been identified that affect disease phenotypes, but their exact roles remain unclear (16). One possible explanation is that some variation affects disease expression at the transcriptional level other than at the protein level. For example, a base pair change in a transcription factor binding site may affect the binding affinity of transcription factors that consequently may alter the transcription of the related genes. These effects are indirect and may seem subtle, but their interactions with other genetic or environmental factors may result in the pathogenesis of common diseases.
Like other complex disorders, a large number of CAD associated risk variants have been discovered by multiple GWAS (12,13,17). ENCODE provides information regarding the functionality of human genome (18). This data requires careful interpretation and helps to define the biological function of previously termed 'junk DNA'. Using bioinformatics tools, we may generate new hypotheses about the gene regulation of complex disorders. In this study, we have used two bioinformatics tools, SNAP and RegulomeDB, in order to identify the putative roles of CAD-associated SNPs.
We examined a total 1,200 SNPs in 54 loci implicated by GWAS, including 58 genome-wide significant SNPs. Ninety-seven SNPs were predicted to have regulatory functions with a RegulomeDB score of <3, but only 8 of them were genome-wide significant. Interestingly, all 8 genomewide significant SNPs with suggested regulatory function are located either in intronic or intergenic regions, suggesting that these are true associations that regulate gene expression at the transcriptional level.
Among these eight GWAS reported functional SNPs, the SNP with the top RegulomeDB score was LIPA/ rs2246833 (Regulome DB score ¼ 1b). This variant is located in Intron 6 of lipase A (LIPA) and is an eQTL for the same gene which catalyzes intracellular triglyceride and hydrolyses cholesterol ester (19).
ZC3HC1/rs11556924 is a GWAS significant CAD associated SNP that returned a score of 1f. rs11556924 is a coding SNP located in the ZC3HC1 gene region encoding NIPA (Nuclear Interaction Partner of ALK) protein. This polymorphism is responsible for arginine-histidine amino acid alteration at position 363 (R363H). The SNP has been        associated with essential hypertension in Finnish population (20,21). The CYP17A1-CNNM2-NT5C2 gene region has the highest number of regulatory SNPs, including one GWAS significant SNP, rs12413409. This locus affects diastolic blood pressure, systolic blood pressure and body mass index. All three measures are important risk factors for CAD (22). There are 25 putative regulatory SNPs in LD with rs12413409 that are located across four genes on chromosome 10 (CNNM2, NT5C2, AS3MT and BORCS7/ASMT) but affect the expression of same protein USMG5. These findings suggest that USMG5 should be investigated as an important player for CAD pathogenesis. USMG5 (upregulated during skeletal muscle growth protein 5) is also known as diabetes-associated protein in insulin-sensitive tissues that plays a crucial role in the maintenance of ATP synthase structure in mitochondria (23). Chen et al. (24) have purified this protein from bovine heart mitochondria and suggested its role in cell energy metabolism.
APOE-APOC1/TOMM40/rs2075650 is present in the TOMM40 gene region near the APOE-C1 cluster. TOMM40 encodes TOMM40 protein, which is an important subunit 40 of outer mitochondrial membrane protein complex. rs2075650 risk allele has shown an association with low levels of CRP in CAD patients (25).
rs46522, an intronic SNP in Ubiquitin-conjugating enzyme E2Z (UBE2Z) gene region returns a RegulomeDB score of 1f. This SNP is associated with CAD in Iranian and Han Chinese populations (26,27). The exact mechanism by which genetic alteration in UBE2Z can attribute to the CAD risk is not yet clear; however, rs46522 is in strong LD with the causal SNPs in gastric inhibitory peptide (GIP) gene that encodes GIP protein, a protein that modifies the glucose and lipid metabolism potentially mediating known CAD risk factors.
ZNF259-APOA5APOA1/rs964184 is also an important regulatory SNP. ZNF259 protein polymorphism has been associated with metabolic syndrome in Chinese population. Aung et al. (28) have also shown its association with lipid levels. ZNF259 is located close to APOA5. Overexpression of APOA5 in mice reduces plasma triglyceride levels and mice lacking APOA5 have hypertriglyceridemia (29).
COL4A2/rs4773144 has been identified as functional lead SNP by RegulomeDB (score ¼ 2b). This gene controls collagen proliferation, indicating a potential functional role in atherosclerotic plaque strengthening (30).
SMG6/rs2281727 is an intronic SNP. The potential function of SMG6 in CAD is not yet established. This gene promotes the endonuclease activity and is responsible for protection of telomere ends of chromosomes (16).
Although regulatory elements are most often found in non-coding regions of the genome, we found 5 loci with exonic regulatory SNPs (VAMP8/rs1009, CNNM2/ rs943037, GIP/rs2291725, KIAA1462/rs3739998 and UBE2Z/rs15563), indicating the presence of regulatory signals inside the coding sequences as well.
RegulomeDB identified three important functional SNPs affecting CAD phenotype. Among these, REST-NOA1/rs17087335 is the lead GWAS SNP that encodes a transcription factor which suppresses the voltage gated sodium and potassium channels and it has shown to maintain vascular smooth muscle cells in non-proliferative phase (32). SWAP70/rs10840293 encodes a signaling molecule that is implicated in cell adhesion and migration and it appears to be a potential regulator of leukocyte migration and their adhesion to endothelial cells (33). SMAD3 is a major regulator of TGF-ß. A study on mice has shown that mutations in this gene lead to decreased connective tissue deposition in response to vascular injury (34).
It should be noted that 342 SNPs had returned 'No Data' when queried by RegulomeDB. This suggests that current evidence does not support a functional role for those variants. Our results also showed that some loci harbor markedly more regulatory SNPs as compared with other regions. We caution against interpreting this finding to meant that one region is more functionally relevant, as regions with 'fewer' functional SNPs may have yet to be interrogated as thoroughly and thus have fewer annotations.
Since these loci are mostly in Europeans, and only 5 of them are replicated in South Asians (35), the findings may not be as relevant to other populations as they are to Europeans as genetic effects can differ across populations. The cause of this varying association with disease phenotype may be the ethnic admixture resulting in population Figure 1. 58 GWAS ANPs in LD with 1200 SNPs. We used SNAP webportal to determine LD SNPs. These 1200 SNPs were further evaluated by RegulomeDB to identify their functional role. RegulomeDB did not provide data for 342 SNPs. A total of 858 SNPs returned the scores of 1-6 by RegulomeDB. Of those 858 SNPs, 97 returned the scores of < 3. Among 97 functional SNPs, only 8 were GWAS SNPs. Lower the RegulomeDB score, more evidence of functionality.
stratification. It is also noteworthy that robust associations of variants with different diseases have been reported in Europeans while other populations (Africans, Asians and Hispanics) failed to demonstrate those associations (36,37).
Though the cellular mechanisms underlying CAD pathogenesis are established, the molecular basis is not yet agreed upon. Comprehending the molecular basis of disease is crucial before pathogenesis is completely described. The study has identified 97 regulatory SNPs associated with CAD. In summary, our results highlight the importance of considering both disease-associated SNPs and those SNPs in LD, as well as the regulatory function of these SNPs to help identify the causal genetic mechanisms of CAD. The methods which we have implemented here can inform planning of more complete and better directed functional genomic studies.