Abstract

The identification of multiple signals at individual loci could explain additional phenotypic variance (‘missing heritability’) of common traits, and help identify causal genes. We examined gene expression levels as a model trait because of the large number of strong genetic effects acting in cis. Using expression profiles from 613 individuals, we performed genome-wide single nucleotide polymorphism (SNP) analyses to identify cis-expression quantitative trait loci (eQTLs), and conditional analysis to identify second signals. We examined patterns of association when accounting for multiple SNPs at a locus and when including additional SNPs from the 1000 Genomes Project. We identified 1298 cis-eQTLs at an approximate false discovery rate 0.01, of which 118 (9%) showed evidence of a second independent signal. For this subset of 118 traits, accounting for two signals resulted in an average 31% increase in phenotypic variance explained (Wilcoxon P< 0.0001). The association of SNPs with cis gene expression could increase, stay similar or decrease in significance when accounting for linkage disequilibrium with second signals at the same locus. Pairs of SNPs increasing in significance tended to have gene expression increasing alleles on opposite haplotypes, whereas pairs of SNPs decreasing in significance tended to have gene expression increasing alleles on the same haplotypes. Adding data from the 1000 Genomes Project showed that apparently independent signals could be potentially explained by a single association signal. Our results show that accounting for multiple variants at a locus will increase the variance explained in a substantial fraction of loci, but that allelic heterogeneity will be difficult to define without resequencing loci and functional work.

INTRODUCTION

Genome-wide association studies (GWAS) have identified many novel associations between common variants and traits. However, the amount of phenotypic variance explained by these variants remains smaller than that predicted from heritability estimates. The presence of multiple association signals at individual loci, possibly as a result of allelic heterogeneity, may explain additional phenotypic variation of common traits, and therefore account for some of the ‘missing heritability’. Allelic heterogeneity is defined as the presence of multiple alleles that act through one gene to influence a trait. Until recently, few GWAS had performed conditional and multivariable analyses to test the possibility that multiple independent common variants at a locus were associated with a trait. Exceptions include recent studies of height (1), Parkinson's disease (2) and fetal haemoglobin levels (3) and two studies of cis expression loci (4,5).

Allelic heterogeneity can be difficult to define for two main reasons. First, variants at the same locus tend to be correlated to varying degrees due to linkage disequilibrium (LD). At any one locus, this correlation often results in the association of many single nucleotide polymorphisms (SNPs) with the trait of interest. Usually, only the SNP with the strongest evidence of association is included to represent a new finding, and other SNPs are not considered independently associated if they are within a certain distance or correlated with the lead SNP above a certain r2 threshold. Secondly, even if two SNPs in the same region are identified as independently associated with a trait, it is difficult to prove that they act on the same gene.

As well as potentially explaining additional phenotypic variation, more detailed analysis of loci identified by GWAS is important for a second reason—the identification of additional alleles at a locus may help identify the causal gene in the region. An example is the association between common SNPs at the IFIH1 region and type 1 diabetes. Resequencing of genes in the region identified several low-frequency IFIH1 coding variants independently associated with type 1 diabetes (6), strongly suggesting that the common SNP acts through IFIH1 rather than another gene in that region.

To help understand the extent to which multiple signals at the same loci, possibly as a result of allelic heterogeneity, could contribute to common traits we used cis gene expression levels. Several GWAS have identified many expression quantitative trait loci (eQTLs) (4,7–11). There are several advantages to using gene expression levels to test allelic heterogeneity. First, eQTLs tend to have relatively strong phenotypic effects—especially cis-eQTLs (a variant that influences expression levels of a closely linked gene). Secondly, a large proportion of genes have cis-eQTLs. Thirdly, if analyses are limited to cis-eQTL associations, then the problem of identifying the causal gene at a locus is largely eliminated—SNPs associated with expression levels of a nearby gene are most likely to be acting directly on that gene. In contrast, some features of the genetics of cis gene expression levels may not be representative of common traits—most notably cis-effects could predominate the polygenic component.

We performed a cis-eQTL analysis using 613 individuals from the InCHIANTI study from whom fresh whole blood mRNA was available as well as genome-wide SNP data. We tested the hypothesis that multiple signals at known loci would explain more phenotypic variation and attempted to identify patterns of association consistent with allelic heterogeneity using cis gene expression phenotypes.

RESULTS

Identification of cis-SNPs in 1298 loci

We defined a cis-SNP as a SNP 1 Mb ± of a probe's start site. Using this definition, and based on HapMap-imputed genotyped data, we identified 1298 probes with at least one cis-SNP at P< 1 × 10−6 [approximate false discovery rate (FDR)<1.0%]. We termed the most strongly associated SNP at each cis-signal the ‘Index HapMap SNP’.

Conditional analysis identifies 118 (9%) cis-eQTLs with evidence of a second signal

We repeated the association analysis for each of the relevant 1298 probes, but conditioned each on the Index HapMap SNP. We found evidence of a second signal for 118 (∼9%) probes (113 separate loci), based on P< 1 × 10−6 (approximate FDR ∼1%). We termed this SNP the ‘Second HapMap SNP’.

Multivariable analyses explain additional phenotypic variance in loci with evidence of second signals

For each of the 118 probes with evidence of a second signal, we next tested the hypothesis that including SNPs representing both signals would explain more of the variance in the relevant gene expression trait. For each probe, we calculated the variance in cis gene expression explained by both SNPs and compared this figure to the variance explained by the Index HapMap SNP alone. We used a model that included both the Index HapMap SNP and the Second HapMap SNP as independent variables and the relevant cis gene expression levels as the dependent variable. This multivariable model provides an estimate of the effect of each SNP when taking into account any correlation (due to LD or interaction) with the other. For all 118 loci, including both SNPs increased the phenotypic variance explained compared with the single Index SNP (Supplementary Material, Table S1). The average phenotypic variance explained by the Index SNP alone was 17.5% (range: 3.8–63.9%) and this figure rose to 22.9% (range: 8.3–66.4%) when accounting for both SNPs and the correlation between them, an average increase of 31% (Wilcoxon P< 0.0001).

Testing two SNPs in the same statistical model increases, decreases or changes very little their strength of association compared with analyses of single SNPs

For the 118 probes with evidence of a second signal, we observed that the strength of association of the Second HapMap SNP at each cis-eQTL locus could increase, decrease or stay very similar when accounting for the LD with the Index HapMap SNP. Examples of the Second HapMap SNPs that increased the most (‘jumpers’), reduced the most (‘fallers’) and changed the least (‘stickers’) after accounting for the correlation with the Index HapMap SNP are shown in Figure 1 and Table 1. The differences in association statistics between the one-SNP (univariable) and two-SNP (multivariable) analyses were highly correlated to the extent of LD between the two SNPs—association statistics changed least when LD between SNPs was weakest (Fig. 2, and Supplementary Material, Table S2 and Fig. S1).

Table 1.

Examples of single (univariable) and two-SNP (multivariable) association statistics in loci with evidence of two signals

GeneIndex HapMap SNPSecond HapMap SNPIndex HapMap P UnivariableIndex HapMap P MultivariableSecond HapMap P UnivariableSecond HapMap P MultivariableIndex SNP—Second SNP r2
Jumpers
FN3KRPrs2243523rs118692494.56E−352.32E−554.96E−051.16E−250.14
BTNL3rs4700774rs44449301.85E−865.92E−1034.63E−055.29E−220.03
LPCAT2rs883180rs22870721.91E−1022.83E−1160.891.05E−150.11
DHRS1rs10134537rs45688.75E−241.62E−360.541.81E−140.26
PRMT2rs2070435rs119107073.26E−289.07E−392.89E−085.59E−190.06
STAT6rs324015rs20355451.05E−313.26E−428.76E−112.03E−210.03
KRT72rs626758rs6818121.27E−322.75E−420.085.66E−120.11
HOXB2rs1042815rs15537485.61E−721.38E−810.321.43E−110.05
IRF5rs10229001rs174249211.38E−1377.30E−1470.983.89E−110.05
RNASE2rs11156734rs49823864.77E−232.71E−329.04E−042.75E−130.08
Stickers
GNLYrs12151742rs18661381.56E−201.66E−203.97E−094.05E−090.00
SUPT3Hrs9395049rs77734443.09E−082.51E−081.01E−068.15E−070.00
CA2rs2930553rs25482814.32E−092.79E−092.43E−071.56E−070.00
KLHL5rs10021255rs20600053.80E−112.20E−112.11E−071.20E−070.00
EHD4rs17686769rs80349442.41E−095.74E−097.88E−081.88E−070.00
C17ORF60rs6504230rs169479563.78E−179.04E−172.97E−086.97E−080.00
NSFL1Crs6105165rs60793259.75E−252.77E−245.35E−091.46E−080.01
PLA2G7rs9472830rs100811693.74E−101.10E−093.91E−091.15E−080.00
RSPH3rs12207795rs94575327.95E−153.67E−141.83E−088.46E−080.00
CPVLrs1052200rs127009391.71E−153.66E−165.99E−121.26E−120.00
Fallers
C17ORF97rs6565724rs80775382.59E−831.82E−571.86E−381.65E−120.21
TIMM10rs2729371rs29558499.26E−805.35E−574.90E−373.36E−140.16
ZP3rs17718122rs115056884.31E−1274.72E−1061.22E−291.65E−080.15
NAAArs11732759rs119346381.19E−485.66E−343.91E−401.90E−250.10
DDTrs5751777rs117038815.48E−492.15E−384.31E−381.69E−270.04
SLCO3A1rs2270059rs64968982.49E−216.87E−111.96E−195.67E−090.15
NAAArs2242471rs119346381.51E−423.13E−325.12E−331.08E−220.08
PLA2G4Crs274883rs23072795.55E−193.79E−096.35E−194.33E−090.20
VSTM1rs10500316rs6125291.85E−256.68E−162.12E−237.82E−140.10
NQO2rs2518581rs94051881.20E−321.08E−232.43E−222.36E−130.07
GeneIndex HapMap SNPSecond HapMap SNPIndex HapMap P UnivariableIndex HapMap P MultivariableSecond HapMap P UnivariableSecond HapMap P MultivariableIndex SNP—Second SNP r2
Jumpers
FN3KRPrs2243523rs118692494.56E−352.32E−554.96E−051.16E−250.14
BTNL3rs4700774rs44449301.85E−865.92E−1034.63E−055.29E−220.03
LPCAT2rs883180rs22870721.91E−1022.83E−1160.891.05E−150.11
DHRS1rs10134537rs45688.75E−241.62E−360.541.81E−140.26
PRMT2rs2070435rs119107073.26E−289.07E−392.89E−085.59E−190.06
STAT6rs324015rs20355451.05E−313.26E−428.76E−112.03E−210.03
KRT72rs626758rs6818121.27E−322.75E−420.085.66E−120.11
HOXB2rs1042815rs15537485.61E−721.38E−810.321.43E−110.05
IRF5rs10229001rs174249211.38E−1377.30E−1470.983.89E−110.05
RNASE2rs11156734rs49823864.77E−232.71E−329.04E−042.75E−130.08
Stickers
GNLYrs12151742rs18661381.56E−201.66E−203.97E−094.05E−090.00
SUPT3Hrs9395049rs77734443.09E−082.51E−081.01E−068.15E−070.00
CA2rs2930553rs25482814.32E−092.79E−092.43E−071.56E−070.00
KLHL5rs10021255rs20600053.80E−112.20E−112.11E−071.20E−070.00
EHD4rs17686769rs80349442.41E−095.74E−097.88E−081.88E−070.00
C17ORF60rs6504230rs169479563.78E−179.04E−172.97E−086.97E−080.00
NSFL1Crs6105165rs60793259.75E−252.77E−245.35E−091.46E−080.01
PLA2G7rs9472830rs100811693.74E−101.10E−093.91E−091.15E−080.00
RSPH3rs12207795rs94575327.95E−153.67E−141.83E−088.46E−080.00
CPVLrs1052200rs127009391.71E−153.66E−165.99E−121.26E−120.00
Fallers
C17ORF97rs6565724rs80775382.59E−831.82E−571.86E−381.65E−120.21
TIMM10rs2729371rs29558499.26E−805.35E−574.90E−373.36E−140.16
ZP3rs17718122rs115056884.31E−1274.72E−1061.22E−291.65E−080.15
NAAArs11732759rs119346381.19E−485.66E−343.91E−401.90E−250.10
DDTrs5751777rs117038815.48E−492.15E−384.31E−381.69E−270.04
SLCO3A1rs2270059rs64968982.49E−216.87E−111.96E−195.67E−090.15
NAAArs2242471rs119346381.51E−423.13E−325.12E−331.08E−220.08
PLA2G4Crs274883rs23072795.55E−193.79E−096.35E−194.33E−090.20
VSTM1rs10500316rs6125291.85E−256.68E−162.12E−237.82E−140.10
NQO2rs2518581rs94051881.20E−321.08E−232.43E−222.36E−130.07

‘Jumpers’ refers to loci where pairs of SNPs exhibit a relatively large increase in significance in multivariable compared with univariable analysis, ‘Fallers’ where pairs of SNPs decrease in significance by a comparatively large amount, and ‘Stickers’ where the significance of pairs of SNPs remain similar. Details from all 118 probe-cis gene associations, including effect sizes, are given in Supplementary Material, Table S2.

Table 1.

Examples of single (univariable) and two-SNP (multivariable) association statistics in loci with evidence of two signals

GeneIndex HapMap SNPSecond HapMap SNPIndex HapMap P UnivariableIndex HapMap P MultivariableSecond HapMap P UnivariableSecond HapMap P MultivariableIndex SNP—Second SNP r2
Jumpers
FN3KRPrs2243523rs118692494.56E−352.32E−554.96E−051.16E−250.14
BTNL3rs4700774rs44449301.85E−865.92E−1034.63E−055.29E−220.03
LPCAT2rs883180rs22870721.91E−1022.83E−1160.891.05E−150.11
DHRS1rs10134537rs45688.75E−241.62E−360.541.81E−140.26
PRMT2rs2070435rs119107073.26E−289.07E−392.89E−085.59E−190.06
STAT6rs324015rs20355451.05E−313.26E−428.76E−112.03E−210.03
KRT72rs626758rs6818121.27E−322.75E−420.085.66E−120.11
HOXB2rs1042815rs15537485.61E−721.38E−810.321.43E−110.05
IRF5rs10229001rs174249211.38E−1377.30E−1470.983.89E−110.05
RNASE2rs11156734rs49823864.77E−232.71E−329.04E−042.75E−130.08
Stickers
GNLYrs12151742rs18661381.56E−201.66E−203.97E−094.05E−090.00
SUPT3Hrs9395049rs77734443.09E−082.51E−081.01E−068.15E−070.00
CA2rs2930553rs25482814.32E−092.79E−092.43E−071.56E−070.00
KLHL5rs10021255rs20600053.80E−112.20E−112.11E−071.20E−070.00
EHD4rs17686769rs80349442.41E−095.74E−097.88E−081.88E−070.00
C17ORF60rs6504230rs169479563.78E−179.04E−172.97E−086.97E−080.00
NSFL1Crs6105165rs60793259.75E−252.77E−245.35E−091.46E−080.01
PLA2G7rs9472830rs100811693.74E−101.10E−093.91E−091.15E−080.00
RSPH3rs12207795rs94575327.95E−153.67E−141.83E−088.46E−080.00
CPVLrs1052200rs127009391.71E−153.66E−165.99E−121.26E−120.00
Fallers
C17ORF97rs6565724rs80775382.59E−831.82E−571.86E−381.65E−120.21
TIMM10rs2729371rs29558499.26E−805.35E−574.90E−373.36E−140.16
ZP3rs17718122rs115056884.31E−1274.72E−1061.22E−291.65E−080.15
NAAArs11732759rs119346381.19E−485.66E−343.91E−401.90E−250.10
DDTrs5751777rs117038815.48E−492.15E−384.31E−381.69E−270.04
SLCO3A1rs2270059rs64968982.49E−216.87E−111.96E−195.67E−090.15
NAAArs2242471rs119346381.51E−423.13E−325.12E−331.08E−220.08
PLA2G4Crs274883rs23072795.55E−193.79E−096.35E−194.33E−090.20
VSTM1rs10500316rs6125291.85E−256.68E−162.12E−237.82E−140.10
NQO2rs2518581rs94051881.20E−321.08E−232.43E−222.36E−130.07
GeneIndex HapMap SNPSecond HapMap SNPIndex HapMap P UnivariableIndex HapMap P MultivariableSecond HapMap P UnivariableSecond HapMap P MultivariableIndex SNP—Second SNP r2
Jumpers
FN3KRPrs2243523rs118692494.56E−352.32E−554.96E−051.16E−250.14
BTNL3rs4700774rs44449301.85E−865.92E−1034.63E−055.29E−220.03
LPCAT2rs883180rs22870721.91E−1022.83E−1160.891.05E−150.11
DHRS1rs10134537rs45688.75E−241.62E−360.541.81E−140.26
PRMT2rs2070435rs119107073.26E−289.07E−392.89E−085.59E−190.06
STAT6rs324015rs20355451.05E−313.26E−428.76E−112.03E−210.03
KRT72rs626758rs6818121.27E−322.75E−420.085.66E−120.11
HOXB2rs1042815rs15537485.61E−721.38E−810.321.43E−110.05
IRF5rs10229001rs174249211.38E−1377.30E−1470.983.89E−110.05
RNASE2rs11156734rs49823864.77E−232.71E−329.04E−042.75E−130.08
Stickers
GNLYrs12151742rs18661381.56E−201.66E−203.97E−094.05E−090.00
SUPT3Hrs9395049rs77734443.09E−082.51E−081.01E−068.15E−070.00
CA2rs2930553rs25482814.32E−092.79E−092.43E−071.56E−070.00
KLHL5rs10021255rs20600053.80E−112.20E−112.11E−071.20E−070.00
EHD4rs17686769rs80349442.41E−095.74E−097.88E−081.88E−070.00
C17ORF60rs6504230rs169479563.78E−179.04E−172.97E−086.97E−080.00
NSFL1Crs6105165rs60793259.75E−252.77E−245.35E−091.46E−080.01
PLA2G7rs9472830rs100811693.74E−101.10E−093.91E−091.15E−080.00
RSPH3rs12207795rs94575327.95E−153.67E−141.83E−088.46E−080.00
CPVLrs1052200rs127009391.71E−153.66E−165.99E−121.26E−120.00
Fallers
C17ORF97rs6565724rs80775382.59E−831.82E−571.86E−381.65E−120.21
TIMM10rs2729371rs29558499.26E−805.35E−574.90E−373.36E−140.16
ZP3rs17718122rs115056884.31E−1274.72E−1061.22E−291.65E−080.15
NAAArs11732759rs119346381.19E−485.66E−343.91E−401.90E−250.10
DDTrs5751777rs117038815.48E−492.15E−384.31E−381.69E−270.04
SLCO3A1rs2270059rs64968982.49E−216.87E−111.96E−195.67E−090.15
NAAArs2242471rs119346381.51E−423.13E−325.12E−331.08E−220.08
PLA2G4Crs274883rs23072795.55E−193.79E−096.35E−194.33E−090.20
VSTM1rs10500316rs6125291.85E−256.68E−162.12E−237.82E−140.10
NQO2rs2518581rs94051881.20E−321.08E−232.43E−222.36E−130.07

‘Jumpers’ refers to loci where pairs of SNPs exhibit a relatively large increase in significance in multivariable compared with univariable analysis, ‘Fallers’ where pairs of SNPs decrease in significance by a comparatively large amount, and ‘Stickers’ where the significance of pairs of SNPs remain similar. Details from all 118 probe-cis gene associations, including effect sizes, are given in Supplementary Material, Table S2.

Effects of including two associated cis-eQTL SNPs in multivariable analyses. Plots show position of SNPs on the X-axis and –log10P-values for association with cis gene expression on the Y-axis. The red diamonds represent the individual (univariable) statistics for the Index HapMap SNP and the second HapMap SNP. The two green diamonds represent the associations of the same two SNPs when accounting for the correlation between the two SNPs using a multivariable model. Estimated haplotype effects are shown underneath each plot, where ‘2’ represents an allele associated with increased gene expression, and ‘1’ represents an allele associated with reduced gene expression. Alleles on haplotypes are ordered by chromosomal position. (A and B) Examples of ‘jumpers’, pairs of SNPs that both increase in significance in the multivariable compared with univariable models at the FN3KRP and STAT6 loci, respectively. (C and D) Examples of ‘stickers’, pairs of SNPs that remain very similar in significance in the multivariable compared with univariable models at the GNLY and CPVL loci, respectively. (E and F) Examples of ‘fallers’, pairs of SNPs that both reduce in significance in multivariable compared with univariable models at the C17ORF97 and TIMM10 loci, respectively.
Figure 1.

Effects of including two associated cis-eQTL SNPs in multivariable analyses. Plots show position of SNPs on the X-axis and –log10P-values for association with cis gene expression on the Y-axis. The red diamonds represent the individual (univariable) statistics for the Index HapMap SNP and the second HapMap SNP. The two green diamonds represent the associations of the same two SNPs when accounting for the correlation between the two SNPs using a multivariable model. Estimated haplotype effects are shown underneath each plot, where ‘2’ represents an allele associated with increased gene expression, and ‘1’ represents an allele associated with reduced gene expression. Alleles on haplotypes are ordered by chromosomal position. (A and B) Examples of ‘jumpers’, pairs of SNPs that both increase in significance in the multivariable compared with univariable models at the FN3KRP and STAT6 loci, respectively. (C and D) Examples of ‘stickers’, pairs of SNPs that remain very similar in significance in the multivariable compared with univariable models at the GNLY and CPVL loci, respectively. (E and F) Examples of ‘fallers’, pairs of SNPs that both reduce in significance in multivariable compared with univariable models at the C17ORF97 and TIMM10 loci, respectively.

The correlation between how pairs of SNPs change in significance between univariable (single SNP) and multivariable (two SNP) models and the LD between them.
Figure 2.

The correlation between how pairs of SNPs change in significance between univariable (single SNP) and multivariable (two SNP) models and the LD between them.

To test why some second HapMap SNPs would increase in significance, some decrease and some stay very similar, we performed haplotype analyses. For each of the 118 probes, we calculated associations between haplotypes formed by the two SNPs and cis gene expression levels. Examples of these two-SNP haplotypes are shown in Figure 1, where ‘2’ represents an allele associated with increased gene expression, and ‘1’ represents an allele associated with reduced gene expression. We observed two types of haplotype. First, we observed haplotypes where the two alleles associated with increased gene expression tended to occur on opposite haplotypes. These ‘1–2’ or ‘2–1’ haplotypes were most common in the ‘jumping’ SNPs because a multivariable analysis will adjust for the cancelling out effect of the other SNP. Secondly, we observed haplotypes where the two alleles associated with increased gene expression tended to occur on the same haplotype. These ‘2–2’ haplotypes were most common among the ‘falling’ SNPs because multivariable analysis will adjust for the correlated effect of the other SNP. ‘Sticking’ SNPs tended to have more of a mixture of both forms of haplotype (as expected due to the lower LD between them).

Allelic heterogeneity or one variant explains all?

The identification of a second signal at P< 1 × 10−6 after conditioning on the Index HapMap SNP does not necessarily mean there are two independent signals. It is possible that two apparently independent signals are tagging a third, unknown, variant. To investigate this possibility, we re-analysed the 118 probes with evidence of two signals using a denser set of SNPs—those imputed from the 1000 Genomes Project. We compared cis-eQTL association results based on HapMap-imputed data with those from 1000 Genomes Project imputed data. We termed the most strongly associated SNP from the 1000 Genome project the ‘1000G SNP’. We used this approach to test the proof of principle that previously untyped ‘novel’ variants could potentially explain two apparently independent association signals.

Adding 1000 Genomes imputed data to the 118 probes with evidence of two signals resulted in a range of patterns of association. For most probes, the evidence for two signals remained strong, but we also observed probes where including the most strongly associated 1000 Genomes SNP appeared to appreciably reduce the evidence for two independent signals. Examples of how association statistics change for the probes with strong evidence of two independent signals are shown in Figures 3A–D and 4A and Supplementary Material, Table S3. At each of these probes, the statistical strength of the association between the Second HapMap SNP and cis gene expression remains very similar when including the Index HapMap SNP and/or the 1000G SNP in the same multivariable test. This robust evidence of a second signal is consistent with the weak LD between the second signal and the other two SNPs. In contrast, the most strongly associated 1000G SNP appears to represent the same signal as the Index HapMap SNP—when these two SNPs are in the same multivariable statistical model, the evidence of association for each falls appreciably, due to the strong LD between them.

Figure 3.

Effects of including three associated cis-eQTL SNPs in multivariable analyses. Bars represent association with cis gene expression for three SNPs per locus—from left to right: the Index HapMap SNP; the Second HapMap SNP; and the 1000G SNP. Black bars represent the association of the SNP with cis gene expression without taking into account correlation (due to LD) with any other SNPs (univariable analysis). The remaining bars represent the association of the SNP with cis gene expression while taking into account any correlation with the other two SNPs, both separately in two SNP models and as a single model with all three SNPs (multivariable analyses). (AD) cis-eQTL with strongest evidence of allelic heterogeneity. (EH) cis-eQTL loci where two apparently independent signals could represent a single association signal. Estimated haplotype effects are shown underneath each plot, where ‘2’ represents an allele associated with increased gene expression, and ‘1’ represents an allele associated with reduced gene expression. Alleles on haplotypes are ordered by the Index HapMap SNP, the second HapMap SNP and the 1000G SNP.

Figure 4.

Effects of including three associated cis-eQTL SNPs in multivariable analyses shown as regional plots. Plots show position of SNPs on the X-axis and –log10P-value for association with cis gene expression on the Y-axis. The red diamonds represent the individual (univariable) statistics. The green diamonds represent the associations of the same SNPs when accounting for the correlation between each other using a multivariable model. Arrows emphasize the directional change of significance; (A) examples of two loci where inclusion of the 1000G SNP does not reduce the evidence for two independent signals; (B) examples of two loci where a single 1000G SNP appears to account for two apparently independent HapMap cis-eQTL SNP associations.

Examples of how association statistics change for the probes where two apparently independent signals are more likely to represent a single association signal are shown in Figures 3E–H  and 4B and Supplementary Material, Table S3. For all three SNPs at these probes (two HapMap SNPs and the 1000G SNP), the evidence of association with cis gene expression falls appreciably when the correlation between all three is accounted for in a multivariable statistical model. These probes are notable in that the association statistics of the HapMap SNPs also fall when each is placed into a two-SNP multivariable model with the 1000G SNP. An example is at the STAT6 locus. The two cis HapMap SNPs at the STAT6 locus increase in significance when accounting for the LD between them, but including the 1000G SNP results in a pattern more consistent with a single association signal. Each of the two HapMap SNPs at STAT6 is in moderate LD with the 1000G SNP.

Features of 1000 Genomes SNPs that better explain an association signal compared with HapMap SNPs

We collapsed the total of 1298 probes with a cis-eQTL and the 118 of those probes with evidence of two signals into 1188 and 113 genes, respectively. We examined in more detail the features of the loci where a 1000G SNP was more strongly associated with cis gene expression by greater than three orders of magnitude compared with the Index HapMap SNP. We identified 43/1188 genes that matched this criteria and observed 20/43 genes (47%) that overlapped with the 113 genes with at least two signals where we would expect approximately four under the null (Fisher's exact test P< 0.0001). This suggests stronger 1000 Genomes signals are more likely to occur in regions with evidence of two independent signals. Compared with the Index HapMap SNPs, the 1000 Genomes SNPs at these 43 gene loci were not significantly closer to the cis-gene (longest transcript identified), no different in allele frequency [1000 Genomes SNP minor allele frequency (MAF) median = 0.29; HapMap Index SNP MAF median = 0.33], and no more likely to be conserved across species [based on the Genomic Evolutionary Rate Profiling (GERP) score] (all P> 0.05).

DISCUSSION

We have used cis gene expression levels to test the hypothesis that conditional and multivariable analysis of genotypes at known loci will explain additional phenotypic variance, and by inference, more of the ‘missing heritability’ for common traits. We found evidence of a second signal in 9% of cis-eQTLs. Accounting for these second signals resulted in an average increase of 31% in phenotypic variance explained at this subset of loci. Our results are consistent with other cis-eQTL analyses (4,5) and a recent analysis of height that identified evidence of a second signal in 5–10% of loci (1). Our results add to these previously published studies because we have performed full conditional and multivariable analyses at each locus.

Whether or not a second signal represents a genuine independent allele or is in partial LD with a single causative allele does not affect our conclusion that these types of analysis can explain more of the variance (and therefore more heritability) of a trait. The distinction between two independent signals and one partially tagged signal is more important when trying to use association results to identify causal genes, or when choosing SNPs for functional studies. Therefore, a second aim of our study was to investigate patterns of association consistent with allelic heterogeneity. Our results have implications for these types of analyses in disease and other clinically relevant traits. The identification of multiple alleles associated with a clinically relevant trait could be extremely important in narrowing the search for likely causal genes. Our study suggests full deep sequencing to identify all variants at a locus will be critical to distinguish between genuine allelic heterogeneity and artefacts that appear as evidence of separate association signals. We identified loci whereby seemingly independent signals represented by SNPs in low LD, and that remain statistically significant after conditional analyses, may not be independent. Instead, these loci may be explained by partially tagging untyped variants.

We observed several patterns of association when analysing single SNPs compared with multiple SNPs at the same loci. The frequency with which trait raising alleles segregated on the same or opposite haplotypes affected the degree to which association statistics ‘fell’ or ‘jumped’ in multivariable compared with univariable analyses. These patterns suggest that in the presence of moderate LD between two functional SNPs, the ability to distinguish between both signals will depend on the haplotype structure. In an extreme example, performing conditional or multivariable analysis on two independent functional SNPs in perfect (r2= 1.0) LD will not produce any evidence of a second signal.

We examined in more detail the features of loci where a 1000 Genome imputed SNP was more than three orders of magnitude stronger than the best HapMap-imputed SNP. These loci were much more likely to be among those with evidence of two signals. We suggest this enrichment means that loci with evidence of two apparently independent loci should be examined in much more detail before deciding whether or not one, two or more signals are responsible for the association between locus and trait. The 1000 Genomes imputed SNPs were similar in frequency and conservation score to the best HapMap-imputed SNPs and so did not reveal any evidence for the ‘synthetic association caused by rare variants’ hypothesis put forward by Goldstein and colleagues (12).

There are some limitations to our study. First, we have not definitively proven or disproven cases of genuine allelic heterogeneity. We have only identified example loci where two SNPs are more likely to represent two independent signals compared with other example loci, where the results are more consistent with a single association signal. Secondly, we have largely used imputed data, which results in more error in the estimation of LD compared with directly typed variants. However, results were similar at a subset of loci where direct genotypes were available (data not shown). Thirdly, this analysis has been limited to common variants identified by the HapMap and 1000 Genomes Project. Further studies are needed to assess the variance explained, and patterns of association that occur, when low frequency and rare SNPs are included. Finally, we have not been able to measure heritability as limited family data was available. However, through explaining more phenotypic variation through multiple additive effects (assuming no population stratification), we must have explained more of the heritability.

MATERIALS AND METHODS

Samples

We used individuals from the InCHIANTI study; a study of aging from the Chianti region in Tuscany, Italy (13,14).

Expression profiling

Peripheral blood specimens were collected from 712 individuals using the PAXgene tube technology to preserve levels of mRNA transcripts. RNA was extracted from peripheral blood samples using the PAXgene Blood mRNA kit (Qiagen, Crawley, UK) according to the manufacturer's instructions.

RNA was biotinylated and amplified using the Illumina® TotalPrep(tm) -96 RNA Amplification Kit and directly hybed with HumanHT-12_v3 Expression BeadChips that include 48 803 probes. Image data were collected on an Illumina iScan and analysed using Illumina GenomeStudio software. These experiments were performed as per the manufacturer's instructions and as previously described (11).

Genotyping and imputation

Genome-wide genotyping was performed using the Illumina Infinium HumanHap550 genotyping chip. Standard quality-control procedures were used to filter out SNPs with MAF <1%, Hardy–Weinberg P< 1 × 10−4 and a call rate <99%. This resulted in 495 343 directly genotyped SNPs.

We used MACH 1.0.16 to impute missing genotypes not captured by the Illumina chip. We formed two imputed data sets using the HapMap r22 build-36 reference panel to impute 2 543 887 SNPs and the June 2010 release of the 1000 Genomes Project (15) build-36 reference panel to impute 6 858 242 SNPs. Imputed SNPs were excluded from the analysis if their MAF was <0.01 and if their r2 [a measure of imputation quality in MaCH (16)] was <0.3 or <0.5 depending on whether they were based on the HapMap or 1000 Genomes Project reference panels, respectively, as recommended and consistent with other literature (16).

Quality-control analysis of gene expression levels

The BeadChip microarrays were normalized using a cubic-spline normalization algorithm using Illumina's BeadStudio software in an attempt to minimize environmental factors during microarray processing that may affect levels of expression.

Next, each probe's expression value for a given individual that had a BeadStudio detection P< 0.01 was flagged as not differentially expressed from noise.

Using probe intensity values that were classified as above background noise, subjects were removed if their mean intensities were outside 3 standard deviations, or the proportion of probes expressed was less than 3 standard deviations. This left 698 individuals, of which 613 had genome-wide SNP data available.

Using the 698 individuals with good quality gene expression data, we removed probes that were not differentially expressed above background noise in over 5% of samples. Of the 16 571 probes that passed this threshold, we removed a further 2 682 probes that were reported to have had at least one SNP within the 50 bp probe region, based on dbSNP 129. We subsequently removed a further 68 probes as their start sites could not be identified. This left 13 821 probes with which to perform association analyses.

Cis-eQTL association testing

We transformed the post-quality-control expression levels for each probe by inverse normalization of expression residuals that adjusted for age, sex, amplification batch and hybridization batch.

To identify cis-SNP–probe associations, we initially performed association analysis of the 13 821 probes using directly genotyped data and the program PLINK. We defined a cis-SNP as a SNP 1 Mb ± of a probe's start site. Probe coordinates were mapped to HG18 using ReMOAT; a re-annotation pipeline set up for enhancing the annotation of Illumina BeadArrays (17). For association testing, we used a cis-SNP significance threshold of P< 1 × 10−6. We estimated the false discovery rate using two methods. First, we used the widely accepted criteria for GWAS that the genome contains ∼1 million independent common variants. If the genome is 3000 Mb then a cis locus of 2 Mb of sequence is ∼1/1500 of the genome and each cis locus will contain 1000000/1500 = 667 independent variants. We have tested 13 821 probes so performed 13 821 × 667 = 9.2 million tests. A P-value of ∼5 × 10−9 therefore provides a Bonferroni-based corrected P-value of 0.05 and a P-value = 1 × 10−6 the threshold at which we would expect ∼10 associations by chance. For second signals, we used the same principle but based the number of tests on the number of loci with primary signals ×667. Secondly, we selected all SNPs (best-guess genotypes for imputed SNPs) in the 1298 2 Mb regions with evidence of a cis-eQTL. We estimated the number of independent cis-SNPs in these regions using the indep-pairwise command in PLINK and an r2 cut-off of 0.5. We repeated this calculation for the subset of 118 probes with evidence of two signals. The 2 Mb regions flanking the 1298 and 118 probes contained an average of 280 and 307 ‘independent’ SNPs, respectively. These figures are still estimates because defining independence at r2= 0.5 is arbitrary and so we used the more conservative figure of 667 independent signals per cis locus.

We then followed up probes that showed potential associations with at least 1 cis-SNP P< 1 × 10−6 by repeating the analyses using the imputed HapMap data set using MACH2QTL. We termed the SNPs showing the strongest cis associations in this analysis the ‘Index HapMap SNP’. The Index HapMap SNP was placed into a univariable model for analysis (y = c + β1·SNP1) in STATA.

As a quality-control step, we also compared our cis-eQTL data to that from European HapMap samples. Of 293 cis-eQTLs reported by Stranger et al. (8), we identified 54 at P< 5 × 10−9 (Supplementary Material, Table S4) despite differences in cell type between our data (gene expression from whole blood) and the HapMap sample data (lymphoblastoid cell lines).

Conditional analyses using HapMap-imputed data

We next performed conditional analyses on each probe using the Index HapMap SNP. For each of the 1298 associations with evidence of a cis-eQTL, we performed a conditional analysis, where we included the Index HapMap SNP as a covariate in a reanalysis of the relevant chromosome, using the program MACH2QTL.

Probes with evidence of a second cis-signal associated at P< 1 × 10−6 (we termed the most strongly associated SNP from this analysis the ‘Second HapMap SNP’) were then re-analysed in a multivariable model. We placed both the Index HapMap SNP and the Second HapMap SNP into a multivariable model (y = c + β1·SNP1 + β2·SNP2) using STATA. This allowed us to assess the association between each SNP and cis gene expression, as well as the amount of variance explained by each SNP, when accounting for any correlation between the two. We compared the amount of variance explained to that when using a univariable analysis consisting of only the Index HapMap SNP. We used estimated counts of the reference allele derived from posterior probabilities, or ‘dosages’, at each SNP to represent genotypes in each individual.

Adding 1000 genomes project data

We next selected the 118 loci with two signals associated with cis gene expression at P< 1 × 10−6. We performed association testing in MACH2QTL using an imputed InCHIANTI data set based on the June 2010 release of the 1000 Genomes Project.

Using dosages derived from 1000 Genomes Project-based imputation, we performed further multivariable analyses. We included three SNPs in the statistical model: the Index HapMap SNP, the Second HapMap SNP and the most strongly associated 1000 Genomes SNP (where different from the top HapMap SNP) as independent variables and cis gene expression levels as the dependent variable (y= c+ β1·SNP1+ β2·SNP2+ β3·SNP3). We termed the most strongly associated 1000 Genomes SNP the ‘1000G SNP’. We also performed two further multivariable analyses using just two SNPs—the most strongly associated 1000 Genomes SNP with each of the two HapMap SNPs.

Haplotype analysis

We performed analyses of estimated haplotypes using UNPHASED (18) to examine their association against levels of expression. Prior to haplotype construction, we generated best-guess SNP genotypes based on dosage data and coded expression increasing alleles as ‘2’ and expression decreasing alleles as ‘1’. We examined their frequency, significance and main effects by testing each one while adjusting for all others in the same model.

Directly genotyped SNPs

We attempted to verify data from the analyses described above based on SNP dosages using directly typed SNP genotypes if available. These were put into the same two-HapMap SNP multivariable models described above where possible in order to evaluate the sensitivity of the data.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at HMG online.

Conflict of Interest statement. None declared.

FUNDING

This work was supported by the Wellcome Trust 083270/Z/07/Z. The InCHIANTI study was supported by contract funding from the U.S. National Institute on Aging (NIA), and the research was supported in part by the Intramural Research Program, NIA, and National Institute of Health (NIH). A.R.W. supported by the Peninsula NIHR Clinical Research Facility. Funding to pay the Open Access publication charges for this article was provided by the Wellcome Trust.

REFERENCES

1
Lango Allen
H.
Estrada
K.
Lettre
G.
Berndt
S.I.
Weedon
M.N.
Rivadeneira
F.
Willer
C.J.
Jackson
A.U.
Vedantam
S.
Raychaudhuri
S.
et al.
,
Hundreds of variants clustered in genomic loci and biological pathways affect human height
Nature
,
2010
, vol.
467
(pg.
832
-
838
)
2
Spencer
C.C.
Plagnol
V.
Strange
A.
Gardner
M.
Paisan-Ruiz
C.
Band
G.
Barker
R.A.
Bellenguez
C.
Bhatia
K.
Blackburn
H.
et al.
,
Dissection of the genetics of Parkinson's disease identifies an additional association 5′ of SNCA and multiple associated haplotypes at 17q21
Hum. Mol. Genet.
,
2011
, vol.
20
(pg.
345
-
353
)
3
Galarneau
G.
Palmer
C.D.
Sankaran
V.G.
Orkin
S.H.
Hirschhorn
J.N.
Lettre
G.
,
Fine-mapping at three loci known to affect fetal hemoglobin levels explains additional genetic variation
Nat. Genet.
,
2010
, vol.
42
(pg.
1049
-
1051
)
4
Dimas
A.S.
Deutsch
S.
Stranger
B.E.
Montgomery
S.B.
Borel
C.
Attar-Cohen
H.
Ingle
C.
Beazley
C.
Gutierrez Arcelus
M.
Sekowska
M.
et al.
,
Common regulatory variation impacts gene expression in a cell type-dependent manner
Science
,
2009
, vol.
325
(pg.
1246
-
1250
)
5
Nica
A.C.
Parts
L.
Glass
D.
Nisbet
J.
Barrett
A.
Sekowska
M.
Travers
M.
Potter
S.
Grundberg
E.
Small
K.
et al.
,
The architecture of gene regulatory variation across multiple human tissues: the MuTHER study
PLoS Genet.
,
2011
, vol.
7
pg.
e1002003
6
Nejentsev
S.
Walker
N.
Riches
D.
Egholm
M.
Todd
J.A.
,
Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes
Science
,
2009
, vol.
324
(pg.
387
-
389
)
7
Dixon
A.L.
Liang
L.
Moffatt
M.F.
Chen
W.
Heath
S.
Wong
K.C.
Taylor
J.
Burnett
E.
Gut
I.
Farrall
M.
et al.
,
A genome-wide association study of global gene expression
Nat. Genet.
,
2007
, vol.
39
(pg.
1202
-
1207
)
8
Stranger
B.E.
Nica
A.C.
Forrest
M.S.
Dimas
A.
Bird
C.P.
Beazley
C.
Ingle
C.E.
Dunning
M.
Flicek
P.
Koller
D.
et al.
,
Population genomics of human gene expression
Nat. Genet.
,
2007
, vol.
39
(pg.
1217
-
1224
)
9
Myers
A.J.
Gibbs
J.R.
Webster
J.A.
Rohrer
K.
Zhao
A.
Marlowe
L.
Kaleem
M.
Leung
D.
Bryden
L.
Nath
P.
et al.
,
A survey of genetic human cortical gene expression
Nat. Genet.
,
2007
, vol.
39
(pg.
1494
-
1499
)
10
Ge
B.
Pokholok
D.K.
Kwan
T.
Grundberg
E.
Morcos
L.
Verlaan
D.J.
Le
J.
Koka
V.
Lam
K.C.
Gagne
V.
et al.
,
Global patterns of cis variation in human cells revealed by high-density allelic expression analysis
Nat. Genet.
,
2009
, vol.
41
(pg.
1216
-
1222
)
11
Gibbs
J.R.
van der Brug
M.P.
Hernandez
D.G.
Traynor
B.J.
Nalls
M.A.
Lai
S.L.
Arepalli
S.
Dillman
A.
Rafferty
I.P.
Troncoso
J.
et al.
,
Abundant quantitative trait loci exist for DNA methylation and gene expression in human brain
PLoS Genet.
,
2010
, vol.
6
pg.
e1000952
12
Dickson
S.P.
Wang
K.
Krantz
I.
Hakonarson
H.
Goldstein
D.B.
,
Rare variants create synthetic genome-wide associations
PLoS Biol.
,
2010
, vol.
8
pg.
e1000294
13
Ferrucci
L.
Bandinelli
S.
Benvenuti
E.
Di Iorio
A.
Macchi
C.
Harris
T.B.
Guralnik
J.M.
,
Subsystems contributing to the decline in ability to walk: bridging the gap between epidemiology and geriatric practice in the InCHIANTI study
J. Am. Geriatr. Soc.
,
2000
, vol.
48
(pg.
1618
-
1625
)
14
Melzer
D.
Perry
J.R.
Hernandez
D.
Corsi
A.M.
Stevens
K.
Rafferty
I.
Lauretani
F.
Murray
A.
Gibbs
J.R.
Paolisso
G.
et al.
,
A genome-wide association study identifies protein quantitative trait loci (pQTLs)
PLoS Genet.
,
2008
, vol.
4
pg.
e1000072
15
Durbin
R.M.
Abecasis
G.R.
Altshuler
D.L.
Auton
A.
Brooks
L.D.
Durbin
R.M.
Gibbs
R.A.
Hurles
M.E.
McVean
G.A.
,
A map of human genome variation from population-scale sequencing
Nature
,
2010
, vol.
467
(pg.
1061
-
1073
)
16
Li
Y.
Willer
C.J.
Ding
J.
Scheet
P.
Abecasis
G.R.
,
MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes
Genet. Epidemiol.
,
2010
, vol.
34
(pg.
816
-
834
)
17
Barbosa-Morais
N.L.
Dunning
M.J.
Samarajiwa
S.A.
Darot
J.F.
Ritchie
M.E.
Lynch
A.G.
Tavare
S.
,
A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data
Nucleic Acids Res.
,
2010
, vol.
38
pg.
e17
18
Dudbridge
F.
,
Likelihood-based association analysis for nuclear families and unrelated subjects with missing genotype data
Hum. Hered.
,
2008
, vol.
66
(pg.
87
-
98
)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data