-
PDF
- Split View
-
Views
-
Cite
Cite
Alan Kwong, Matthew Zawistowski, Lars G Fritsche, Xiaowei Zhan, Jennifer Bragg-Gresham, Kari E Branham, Jayshree Advani, Mohammad Othman, Rinki Ratnapriya, Tanya M Teslovich, Dwight Stambolian, Emily Y Chew, Gonçalo R Abecasis, Anand Swaroop, Whole genome sequencing of 4,787 individuals identifies gene-based rare variants in age-related macular degeneration, Human Molecular Genetics, Volume 33, Issue 4, 15 February 2024, Pages 374–385, https://doi.org/10.1093/hmg/ddad189
- Share Icon Share
Abstract
Genome-wide association studies have contributed extensively to the discovery of disease-associated common variants. However, the genetic contribution to complex traits is still largely difficult to interpret. We report a genome-wide association study of 2394 cases and 2393 controls for age-related macular degeneration (AMD) via whole-genome sequencing, with 46.9 million genetic variants. Our study reveals significant single-variant association signals at four loci and independent gene-based signals in CFH, C2, C3, and NRTN. Using data from the Exome Aggregation Consortium (ExAC) for a gene-based test, we demonstrate an enrichment of predicted rare loss-of-function variants in CFH, CFI, and an as-yet unreported gene in AMD, ORMDL2. Our method of using a large variant list without individual-level genotypes as an external reference provides a flexible and convenient approach to leverage the publicly available variant datasets to augment the search for rare variant associations, which can explain additional disease risk in AMD.
Introduction
Age-related macular degeneration (AMD) is a multifactorial progressive disease characterized by atrophy of the retinal pigment epithelium (RPE), accumulation of different extracellular deposits (called drusen), and progressive degeneration of photoreceptors, especially in the central region of the retina (called macula) [1] but not limited to this area [2]. AMD is a major cause of blindness in the elderly worldwide, with advanced age, lifestyle and environment, and genetic susceptibility constituting the primary risk factors [1, 3]. Large drusen are a significant risk factor for progression towards advanced AMD, exhibiting two distinct late-stage outcomes: geographic atrophy (GA) and macular neovascularization (MNV). GA is characterized by loss of photoreceptors that may begin in the peripheral retina or the macula, whereas MNV involves abnormal blood vessels infiltrating the retina from the underlying tissue layer. More recently, the presence of reticular pseudodrusen (also known as subretinal drusenoid deposits) was identified as an additional risk factor for the development of AMD [4, 5]. Genome-wide association studies (GWAS) of AMD have led to the identification of over 50 associated genetic risk loci [6–13]; however, the most significant signals at the vast majority of the risk loci come from common single-nucleotide polymorphisms (SNPs) in non-coding regions of the genome with no discernable function(s). Therefore, specific molecular and cellular contributions of associated genetic variations to disease progression and AMD pathology are poorly understood.
Linking associated non-coding risk variants to a potential functional mechanism is complex since their location approximates nearby causal variants [14]. Consequently, for most of the identified AMD loci, it is currently unknown whether the gene(s) closest to the lead variant is causal for AMD. Notably, many non-coding variants can affect gene expression, as suggested by eQTL studies in the retina and RPE, thus enabling the identification of candidate target genes at AMD loci [15–17]. Transcriptome-wide association studies (TWAS) have also indicated additional candidate AMD genes [15, 18]. Furthermore, integrated “omics” analyses that incorporate cis-regulatory elements with epigenome marks and genome topology of the human retina have begun to unravel possible mechanism(s) as to how non-coding variants might influence target gene expression in tissue-specific contexts [19–21]. Genetic variants in non-coding regions can also impact DNA methylation and/or splicing, consequently altering the transcriptional landscape [22–25].
Most AMD-GWAS have included primarily common variants [10, 26] and/or focused on targeted explorations of rare variants [27–30]. A large AMD-GWAS [11] analyzed 16 144 advanced AMD cases and 17 832 controls using genotyping arrays with 439 350 markers and subsequent imputation of 11 584 480 variants from 1000 Genomes [31] to fill in the gaps from previous studies. Though this approach comprehensively analyzed common variants and identified 34 AMD-associated loci, examination of rare variants was limited to those present on the array or imputable from reference panels; thus, the analyses, despite being well powered, may have missed many potentially functionally relevant rare variants.
The discovery of associated rare variants, often included because of their predicted impact on proteins, can direct imply causality and already has implicated a major role of the complement system in AMD [29, 30, 32, 33]. AMD-associated rare variants have also been identified in regulators of WNT signaling pathway, and FRZB/SFRP3 and TLE2 genes, among others [11, 34, 35]. More recently, whole exome sequencing of cohorts of AMD families has indicated rare variants that might influence or contribute to the disease phenotype [36]. Ultra-rare and potentially pathogenic coding variants in complement factor C8 have been reported in four families with AMD [37]. A common emerging theme is that molecular mechanisms linking associated variants to disease are complex, and there remains a gulf between genome-wide association results and biological interpretation. A more comprehensive catalog of genetic variations in AMD, including very rare variants observed only in AMD cases, would be useful in elucidating the genetic architecture of AMD.
We therefore set out to identify and characterize AMD-associated rare variants, including those that could not be studied using the standard array-genotyping-and-imputation approach. To accomplish this, we have used whole-genome sequencing (WGS) which allows us to systematically assess both common and rare variants across the genome and the potential to identify loss-of-function (pLoF) variants [11, 38] and to identify sets of candidate functional variants for each of the known common AMD risk genes. By sequencing AMD cases, we increased the probability of finding AMD-associated rare variants that are not present in previous reference panels. To further enhance the power to discover gene-based associations related to very rare variants, we compared our variant list and allele counts with those from the Exome Aggregation Consortium (ExAC) [39], the Genome Aggregation Database (gnomAD) [40], and TOPMed [41]. Our results demonstrate the advantages and limitations of a low-coverage sequencing study of a complex disease and demonstrate the power of integrating several different approaches to discover genetic association signals.
Results
Variant calling
From our initial data set of 4869 samples, we removed samples with over 3% contamination (20 samples), samples of inferred non-European ancestry (32 samples), samples whose sequencing data did not match their array genotype data (11 samples), and samples that did not conform to our disease criteria (19 samples) (Supplementary Fig. S1). Our final data set contained 4787 samples, 2394 AMD cases and 2393 controls. Of the 2394 cases, 583 had Large Drusen, 419 had GA, 1122 had MNV, and 270 had mixed GA + MNV.
Using a PCR-free library prep, we obtained > 130 million reads per sample, targeting 6× coverage. Through WGS, we identified 46 946 619 variants including 43 596 300 SNPs and 3 350 319 indels. About three-quarters of the variants had an allele frequency below 0.5%, with half of all SNPs either singletons or doubletons (17 157 068 singletons, 5 129 082 doubletons), and among indels, 692 506 were singletons and 267 158 doubletons (Supplementary Table S1). We discovered 27 352 890 variants not found in dbSNP build 138, with over 98% having an allele frequency of less than 0.5%. This substantial increase in rare variants compared to the previous AMD GWAS [11] was due to our use of WGS: of our 46.9 million variants, 37 090 168 had an allele frequency of < 1%, whereas only 3 050 013 of the 12 023 830 variants tested in the previous GWAS had an allele frequency below 1%. Our study discovered 31 584 812 autosomal variants not tested in previous GWAS: 13 627 pLoF, 174 611 nonsynonymous, 112 050 synonymous, and 31 284 524 intronic and intergenic variants.
Our final data set contained 301 288 nonsynonymous variants, 165 161 synonymous variants, 19 276 755 intronic variants, and 27 203 395 variants that belonged to other categories (including intergenic, upstream, downstream, and untranslated region variants). Among the nonsynonymous variants, 282 834 were missense, 7043 were nonsense, 6805 were frameshifts, and 4606 were essential splice site variants (Supplementary Table S2). Most of these, 92.2% of stop gains and 81.7% of other pLoF variants, were very rare, i.e. had an allele frequency of less than 0.5% (Supplementary Table S3).
Single-variant association tests
When comparing 2394 AMD cases with 2394 controls in the single variant GWAS, we discovered 1854 genome-wide significant variants (P-value < 5×10−8) exclusively in four known AMD risk loci which showed strong association signals in previous studies (CFH, C2/CFB/SKIV2L, ARMS2/HTRA1, and C3; Table 1 and Fig. 1A). There was no evidence of population stratification (genomic control λGC = 1.021, Fig. 1B and C). Sequential forward selection led to 4 significant independent signals in the CFH locus, 2 in C2/CFB/SKIV2L, 2 in C3, and 1 in ARMS2/HTRA1 (Table 2).
Single-variant association tests for advanced AMD. We compared 2394 cases of AMD—Including large drusen, MNV, GA, and mixed GA/MNV—To 2393 controls using firth bias-corrected logistic regression. Alleles indicate the major and minor alleles for the given variants. P-values were obtained from firth bias-corrected logistic regression. IAMDGC top variant indicates the top variant in the given locus as described11. R2 with IAMDGC indicates the genotype R2 between our top variant and the top variant for the same locus in IAMDGC, as found in our data. Our most significant signal in the C2 locus is only in moderate linkage disequilibrium with the top variant found in IAMDGC: In the previous study, the top variant (rs116503776) was found in a haplotype analysis to tag two previously described CFB missense variants, which appear to be the risk-carrying variants (Supplementary Fig. 4 of Fritsche et al.). In the C3 locus, our most significant signal was rs2230199 (c.304C > G, p.Arg102Gly), the same one found in IAMDGC and previous studies.
Chr . | Top var. pos. . | Alleles . | Locus . | P-value . | Annotation . | IAMDGC top variant . | R2 w/IAMDGC . |
---|---|---|---|---|---|---|---|
1 | 196 684 392 | G/T | CFH | 4.62 × 10−114 | Intron-CFH | 196704632_C/A (rs10922109) | 0.995 |
6 | 31 894 355 | C/T | C2/CFB/SKIV2L | 4.04 × 10−25 | Intron-C2 | 31 930 462_G/A (rs116503776) | 0.659 |
10 | 124 214 600 | G/GGT | ARMS2/HTRA1 | 2.13 × 10−92 | Intron-ARMS2 | 124215565_T/C (rs3750846) | 0.994 |
19 | 6 718 387 | G/C | C3 | 2.55 × 10−20 | Nonsyn-C3 | 6 718 387_G/C (rs2230199) | - |
Chr . | Top var. pos. . | Alleles . | Locus . | P-value . | Annotation . | IAMDGC top variant . | R2 w/IAMDGC . |
---|---|---|---|---|---|---|---|
1 | 196 684 392 | G/T | CFH | 4.62 × 10−114 | Intron-CFH | 196704632_C/A (rs10922109) | 0.995 |
6 | 31 894 355 | C/T | C2/CFB/SKIV2L | 4.04 × 10−25 | Intron-C2 | 31 930 462_G/A (rs116503776) | 0.659 |
10 | 124 214 600 | G/GGT | ARMS2/HTRA1 | 2.13 × 10−92 | Intron-ARMS2 | 124215565_T/C (rs3750846) | 0.994 |
19 | 6 718 387 | G/C | C3 | 2.55 × 10−20 | Nonsyn-C3 | 6 718 387_G/C (rs2230199) | - |
Single-variant association tests for advanced AMD. We compared 2394 cases of AMD—Including large drusen, MNV, GA, and mixed GA/MNV—To 2393 controls using firth bias-corrected logistic regression. Alleles indicate the major and minor alleles for the given variants. P-values were obtained from firth bias-corrected logistic regression. IAMDGC top variant indicates the top variant in the given locus as described11. R2 with IAMDGC indicates the genotype R2 between our top variant and the top variant for the same locus in IAMDGC, as found in our data. Our most significant signal in the C2 locus is only in moderate linkage disequilibrium with the top variant found in IAMDGC: In the previous study, the top variant (rs116503776) was found in a haplotype analysis to tag two previously described CFB missense variants, which appear to be the risk-carrying variants (Supplementary Fig. 4 of Fritsche et al.). In the C3 locus, our most significant signal was rs2230199 (c.304C > G, p.Arg102Gly), the same one found in IAMDGC and previous studies.
Chr . | Top var. pos. . | Alleles . | Locus . | P-value . | Annotation . | IAMDGC top variant . | R2 w/IAMDGC . |
---|---|---|---|---|---|---|---|
1 | 196 684 392 | G/T | CFH | 4.62 × 10−114 | Intron-CFH | 196704632_C/A (rs10922109) | 0.995 |
6 | 31 894 355 | C/T | C2/CFB/SKIV2L | 4.04 × 10−25 | Intron-C2 | 31 930 462_G/A (rs116503776) | 0.659 |
10 | 124 214 600 | G/GGT | ARMS2/HTRA1 | 2.13 × 10−92 | Intron-ARMS2 | 124215565_T/C (rs3750846) | 0.994 |
19 | 6 718 387 | G/C | C3 | 2.55 × 10−20 | Nonsyn-C3 | 6 718 387_G/C (rs2230199) | - |
Chr . | Top var. pos. . | Alleles . | Locus . | P-value . | Annotation . | IAMDGC top variant . | R2 w/IAMDGC . |
---|---|---|---|---|---|---|---|
1 | 196 684 392 | G/T | CFH | 4.62 × 10−114 | Intron-CFH | 196704632_C/A (rs10922109) | 0.995 |
6 | 31 894 355 | C/T | C2/CFB/SKIV2L | 4.04 × 10−25 | Intron-C2 | 31 930 462_G/A (rs116503776) | 0.659 |
10 | 124 214 600 | G/GGT | ARMS2/HTRA1 | 2.13 × 10−92 | Intron-ARMS2 | 124215565_T/C (rs3750846) | 0.994 |
19 | 6 718 387 | G/C | C3 | 2.55 × 10−20 | Nonsyn-C3 | 6 718 387_G/C (rs2230199) | - |
Sequential forward selection single-variant association tests for advanced AMD. Sequential forward selection results for 2394 AMD cases and 2393 controls. Firth bias-corrected single-variant association. For association results in each locus, the top-most variant did not use any variants as covariates; each subsequent variant uses all variants above it in the same locus as covariates. The variants below represent statistically independent signals across the entire genome. Locus names correspond to names given to the equivalent loci in IAMDGC.
Lead variant . | . | . | Major/minor . | . | Allele Frequencies . | Association results . | ||
---|---|---|---|---|---|---|---|---|
dbSNP ID . | Chr . | Position . | Alleles . | Locus name . | Cases . | Controls . | OR . | Cond. P-value . |
rs6688272 | 1 | 196 684 392 | G/T | CFH | 0.201 | 0.418 | 0.36 | 2.90 × 10−113 |
rs10922094 | 1 | 196 661 505 | G/C | 0.388 | 0.619 | 0.6 | 1.03 × 10−20 | |
- | 1 | 196 024 122 | TG/T | 0.00497 | 0.000209 | 32.8 | 2.06 × 10−10 | |
rs79436252 | 1 | 196 358 288 | A/G | 0.0655 | 0.0224 | 1.95 | 5.72 × 10−9 | |
rs556679 | 6 | 31 894 355 | C/T | C2/CFB/SKIV2L | 0.051 | 0.109 | 0.48 | 2.05 × 10−25 |
rs28383438 | 6 | 32 609 038 | C/T | 0.166 | 0.123 | 1.52 | 5.81 × 10−13 | |
rs144224550 | 10 | 124 214 600 | G/GGT | ARMS2/HTRA1 | 0.414 | 0.217 | 2.44 | 5.08 × 10−91 |
rs2230199 | 19 | 6 718 387 | G/C | C3 | 0.281 | 0.2 | 1.55 | 2.58 × 10−20 |
rs181290250 | 19 | 6 722 565 | C/T | 0.0157 | 0.00251 | 6.9 | 3.46 × 10−14 |
Lead variant . | . | . | Major/minor . | . | Allele Frequencies . | Association results . | ||
---|---|---|---|---|---|---|---|---|
dbSNP ID . | Chr . | Position . | Alleles . | Locus name . | Cases . | Controls . | OR . | Cond. P-value . |
rs6688272 | 1 | 196 684 392 | G/T | CFH | 0.201 | 0.418 | 0.36 | 2.90 × 10−113 |
rs10922094 | 1 | 196 661 505 | G/C | 0.388 | 0.619 | 0.6 | 1.03 × 10−20 | |
- | 1 | 196 024 122 | TG/T | 0.00497 | 0.000209 | 32.8 | 2.06 × 10−10 | |
rs79436252 | 1 | 196 358 288 | A/G | 0.0655 | 0.0224 | 1.95 | 5.72 × 10−9 | |
rs556679 | 6 | 31 894 355 | C/T | C2/CFB/SKIV2L | 0.051 | 0.109 | 0.48 | 2.05 × 10−25 |
rs28383438 | 6 | 32 609 038 | C/T | 0.166 | 0.123 | 1.52 | 5.81 × 10−13 | |
rs144224550 | 10 | 124 214 600 | G/GGT | ARMS2/HTRA1 | 0.414 | 0.217 | 2.44 | 5.08 × 10−91 |
rs2230199 | 19 | 6 718 387 | G/C | C3 | 0.281 | 0.2 | 1.55 | 2.58 × 10−20 |
rs181290250 | 19 | 6 722 565 | C/T | 0.0157 | 0.00251 | 6.9 | 3.46 × 10−14 |
Sequential forward selection single-variant association tests for advanced AMD. Sequential forward selection results for 2394 AMD cases and 2393 controls. Firth bias-corrected single-variant association. For association results in each locus, the top-most variant did not use any variants as covariates; each subsequent variant uses all variants above it in the same locus as covariates. The variants below represent statistically independent signals across the entire genome. Locus names correspond to names given to the equivalent loci in IAMDGC.
Lead variant . | . | . | Major/minor . | . | Allele Frequencies . | Association results . | ||
---|---|---|---|---|---|---|---|---|
dbSNP ID . | Chr . | Position . | Alleles . | Locus name . | Cases . | Controls . | OR . | Cond. P-value . |
rs6688272 | 1 | 196 684 392 | G/T | CFH | 0.201 | 0.418 | 0.36 | 2.90 × 10−113 |
rs10922094 | 1 | 196 661 505 | G/C | 0.388 | 0.619 | 0.6 | 1.03 × 10−20 | |
- | 1 | 196 024 122 | TG/T | 0.00497 | 0.000209 | 32.8 | 2.06 × 10−10 | |
rs79436252 | 1 | 196 358 288 | A/G | 0.0655 | 0.0224 | 1.95 | 5.72 × 10−9 | |
rs556679 | 6 | 31 894 355 | C/T | C2/CFB/SKIV2L | 0.051 | 0.109 | 0.48 | 2.05 × 10−25 |
rs28383438 | 6 | 32 609 038 | C/T | 0.166 | 0.123 | 1.52 | 5.81 × 10−13 | |
rs144224550 | 10 | 124 214 600 | G/GGT | ARMS2/HTRA1 | 0.414 | 0.217 | 2.44 | 5.08 × 10−91 |
rs2230199 | 19 | 6 718 387 | G/C | C3 | 0.281 | 0.2 | 1.55 | 2.58 × 10−20 |
rs181290250 | 19 | 6 722 565 | C/T | 0.0157 | 0.00251 | 6.9 | 3.46 × 10−14 |
Lead variant . | . | . | Major/minor . | . | Allele Frequencies . | Association results . | ||
---|---|---|---|---|---|---|---|---|
dbSNP ID . | Chr . | Position . | Alleles . | Locus name . | Cases . | Controls . | OR . | Cond. P-value . |
rs6688272 | 1 | 196 684 392 | G/T | CFH | 0.201 | 0.418 | 0.36 | 2.90 × 10−113 |
rs10922094 | 1 | 196 661 505 | G/C | 0.388 | 0.619 | 0.6 | 1.03 × 10−20 | |
- | 1 | 196 024 122 | TG/T | 0.00497 | 0.000209 | 32.8 | 2.06 × 10−10 | |
rs79436252 | 1 | 196 358 288 | A/G | 0.0655 | 0.0224 | 1.95 | 5.72 × 10−9 | |
rs556679 | 6 | 31 894 355 | C/T | C2/CFB/SKIV2L | 0.051 | 0.109 | 0.48 | 2.05 × 10−25 |
rs28383438 | 6 | 32 609 038 | C/T | 0.166 | 0.123 | 1.52 | 5.81 × 10−13 | |
rs144224550 | 10 | 124 214 600 | G/GGT | ARMS2/HTRA1 | 0.414 | 0.217 | 2.44 | 5.08 × 10−91 |
rs2230199 | 19 | 6 718 387 | G/C | C3 | 0.281 | 0.2 | 1.55 | 2.58 × 10−20 |
rs181290250 | 19 | 6 722 565 | C/T | 0.0157 | 0.00251 | 6.9 | 3.46 × 10−14 |

Manhattan plot of AMD cases vs. controls and QQ plots of single-variant association P-values for variants inside and outside of known loci. (A) Association signals can be found in CFH (chromosome 1), C2/CFB/SKIV2L (chromosome 6), ARMS2/HTRA1 (chromosome 10), and C3 (chromosome 19) using a significance threshold of 5 × 10−8. (B) QQ Plot of single-variant association P-values for variants within known AMD risk loci. All significant signals (P-value < 5 × 10−8) were found within these previously discovered AMD risk loci. The most significant signals were all common variants (AF > 5%). (C) QQ plot of single-variant association P-values for variants outside known AMD risk loci. There was no evidence of population stratification when we considered all variants with AF > 0.1% outside of known loci (λGC = 1.021).
Similar to the previous studies, as mentioned before, the most significant signals were generally common variants without any directly implicated functional consequences on surrounding genes, except in C3 locus where the top variant was the previously known missense variant rs2230199 (19:6718387 G > C, C3 Arg102Gly, case allele frequency = 0.28, control allele frequency = 0.20, odds ratio = 1.55, P-value 2.55 x 10−20). In the CFH locus, the most significant signal was rs6688272 (1:196684392 G > T, case allele frequency = 0.20, control allele frequency = 0.42, odds ratio = 0.36, P-value = 4.62 × 10−114), an intronic variant in high linkage disequilibrium (LD) with the most significant signal in the same locus from Fritsche et al. (rs10922109) in our samples (r2 = 0.995; see Table 1). Similarly, the top variant in the ARMS2/HTRA1 locus, rs144224550 (10:124214600 G > GGT, case allele frequency = 0.41, control allele frequency = 0.22, odds ratio = 2.44, P-value = 2.13 × 10−92) was also in high LD with the top variant for the same locus as described11 (rs3750846, r2 = 0.994).
While prior studies suggested that signals in the C2/CFB/SKIV2L locus could be explained by two coding variants in CFB(11), our dataset, which had a higher resolution, highlighted an independent signal in C2. Our top variant in the C2/CFB/SKIV2L locus, rs556679 (6:31894355 C > T, case allele frequency = 0.051, control allele frequency = 0.11, odds ratio = 0.48, P-value = 4.04 × 10−25), which was previously identified by Zhan et al. as exhibiting strong evidence of association, was only in moderate LD (r2 = 0.659) with the known top variant in Fritsche et al. (rs116503776, 6:31930462 G > A, intronic in SKIV2L). Whereas previous studies suggested that variant associations in C2 could be explained by LD with causal variants in CFB [11, 42, 43], our haplotype analysis showed that rs556679 was associated with AMD independent of the two known coding variants in CFB (Table 3).
Haplotype analysis in CFB. We examined six different haplotypes that included our top variant in the CFB locus (31 894 355 C > T) along with two known nonsynonymous variants in the CFB gene to determine whether our observed variant had an effect on AMD independent of known variants. The odds ratio for haplotype H4, which contained the alternate allele for our top variant but the reference allele for both nonsynonymous variants, had attenuated effect compared to haplotypes with the alternate allele in either of the nonsynonymous variants, but was still significantly different from the null, suggesting that our top variant may be in high linkage disequilibrium with other variants causal for AMD. Odds ratios and P-values were calculated by comparing the case and control haplotype counts for H1 against the case and control haplotype counts for each of the alternate haplotypes (H2-H6) using Fisher’s exact test.
![]() |
![]() |
Haplotype analysis in CFB. We examined six different haplotypes that included our top variant in the CFB locus (31 894 355 C > T) along with two known nonsynonymous variants in the CFB gene to determine whether our observed variant had an effect on AMD independent of known variants. The odds ratio for haplotype H4, which contained the alternate allele for our top variant but the reference allele for both nonsynonymous variants, had attenuated effect compared to haplotypes with the alternate allele in either of the nonsynonymous variants, but was still significantly different from the null, suggesting that our top variant may be in high linkage disequilibrium with other variants causal for AMD. Odds ratios and P-values were calculated by comparing the case and control haplotype counts for H1 against the case and control haplotype counts for each of the alternate haplotypes (H2-H6) using Fisher’s exact test.
![]() |
![]() |
We tested the robustness of our results by repeating our single-variant association tests using the first two principal components of each sample as covariates, analyzing only GA or MNV samples with their corresponding matched samples only. The results largely agreed with our original single-variant analysis and did not change the discovered loci or significant association signals (data not shown).
Gene-based association tests
We used SKAT-O to test for evidence of an excess of rare, nonsynonymous alleles in our samples [44]. Grouping together nonsynonymous variants (including nonsense, frameshift, and splice site variants) with allele frequencies below 5% and using a Bonferroni correction for 22 502 tests (P-value significance threshold of 0.05/22502 = 2.22 × 10−6), 2 genes in the CFH locus, 4 genes the in C2/CFB/SKIV2L locus, and 2 genes in the C3 locus, including NRTN, showed significant associations (Table 4). To account for LD with nearby strongly associated variants, which could indirectly lead to significant gene-based rare variant signals, we also performed conditional analyses where we adjusted for the top common variant(s) in each of the loci. Conditioning on the top single variant signals in CFH and C2/CFB/SKIV2L, the SKAT-O signals for these genes disappeared. The two signals in the C3 locus remained after conditioning on the top common variant in that locus, and the signal for NRTN remained after conditioning on the top two variants, indicating that the NRTN signal might be independent of the signal in C3. In published GWAS [11], one of the three independent signals in the C3 locus (rs12019136, 19:5835677 G > A, intronic) was found to be near NRTN/FUT6, and their nonsynonymous variant enrichment analysis yielded a 95% credible set for this locus, which included a variant in NRTN (rs79744308, 19:5827765 G > A, p.Ala59Thr) as one of two signals with over 5% posterior probability of explaining the signal within this locus (along with a variant in FUT6). This same variant had the strongest signal in our single-variant conditional analysis in the C3 locus after conditioning on the first two independent signals, though it did not pass the threshold for genome-wide significance (conditional P = 5.85 × 10−7, OR = 0.59, case AF = 0.031, control AF = 0.051). The SKAT-O signal for NRTN disappeared when we conditioned on the NRTN nonsynonymous variant rs79744308 (P = 2.83 × 10−7 when not conditioning on rs79744308; P = 0.32 when conditioning on rs79744308), suggesting the variant indeed can explain the signal in the NRTN locus.
Gene-based tests on nonsynonymous variants for advanced AMD using SKAT-O. Significant signals from gene-based associations of 2394 AMD cases vs 2393 controls from SKAT-O were found in multiple genes in three of the four significant loci found in single-variant association tests. We used a Bonferroni correction for 22 502 tests to obtain a significance threshold of 2.22 × 10−6. No significant gene-based signal was found in the ARMS2/HTRA1 locus. Conditioning on the top variants in CFH (rs6688272, 1:196684392 G/T) and C2/CFB/SKIV2L (rs556679, 6:31894355 C/T) eliminated the SKAT-O signals in those loci, while conditioning on the top two independent variants in C3 (rs2230199, 19:6718387 G/C and rs181290250, 19:6722565 C/T, respectively) eliminated the signal in C3 but not NRTN. The NRTN signal disappeared after conditioning on the relatively common nonsynonymous variant rs79744308 (19:5827765 G/A, p.Ala59Thr, allele frequency = 4.1% in our samples), indicating that the NRTN signal was largely driven by this variant.
Chr . | Gene . | Locus . | Frac. with rare varsa . | Number of variants . | Number of singletons . | P-value . | SKAT-O Rhob . | Number of conditioning variants . |
---|---|---|---|---|---|---|---|---|
1 | CFH | CFH | 0.069 | 94 | 56 | 2.33 × 10−10 | 0 | 0 |
1 | CFHR2 | CFH | 0.12 | 16 | 2 | 4.24 × 10−9 | 0 | 0 |
6 | CFB | C2/CFB/SKIV2L | 0.23 | 80 | 47 | 3.90 × 10−11 | 0 | 0 |
6 | C2 | C2/CFB/SKIV2L | 0.11 | 54 | 38 | 8.29 × 10−10 | 0 | 0 |
6 | NOTCH4 | C2/CFB/SKIV2L | 0.23 | 114 | 68 | 7.15 × 10−9 | 0 | 0 |
19 | C3 | C3 | 0.036 | 50 | 36 | 1.70 × 10−11 | 0 | 0 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 5.67 × 10−7 | 0.4 | 0 |
19 | C3 | C3 | 0.036 | 50 | 36 | 7.37 × 10−13 | 0 | 1 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 1.34 × 10−7 | 0.5 | 1 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 2.83 × 10−7 | 0.5 | 2 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 0.318 | 1 | 3 |
Chr . | Gene . | Locus . | Frac. with rare varsa . | Number of variants . | Number of singletons . | P-value . | SKAT-O Rhob . | Number of conditioning variants . |
---|---|---|---|---|---|---|---|---|
1 | CFH | CFH | 0.069 | 94 | 56 | 2.33 × 10−10 | 0 | 0 |
1 | CFHR2 | CFH | 0.12 | 16 | 2 | 4.24 × 10−9 | 0 | 0 |
6 | CFB | C2/CFB/SKIV2L | 0.23 | 80 | 47 | 3.90 × 10−11 | 0 | 0 |
6 | C2 | C2/CFB/SKIV2L | 0.11 | 54 | 38 | 8.29 × 10−10 | 0 | 0 |
6 | NOTCH4 | C2/CFB/SKIV2L | 0.23 | 114 | 68 | 7.15 × 10−9 | 0 | 0 |
19 | C3 | C3 | 0.036 | 50 | 36 | 1.70 × 10−11 | 0 | 0 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 5.67 × 10−7 | 0.4 | 0 |
19 | C3 | C3 | 0.036 | 50 | 36 | 7.37 × 10−13 | 0 | 1 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 1.34 × 10−7 | 0.5 | 1 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 2.83 × 10−7 | 0.5 | 2 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 0.318 | 1 | 3 |
aThe fraction of samples with rare variants, defined as having an allele frequency of less than 5%.
bThe estimated weight parameter for SKAT-O, indicating the proportion of the weight assigned to the burden test. A Rho of 0 means the SKAT-O test is equivalent to a SKAT test, while a Rho of 1 means the SKAT-O test is equivalent to a burden test.
Gene-based tests on nonsynonymous variants for advanced AMD using SKAT-O. Significant signals from gene-based associations of 2394 AMD cases vs 2393 controls from SKAT-O were found in multiple genes in three of the four significant loci found in single-variant association tests. We used a Bonferroni correction for 22 502 tests to obtain a significance threshold of 2.22 × 10−6. No significant gene-based signal was found in the ARMS2/HTRA1 locus. Conditioning on the top variants in CFH (rs6688272, 1:196684392 G/T) and C2/CFB/SKIV2L (rs556679, 6:31894355 C/T) eliminated the SKAT-O signals in those loci, while conditioning on the top two independent variants in C3 (rs2230199, 19:6718387 G/C and rs181290250, 19:6722565 C/T, respectively) eliminated the signal in C3 but not NRTN. The NRTN signal disappeared after conditioning on the relatively common nonsynonymous variant rs79744308 (19:5827765 G/A, p.Ala59Thr, allele frequency = 4.1% in our samples), indicating that the NRTN signal was largely driven by this variant.
Chr . | Gene . | Locus . | Frac. with rare varsa . | Number of variants . | Number of singletons . | P-value . | SKAT-O Rhob . | Number of conditioning variants . |
---|---|---|---|---|---|---|---|---|
1 | CFH | CFH | 0.069 | 94 | 56 | 2.33 × 10−10 | 0 | 0 |
1 | CFHR2 | CFH | 0.12 | 16 | 2 | 4.24 × 10−9 | 0 | 0 |
6 | CFB | C2/CFB/SKIV2L | 0.23 | 80 | 47 | 3.90 × 10−11 | 0 | 0 |
6 | C2 | C2/CFB/SKIV2L | 0.11 | 54 | 38 | 8.29 × 10−10 | 0 | 0 |
6 | NOTCH4 | C2/CFB/SKIV2L | 0.23 | 114 | 68 | 7.15 × 10−9 | 0 | 0 |
19 | C3 | C3 | 0.036 | 50 | 36 | 1.70 × 10−11 | 0 | 0 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 5.67 × 10−7 | 0.4 | 0 |
19 | C3 | C3 | 0.036 | 50 | 36 | 7.37 × 10−13 | 0 | 1 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 1.34 × 10−7 | 0.5 | 1 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 2.83 × 10−7 | 0.5 | 2 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 0.318 | 1 | 3 |
Chr . | Gene . | Locus . | Frac. with rare varsa . | Number of variants . | Number of singletons . | P-value . | SKAT-O Rhob . | Number of conditioning variants . |
---|---|---|---|---|---|---|---|---|
1 | CFH | CFH | 0.069 | 94 | 56 | 2.33 × 10−10 | 0 | 0 |
1 | CFHR2 | CFH | 0.12 | 16 | 2 | 4.24 × 10−9 | 0 | 0 |
6 | CFB | C2/CFB/SKIV2L | 0.23 | 80 | 47 | 3.90 × 10−11 | 0 | 0 |
6 | C2 | C2/CFB/SKIV2L | 0.11 | 54 | 38 | 8.29 × 10−10 | 0 | 0 |
6 | NOTCH4 | C2/CFB/SKIV2L | 0.23 | 114 | 68 | 7.15 × 10−9 | 0 | 0 |
19 | C3 | C3 | 0.036 | 50 | 36 | 1.70 × 10−11 | 0 | 0 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 5.67 × 10−7 | 0.4 | 0 |
19 | C3 | C3 | 0.036 | 50 | 36 | 7.37 × 10−13 | 0 | 1 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 1.34 × 10−7 | 0.5 | 1 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 2.83 × 10−7 | 0.5 | 2 |
19 | NRTN | C3 | 0.082 | 12 | 8 | 0.318 | 1 | 3 |
aThe fraction of samples with rare variants, defined as having an allele frequency of less than 5%.
bThe estimated weight parameter for SKAT-O, indicating the proportion of the weight assigned to the burden test. A Rho of 0 means the SKAT-O test is equivalent to a SKAT test, while a Rho of 1 means the SKAT-O test is equivalent to a burden test.
To verify our SKAT-O results, we repeated our gene-based analysis using the variable threshold (VT) burden test [45] as implemented in EPACTS using a maximal allele frequency (AF) threshold of 5%. Only two genes remained significantly associated with AMD after Bonferroni-correction (P-value below 0.05/22502 = 2.22 × 10−6): C2 (P = 7.0 × 10−7) and NRTN (P = 4.0 × 10−7). To test the sensitivity of the analysis with respect to the variant threshold, we repeated the SKAT-O and VT analysis with a maximal AF threshold of 1%. The only gene that remained significant in the 1% SKAT-O analysis was C3 (P = 1.70 × 10−11); no genes were significant in the 1% VT analysis.
Rare pLoF rate comparison with a large external panel (ExAC)
To further test our rarer variants for enrichment in cases, we performed an analysis restricted to very rare pLoF variants (allele frequency < 0.1% in our sample) and aggregated results at the gene level. We performed this analysis by comparing our data against the variant list from the Exome Aggregation Consortium (ExAC) [39], with 60 706 population-based sequenced samples.
Using estimated rare pLoF carrier counts, we generated contingency tables using pLoF carrier counts and case-control status. First, restricting our analysis to genes found in the 34 known risk loci from the previous GWAS removed 20 538 out of 21 383 genes, narrowing the list down to 845 genes. Next, we removed genes without any rare pLoF variants in both AMD cases and controls, which accounted for an additional 538 genes, leaving us with 307. We kept only genes for which the control and ExAC samples were not significantly different, using P > 0.01 as the criterion, which removed 249 additional genes, with 289 remaining. We narrowed this list down further to include only genes in which the case and ExAC samples were significantly different, using P < 0.01 as the criterion, removing 279 genes, leaving 10 genes. We filtered this list of genes for those in which the case pLoF allele frequency was greater than the control pLoF allele frequency in our sequenced samples, which suggested an association of deleterious rare pLoF alleles with increased risk for AMD (Supplementary Table S4). This removed 7 more genes, leaving us with 3 genes with evidence of association (Table 5).
Comparing rare pLoF allele carriers in AMD cases and controls with ExAC. The 2×2 contingency tables between case-control/ExAC status and pLoF carrier status were constructed as follows: Allele carriers = pLoF count; allele non-carriers = N – pLoF count, where N = the number of samples in each data source (cases, controls, or ExAC); then two 2×2 tables can be constructed, one between cases and ExAC and one between controls and ExAC. (For example, the 2×2 table for CFH had 5 carriers and 2389 non-carriers in cases, and 9 carriers and 60 697 non-carriers in ExAC.) We used one-sided Fisher’s exact tests with null hypotheses that the odds ratio for the effect of ExAC vs. the tested cohort was less than 1 (that is, the gene had a lower pLoF frequency in the tested cohort than in ExAC). We analyzed a subset of genes for which the case-ExAC P-value were significant (P-value < 0.01) but the control-ExAC P-value were not significant (P-value > 0.01), and which lay within known AMD risk loci, totaling 3 genes. These genes had a higher case carrier frequency than control carrier frequency, consistent with what would be expected from rare deleterious loss-of-function variants.
Gene . | Case carriers . | Control carriers . | ExAC carriers . | Case vs. ExAC . | |
---|---|---|---|---|---|
. | (N = 2394) . | (N = 2393) . | (N = 60 706) . | P-value . | OR . |
CFH | 5 | 0 | 9 | 1.18 × 10−4 | 14.1 |
ORMDL2 | 3 | 0 | 6 | 3.86 × 10−3 | 12.7 |
CFI | 5 | 0 | 30 | 9.84 × 10−3 | 4.2 |
Gene . | Case carriers . | Control carriers . | ExAC carriers . | Case vs. ExAC . | |
---|---|---|---|---|---|
. | (N = 2394) . | (N = 2393) . | (N = 60 706) . | P-value . | OR . |
CFH | 5 | 0 | 9 | 1.18 × 10−4 | 14.1 |
ORMDL2 | 3 | 0 | 6 | 3.86 × 10−3 | 12.7 |
CFI | 5 | 0 | 30 | 9.84 × 10−3 | 4.2 |
Comparing rare pLoF allele carriers in AMD cases and controls with ExAC. The 2×2 contingency tables between case-control/ExAC status and pLoF carrier status were constructed as follows: Allele carriers = pLoF count; allele non-carriers = N – pLoF count, where N = the number of samples in each data source (cases, controls, or ExAC); then two 2×2 tables can be constructed, one between cases and ExAC and one between controls and ExAC. (For example, the 2×2 table for CFH had 5 carriers and 2389 non-carriers in cases, and 9 carriers and 60 697 non-carriers in ExAC.) We used one-sided Fisher’s exact tests with null hypotheses that the odds ratio for the effect of ExAC vs. the tested cohort was less than 1 (that is, the gene had a lower pLoF frequency in the tested cohort than in ExAC). We analyzed a subset of genes for which the case-ExAC P-value were significant (P-value < 0.01) but the control-ExAC P-value were not significant (P-value > 0.01), and which lay within known AMD risk loci, totaling 3 genes. These genes had a higher case carrier frequency than control carrier frequency, consistent with what would be expected from rare deleterious loss-of-function variants.
Gene . | Case carriers . | Control carriers . | ExAC carriers . | Case vs. ExAC . | |
---|---|---|---|---|---|
. | (N = 2394) . | (N = 2393) . | (N = 60 706) . | P-value . | OR . |
CFH | 5 | 0 | 9 | 1.18 × 10−4 | 14.1 |
ORMDL2 | 3 | 0 | 6 | 3.86 × 10−3 | 12.7 |
CFI | 5 | 0 | 30 | 9.84 × 10−3 | 4.2 |
Gene . | Case carriers . | Control carriers . | ExAC carriers . | Case vs. ExAC . | |
---|---|---|---|---|---|
. | (N = 2394) . | (N = 2393) . | (N = 60 706) . | P-value . | OR . |
CFH | 5 | 0 | 9 | 1.18 × 10−4 | 14.1 |
ORMDL2 | 3 | 0 | 6 | 3.86 × 10−3 | 12.7 |
CFI | 5 | 0 | 30 | 9.84 × 10−3 | 4.2 |
Two of these three genes (CFH and CFI) had been identified in previous studies to contain rare coding variants (including missense and pLoF variants) associated with AMD [38, 46, 47]. In our data, five out of 2394 case samples each carried a pLoF allele in CFH (a pLoF rate of about 1 in 500), distributed across four different variants, while none of our 2393 control samples carried any alternate alleles in the same variants. Similarly, five case samples each carried a pLoF allele in CFI, distributed across three different variants (about 1 in 500), while no control samples carried any alternate allele. In comparison, ExAC contains nine pLoF allele carriers in CFH (about 1 in 7000) and 30 in CFI (about 1 in 2000). The third gene, ORMDL2, was found in the RDH5/CD63 locus, and no previous study had examined the effect of rare variants in this locus on AMD. The credible set analysis for the RDH5/CD63 locus in Fritsche et al. did not find any nonsynonymous variant likely to explain the signal within this locus, with the lead variant (rs3138141) lying downstream of BLOC1S1, and the only exonic variant being a synonymous SNP in RDH5 (rs3138142, p.Ile14=) [11]. There was a significant difference in pLoF allele frequencies in ORMDL2 between our cases and ExAC based on pLoF variants, but by definition not for controls vs. ExAC (P-values = 3.86 × 10−3 and 0.79, respectively): three of our case samples carried a pLoF allele in ORMDL2 (about 1 in 800) with none found in our control samples, while ExAC contains six pLoF alleles in the same gene (about 1 in 2000).
Replication study of rare pLoF carrier frequencies using Regeneron samples
Comparing 1714 cases with 7356 controls in Regeneron exome sequence samples [48] and comparing pLoF carrier rates for variants with an allele frequency < 1% in these samples, we replicated our rare pLoF signal in CFH, with a significant difference between cases and controls in CFH (case allele frequency = 4.08 × 10−3, control allele frequency = 2.72 × 10−4, P-value = 2.14 × 10−4, OR = 15.1, 95% CI = 2.9–148.7; Table 6). Of the five genes tested in the Regeneron samples, only two (CFH and SLC16A8) contained rare pLoF variants (7 and 39 case carriers, and 2 and 139 control carriers, in CFH and SLC16A8, respectively). We did not detect evidence of a significant difference between cases and controls in the other four genes (CFI, ORMDL2, SLC16A8, and TIMP3) in the Regeneron samples.
Replication gene-level association study of rare pLoF variants using Regeneron samples. Predicted LoF variants in 1714 cases and 7356 controls with allele frequency < 1% were used to estimate rare pLoF carrier counts in five genes. Only two of the five genes (CFH and SLC16A8) contained rare pLoF variants in the Regeneron samples. We were able to replicate the signal in CFH. The pLoF frequency in SLC16A8 was significantly different between Regeneron controls and ExAC, indicating that the significant SLC16A8 result for case vs. ExAC could be a false positive, possibly arising from the differences in sequencing depth, variant calling algorithms, or population structure between the Regeneron and ExAC samples.
Gene . | Case alleles . | Control alleles . | ExAC alleles . | Case vs. Control . | Case vs. ExAC . | Control vs. ExAC . | |||
---|---|---|---|---|---|---|---|---|---|
. | (N = 1714) . | (N = 7356) . | (N = 60 706) . | P-value . | OR . | P-value . | OR . | P-value . | OR . |
CFH | 7 | 2 | 9 | 2.14 × 10−4 | 15.1 | 1.07 × 10−7 | 27.7 | 0.34 | 1.8 |
CFI | 0 | 0 | 30 | 1 | n/a | 1 | n/a | 1 | n/a |
ORMDL2 | 0 | 0 | 6 | 1 | n/a | 1 | n/a | 1 | n/a |
SLC16A8 | 39 | 139 | 159 | 0.29 | 1.21 | < 2.2 × 10−16 | 6.1 | < 2.2 × 10−16 | 7.3 |
TIMP3 | 0 | 0 | 33 | 1 | n/a | 1 | n/a | 1 | n/a |
Gene . | Case alleles . | Control alleles . | ExAC alleles . | Case vs. Control . | Case vs. ExAC . | Control vs. ExAC . | |||
---|---|---|---|---|---|---|---|---|---|
. | (N = 1714) . | (N = 7356) . | (N = 60 706) . | P-value . | OR . | P-value . | OR . | P-value . | OR . |
CFH | 7 | 2 | 9 | 2.14 × 10−4 | 15.1 | 1.07 × 10−7 | 27.7 | 0.34 | 1.8 |
CFI | 0 | 0 | 30 | 1 | n/a | 1 | n/a | 1 | n/a |
ORMDL2 | 0 | 0 | 6 | 1 | n/a | 1 | n/a | 1 | n/a |
SLC16A8 | 39 | 139 | 159 | 0.29 | 1.21 | < 2.2 × 10−16 | 6.1 | < 2.2 × 10−16 | 7.3 |
TIMP3 | 0 | 0 | 33 | 1 | n/a | 1 | n/a | 1 | n/a |
Replication gene-level association study of rare pLoF variants using Regeneron samples. Predicted LoF variants in 1714 cases and 7356 controls with allele frequency < 1% were used to estimate rare pLoF carrier counts in five genes. Only two of the five genes (CFH and SLC16A8) contained rare pLoF variants in the Regeneron samples. We were able to replicate the signal in CFH. The pLoF frequency in SLC16A8 was significantly different between Regeneron controls and ExAC, indicating that the significant SLC16A8 result for case vs. ExAC could be a false positive, possibly arising from the differences in sequencing depth, variant calling algorithms, or population structure between the Regeneron and ExAC samples.
Gene . | Case alleles . | Control alleles . | ExAC alleles . | Case vs. Control . | Case vs. ExAC . | Control vs. ExAC . | |||
---|---|---|---|---|---|---|---|---|---|
. | (N = 1714) . | (N = 7356) . | (N = 60 706) . | P-value . | OR . | P-value . | OR . | P-value . | OR . |
CFH | 7 | 2 | 9 | 2.14 × 10−4 | 15.1 | 1.07 × 10−7 | 27.7 | 0.34 | 1.8 |
CFI | 0 | 0 | 30 | 1 | n/a | 1 | n/a | 1 | n/a |
ORMDL2 | 0 | 0 | 6 | 1 | n/a | 1 | n/a | 1 | n/a |
SLC16A8 | 39 | 139 | 159 | 0.29 | 1.21 | < 2.2 × 10−16 | 6.1 | < 2.2 × 10−16 | 7.3 |
TIMP3 | 0 | 0 | 33 | 1 | n/a | 1 | n/a | 1 | n/a |
Gene . | Case alleles . | Control alleles . | ExAC alleles . | Case vs. Control . | Case vs. ExAC . | Control vs. ExAC . | |||
---|---|---|---|---|---|---|---|---|---|
. | (N = 1714) . | (N = 7356) . | (N = 60 706) . | P-value . | OR . | P-value . | OR . | P-value . | OR . |
CFH | 7 | 2 | 9 | 2.14 × 10−4 | 15.1 | 1.07 × 10−7 | 27.7 | 0.34 | 1.8 |
CFI | 0 | 0 | 30 | 1 | n/a | 1 | n/a | 1 | n/a |
ORMDL2 | 0 | 0 | 6 | 1 | n/a | 1 | n/a | 1 | n/a |
SLC16A8 | 39 | 139 | 159 | 0.29 | 1.21 | < 2.2 × 10−16 | 6.1 | < 2.2 × 10−16 | 7.3 |
TIMP3 | 0 | 0 | 33 | 1 | n/a | 1 | n/a | 1 | n/a |
Discussion
Our data represents the largest whole-genome sequencing study for AMD to date. We discovered several novel rare variants enriched in AMD cases, providing a valuable resource for future AMD studies. Because our data spanned the entire genome, it may be useful for future fine-mapping studies involving regions outside the exome, which remain the main location of AMD risk loci; e.g. previous studies have found regulatory roles for non-coding RNA in the complement pathway [49]. Using our disease-specific genotype data in imputation reference panels may help to impute both rare and common non-exonic variants accurately and may facilitate the discovery of regulation-specific variants in future AMD association studies since regulatory variants are likely drivers of a significant proportion of AMD GWAS signals [15, 17]. A subset of our samples was submitted to the Haplotype Reference Consortium [50], allowing future GWAS studies to impute AMD-specific variants into their samples using a web-based imputation service [51]. Additionally, we have contributed rare variant data to other AMD studies [34, 52].
In our single variant association analysis, we replicated the most significant signals from the AMD Consortium [11]. We obtained evidence supporting results from past studies that detected clinically significant risk variants in both CFB and C2 [53, 54] and that the most significant signals in both of these genes were often in high linkage disequilibrium [55].
Our results from grouped association tests generally recapitulated previous results. The elimination of our SKAT-O signals in CFH and C2/CFB/SKIV2L after conditioning on the top variants in each locus provides support for the hypothesis that the top single-variant signals can tag a larger set of rare nonsynonymous variants in LD with those top variants. Similarly, in NRTN, eliminating our SKAT-O signal after conditioning on rs79744308 (NRTN p.Ala59Thr) indicates that this variant largely drove the SKAT-O signal. The nonsynonymous nature of this variant, along with previous mouse studies which found functional relationships between NRTN and neuron development and activity in the retina, [56–58] suggest that rs79744308 in NRTN may have a protective effect on AMD independent of the effects of variants in C3. Transcriptional profiles of postmortem retinas from 453 controls and AMD cases reveal high expression of NRTN gene [15].
Using a large publicly available data set as an external panel, we were able to improve our association analysis further. Our analysis of very rare pLoF variants provides us with association evidence not found by other methods. We identified confirmatory signals in CFH and CFI. The signal in CFI was especially interesting since the gene was not indicated in single-variant or grouped association tests. These results are concordant with previous studies identifying pLoF variants in CFH to be significantly associated with early-onset macular drusen [22], whereas pLoF variants in CFI were associated with advanced AMD [8]. Additionally, we identified a similar signal in ORMDL2, which is also highly expressed in the human retina [15], but its function in eye tissues is poorly understood. Knockout mouse studies suggest that the Ormdl protein family may be important in preventing damage to the nervous system [59], making ORMDL2 an interesting candidate gene for AMD functional studies.
Our findings provide several key insights for designing and performing future sequencing studies for complex diseases, including AMD. First, using well-matched cases and controls can greatly reduce the potential problem of population stratification. In our study, we were aggressive about matching cases and controls on age, sex, and ancestry, resulting in well-controlled test statistics (λGC = 1.021). Second, our approach addresses the limitations of traditional association methods and highlights the importance of a more holistic approach in interpreting GWAS results. For example, our SKAT-O test of nonsynonymous variants showed that there was no additional significant gene-based signal in the C2/CFB/SKIV2L locus after conditioning on the top variant, rs556679, yet our haplotype analysis of the same locus indicated an additional haplotype risk effect beyond the top variant even after controlling for two known nonsynonymous variants. This suggests that the underlying signal of that locus may involve more complex interactions between different variants, including but not limited to nonlinear interaction effects between multiple variants, within enhancers, promoters, and variants not called by the sequencing pipeline. As another example, the SKAT-O signal in NRTN remained significantly significant after conditioning on the only significant (P < 5 × 10−8) independent variants in the C3 locus. Yet, the signal disappeared after conditioning on rs79744308, a nonsynonymous variant in NRTN with a single-variant association P-value of 9.26 × 10−7. Though this variant was not genome-wide significant, it drove the entire NRTN SKAT-O signal, suggesting a possible biological connection. Identification of this variant as potentially interesting was only possible after combining the results from both the single-variant and grouped tests.
Third, we leveraged a large, publicly available, population-based variant list to supplement association tests within our sequenced samples. In studies with modest sample sizes, single-variant and gene-based tests will be underpowered to find associations, even more so for rare variants. To partially mitigate this limitation, we prioritized genes within known risk loci and focused on extremely rare predicted loss-of-function variants, ones with major, well-defined effects on the translated products, with a better chance of providing biological insights for genetic associations. Though individual genotypes usually are not shared due to privacy concerns, variant lists with allele counts are often available to the public. The availability of the ExAC variant list and our focus on rare pLoF variants led to our discovery of a difference in loss-of-function allele frequencies between our cases and the samples in ExAC in the ORMDL2 gene, suggesting a possible functional role for this gene in AMD. Additionally, though we had very few rare pLoF alleles in each gene, by targeting genes in known AMD risk loci and using a less stringent FDR threshold, our method was able to recover the known CFI signal that SKAT-O was underpowered to detect. To test the robustness of our approach, we repeated our pLoF analysis with the larger gnomAD dataset [41]. The 141 456 gnomAD samples had 32 pLoF alleles in CFH (about 1 in 4400), 100 in CFI (about 1 in 1400), and 28 in ORMDL2 (about 1 in 5000), all lower than in our AMD cases (about 1 in 500, 1 in 500, and 1 in 800, respectively). The difference in CFH remained significant (P = 3.56 × 10−4 using the Exact Test), while the differences in CFI and ORMDL2 were attenuated (P = 0.031 and 0.015, respectively). A similar comparison with pLoF alleles in TOPMed data led to similar results, finding about 1 in 4000 pLoF alleles in CFH, about 1 in 900 in CFI, and about 1 in 2700 in ORMDL2. We note that using low-pass sequencing limited our ability to reliably discover and call all rare pLoF variants in our samples, likely leading to underestimating pLoF counts in our cases. Moreover, our ability to replicate the CFH signal in an independent data set from Regeneron validated our approach. As the number of sequencing studies increases, our new group-based approach should be able to leverage new data sources to uncover associations that could not be detected using traditional methods.
Finally, our approach highlights the importance of high-quality phenotypes for genomic studies. A vast majority of cases in our study obtained diagnoses from ophthalmologists with subspecialty training in screening retinal imaging for advanced AMD, the controls in AREDS were followed for 10 years to make sure there was no AMD phenotype, and our single-variant association tests confirmed four previously discovered loci with highly significant results. In comparison, the UK Biobank [60] contained a phenotype with similar power (2524 cases and 106 293 controls; data field 6148/5), which consisted of all respondents who answered “Eye problems/disorders: Macular degeneration” to the question “Has a doctor told you that you have any of the following problems with your eyes?”, obtained via self-reported touchscreen questionnaires. The association tests for this phenotype were able to confirm only the top two loci, with greatly reduced significance for the top signals in each locus (pCFH = 4.6×10−114 vs. 1×10−18 and pARMS2 = 2.1×10−92 vs. 1.9×10−24 for our study vs. UK Biobank, respectively). These results show that the high quality of phenotypes, along with age- and sex-matching cases and controls and filtering for ancestral background, are important considerations for association studies.
Our study has a few limitations that decrease our ability to find associations. First, the sample size was relatively modest compared to the published AMD GWAS [11], reducing the power to replicate known loci or discover novel associations. Second, low sequencing depth (~6×) limited our ability to call many very rare variants with confidence due to the conservative nature of our variant calling filtering criteria. Though this limited our variant count, it ensured that variants that passed all filters during variant calling are of high quality, including those within difficult-to-map homologous or repetitive regions. We believe that the minimal presence of false positives in our grouped test results provides evidence for the reliability of our variant calls. Third, even though our phenotypes were derived from high-quality expert screening from ophthalmologists, it is possible that a small number may have been misclassified cases of late-onset AMD-mimicking dystrophies, leading to a reduction in association power. Fourth, using only samples of European ancestry meant that our results can’t be generalized to other populations. Fifth, the allele frequency threshold in our replication study using samples from Regeneron was higher than that used in our initial analysis with ExAC (1% instead of 0.1%), which might lead to overestimating the number of carriers and lead to larger Type I errors in gene-level carrier tests and adding noise to the results. Finally, the differences in sequencing depths, variant calling algorithms, and phenotype definition between our case–control data, ExAC, and Regeneron samples did not allow us to analyze variants in all these data sets jointly. Despite these limitations, we have expanded on results from previous studies, provided an integrative framework for the identification of rare-variant association studies and highlighted potentially novel gene-level associations.
Materials and Methods
Experimental model and study participant details
Our primary data consisted of whole genome sequence samples for 2394 AMD cases and 2393 controls, matched by age and sex, with an average depth of 6×. Their ages ranged from 50 to 101, with an average of 75, with 55% females and 45% males. There were no significant differences in age or sex between cases and controls. In our association analysis, the cases have been combined for comparisons against controls. P-value for age was obtained via a two-sample t-test; P-value for male proportion was obtained using Pearson’s Chi-squared test with Yates’ continuity correction. Sample summary is listed in the table below.
AMD Status . | Controls . | Cases . | P-value . |
---|---|---|---|
None | 2393 | ||
Large Drusen | 583 | ||
Geographic Atrophy (GA) | 419 | ||
Macular Neovascularization (MNV) | 1122 | ||
Mixed GA + MNV | 270 | ||
Total | 2393 | 2394 | |
Age, mean (range) | 74.9 (50.0–94.2) | 75.1 (50.4–101.0) | 0.49 |
Males (%) | 45.2 | 44.9 | 0.86 |
AMD Status . | Controls . | Cases . | P-value . |
---|---|---|---|
None | 2393 | ||
Large Drusen | 583 | ||
Geographic Atrophy (GA) | 419 | ||
Macular Neovascularization (MNV) | 1122 | ||
Mixed GA + MNV | 270 | ||
Total | 2393 | 2394 | |
Age, mean (range) | 74.9 (50.0–94.2) | 75.1 (50.4–101.0) | 0.49 |
Males (%) | 45.2 | 44.9 | 0.86 |
AMD Status . | Controls . | Cases . | P-value . |
---|---|---|---|
None | 2393 | ||
Large Drusen | 583 | ||
Geographic Atrophy (GA) | 419 | ||
Macular Neovascularization (MNV) | 1122 | ||
Mixed GA + MNV | 270 | ||
Total | 2393 | 2394 | |
Age, mean (range) | 74.9 (50.0–94.2) | 75.1 (50.4–101.0) | 0.49 |
Males (%) | 45.2 | 44.9 | 0.86 |
AMD Status . | Controls . | Cases . | P-value . |
---|---|---|---|
None | 2393 | ||
Large Drusen | 583 | ||
Geographic Atrophy (GA) | 419 | ||
Macular Neovascularization (MNV) | 1122 | ||
Mixed GA + MNV | 270 | ||
Total | 2393 | 2394 | |
Age, mean (range) | 74.9 (50.0–94.2) | 75.1 (50.4–101.0) | 0.49 |
Males (%) | 45.2 | 44.9 | 0.86 |
We obtained samples of European ancestry with advanced AMD (defined as subjects diagnosed as having large drusen, macular neovascularization [MNV], geographic atrophy [GA], or mixed MNV/GA in at least one eye) from the University of Michigan Kellogg Eye Center, the AREDS1 and AREDS2 studies from the National Eye Institute, and the Scheie Eye Institute at the University of Pennsylvania Perelman School of Medicine, as described previously in indicated references [6, 11, 60–63]. The AREDS1 study included their own matched controls, while additional matched control samples for the other studies were obtained from the Michigan Genomics Initiative (MGI; as per above table) as follows: first, we calculated propensity scores for all available samples, using age and sex as covariates and treatment group as outcome; then, we used a 1-to-1 greedy matching algorithm to match each unmatched case to the best-matching MGI control sample [60]. The matched pair was removed from the selection pool, and the matching continued iteratively until all cases were matched. Samples from MGI used for matching were restricted to subjects not diagnosed with AMD at the time of sample collection. Study participants provided informed consent, and all protocols were reviewed and approved by the University of Michigan Institutional Review Board (HUM00080973) and as described previously [6, 11, 60–63].
Methods
Whole genome sequencing data
We collected DNA samples from blood, saliva, and cell culture to generate short-read next-generation sequences via the Illumina HiSeq 2500 platform utilizing Illumina’s TruSeq DNA PCR-Free library prep for multiplexed runs.
Bioinformatics analysis
We used BWA-MEM (v0.6) to align the genomes using the NCBI RefSeq hg19 human genome reference assembly, GotCloud (v1.16) to call single nucleotide polymorphisms (SNPs) and short insertions and deletions (indels) [64, 65] and beagle4 for phasing and genotype refinement. Briefly, first, we calculated genotype likelihoods across all samples, accounting for base quality scores. Next, we created a variant list based on those joint likelihoods and multiple quality control filters. Then, we estimated posterior probabilities based on those likelihoods, which we used to call genotypes. Finally, we phased the genotypes to produce the final genomic dataset. We used the called genotypes for all summaries and downstream analyses. To annotate the variants, we used Variant Effect Predictor (VEP), build 84 [66], using the combined transcripts from both RefSeq and Ensembl as reference.
For quality control, we used LASER [67] (v1.0) to project genetic ancestry principal components (PCs) derived from our genotype data onto reference ancestry coordinates derived using samples from the Human Genome Diversity Project (HGDP) [68]. To filter out samples of non-European ancestry, we calculated an acceptance region for samples with European ancestry as follows: first, using the first two PCs, we calculated the mean coordinates for European, African, East Asian, and all HGDP samples; next, we defined a circular region using the mean European HGDP coordinates as the center, with a radius equal to one-quarter the distance from the mean European HGDP coordinates and the mean all-samples HGDP coordinates (Supplementary Fig. S1). Thirty-two samples outside of this circular region were removed from downstream analyses. Furthermore, we used VerifyBamID [64] (v1.1.1) to estimate sample-level contamination and excluded 20 samples with estimated contamination above 3%. Of our sequenced samples, 2565 had array genotypes available, which we used to identify sample labeling errors in our sequence data. The vast majority of these samples had very high concordance (> 99%) between array and WGS genotypes. For samples with low concordance, we compared their WGS genotypes against all samples with array genotypes, identifying 4 pairs of sample swaps, one duplicated sample, and 19 other samples with mismatches between sequencing and array data. We relabeled the swapped samples, combined the duplicated samples, and removed the mismatched samples.
Quantification and statistical analysis
Association analyses
We defined a single binary outcome, “advanced AMD”, which included samples with large drusen, GA, and/or MNV. We used the most likely called genotypes in our association tests. Because we used age- and sex- matched case-control samples, we did not include covariates in our association models. We used P ≤ 5 × 10−8 as our genome wide significance threshold for all single-variant tests. We tested the advanced AMD cases against control samples using Firth bias-corrected logistic regression using a likelihood ratio test [69], as implemented in EPACTS (v3.2.6) (https://github.com/statgen/EPACTS) to perform single-variant GWAS.
Next, to identify statistically independent loci, we used a chromosome-based sequential forward selection approach: first, we identified the most significantly associated variant by P-value in each chromosome; if that variant was genome wide significant (P-value < 5 × 10−8), then we performed a conditional analysis on all variants in that chromosome by adding the genotypes at that variant as a covariate. If the most significant variant in this analysis was also genome wide significant, then we added it as an additional covariate and repeated the analyses of all variants in that chromosome until no significant variants remained. Each of the variants selected in this sequential procedure was matched to the nearest locus among the reported 34 AMD-associated loci [11] (all variants were within 434 kb of one of these previously reported loci).
For gene-based tests, we used SKAT-O [70] for all predicted nonsynonymous variants with allele frequency less than 5% in each gene. Predicted nonsynonymous variants are defined as those with moderate to high severity according to Sequence Ontology [71], and includes missense, nonsense, frameshift, and essential splice site variants. We applied the Bonferroni correction for 22 502 tests to get a significance threshold of 0.05/22 502 = 2.22 × 10−6.
External data sets
We used several other data sets for gene-based pLoF comparison (detailed below). For external controls, we used pLoF allele count data from ExAC [39], consisting of 60 706 exome-sequenced samples; gnomAD [40], with 125 748 sequenced exomes and 15 708 sequenced genomes; and TOPMed [41] (via the BRAVO browser (NHLBI 2018)) with 62 784 sequenced genomes. We also obtained summary data for pLoF allele counts in five genes (CFH, CFI, ORMDL2, SLC16A8, and TIMP3) from Regeneron, with 1714 AMD cases and 7356 controls, for a follow-up replication analysis.
Gene-level comparison of rare pLoF variants with the exome aggregation consortium
We annotated both the ExAC variant list and our own variant list using Variant Effect Predictor (VEP) build 84, using the merged RefSeq/Ensembl human database as reference. We generated a subset of rare pLoF variants, defined as variants with an allele frequency of less than 0.1% in our samples and annotated as stop-gained, frameshift, splice donor, or splice acceptor, from our sequenced data. Rare pLoF variants are especially informative for genetic association studies, because they can point to very specific effector genes and disease mechanisms. To increase power, we compared frequencies of rare pLoFs in our case and controls to variant frequencies in ExAC. We have |${N}_{ctrl}$| = 2393 total controls, |${N}_{case}$| = 2394 total cases, and |${N}_{ExAC}$| = 60 706 total ExAC samples. One complication is that the ExAC data consists of variant-level allele summaries with no individual-level data. Thus, to tally the number of rare pLoF carriers in each gene, we summed the number of pLoF carriers for each variant in the gene. In this model, we assumed that all alternate alleles were present in different individuals in ExAC due to the rarity of the alleles, so that the number of rare pLoF variants represented our approximation of the number of pLoF carriers. This number could produce a small overestimate, since for some genes the same individual might carry multiple pLoF variants, leading to a larger estimated pLoF carrier count than the number of individuals with a pLoF in those genes. For the ith variant in the jth gene, we estimated the number of pLoF variant carriers by obtaining allele counts for all rare pLoF variants in three sets of samples—our controls (|${A}_{ctrl,i,j}$|), our cases (|${A}_{case,i,j}$|), and ExAC (|${A}_{ExAC,i,j}$|); next, we summed up the total number of alleles in each of our three cohorts as our estimate of the number of pLoF carriers (c) for the jth gene:
We filtered out all genes for which |${c}_{cohort,j}=0$| in all cohorts. We then calculated the number of non-carriers (w) in the jth gene for each cohort:
With these estimates, we created two 2x2 contingency tables for allele carriers vs. non-carriers: controls vs. ExAC (wctrl,j and cctrl,j vs. wExAC,j and cExAC,j) and cases vs. ExAC (wcase,j and ccase,j vs. wExAC,j and cExAC,j).
Given that our sequenced data had a lower average coverage than ExAC, we expected fewer rare pLoF alleles per sample to be called in our data compared to ExAC, leading to a lower pLoF carrier frequency. Therefore, while a higher pLoF carrier frequency in our cases compared to ExAC would provide evidence to suggest an association, we do not attempt to interpret situations where our sample shows fewer pLoF carriers than ExAc. We calculated P-values for these contingency tables using a one-sided Fisher’s Exact Test (PFETA-vs-B,j), representing the probability of our tested cohort (AMD controls or AMD cases) having a lower pLoF carrier frequency compared to ExAC, given a fixed total number of pLoF carriers between the tested cohort and ExAC. A significant P-value is therefore evidence supporting a higher pLoF carrier frequency in our tested cohort compared to ExAC. Due to differences in sequencing method, sequencing depth, variant calling algorithms, and experimental conditions, we restricted our comparison to only those genes for which the distribution of pLoF alleles in our sequenced controls was not significantly higher than that in ExAC.
We performed a simulation study by varying effect size, carrier frequency, and P-value threshold to characterize the balance between false discovery rate and power under different thresholds. We found that the less stringent threshold of 0.01 had good power to discover gene-level associations across a variety of allele frequencies and effect sizes while maintaining a low false discovery rate (Supplementary Table S4). We retained the subset of genes found in the International AMD Genomics Consortium (IAMDGC) risk loci for which |${PFET}_{CtrlExAC,j}>0.01$|, representing the set of genes in which the pLoF carrier frequencies in our control samples were not statistically significantly higher than the pLoF carrier frequencies in ExAC. Following these results, we filtered the gene subset for those with |${PFET}_{CaseExAC,j}<0.01$|.
To test the validity of using pLoF allele counts as an estimate of pLoF carrier counts, we estimated both the proportion of genes in which at least one sample had multiple pLoF variants and the proportion of samples containing multiple rare pLoF variants in a single gene in our sequencing data. Starting with 21 351 genes, we identified a subset of 8222 genes in which our AMD cases or AMD controls had at least one rare pLoF variant, where “rare” was defined as any given variant having an allele frequency of less than 0.1% (in our samples, this translated to having an allele count of 9 or fewer out of 4787 × 2 = 9574 total alleles). Of these 8222 genes, we observed two or more rare pLoF variants in the same individual in 59 out of 4787 samples (1.2%), distributed across 26 genes (0.3% of genes). None of these genes are found in our list of potentially associated genes.
Replication analysis of gene-level rare pLoF alleles from Regeneron samples
We repeated our rare pLoF association study using samples from Regeneron for the genes in which we found a difference between cases and controls in our ExAC comparison, along with genes which contained previously discovered pLoF variants with large effects, totaling five genes: CFH, CFI, ORMDL2, SLC16A8, and TIMP3. The AMD phenotypes for these samples were predicted using the eMERGE Network EMR Phenotype Algorithm [72]. We used Fisher’s Exact Test to compare rare pLoF variants in the cases vs. controls within the Regeneron samples, where rare was defined as an allele frequency less than 1%. We also compared the rare pLoF variants in Regeneron cases and controls to ExAC in the same way we compared our AMD WGS sequencing samples to ExAC above: for each gene, we compared controls to ExAC first to determine whether the two samples were different, then compared cases to ExAC if the controls were not statistically different from ExAC.
Data submission
Much of the data is available at dbGaP study accession phs002015.v2.p1 and other will be shared by the corresponding author upon request per policies of institutions and NIH. Any information required to reanalyze the data reported in this paper is available from the corresponding authors upon request.
Acknowledgements
We thank Linn Gieser and Milton English for assistance. The authors acknowledge the Michigan Genomics Initiative participants, Precision Health at the University of Michigan, the University of Michigan Medical School Central Biorepository, the University of Michigan Medical School Data Office for Clinical and Translational Research, and the University of Michigan Advanced Genomics Core for providing data and specimen storage, management, processing, and distribution services in support of the research reported in this publication.
Author contributions
Overall Conceptualization, A.K., E.Y.C., G.R.A. and A.S.; clinical and sample resources, Data collection X.Z., J.B.G., K.E.B., M.O., D.S.; Experimental work, Genetic and computational analysis, A.K., M.Z., L.G.F.; Data for replication study, T.M.T.; Writing original draft, A.K., J.A., G.R.A., A.S.; Writing, review, and editing: all authors; Funding: D.S., E.Y.C., A.S.; Supervision, G.R.A. and A.S.; Project Administration, E.Y.C., G.R.A. and A.S.
Conflict of interest statement: Alan Kwong is now an employee and shareholder of 23andMe, Inc.; Tanya M. Teslovich and Gonçalo R. Abecasis are employees and shareholders of Regeneron Pharmaceuticals, Inc. No other coauthor has any declared conflict of interest.
Funding
This research was supported by the National Eye Institute (Intramural Research Program grants ZIAEY000546 to A.S., ZIAEY000489 to E.Y.C. and extramural grant R01EY031209 to D.S.). This report also included data from the UK Biobank Resource conducted under application number 24460.