Exome Sequencing and Genotyping Identify a Rare Variant in NLRP7 Gene Associated With Ulcerative Colitis

Abstract Background and Aims Although genome-wide association studies [GWAS] in inflammatory bowel disease [IBD] have identified a large number of common disease susceptibility alleles for both Crohn’s disease [CD] and ulcerative colitis [UC], a substantial fraction of IBD heritability remains unexplained, suggesting that rare coding genetic variants may also have a role in pathogenesis. We used high-throughput sequencing in families with multiple cases of IBD, followed by genotyping of cases and controls, to investigate whether rare protein-altering genetic variants are associated with susceptibility to IBD. Methods Whole-exome sequencing was carried out in 10 families in whom three or more individuals were affected with IBD. A stepwise filtering approach was applied to exome variants, to identify potential causal variants. Follow-up genotyping was performed in 6025 IBD cases [2948 CD; 3077 UC] and 7238 controls. Results Our exome variant analysis revealed coding variants in the NLRP7 gene that were present in affected individuals in two distinct families. Genotyping of the two variants, p.S361L and p.R801H, in IBD cases and controls showed that the p.S361L variant was significantly associated with an increased risk of ulcerative colitis [odds ratio 4.79, p = 0.0039] and IBD [odds ratio 3.17, p = 0.037]. A combined analysis of both variants showed suggestive association with an increased risk of IBD [odds ratio 2.77, p = 0.018]. Conclusions The results suggest that NLRP7 signalling and inflammasome formation may be a significant component in the pathogenesis of IBD.


Introduction
Genome-wide association scans [GWAS] have been very successful in the identification of susceptibility genes for many common, complex disorders [https://www.ebi.ac.uk/gwas/]. GWAS in both major forms of inflammatory bowel disease [IBD], Crohn's disease [CD] and ulcerative colitis [UC], have been among the most productive, and provided a better understanding of the biology of these diseases. There are now around 240 robust genetic associations that have been identified [1][2][3] in IBD. However, most associations either span genomic regions that encompass multiple potential candidate genes or lie within non-coding or gene-poor regions. Their biological significance is therefore often unclear, and they explain only a modest proportion of the estimated heritability [or variation in genetic liability] of IBD. A recent study by the International IBD Genetics Consortium used a fine-mapping approach in 67 852 individuals, in an attempt to pinpoint the true disease-causing DNA variants at 94 of the top IBD loci. 4 This identified 18 single independent causal DNA variants at 14 loci with > 95% certainty, and 27 single independent genetic variants at 26 loci with > 50% certainty. Taken together, these 45 independent associations at 37 loci were enriched for 13 variants which elicit changes specifically in the protein-coding sequences of seven genes.
It is thought that some of the hidden heritability in IBD may be explained by the association of rare protein-altering variants that are likely to confer a higher risk of disease. It is unlikely that such variants would be detected by the GWAS approach alone, which is designed to identify association with common variants. In recent years, advances in DNA sequencing technology have facilitated gene sequencing on a much larger scale than could previously be attempted, enabling large-scale studies that have led to discovery of complex disease-causing variants that previously eluded GWAS studies. 5 Moreover, studies using targeted high-throughput sequencing of genes in IBD GWAS regions have, to date, detected around 30 multiple low-frequency variants associated with adult IBD 6-10 in 17 genes, including multiple variants in the NOD2 and IL23R genes.
A recent UK IBD Genetics Consortium study has undertaken low-depth sequencing of the entire genome in 4280 IBD patients, and found only one new low-frequency protein-coding [missense] variant in the ADCY7 gene which increases risk for UC. However, this study also detected evidence for an increased 'mutational-load' of rare damaging missense variants in known CD risk genes, although it was estimated that these low-frequency variants only accounted for a very small fraction of the variation in genetic liability [heritability] of IBD [< 2%] in the general population. 11 In general, the majority of genetic variants associated with common, complex disorders thus far do not lie within the coding regions of genes. However, this is not the case for some subsets of common disorders with a strong familial predisposition, such as breast cancer and Alzheimer's disease, in which rare, highly penetrant mutations have a causal role.
In this study, we hypothesised that families with large numbers of individuals affected with adult-onset IBD are more likely to have resulted from the action of rare or low-frequency damaging missense changes in one or a few genes. Such disease-causing variants would be easier to find by high-depth sequencing of coding genes in individuals from these families and, if present in genes from known IBD risk loci, might help to identify the causal genes in regions intractable to fine-mapping approaches.

Subjects
A large collection of White British IBD families with three or more affected individuals were recruited, with informed consent and institutional ethical approval as described previously. 12 Genomic DNA was prepared from 10 ml of whole blood using the salt/chloroform method described elsewhere. 12 Ten families were selected for sequencing because they had more than three affected individuals across multiple generations. Overall, seven of the selected families were affected with only CD, two with only UC, and one family had both CD-and UC-affected individuals. DNA samples from a total of 6025 IBD cases [2948 CD; 3077 UC] and 7238 controls of White British descent, described previously, 9 were used for follow-up genotyping and association studies, as well as genotype data from 16 267 UK IBD cases and 18 843 UK population controls from a recent whole-genome sequencing study. 11

Exome sequencing analysis
Whole-exome sequencing [WES] was carried out on 27 affected individuals from ten families, that is, two, three, or four individuals from each family. Our sequencing strategy was devised so that, where possible, only the most distantly related and affected family members from each branch of each family were sequenced. Using this strategy, we assumed that rare or low-frequency variants, that were shared between the exome-sequenced affected relatives in a family, were highly likely to be identical by descent, therefore optimising the information gained against the cost of WES of all the affected individuals. A total of 3 µg of genomic DNA was sheared to a mean fragment size of 150 bp [Covaris], and the fragments used for Illumina paired-end DNA library preparation and enrichment for target sequences [SureSelect Human All Exon 50Mb kit, Agilent]. Enriched DNA fragments were sequenced with 100-bp paired-end reads [GAIIx or HiSeq2000 platform, Illumina]. Sequencing reads were aligned to the reference human genome sequence [hg19] using the Novoalign software [Novocraft Technologies]. Duplicate and multiply mapping reads were excluded, and the depth and breadth of sequence coverage were calculated using custom scripts and the BedTools package. Single-nucleotide substitutions and small indels were identified with SAMtools, annotated with the ANNOVAR software, and variants called as previously described. 13 On average, seven gigabases of sequence were generated per sample; > 82% of the target exome was present at > 20-fold coverage, and > 94% was present at > 5-fold coverage.
Identified variants were prioritised for follow-up based on the following filters. i. We assumed a dominant model of inheritance and therefore selected only those variants that were heterozygous, ii. We looked for variants most likely to have a functional effect/ be protein-altering, so we focused on non-synonymous [missense, nonsense, or canonical splice site substitutions], and variants that were absent or present with a frequency of < 1% in the Exome Aggregation Consortium database [ExAc] [http:// exac.broadinstitute.org/about] or whole-exome sequence data from > 1000 in-house non-IBD individuals. iii. We selected only those variants that were present in all WES members from each family, using the strategy described above. iv. Finally, to increase the likelihood of selecting disease-causing variants, we selected only those genes in which we had identified one or more variants fulfilling criteria i-iii above in at least two different IBD families.

Genotyping
Owing to sequence homology of the coding regions of NLRP7 with its nearby paralogue NLRP2, the two rare variants in NLRP7,

Results
We conducted a screen for rare, potentially disease-causing coding variants by WES in a subset of the most distantly related affected individuals from 10 IBD families. The large number of coding variants that were identified across all individuals were prioritised for follow-up analysis by applying a series of stepwise variant filters.
First, within each family, we retained only those variants that were heterozygous, present in all WES-affected family members, and were protein-altering, i.e. those that introduce a missense, nonsense, or frameshift change in the protein structure or altered a canonical splice site, which left 21 127 variants. Next, we removed all common variants with an allele frequency reported to be equal to or greater than 1%. This filtering strategy resulted in 694 variants identified in 627 genes including 88 novel variants. Our final filter was to restrict the list of variants for follow-up to include only those genes that contained rare variants cosegregating with IBD in WES individuals in two or more IBD families [see Methods], thus providing more than one source of evidence for the implicated gene.
This resulted in 34, low-frequency, protein-altering variants in 17 genes responsible for a range of cellular functions including cell migration, metabolism, transcriptional activation, and the innate immune response [Supplementary Table 1 Figure 1A]. In family GS13, p.S361L was present in all three affected family members with CD and in two unaffected individuals, and was absent in three unaffected relatives and an unrelated spouse who coincidentally also has CD. This suggests that if p.S361L is the causal variant in this family, it is showing incomplete penetrance. In family GS64, the p.R801H variant was present in three of four affected relatives [two with CD; one with UC] and in one unaffected family member. The absence of p.R801H in one affected relative [ Figure 1A] suggests that one or more other variants may have contributed to the development of IBD in this individual.
The two low-frequency NLRP7 variants were next followed up by genotyping in a large panel of 6025 unrelated IBD cases [excluding individuals from the families that underwent WES] and 7238 population controls, to test for association with IBD. Both variants had a high call rate [> 99%] and were in Hardy-Weinberg equilibrium in both cases and controls. The minor A allele that encodes the variant leucine residue of NLRP7 p.S361L had a higher frequency in all IBD subgroups compared with controls [Freq [CD] = 0.100%, Freq [UC] = 0.200%, Freq [cont] = 0.035%] and was significantly associated with UC (P = 3.9 × 10 -3 , odds ratio [OR] = 4.79, 95% confidence onterval [CI]: 1.60 -14.00), also after correction for multiple testing [ Table 1]. The minor T allele that encodes the variant histidine residue of NLRP7 p.R801H also had a higher frequency in IBD cases compared with controls, but failed to achieve significance on its own [ Table 1]. Owing to the low frequency of both of the variant alleles in the population, we performed a combined analysis of their cumulative frequency in IBD cases compared with controls, and this was significantly different, although not after correction for multiple testing (P = 0.018, P corrected = 0.054 [OR = 2.77, 95% CI: 1.14 -6.75]) [ Table 1].
The variant p.S361L maps to the NACHT domain of the NLRP7 protein, whereas the p.R801H missense substitution is located in the leucine-rich repeat domain [ Figure 1B]. The p.S361L variant was predicted to be 'deleterious' by the CADD pathogenicity software tool [CADD score 23.4], 'damaging' by SIFT, and 'probably damaging' by PolyPhen-2. The p.R801H variant was predicted to be 'tolerated' [CADD score 13.99], 'neutral' by SIFT, and 'benign' by PolyPhen-2.
The remaining 32 variants in 16 genes that were identified by our WES and filtering strategy [Supplementary Table 1] were followed up by examination of their frequency distribution in a recent large GWAS meta-analysis of 16 267 UK IBD cases and 18 843 UK population controls. 11 Of the 10 variants that were detected in the GWAS meta-analysis study and passed quality control, two showed evidence of association with IBD [P < 2 × 10 -3 ]. These were both in the TRIM31 gene [Supplementary Table 1]. TRIM31 is located on chromosome 6p within the human major histocompatibility complex [MHC] region. One of the associated variants, rs62624473, is a splice site substitution, and the minor allele [freq = 0.9%] is associated with a decreased risk for UC and IBD [P = 0.00174 and 0.00172, respectively; OR = 0.55 and 0.86, respectively] in the GWAS-meta analysis data. This is in contrast to the observed co-segregation of the rare allele with UC in family 107, suggesting that this variant is unlikely to be causal of disease. The second variant in TRIM31 is the missense change p.C48R, rs140451451. This variant is very rare [0.07%], but shows evidence of association with UC and IBD in the GWAS meta-analysis data [p = 2.82 × 10 -05 and 5.81 × 10 -04 , respectively]. It has a low CADD score [5.416], and is predicted to be 'possibly damaging' and 'tolerated' by SIFT and Polyphen2, respectively. Also, we cannot exclude the possibility that  its association with IBD is a result of its correlation with common variants at the nearby HLA-DRB1 gene in the human MHC region, where there is known to be extensive linkage disequilibrium and multiple IBD-associated haplotypes. 15

Discussion
Whole-exome sequencing of affected individuals in IBD families identified 17 genes in which rare coding variants were present in more than one family. None of these genes have thus far been found to be mutated in monogenic disorders associated with intestinal inflammation, 16 and only one, NLRP7, is located in one of the 240 loci known to be associated with IBD. This suggests that rare, highly penetrant, coding variants in genes from GWAS loci may not play a substantial role in the UK IBD families we have sequenced. The discovery of at least one significant association of a low-frequency coding variant in NLRP7 [p.S361L] with IBD suggests that this and potentially other variants in this gene may predispose individuals to IBD. This is further supported by the prediction that p.S361L would damage the function of the NLRP7 protein. The NLRP7 variant p.R801H may be an incidental finding and not causally related to IBD in family GS64. However, the co-segregation of the variant allele with three of four individuals with the IBD phenotype in this family, and the association of both NLRP7 variants with IBD in the combined analysis, warrant further exploration of these and additional NLRP7 variants in very large case-control collections and additional IBD families.
The NLRP7 gene is located on chromosome 19 at position chr19:55,434,877-55,458,873 [hg19] within a known IBD locus. The region was initially identified by GWAS 1 to be significantly associated with IBD [p = 6.5 × 10 -11 ], and spans multiple genes including NLRP7, NLRP2, KIR2DL1, LILRB4 and 15 others. The top associated SNP in the region is rs11672983, although the two missense variants identified in this study are not in linkage disequilibrium with this SNP [r 2 = 0]. This locus was not included in the recent finemapping study by the International IBD consortium. 4 Whereas NLRP7 variants were identified in families that contained individuals affected by both CD and UC or by CD alone, we were able to detect a significant association signal only for p.S361L in UC. However, these findings do not preclude a role for NLRP7 in CD. It is possible that the lack of association in this study may be because the effect size is smaller in CD [OR = 1.49] and thus lacked sufficient power to detect the smaller effect.
It has been reported that the p.S361L variant is more prevalent in the Ashkenazi Jewish [AJ] population [MAF = 0.0438] compared with White European populations [MAF = 0.0004] by the gnomAD database [http://gnomad.broadinstitute.org/]. This is of interest, as IBD is more prevalent in the AJ population [17][18][19] ; however, there has been no reported association of NLRP7 or this region of chromosome 19 with IBD in this population to date, 20 and none of the IBD patients analysed in this association study had reported Jewish ancestry. The lack of an association of this variant with IBD in the Jewish population could reflect the fact that NLRP7 variants are challenging to genotype because of homology with the nearby gene NLRP2, or that environmental components such as commensal bacteria and other factors that act through the NLRP7 pathway are also likely to play a substantial role in the aetiology of IBD in this population. The potential role of environmental factors is also consistent with the incomplete penetrance of the NLRP7 variants in families GS13 and GS64.
The NLRP7 gene encodes the NACHT, leucine-rich repeat and pyrin domain containing 7 protein, which is a member of the nucleotide oligomerisation domain [NOD]-like receptor family of proteins.
These are a family of pattern recognition receptors which include the known CD-risk gene NOD2 and are sensors of pathogen-associated molecular patterns [PAMPs] involved in the innate immune response, apoptosis, and tissue damage. Over recent years there has been increasing evidence for the importance of these receptors in mucosal immunity 21 and IBD, 22 with colonic expression of a number of NLRs, including NLRP7, being shown to be significantly altered in patients with active IBD. 22 NLRP7 expression is induced in response to LPS and IL-1β in peripheral blood mononuclear cells [PBMC] and is detected at high levels in thymus, spleen, and bone marrow, suggesting a role in host defence. NLRP3-mediated inflammasome assembly has been demonstrated to be important in both human and mouse models of IBD, 23 and similarly it is known that the NLRP7 protein can promote both positive and negative regulation of inflammasome activity. 24 For example, it is required for bacterial acylated lipoprotein-mediated caspase-1 activation and maturation of IL-1β and IL-18, but has also been shown to inhibit NLRP3 and caspase-1-mediated IL-1β release. This contradiction may reflect a role for NLRP7 in preventing inflammasome formation and IL-1β release in quiescent cells while activating the inflammasome in response to bacterial infection. 25 In either case, this would be consistent with the association of NLRP7 mutations with immune-mediated disorders such as IBD. Mutations in NLRP7 also contribute to the development of hydatidiform mole in abnormal human pregnancies, which may involve inflammation-dependent or independent functions. 26 In conclusion, we propose that rare coding variants in NLRP7 may contribute to the development of IBD. Further work will be required to demonstrate the functional effects of these mutations in the context of intestinal inflammation, and to examine more generally the contribution of aberrant regulation of the inflammasome in the pathogenesis of IBD.