Integrating population variation and protein structural analysis to improve clinical interpretation of missense variation: application to the WD40 domain

We present a generic, multidisciplinary approach for improving our understanding of novel missense variants in recently discovered disease genes exhibiting genetic heterogeneity, by combining clinical and population genetics with protein structural analysis. Using six new de novo missense diagnoses in TBL1XR1 from the Deciphering Developmental Disorders study, together with population variation data, we show that the β-propeller structure of the ubiquitous WD40 domain provides a convincing way to discriminate between pathogenic and benign variation. Children with likely pathogenic mutations in this gene have severely delayed language development, often accompanied by intellectual disability, autism, dysmorphology and gastrointestinal problems. Amino acids affected by likely pathogenic missense mutations are either crucial for the stability of the fold, forming part of a highly conserved symmetrically repeating hydrogen-bonded tetrad, or located at the top face of the β-propeller, where ‘hotspot’ residues affect the binding of β-catenin to the TBLR1 protein. In contrast, those altered by population variation are significantly less likely to be spatially clustered towards the top face or to be at buried or highly conserved residues. This result is useful not only for interpreting benign and pathogenic missense variants in this gene, but also in other WD40 domains, many of which are associated with disease.


Introduction
Understanding the impact of missense variants in known disease genes is a major challenge for the clinical application of genomics (1,2). A handful of well-known disease genes [such as CFTR (3) and TP53 (4)] have been extremely well studied over several decades through both research and clinical genetic testing, and multiple known pathogenic missense variants have been individually characterized in silico, in vitro and in vivo. However, the rate of gene discovery has grown exponentially since the completion of the human genome sequence (5): nearly 3500 suspected disease genes are currently listed in OMIM, many of which have been discovered through exome sequencing of patients with rare diseases (6) with rare nonsense or protein-truncating mutations. Many such genes are, as yet, unstudied and very sparsely populated with known pathogenic (or benign) missense variants, so most rare missense variants identified in these genes are likely to be novel. Meanwhile, massively parallel sequencing technologies are increasingly being used for clinical genetic testing in the form of multigene panels, exome sequencing and even whole-genome sequencing (7). As a result, a plethora of previously uncharacterized missense variants are being discovered regularly in known disease genes (8)(9)(10), where the consequence for protein structure, cellular processes or disease aetiology is unclear, severely compromising their clinical utility.
The increasing availability of exome sequencing and wholegenome sequencing in research means that the pervasiveness of normal genetic variation is starting to become clear. A normal human genome contains three to four million variants, of which approximately 10 000 will be non-synonymous variants in coding exons predicted to cause a missense change, altering a single amino acid in the resulting protein (11,12). However, despite the fact that missense variation is extraordinarily commonplace, most genes still do not yet contain sufficient confirmed pathogenic and benign missense variants upon which to build detailed specific models to understand and accurately predict their relationship to human disease. Although numerous increasingly useful pathogenicity predictors exist (13)(14)(15)(16)(17)(18), they generally have low specificity (9,19) and are based on sequence alignments that often exclude detailed knowledge of three-dimensional (3D) protein structure. However, as the same structural domain is commonly present in different proteins, encoded by different genes, and associated with different diseases, a method heavily informed by protein structure analysis is likely to yield insights across multiple genes and diseases.
Sequence data on normal population variation coupled with high throughput exome/genome sequencing of patients with rare diseases offer the perfect opportunity to investigate whether there are systematic differences between pathogenic and benign missense variants at an individual gene or protein level. Here, we use novel diagnostic de novo mutations identified through the Deciphering Developmental Disorders (DDD) study (20,21) as an example to explore the application of detailed protein structure analysis to the understanding of disease. As a proof of principle, we focus here on the WD40 domain, one of the most abundant structural domains in eukaryotic genomes (22). Different WD40-containing genes have already been associated with multiple diseases (23,24), including TBL1XR1 [transducin (beta)-like 1 X-linked receptor 1], in which haploinsufficiency has recently been linked to autism spectrum disorders (25,26) and developmental delay (27)(28)(29) (OMIM no. 608628). The encoded TBL1-related protein 1 (UniProt ID Q9BZK7) is involved in a transcription signalling pathway and comprises two structural domains: an LisH domain (30) and a WD40 β-propeller domain (31). Here, we use this gene to investigate the value of integrating population variation and protein structural analysis to improve clinical interpretation of missense variation.

Results
Six children within the DDD study were found to have likely pathogenic de novo mutations in TBL1XR1, including five single nucleotide variants predicted to cause a missense change, and one 1 bp frameshift insertion predicted to result in loss of function through truncation or nonsense-mediated decay ( Table 1). Two additional likely de novo missense mutations have also been published in children affected by developmental disorders (25,28), as well as a de novo 1 bp frameshift deletion (25). A number of whole gene deletions have also been described (27,29).
Children with likely pathogenic mutations in TBL1XR1 have developmental delay often with autistic features ( Table 1). All patients have marked expressive speech and language delay as the most consistent feature, and most have special needs requiring specialist educational assistance. In addition, most of the children identified via the DDD study have gastrointestinal disturbance or constipation. Although a number of patients have dysmorphic features, a preliminary assessment of facial photographs does not suggest an identifiable facial gestalt and growth parameters were typically within the normal range (Supplementary Material, Table S1). There are no apparent differences in either the phenotypes or severity of the children with missense mutations versus those with truncating mutations and gene deletions, potentially suggesting a common loss of function mechanism.
Although TBL1XR1 is a highly constrained gene [Exome Aggregation Consortium (ExAC), Cambridge, MA, USA; http://exac. broadinstitute.org/; accessed December 2015], we were able to identify 64 unique germline population missense variants in TBL1XR1 in population controls, in which benign variants are expected to be relatively enriched and pathogenic variants relatively depleted for rare childhood onset dominant disorders with obvious phenotypes. These variants were identified using multiple databases: the ExAC (http://exac.broadinstitute.org/; ac- All five DDD missense mutations and one published likely pathogenic mutation are located within the WD40 domain of TBLR1, in addition to 33 of the population missense variants (Table 2). Interestingly, we also identified 16 likely non-pathogenic missense variants in TBL1XR1 within the DDD cohort (where the variant is in, or inherited from, an unaffected parent), all of which either lie outside the WD40 domain or have already been observed in the population.
The WD40 domain of TBLR1 has a β-propeller structure consisting of eight propeller 'blades', each formed by a four-stranded antiparallel β-sheet, which are joined by β-hairpins. The blades are arranged symmetrically about a central axis, like the staves of a barrel, and β-catenin binds to the 'top' face of the propeller to promote the transcription of Wnt target genes (33) (Fig. 1B). A number of 'hotspot residues' have been identified previously (31) on the top face of the domain (34), which are likely to be involved in the protein's interaction with β-catenin. In addition, the amino acid sequence of each blade of the β-propeller in most WD40 domains, including that in TBLR1, exhibits a recognizable pattern of residues known as the WD40 repeat motif, with certain residue types favoured in specific positions. The PROSITE sequence logo (35) for this motif is shown in Figure 2A, in which taller letters identify the highly conserved residues that are important for stabilization of the blade's structure. The TBLR1 protein has six complete tetrads and one incomplete tetrad that is missing the tryptophan residue (Fig. 2B). Of note in the logo are the histidine, serine/threonine, aspartic acid and tryptophan residues at motif positions 4, 22, 26 and 32, respectively. These form the Asp-His-Ser/Thr-Trp (DHSW) tetrad-a network of 'unusually strong' hydrogen bonds that maintains the domain's thermostability (37) (Fig. 3). The aspartic acid at motif position 26 is present in all eight blades and plays an especially important role in stabilizing the beta-hairpin structure at the top of each blade via two hydrogen bonds to the main chain nitrogen atoms of adjoining strands. An experimental study in 2010 showed that, although mutations to the tetrad residues maintained the domain's 3D structure, as evidenced by crystal structures of the mutant proteins, the stability of the proteins was severely affected (37), potentially interfering with folding or function.  The five DDD missense mutations are His213Gln, Asp328Gly, Asp370Tyr, His441Arg and Pro444Arg. The first four involve histidine and aspartic acid residues from different symmetrically repeated DHSW tetrads, at positions 4 and 26 in the WD40 motif (Fig. 2B), so that their change is likely to disrupt the stability of the protein's fold. Of particular interest is the highly conserved aspartic acid at position 26 in the WD40 motif, which can hydrogen-bond to the tetrad's histidine and also to a main-chain nitrogen on the preceding propeller blade (Thr349 in Fig. 3), and to a main-chain nitrogen two residues down (Thr372 in Fig. 3). In their native state, both are structurally stabilizing interactions, helping to hold the propeller together. The latter interaction helps maintain the beta turn that joins the two beta strands either side of the Asp. The only non-DDD likely pathogenic missense mutation identified is Leu282Pro, which is at position 21 in the WD40 motif, adjacent to a DHSW tetrad, where addition of a proline residue likely alters the packing of the strands sufficiently to alter the hydrogen bond network inside the tetrad. The fifth of the DDD mutations, Pro444Arg, occurs at position 7 in the WD40 motif (Fig. 2B). This is not a highly conserved position, although there are three proline residues at this position in TBLR1. Here, the fact that the amino acid is on the domain's top face (Fig. 4), coupled with the dramatic nature of the change, is likely to be responsible for the deleterious effect of the mutation. The mutation places a large, charged arginine at the protein-protein interface, and this potentially interferes with, or disrupts, the interaction required for the protein's function.
To evaluate the structural impact of the missense mutations in this domain further, the six amino acids with likely pathogenic missense mutations in the WD40 domain were compared with the 29 amino acids with benign population missense variation ( Table 2). The location of these amino acids along the Z-axis of the protein structure in PDB entry 4lg9 was analysed, i.e. through the middle of the β-barrel, from the top to bottom face (Fig. 4A and B), indicating that the disease-associated amino acids are clustered in 3D space and significantly different-closer to the top binding face-from those associated with presumed benign variation (P = 9 × 10 −5 , Fig. 4C). In addition, disease-associated amino acids were also predicted to be significantly different from those associated with benign variation using PolyPhen (15) (P = 2 × 10 −8 ), SIFT (18) (P = 7 × 10 −4 ), solvent-exposed surface area (38) (P = 2 × 10 −7 ) and residue conservation (39) (P = 8 × 10 −4 ), but did not differ significantly from the tetrad or top face hotspot residues.

Discussion
We have used the 3D structure of the WD40 domain encoded by the gene TBL1XR1 to understand and characterize the differences between likely pathogenic de novo missense mutations detected in children with severe developmental delay and presumed benign missense variation seen in population samples and the ExAC data set. Although the variants are predicted to result in missense changes, the true biological effect on the resulting protein is unknown. As has been observed previously across all proteins (40), the likely pathogenic mutations in TBLR1 are generally at more buried and conserved sites when compared with population variation. When the structure of the WD40 domain of this protein is considered in detail, there is notable clustering in 3D space, with likely pathogenic mutations more likely to be near the top face of the domain. Specifically, likely pathogenic mutations in TBLR1 all affect either the structural rigidity of the WD40 domain β-propeller, compromising the stability of the fold, or the physicochemical characteristics of the top face of the β-propeller, affecting the binding of β-catenin. The overlap of these diagnostic variants with the previously identified symmetrically repeating DHSW tetrads and top face hotspot residues (34) allows us to make strong predictions about the location of other likely pathogenic The WD40 domain is one of the top 10 most abundant domains in eukaryotic genomes, although rarely present in prokaryotes (22). Its primary role appears to be in making protein-protein interactions, which it can make simultaneously with several different proteins, particularly in relation to forming and regulating protein, DNA or RNA complexes (22,41). A number of diseases are known to be associated with mutations in WD40 domains (23,24), including numerous developmental phenotypes such as lissencephaly (42), short-rib thoracic dysplasia (43) and reduced neuronal migration (44). Twenty-one proteins containing such disease-associated mutations are listed in Supplementary Material, Table S2, with their corresponding locations in the WD40 motif.
As next generation sequencing of gene panels and whole exomes/genomes is increasingly applied in both research and clinical settings, more and more benign and likely pathogenic missense variants will be uncovered in known disease genes as well as in novel disease genes. Although in silico predictions alone should not be relied on as the sole basis to determine the clinical significance of missense variants in proteins, we hope that the analysis used in this study provides useful structural evidence for variant interpretation. Moreover, combining clinical and population genetics with protein structural analysis offers widely applicable in silico method for improving the clinical interpretation of novel missense variation.

Materials and Methods
The DDD study was approved by the UK Research Ethics Committee (10/H0305/83, granted by the Cambridge South REC, and GEN/ 284/12 granted by the Republic of Ireland REC), and appropriate informed consent was obtained from all participants. Patients meeting the recruitment criteria (neurodevelopmental disorder and/or congenital anomalies, abnormal growth parameters, dysmorphic features and unusual behavioural phenotypes) were recruited to the DDD study (www.ddduk.org) by their UK NHS and Republic of Ireland Regional Genetics Service, who also recorded clinical information and phenotypes using the Human Phenotype Ontology (45) via a secure web portal within the DECIPHER database (46). DNA samples from patients and their parents were analysed by the Wellcome Trust Sanger Institute using high-resolution microarray analysis (array-CGH and SNP-genotyping) to investigate copy number variations in the child and by exome sequencing to investigate single nucleotide polymorphisms and small insertions/deletions (indels). Putative de novo sequence variants of interest were validated in-house using either targeted Sanger sequencing or MiSeq sequencing. All genomic variants were annotated with the most severe consequence predicted by Ensembl Variant Effect Predictor (47) and their minor allele frequencies observed in diverse population samples. As has been described previously (20), likely diagnostic variants were fed back to referring clinical geneticists for validation in an accredited diagnostic laboratory and discussion with the family via patients' record in DECIPHER, where they can be viewed in an interactive genome browser.
In a data set of the first 4295 family trios (child, mother and father) with exome sequence data, we investigated genes already robustly implicated in developmental disorders with more than three de novo mutations in DDD children, where the consequence was predicted to result in different missense changes. We cross-referenced this list against the Protein Data Bank (48) to limit our analysis to genes with solved protein structures and further refined the list to those where all missense changes lay within a high-quality crystal structure from the human-derived protein. We further excluded metalloproteins and enzymes in which the missense variants clustered in the catalytic site, and here we limit our discussion to just one gene, TBL1XR1, a fairly recently identified developmental disorder gene (25)(26)(27)(28)(29), in which multiple likely pathogenic missense mutations were found in DDD that map onto a 3D protein domain structure.

Supplementary Material
Supplementary Material is available at HMG online.