Familial hypercholesterolemia (FH), a genetic disorder with a prevalence of 0.2%, represents a high-risk factor to develop cardiovascular and cerebrovascular diseases. The majority and most severe FH cases are associated to mutations in the receptor for low-density lipoproteins receptor (LDL-r), but the molecular basis explaining the connection between mutation and phenotype is often unknown, which hinders early diagnosis and treatment of the disease. We have used atomistic simulations to explore the complete SNP mutational space (227 mutants) of the LA5 repeat, the key domain for interacting with LDL that is coded in the exon concentrating the highest number of mutations. Four clusters of mutants of different stability have been identified. The majority of the 50 FH known mutations (33) appear distributed in the unstable clusters, i.e. loss of conformational stability explains two-third of FH phenotypes. However, one-third of FH phenotypes (17 mutations) do not destabilize the LR5 repeat. Combining our simulations with available structural data from different laboratories, we have defined a consensus-binding site for the interaction of the LA5 repeat with LDL-r partner proteins and have found that most (16) of the 17 stable FH mutations occur at binding site residues. Thus, LA5-associated FH arises from mutations that cause either the loss of stability or a decrease in domain's-binding affinity. Based on this finding, we propose the likely phenotype of each possible SNP in the LA5 repeat and outline a procedure to make a full computational diagnosis for FH.
The low-density lipoprotein (LDL) receptor (LDL-r) belongs and gives name to an ancient family of membrane receptors, including very-low-density lipoprotein receptor (VLDL-r), ApoER2, LRP1, LRP2 and LRP6 (1), that appeared early with the onset of the first metazoans and play important roles in multiple biological processes, through binding to a diverse set of partners (2,3). The receptors of the LDL-r family contain a common set of structural constituents that, from C-terminal to N-terminal, include (a) a cytoplasmic region that embodies NPxY and PPPSP motifs, (b) a single-transmembrane segment anchoring the cytoplasmic and extracellular sections to the cell membrane and (c) an extracellular region formed by an epidermal growth factor (EGF)-like domain composed by several EGF repeats and a β-propeller domain, followed by a ligand-binding region consisting of a variable number of small cysteine-rich domains (1,3), known as LDL-r type A domains (LA domains). The ligand-binding region of the human LDL-r (4) has been widely studied to uncover the mechanism of endocytic LDL internalization and release. These studies have provided a wealth of structural data corresponding to individual domains, domain pairs, the complete extracellular region (5–8) and even a low-resolution structure of the LDL–LDL-r complex (9). Domains LA1–7 (10) and, most importantly, domains LA4–5 (2,11), are key for binding of VLDL and LDL particles (12). LA domains are 40-residue, small autonomous folding units containing little secondary structure and lacking an extensive hydrophobic core, which are mainly stabilized by three disulfide bridges and a coordinated calcium ion (2,13–16). The seven LA domains are connected through small peptide linkers that provide great flexibility to the region (8).
Familial hypercholesterolemia (FH) is a genetic disorder associated to abnormally high levels of LDLs in the blood, which constitute a significant-risk factor for cardiovascular and cerebrovascular diseases (17–20), and has a prevalence of 1:500 in heterozygosis in human populations (21,22). Although FH can be caused by defects in several proteins linked to cholesterol internalization and metabolism in cells—e.g. Apo B-100 (23,24), PCSK9 (25,26) and the LDL-r (27,28)—the majority and most severe FH cases are associated to mutations in the LDL-r (29), some of which have been shown to compromise the conformational stability of specific receptor domains (30). In spite of the high-prevalence worldwide, FH is under-diagnosed and under-treated (31), probably because of the complexity for connecting genetic variations and disease phenotype. Since the recognition of the association of FH to genetic variations in the LDL-r and the discovery of the first disease-causing mutations (32–34), much information has been gathered on different types of mutations in the LDL-r gene (28,35–38). The number of known mutations for the LDL-r is between 1741 and 1835, according to current releases of the LDL-r database (28) and the Professional version of the human gene mutation database (37), respectively. Genetic variations found in this protein include large rearrangements of coding and/or intronic regions, synonymous and non-synonymous substitutions and mutations in the regulatory regions or splicing sites (28,37). Missense substitutions are by large the most frequent type of mutation (28), and are unevenly distributed along the LDL-r gene sequence. A higher accumulation of genetic variations has been reported in exons coding for the ligand-binding region, particularly in exon 4, coding for LA domains 3–5 that are key for LDL binding (2,11,12).
The development of high-scale DNA genotyping and sequencing methodologies (39–42) has stimulated an increase of cascade screening programs in partial populations of some countries and high-risk groups (40–44). Unfortunately, due to a lack of knowledge on the phenotypic effect of most mutations, standard genotype analyses are centered only in a reduced set of known pathological mutations, limiting the predictive power of these techniques (45). Furthermore, the lack of understanding of the molecular basis of the pathological effect of LDL-r mutations limits our ability for devising novel therapies for treating FH. To gain such molecular insight in vitro and in silico studies have been performed on different domains or on the complete LDL-r-binding region (2,5–7,13–16,46–49). These studies provide insights to relate the severity of mutations with structural or stability impairments in the LDL-r. However, we are still far from having a complete molecular-level description of the connection between genetic variations in the LDL-r gene and the severity of FH phenotypes.
In this work, we have performed massive atomistic simulations for predicting the fate of all possible missense mutations in the key LA5 LDL-r lipoprotein-binding domain. Thus, we have generated all the possible mutants arising from non-synonymous single nucleotide polymorphism (SNPs)—i.e. 256 SNPs coding for 227 different mutants (Supplementary Material, Fig. S1)—and have run MD simulations to assess the distortions caused by single-amino acid substitutions. When the structural fluctuations observed during MD trajectories are moderate, we classify the corresponding mutations as not destabilizing. Conversely, they are referred to as destabilizing when they cause a significant distortion of the LA5 domain structure. However, the data we use for classifying mutants in one of these two categories should not be considered as quantitative estimations of changes in the thermodynamic stability of the protein, which is not measured in this study. A total of 4.5 microseconds of MD simulations corresponding to relaxation trajectories have been analyzed using a variety of data-mining methodologies. The use of these analyses, together with our proposal of a consensus-binding site of the LA5 domain based on structural information obtained in different laboratories (8,50), allows us to explain the pathological nature of most mutations described for the LA5 domain as either arising from structural destabilization of its tridimensional structure (Supplementary Material, Fig. S2), or from replacement of binding site residues. The results presented here provide a simple approach for ab initio prediction of the putative pathological impact of new mutations, opening a way for an early diagnosis and treatment of FH.
Global conformational instability of the LA5 domain mutant variants
The 227 mutants resulting from SNPs in the LA5 domain (Supplementary Material, Fig. S1 and Table S2) were generated in silico using SCWRL (51), which was also used to find the best rotamer conformations for the side chains of the substituted residues. All mutants were minimized and equilibrated in explicit water and MD simulations were extended for a total aggregated time of 6 µs, including preparation and production trajectories (see the ‘Materials and Methods’ section). For each mutant 20 ns-long MD simulations were used to explore its conformational evolution upon mutation. Based on previous work (49), this time span was considered enough to permit significant relaxation from the initial wild-type-like structure. A parameter commonly used to evaluate the structural similarity of two proteins, or of two protein conformations along an MD trajectory, is the root mean square deviation (RMSD) or average distance between the atoms of the superimposed structures. However, some characteristics of the RMSD (it is length-sensitive and it is not normalized) are not ideal to measure structural similarity. On the other hand, the template modeling score (TM-score) (52,53), constitutes a protein-length independent metric, which allows performing thorough structural comparison for identifying topology relationships among related proteins. It has been demonstrated, from a comprehensive comparison of proteins from different folding categories, that TM-scores close to 1 correspond to proteins belonging to the same fold topology, while for proteins with different structural topologies values <0.5 are obtained (52). We, thus, followed the dynamical evolution of different mutants using the TM-score. The evolution of the TM-scores for a selection of the 227 trajectories corresponding to some of the mutations found in individuals with FH is shown in Supplementary Material, Figure S3. The extent of the conformational change varies greatly among mutations. For example, the substitutions C197(176)Y; F200(179)L, C; C204(183)S, Y; S206(185)R; D221(200)Y and D227(206)V (see Supplementary Material, Tables S1 or S2 for details on the numbering) cause great conformational change in this timescale leading to conformations containing significant structural distortions after a few nanoseconds of simulation. In contrast, other FH known mutations such as S198(177)L; C209(188)Y; H211(190)D; W214(193)S; D224(203)G; C231(210)R and C231(210)Y cause only mild or apparently no distortions to the structure of the LA5 domain during the simulations. To evaluate the degree of conformational distortion associated to each mutant we have performed principal component analysis (PCA) of all trajectories (54–56). PCA is a standard statistical procedure (see the ‘Materials and Methods’ section) for performing transformations of multivariate data for identifying correlation among variables in the initial data set. This approach has been widely used to analyze MD simulation data (54,57), allowing the decomposition of the trajectories into simple uncorrelated motions and the identification of the more relevant ones. Those essential motions (principal components) constitute highly compressed representations of entire trajectories and offer convenient ways to visually represent and compare different trajectories. When dealing with large numbers of long trajectories, analysis and comparison using principal components is much easier and more objective than using conventional structural analyses of individual trajectories. A summary of the analysis of the 227 mutant trajectories is included in Supplementary Material, Figure S4. The ‘Scree Plot’ in this figure for the 1st to 30th components shows that the first three ones describe on average 36, 16 and 9% of the total variance (i.e. they amount for >60% of the explored MD variance). Those three eigenvectors have been selected to describe the ‘essential dynamics’ of the systems (54,58–61).
After performing the PCs decomposition of all trajectories, we have made a pair-wise comparison of the average structures of different mutants. As shown in Figure 1, while many mutants display fairly similar average structures, some groups of mutants are structurally dissimilar to most others, e.g. trajectories M009–M010, M029, M049–M053, M153–M155, M161–M164 and M191–M197 (see Supplementary Material, Table S2 for the correspondence among trajectory indexes, codon change and amino acid substitutions), The conformational evolution of some of these dissimilar mutants along the trajectories, as shown by the projections of each trajectory into the first three PCs, is depicted in Figure 2, Supplementary Material, Figure S5 and Videos S1–S5. The charts in Figure 2 show the projections of the conformations visited during a simulation into the space formed by the first three PCs. The values of the projections provide a measure of the similarity of the structure at a given time step with the average structure of the simulation, located at the origin . For a Boltzmann-weighted ensemble of molecules that behave harmonically, a ‘Gaussian’ distribution of projections in each dimension is expected, which correspond to an ellipsoid centered at the origin, with the eigenvectors corresponding to the axes of the plot, and the dispersion of the ellipsoid length in the ith PC being proportional to the ith eigenvalue. Deviations from this expected behavior could be taken as a qualitative estimate of the conformational change caused by the mutation. The significant conformational change caused by some amino acid substitutions on the LDL-r LA5 structure is clear when Figure 2 (and Supplementary Material, Fig. S5 and Videos S1–S5) is compared with the plots corresponding to background mutants (Fig. 3, Supplementary Material, Fig. S6 and Videos S6–S10) that has little or no effect on the conformation of the LDL-r LA5 domain.
Clustering of LA5 mutants according to the extent of conformational instability
Among the 227 trajectories collected some are stable, but there are also unstable trajectories that do not follow a multivariate normal distribution in the PC space. In order to perform a combined analysis of all the trajectories, we have concatenated the last 10 ns of each one into a meta-trajectory. This provides a common PC space and Eigensystem where the independent trajectories can be compared. Within this new reference system, we have used a sampling methodology to compare the essential subspaces of the trajectories. By randomly comparing subsets of each simulation against each other (105 random comparisons), it was possible to obtain a statistical assessment of the mean distance between the essential subspaces visited by the different mutants and the wild-type LA5 domain. The Mahalanobis distance metric (62) was used for comparing and clustering the mutants according to the extent of conformational instability (see the ‘Materials and Methods’ section and Fig. 4), which allowed to objectively establish a link between conformational instability and the likelihood of the expression of a LDL-r variant with an impaired function towards LDLs interaction (2,11,49).
The results in Figure 4 provide an intuitive picture of the clustering of the trajectories in a 25th-dimensional space, sufficient to describe 95% of the conformational variability. As expected, the projections for the trajectories of the wild-type LDL-r LA5 domain and of different stable mutants explored an ellipsoidal subspace close to the PCs origin, and consequently they formed a stable cluster of mutants shown in green. This is the largest cluster, comprising 114 mutants (for a detailed list of the mutants see the color codes in Supplementary Material, Table S2). Three additional clusters appeared that are formed by 57 unstable mutants (orange cluster), 34 very unstable mutants (magenta cluster) and 22 highly unstable mutants (red cluster), as judged from the conformational changes experienced. So far, 50 disease-linked SNP have been found in the LDL-r LA5 domain of FH individuals (28,37). These pathological mutations appear unequally distributed among the four stability clusters. The large stable cluster only includes 17 (34%) of the 50 known FH mutations, the remaining 33 being distributed among the three unstable clusters (Fig. 4). The stability of the LA5 structure in the 17 known diseased-linked SNPs classified by us as ‘stable’ mutations can be appreciated in Supplementary Material, Figure S7 and Videos S11–S15. We observe two trends in the structural distribution of mutants among the four clusters. The percentage of buried mutations increases from 52% in the stable cluster to 72, 88 and 86% in the progressively more unstable clusters, and a similar increase is observed in mutations affecting Ca++ coordinating residues or cysteines (from 25 to 25, 50 and 63%). It appears that, in agreement with structural/energetic expectations, mutations in buried residues and in those contributing to structural loci are, on average, more destabilizing that others.
Direct assessment of the effect of all possible SNPs in the structural stability of the LDL-r LA5 domain
Conformational diseases (63–66) are pathological states arising from alterations in protein conformational or binding equilibria driven by changes in physicochemical conditions or by mutations. In FH, an ample number of mutations identified by cascade screening assays (40,41,43,44) have been reported to be genetic determinants of the disease. Although there have been attempts to experimentally assess the effect of mutations in some LDL-r domains (5–7,13–15,46–48), we are far from being able to experimentally explore the consequences of all the biologically accessible mutations. Only a small fraction of all possible mutations in the LDL-r has been catalogued and reported in genetic variation and sequence databases (28,35–38). In an effort to fill the gap between possible genetic variability and knowledge of the associated phenotype, many computational methodologies have been developed (67,68). Those methods, mainly based on indirect genetic, structural or evolutionary assumptions, cannot always anticipate the real effect of the amino acid substitution at the structural level, a key information for predicting whether the mutation might cause a perturbation in the protein conformational equilibrium, and also a crucial step to derive structural information for more efficient drug-design protocols. We present here an alternative ‘de novo’ computational strategy based on the analysis of relatively short, all-atom MD simulations of the protein under physiological conditions. This is to our knowledge, the first complete exploration of the effect of all biologically accessible mutations caused by SNPs in the structure of a protein domain: the LDL-r LA5-binding domain, to determine whether and how FH might be etiologically related to genetic variations.
We have generated all possible mutants arising from SNPs in the cDNA for the LDL-r LA5-binding domain (Supplementary Material, Fig. S1), performing for all of them atomistic MD simulations under physiological conditions. In this ‘exhaustive’ approach, instead of concentrating on explaining previously identified specific amino acid substitutions, we have explored the entire SNP mutational landscape of a key domain known to establish functional interactions with LDLs (2,11), and encoded in the LDL-r exon bearing the highest proportion of mutations identified in individuals with FH. Inspection of the 3D structure of the LA5 domain (Supplementary Material, Fig. S2) allows identifying important structural loci on which the substitution of an amino acid could be accompanied by significant conformational changes. However, our results point out to a context-dependent scenario where the specific substitution, not just the locus, determines whether the structure of the LA5 domain is significantly affected. This finding is illustrated in Supplementary Material, Figure S3 with a selection of trajectories corresponding to FH mutants (28,37). Their TM-scores (52,53) prove that, even in mutants involving residues from the calcium coordinating box or disulfide bond-forming cysteines, some simulations show stable evolutions along time while, for others, the conformational stability appears to be significantly affected. These results are not obvious and could not be easily predicted by standard trained methods (67–71).
In order to define quantitative criteria for a global comparison of the conformations visited by different types of mutants during MD simulations we performed PCA of each trajectory, and the structural differences among the corresponding average structures were calculated using the TM-score (Fig. 1). The presence of clusters of mutants structurally different from the rest of variants suggests the possibility of grouping mutants according to the extent of the instability introduced by the amino acid substitution. The PC representations depicted in Figure 2, corresponding to destabilizing mutants most of whom are associated to FH (Supplementary Material, Table S2), graphically confirms the instability caused by these mutations. In contrast, Figure 3 shows several examples of stable evolutions around the average structure, including mutations such as C209(188)W in an important structural locus, which reaffirms the need for specific mutation testing versus general predictions based on structural location.
FH mutations in the LDL-r LA5 domain due to loss of conformational stability or binding competence
A main goal of this work is to devise a quantitative way of identifying SNP mutations that could destabilize the structure of the LDL-r LA5-binding domain and to group different types of mutations according to the extent of destabilizing effects. Analysis of the meta-trajectories of the 227 mutants allows comparing, within a single Eigensystem, the conformational subspaces more probably visited by each mutant. The clusters in Figure 4 provide an objective classification of the SNPs according to their effect in structural stability, and hence their possible pathogenicity (see also in Supplementary Material, Table S2 a color-coded classification of the destabilizing effect of each mutant). Of the 50 known FH mutations, 33 appear distributed in the three unstable clusters, indicating that loss of conformational stability explains two-third of FH phenotypes. According to our simulations, there appear to be hotspots in the structure of the LDL-r LA5 domain where SNPs are more likely to lead to substitutions compromising the conformational stability—e.g. C197, E201, C204, D221, C222, D224, D227 and E228 (Supplementary Material, Fig. S8). Conversely, some positions are unlikely to derive in unstable mutant proteins due to SNP—e.g. A199, L205, G219, N230 or A232.
On the other hand, 17 FH mutations (Supplementary Material, Table S2) are classified as stable, and the individual analysis of their trajectories (see examples in Supplementary Material, Fig. S7) confirms they do not significantly alter the conformational behavior of the LA5 domain. This is a clear indication that they will cause FH by a different mechanism. The simplest reason that can explain their relationship with FH is that those mutations impair the interaction of LA5 with its binding partners, either other domains from the LDL-r or other proteins involved in cholesterol homeostasis. There are three known partners of LA5: the β-propeller domain of the LDL-r, and the apolipoproteins Apo B and Apo E present in LDL and/or VLDL particles. At low pH (8), LA5 interacts with the β-propeller domain through residues H211, S212, W214, D217, G219, D221 and K223, clustered in the LA5 convex face (Fig. 5). Recent structural data from our group (50) have determined that the interaction between LA5 and key interacting helices of Apo E and Apo B also involves essentially residues at the convex face (W214, G218, G219, D221 and D227 in the complexes with Apo B and Apo E, plus E201 in the complex with Apo E). Those two sets of residues define a common patch in the convex face (Fig. 6A) that very likely constitutes the binding site used by LA5 to interact with different partners. In fact, the homologous LDL-r domains LA3 and LA4 use a structurally equivalent patch to recognize yet another partner, the receptor-associated protein (RAP) (6).
The 17 FH mutations in the stable cluster occur in 11 residues (P196, S198, E201, C209, H211, W214, C216, D221, D227, E228 and C231), 7 of which are surface exposed and contiguous, defining an extended but overlapping version of the binding site (Fig. 6B). Three of the four remaining residues, C216 and C231 forming the C-terminal most disulfide bridge, plus E228 at the calcium cage, are also surface exposed at one end of the convex face and appear to constitute a prolongation of the known binding site (Fig. 6B). Only D227 is buried. As expected, most substitutions at this residue are greatly destabilizing (Supplementary Material, Table S2), but its replacement by similarly charged glutamic acid, although reported to cause FH, does not lead to structural perturbations during the simulations. The likely cause of the FH character of D227(206)E is that its reduced Ca++ affinity impairs folding (13). Except for this mutation, the rest of the 17 non-destabilizing FH SNPs in the LA5 domain affect binding site residues. Thus, the pathogenicity of these mutations seems related to disruption of LA5-binding compatibility with other proteins. The underlying structural reasons can be either that the substitution abolishes important contacts with partner-binding residues—e.g. W214(193)S (8)—or that the newly introduced residue modifies the steric or physicochemical compatibility of the interacting patches [e.g. H211(190)D, Y, L (8,72)].
From molecular dynamics to a strategy for computational diagnosis in conformational diseases
Our results indicate that 33 out of the 50 known FH mutations at the LA5 domain are destabilizing and that, of the remaining 17 non-destabilizing mutations, 16 occur in residues at the surface-binding site. As it appears, computational estimations of conformational instability from short-range MD simulations, combined with experimental knowledge of the LA5 domain interacting residues, makes possible to anticipate the disease phenotype that has been observed in 49 of the 50 SNPs known to be related to FH. This result shows that our method has a remarkably high sensitivity (true positives rate) of 0.98 for classifying FH-causing SNPs. We have also tried to provide a measure for the specificity of the method (true negatives rate), but the lack of sequence data on neutral mutations (see legend of Supplementary Material, Table S2) impedes it.
We would like to propose a FH computational diagnosis for all possible SNPs in module LA5 that considers both the degree of conformational instability in the simulations and the possible impairment of the interacting region (Supplementary Material, Table S2). Of all the possible SNPs—i.e. 256 non-synonymous SNPs coding for 227 different single amino acid substituted variants—only 22% have been found in FH individuals (dot tagged mutants in Supplementary Material, Fig. S1). These variants are labeled as deleterious in Supplementary Material, Table S2. For the remaining 177 mutations not reported in genetic variation databases, those belonging to any of the three unstable clusters are also classified as deleterious. For those in the stable cluster, we have considered evaluating their possible binding impairing effects using qualitative structural criteria—e.g. steric and physicochemical differences between the wild-type and substituted residues. However, such a fine evaluation would be too preliminary given the limited structural information available for LA5/partner complexes, e.g. the structure of the LDL-r complete extracellular region (8) is of low resolution, and the interactions of LA5 with Apo B and E peptides are only defined in part (50) because the exact conformation of those peptides in the complexes is not known. Therefore, we have provisionally evaluated all mutations taking place in binding site residues as ‘deleterious’, which may increase the number of false-positives in this subset of our predictions. Our phenotype predictions in Supplementary Material, Table S2 can be compared with predictions calculated using different methodologies, such as PMUT (68), and a consensus approach, CONDEL (69), integrating the predictions made using SIFT (67,73), polyphen-2 (71) and mutation assessor (70,74). We have also included the predictions obtained using polyphen-2 (71), and the calculation of stability changes upon mutation obtained with FoldX (75). The comparison reveals clear discrepancies among the different predictions, and stability estimations in key structural loci, such as in cysteines or in Ca++-binding residues—e.g. in C197(176)S or E228(207)D as well as in many other mutations in the LA5 structure (Supplementary Material, Tables S2 and S3). Thus, depending on the predictive approach used, the conclusions drawn would be different. Moreover, the true positive rates obtained with PMUT, CONDEL and polyphen-2 for the classification of FH-causing mutations are 42, 76 and 80%, respectively (Supplementary Material, Tables S2 and S3), which shows that our structure-based method outperforms all these sequence-based approaches. Furthermore, we are not only able to correctly predict almost all FH-causing mutations, but also to differentiate mutations that cause the disease through the structural instability of the LA5 domain, and others associated to residues in the interaction site with other partner proteins and LDL particles. Though undoubtedly our approach is more computationally expensive and requires more data processing and analysis than others available for predicting deleteriousness of mutations (67–71,73,74), general advances in computation speed and specific improvement in MD simulations (76–78), together with the emergence of online services for performing client-based and high-throughput MD simulations (79–81), may facilitate the generalization of the approach presented here in the near future.
Our analysis also indicates that, together with LDL-r destabilizing mutations, a significant percentage of FH phenotypes are related to impairment of structural regions mediating protein/protein interactions (e.g. LR5/β-propeller domain or LR5/apoB). In this respect, a complete picture of the interactions formed by the different domains of the LDL-r among them and with VLDL, LDL, PCSK9, RAP or additional partners yet to be discovered is lacking. Comprehensive high-throughput and high-resolution interactome structural predictions will eventually become available, but meanwhile much experimental work remains to be done. Besides, a structural and thermodynamic understanding of how the LDL-r interactions are affected by the specific and changing solution conditions in the different cellular compartments visited along its functional cycle is also needed (5–7,13–15,46–48,50,82–84). Although the structural distortions caused by mutations in the isolated LA5 domain, as identified with our approach, need not be identical to those taking place in the context of the whole LDL-r, the strong experimental evidence showing the structural independence of LA domains (8,30) suggests those distortions are likely to be very similar. Nevertheless, in cases where differences occur they will impose an upper limit to the predictive accuracy of the method. It is also important to emphasize that the evaluation and improvement of the specificity of this or any other predictive method will benefit from the search for, and documentation of, non-pathogenic mutations.
In this work, we have provided a clear structure/stability view of the complete mutational landscape of the LA5 domain, with a quantitative classification of the conformational instability caused by all biologically accessible SNP amino acid substitutions. We hope that these data would be useful for planning experimental work to measure the real extent of the structural instability associated to yet-to-study putative destabilizing mutations, and for designing screening devices for the efficient diagnosis of FH. By extension, the method here illustrated may be applied to studying how SNP may affect the structure and function of other proteins associated to other pathologies.
Examination on the entire SNP mutational space of the LDL-r LA5 domain using relaxation molecular dynamics simulations allows to accurately classify each possible mutation as either compatible with the native structure or as destabilizing. Comparison of this classification with the known SNPs associated to FH disease phenotype clearly reveals two types of FH mutations: those causing a stability defect and those impairing binding interactions with LDL-r-associated proteins. The data generated here delineate the space of putative pathogenic mutations in an important LDL-r domain and may help experimentalist to develop more comprehensive FH screening methods, and may contribute to a better understanding of FH from a structural perspective. On a larger scale and with sufficient computation power, it seems possible to make a full computational diagnosis for FH by considering both the degree of conformational instability in simulations of the LDL-r and related proteins, and the possible impairments of their interacting regions. Importantly, the structural approach followed by us can be applied to predict the deleteriousness of genetic variations in other small proteins without relying in the evolutionary assumptions characteristic of most current methods based on sequence analysis. An obvious challenge in applying this method to larger proteins is that they will require more computation power due to their larger size and also to the larger number of possible SNPs. An additional challenge may apply in the form of slower relaxation kinetics for large proteins (85), especially if they exhibit full thermodynamic cooperativity (86).
Materials and Methods
LA5 domain coding sequence, structure and complete SNP mutational map generation
From the protein sequence of the complete human LDL-r accessible in Uniprot (35) (ID: LDLR_HUMAN, AC: P01130), we extracted the DNA-coding sequence for the LA5 domain by accessing the entry for this gene in the Ensembl database (87) (ID: ENSG00000130164). The protein sequence for the LA5 domain corresponds to residues 195–233 in the sequence of the complete receptor, while the X-ray structure of the domain used as the starting point for MD simulations and structural analysis (PDB id: 1AJJ), includes residues 196–232. Thus, we just extracted the coding sequence for amino acids 196–232, leaving out the codons for the N- and C-terminal serine and valine. The DNA sequence was then processed with an ad hoc script for generating all the biologically accessible mutants arising from the substitution of a single nucleotide (Supplementary Material, Fig. S1). All the non-synonymous SNPs were identified—i.e. 256 non-synonymous SNPs coding for 227 different single-residue substituted protein variants—and the corresponding mutations in the structure of LA5 domain were generated using the program SCWRL (51), for finding the best rotamers of the mutated residue. Then, the mutants were given a specific code (Supplementary Material, Table S2) to be further processed before running the MD simulations.
Setting up the systems for molecular dynamics simulation production
Each of the 227 mutants, plus the wild-type LA5 domain, was solvated in a cubic water box with approximately 5500 TIP3 water molecules, and neutralized with Na+Cl− counter ions using the solvate package in VMD (88). We setup a thorough procedure for preparing the systems previous to running the production MD simulations, including multiple cycles of step-descending minimization/equilibration steps in a preparation phase of ∼5 ns, which encompasses: (a) short CPT dynamics of water molecules with the protein atoms fixed to eliminate possible potential strains in the water box, (b) slow release of the protein atoms by imposing decreasing elastic restraints and (c) very slow heating of the systems to the final simulation temperature (310K) using a gradient temperature ramp. The 5 ns preparation phase guaranteed the stabilization of the temperature and total energy of the systems. Then, 20 ns production MD simulations were run for each mutant using the CHARMM (89) force field (version c34b1) in NAMD (90). The simulations were run using Langevin dynamics, with periodic boundary conditions and Particle Mesh Ewald for modeling long-range electrostatic interactions with a cutoff distance of 14 Å. The Nosé–Hoover thermostat was used for pressure coupling of the system and the friction coefficients were set to 0.5 and 60 ps–1 for protein atoms, and water molecules and ions, respectively. The simulations were run mainly in the cluster of the Red Española de Supercomputación: marenostrum at the Barcelona Supercomputing Center and CaesarAugusta at the Institute for Biocomputation and Complex Systems Physics (BIFI), and also at the Terminus and Memento clusters at BIFI. The trajectories were analyzed with VMD (88) and a set of ad hoc TCL and Perl scripts.
Principal component analysis of individual MD trajectory data and of meta-trajectories
PCA, a useful procedure for capturing correlations among variables, has been extensively used for analyzing MD trajectories aiming at describing the ‘essential dynamics’ (54,58–61) of a system. Performing PCA on MD data starts by aligning the trajectory for removing the translational and rotational components of movement (91). Then, the trajectory is centered to the reference structure —e.g. the initial or an average structure—by subtracting the reference structure to the aligned snapshots, and it is represented as a matrix of the type , on which the rows are the coordinates of the N residues of the system, and the columns the number of frames or snapshots, F, of the trajectory. Subsequently, the covariance or correlation matrix is calculated from the product of the trajectory matrix by its transpose . The eigenvalue decomposition of the covariance matrix renders a set of eigenvalues and orthogonal eigenvectors organized in the form , where is the diagonal matrix of the eigenvalues and V is the matrix of the 3N − 6 eigenvectors paired to the eigenvalues. The eigenvalues are sorted in descending order with respect to the amount of variance of the original data described by the pairing eigenvectors.
Principal components representations of individual trajectories were generated by projecting the coordinates in the Cartesian space coming from the simulation into the eigenspace defined by the first 3 eigenvectors, as shown in Figures 2 and 3 and Supplementary Material, Figure S7. On the other hand, we quantitatively compared the PCA subspace explored by different mutants and by the wild-type LA5 domain as follows. Using the VMD (88) CATDCD utility, we concatenated into a meta-trajectory the last 10 ns of all the trajectories and then recalculated the complete Eigensystem to obtain the projections of the frames of each independent simulation into the meta-trajectory principal components. This approach allows to describe all the different simulations in a common PC space. Then, we quantitatively assessed the effect of mutations on the structure of the LA5 domain by calculating the distance among the subspaces explored by each mutant and the wild-type domain. To do that we used the Mahalanobis distance (62) (MDpp′), a metric routinely used in the field of multivariate statistics which, in contrast with the classic Euclidean distance, accounts for the correlations on data and is independent of data transformations. In the specific case of PCA, the Mahalanobis distance between a pair of points p and p′ in the PC space is defined as:
For obtaining the mean distance among trajectories independently of their compliance (stability) or non-compliance (instability) with the multivariate normal distribution, we set a resampling strategy in which we resampled with replacement a subset of snapshots—e.g. 5 ns for the 10 ns meta-trajectory—from each trajectory, and calculated MDpp′ for all possible pairs of points in the N-dimensional PCA space—e.g. 25 eigenvectors describing 95% of the variance. We repeated this step 105 times for each pair of simulations and obtained normal distributions for the Mahalanobis distances among points in the trajectories, with rather low-standard deviations. From this comparison, we obtained the mean MDT1,T2 among whichever two trajectories. After calculating the distance matrix among trajectories, we performed a clustering—i.e. using a complete-link clustering procedure—of the trajectories according to the PCA subspace explored in each case. All the manipulation of MD data for PCA analysis was performed with a set of ad hoc TCL and Perl scripts, alongside with the PCAZIP package (http://mmb.pcb.ub.es/software/pcasuite/pcasuite.html). We compressed all the trajectories using PCAZIP, taking into consideration only the backbone atoms of the LDL-r LA5 domain and retrieving, in each case, the number of eigenvalues and eigenvectors sufficient to describe 95% of the total variance in the system. Using a tool from the PCAZIP package, we extracted all the metrics and data used in the statistical analyses in our study—e.g. eigenvectors, projections etc. The processing of PCA data, and all the resampling, clustering and statistical analyses were done in the R statistical package (92) with a group of ad hoc R scripts.
V.E.A. was funded by Banco Santander Central Hispano, Fundación Carolina and Universidad de Zaragoza and was recipient of a doctoral fellowship awarded by Consejo Superior de Investigaciones Científicas, JAE program. M.O. acknowledge financial support from grants BIO2012–32868 (Ministerio de Economía y Competitividad, Spain), 2014SGR00134 (Grups de Recerca Consolidats, Generalitat de Catalunya, Spain) and PT13/0001/0019 (ISCIII, Spain). J.S. acknowledge financial support from grants BFU2010-16297 (Ministerio de Ciencia e Innovación, Spain), BFU2013-47064-P and BIO2014-57314-REDT (Ministerio de Economía y Competitividad, Spain) and PI078/08 (Gobierno de Aragón, Spain). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. Funding to pay the Open Access publication charges for this article was provided by grant BFU2013-47064-P (Ministerio de Economia Y competitividad).
The authors thankfully acknowledge the resources from the supercomputers Marenostrum from the Barcelona Supercomputer Center (BSC), and Memento and Terminus hosted at the BIFI, Universidad de Zaragoza and the technical expertise and assistance provided by the High Performance Computing groups both at the BSC and BIFI.
Conflict of Interest statement. None declared.