The accumulation of genome-wide information on single nucleotide polymorphisms in humans provides an unprecedented opportunity to detect the evolutionary forces responsible for heterogeneity of the level of genetic variability across loci. Previous studies have shown that history of recombination events has produced long haplotype blocks in the human genome, which contribute to this heterogeneity. Other factors, however, such as natural selection or the heterogeneity of mutation rates across loci, may also lead to heterogeneity of genetic variability. We compared synonymous and non-synonymous variability within human genes with their divergence from murine orthologs. We separately analyzed the non-synonymous variants predicted to damage protein structure or function and the variants predicted to be functionally benign. The predictions were based on comparative sequence analysis and, in some cases, on the analysis of protein structure. A strong correlation between non-synonymous, benign variability and non-synonymous human–mouse divergence suggests that selection played an important role in shaping the pattern of variability in coding regions of human genes. However, the lack of correlation between deleterious variability and evolutionary divergence shows that a substantial proportion of the observed non-synonymous single-nucleotide polymorphisms reduces fitness and never reaches fixation. Evolutionary and medical implications of the impact of selection on human polymorphisms are discussed.
Since Darwin, biologists understood the importance of heritable variability for the evolution in natural populations; more recently, it has become clear that population variability is also crucial for the study of heritable diseases. Single-nucleotide polymorphisms (SNPs) are the most frequent type of genetic variation among humans and are responsible for most of the variation in human phenotypes. SNPs have been described by Kreitman in the fruit fly Drosophila melanogaster (1). In a subsequent study by McDonald and Kreitman (2), SNPs of different functional categories have been related to the rate of divergence between species, in order to test for the presence of natural selection. Here, we compare the densities of various functional categories of SNPs in different human genes with the divergence between these genes and their murine orthologs. This analysis allows us to identify the evolutionary forces responsible for the observed heterogeneity in polymorphism levels across human genes.
The polymorphism rate has been estimated for many human genes (3). Some genes are highly polymorphic, whereas other genes display almost no polymorphism across the human population (4–6). Theoretically, several evolutionary forces, including random drift (coalescence), heterogeneity of mutation rates, and negative and positive natural selection, can lead to this heterogeneity in the polymorphism rate. In addition to satisfying theoretical interests, understanding the heterogeneity of the polymorphism rate is important for the optimization of the design of association studies aimed at the identification of alleles responsible for human phenotypes (7–10).
Recent data demonstrated wide alternations in the polymorphism density along the genome and postulated existence of stable haplotype blocks (11,12). Recombination hot spots have been hypothesized to contribute greatly to the heterogeneity in polymorphism density along the genome (11–15). Recombination hot spots split the genome into regions with different coalescent histories, i.e. the genome can be represented as a mosaic of blocks, each consisting of sites with a common genealogy, not broken by recombination in most individuals. In selectively neutral regions of the genome, differences in the coalescent history of genes are due to random genetic drift. In protein-coding regions, however, natural selection impacts genetic variation, which causes the patterns to be more complex.
Our aim was to focus on the coding regions and study all of the human genes in public databases that had SNP information and orthologous murine sequences available, in order to: (1) analyze the impact of population history, the mutation rate, and negative (stabilizing) selection on the structure of human genetic variation, and (2) study the distribution of deleterious alleles among monomorphic and variable genes.
The variability of a population is shaped by an interplay of several evolutionary forces. Consequently, several not mutually exclusive explanations may account for different polymorphism rates across genes (Table 1). Mutation with heterogeneous rates, population history (coalescence) and selection may all independently affect the polymorphism rate at each individual locus. We attempted to disentangle the impacts of these factors using data on intraspecies polymorphism and interspecies divergence. The correlation of the polymorphism and evolution rates of human genes establishes the role of factors that also affect interspecies divergence, and the correlations of the polymorphism rates between different functional categories of SNPs make it possible to distinguish the effects of mutation or coalescence from that of selection.
This analysis relies on two assumptions. First, it assumes that evolutionary constraints are similar for within-population variability and between-species divergence. For example, it was shown that majority of deleterious amino acid substitutions destabilize proteins (16). Stability requirements are unlikely to differ substantially for orthologous proteins of closely related species.
The second assumption is that the mutation processes are similar for human and mouse orthologs. The mutation rate at a nucleotide site depends on a number of factors, including DNA methylation sequence contexts (17,18) and other weaker factors (19,20). The nucleotide content at a locus is not significantly different in humans and mice (21); thus, we expected that the per locus mutation rate is similar in the course of the evolution of these two species.
The mutation rate influences all functional categories of substitutions equally and affects intraspecies polymorphism and interspecies divergence in the same way (22). Therefore, if mutation rate heterogeneity were a strong force responsible for the heterogeneity of the polymorphism rate across genes, the abundance of all functional types of SNPs in a gene should be strongly correlated with the number of substitutions between this human gene and its mouse ortholog (Table 1). Also, the numbers of polymorphic sites of different functional types would be correlated with each other.
To detect the impact of mutation rate heterogeneity, we need to eliminate the effects of selection and coalescent history. The only correlation affected by heterogeneity in the mutation rate, but not by selection or coalescent history, is the correlation of neutral polymorphism and divergence rates. Thus, we analyzed the correlation of the number of synonymous substitutions KS between human and murine genes with the number of synonymous polymorphisms. Polymorphism rate is expected to correlate with divergence in neutral sites. However, if the heterogeneity of the mutation rate among loci is within a relatively small range compared with other factors influencing the polymorphism rate, this correlation would not determine most of differences in SNP density and therefore it would not be highly significant. As shown in Figure 1, genes with high KS show a tendency to have a higher synonymous polymorphism rate (also, a correlation coefficient of approximately 0.2 was reported recently in a related study) (23). This tendency is stronger for small values of KS, which is probably due to the inaccuracy of the KS values in genes with a high synonymous divergence rate. However, the dependence of number of synonymous SNPs on KS is overall not highly significant. This suggests, in agreement with previous work (12,20,24), that mutation rate heterogeneity plays only a minor role in creating the observed heterogeneity in polymorphism rates across genes. Alternatively, fluctuations of the mutation rate along the genome are not conserved between the human and the mouse genomes.
All contemporary alleles at each locus have descended from a single allele (22). If a locus has had a long coalescent history, i.e. if the ancestral allele existed many generations ago, many sites at the locus, both functional and neutral, are expected to be polymorphic. Thus, the impact of different coalescent histories at different loci, as well as of loci-specific mutation rates, will create a correlation of densities of functional and neutral SNPs at a locus. We observe a strong and highly significant correlation between the density of synonymous and nonsynonymous SNPs in a gene (Fig. 2). Therefore, the correlation between the densities of functional and neutral SNPs cannot be explained by mutation rate heterogeneity and must be due to different coalescent histories of different loci.
Heterogeneity in the coalescent histories of loci may be caused by genetic drift and recombination, or by selection through background selection (25) or hitch-hiking (26). We assume that positive selection and, therefore, hitch-hiking is too rare to substantially alter the polymorphism rate at many loci simultaneously. Then, the cause of heterogeneous coalescent histories of human loci can be either genetic drift or differences in the strength of negative selection. If background selection has influenced polymorphism rates of modern SNPs, then genes under stronger selective constraint should have a lower density of synonymous SNP.
To determine the existence of the influence, we related the density of synonymous SNPs to the strength of negative selection, estimated as a ratio of non-synonymous to synonymous substitution rates, KA/KS. Correlation of the density of synonymous SNPs with KA/KS ratio was not observed (Fig. 3). This suggests that genetic drift, not background selection, is primarily responsible for differences in coalescent history.
The expected effect of selection on the density of SNPs is greatly dependent on their contribution to fitness. With respect to fitness, four types of changes are possible in a protein sequence: benign (neutral), slightly deleterious, strongly deleterious and beneficial. Very deleterious and beneficial SNPs are expected to be observed only at low rates in a population because very deleterious SNPs are prevented from becoming common by negative selection and beneficial substitutions achieve fixation quickly and are not generally observed in polymorphism state. Thus, the majority of the SNPs observed in a population are probably neutral or slightly deleterious.
The impact of negative selection on a gene depends on its overall contribution to fitness and on the fraction of sites (possible mutations) that are functionally important for this gene. Both the rate of non-synonymous substitution (KA) and the number of benign SNPs are proportional to the fraction of sites that are under selective constraint, therefore, a correlation between the density of benign SNPs and KA was expected. Any neutral variation is expected to correlate with divergence, however, if different loci show a higher heterogeneity in protein evolution rates than that in mutation rates, then they would show a more profound difference in benign amino acid variation than in synonymous variation.
Slightly deleterious mutations, those with a coefficient of selection against them below 1/4Ne (where Ne is the effective population size) contribute to both fixations and polymorphism (27). In contrast, substantially deleterious mutations with coefficients of selection above 1/4Ne almost never reach fixation. Such mutations may be present at low frequencies and, thus, contribute to polymorphism, unless they are very deleterious (with selection coefficients above, approximately 100/4Ne). Among mutations that may affect a protein, the fraction of substantially deleterious mutations with selection coefficients within the range 1/4Ne–100/4Ne is likely to be substantial (28), because this range spans two orders of magnitude. Regardless of this fraction, we did not expect the density of substantially deleterious SNPs to correlate with KA.
We observed a strong correlation between KA and the density of non-synonymous SNPs predicted to be benign and only a very weak correlation between KA and the density of SNPs predicted to be damaging for protein structure or function (Fig. 4). We predicted that some polymorphisms are damaging based on evolutionary sequence conservation and biochemical and protein structure characteristics of the substitution (see Materials and Methods). The polymorphisms that we predicted to be damaging by our routine may, nevertheless, prove not to affect fitness because what is ‘damaging’ from the point of view of protein structure, or even evolutionary conservation, may not have a profound effect on fitness in the modern species. However, the fact that the correlation between KA and the density of damaging SNPs is very weak demonstrates that most of the SNPs that we predict to be damaging do have a deleterious effect on fitness strong enough to prevent their fixations (29).
Since only the density of substantially deleterious SNPs should fail to correlate with KA, these results demonstrate that substantially deleterious alleles are present in the human population in large numbers; it has been suggested (16,29–31) that as many as 1/5 of all missense SNPs may be deleterious. Because the number of deleterious SNPs depends only slightly on the rate of protein evolution, estimated by interspecies sequence divergence, deleterious SNPs are evenly distributed across human genes.
Here we compared the contributions of genetic drift, mutation and selection on the distribution of human polymorphisms. Our results confirm the importance of the coalescent history, genetic drift, to the structure of human haplotypes (12). However, at least in protein coding regions, negative selection also has a profound effect on the density of polymorphisms. We found that selection lowers genetic variability by eliminating deleterious variants and the magnitude of this effect is relatively strong in comparison to genetic drift. The correlation of the non-synonymous sequence divergence to the density of damaging SNPs shows that the accumulation of deleterious SNPs also has a considerable effect on the pattern in human polymorphism such that many SNPs in the human genome appear to be substantially deleterious. While previous analyses detected high levels of deleterious SNPs in the human population (4,16,29–34), here we show that selection is strong enough to prevent their fixation.
Comparing polymorphisms and sequence divergence also has its implications for medical geneticists looking for allelic variants involved in human disorders. Sequence conservation between species has been proposed to be a useful marker to identify potentially important allelic variants. Although there is little doubt that this strategy is useful where individual amino acid sites are concerned, our study shows that the overall number of deleterious variants does not depend on the conservation of a gene as a whole (Fig. 4).
The common disease/common variant (CD/CV) hypothesis claims that genetic diseases are caused by frequent alleles (7). On the other hand, it is also possible that common multifactorial genetic diseases are mostly caused by a high number of rare alleles; this is known as the common disease/rare variant hypothesis (CD/RV) (10). From the evolutionary point of view, the CD/RV hypothesis can be explained in terms of the mutation accumulation model, which postulates that large numbers of deleterious alleles are involved in the inheritance of phenotypes. On the other hand, the CD/CV hypothesis would be most easily supported by pleiotropic models (10). Although multiple studies on specific complex phenotypes will be needed to decisively discriminate between these hypotheses, statistical analysis of allelic variation can provide indirect evidence supporting either of them. Our results support the mutation accumulation model with respect to significant number of substantially deleterious allelic variants accumulated in individual human genomes.
Both the direct and the indirect association studies assume that common diseases are caused by common SNPs (8). In addition, indirect association studies are dependent upon the existence of stable haplotype blocks in the genome and the association of common neutral polymorphisms with the disease causing variants in the haplotype blocks (8). If many deleterious SNPs contribute to the common complex phenotypes of medical interest, association studies might be ineffective (10) and novel strategies are to be sought to identify the genetic basis of complex diseases.
MATERIALS AND METHODS
Protein sequences for human and mouse genes were obtained by extracting all entries from the SWALL database (35) annotated as ‘Homo sapiens’ and ‘Mus musculus’ in the organism field. Orthologous pairs between these two species were identified via BLAST (36) as bidirectional best hits (37) that spanned at least 80% of the length of the longest protein and had showed at least 60% amino acid sequence identity. This procedure yielded 11 597 orthologous pairs. For each gene in each orthologous pair, a nucleotide sequence was obtained by comparing the amino acid sequences of genes in our dataset with the coding sequences obtained from the complete human (38) and mouse (21) genomes and from complete mRNA sequences from GenBank. Only genes with greater than 99% similarity (excluding gaps longer than 20 nucleotides or probable alternative isoforms) were used to reconstruct the nucleotide sequence, resulting in 2592 gene pair alignments. For each orthologous gene pair, a nucleotide alignment was reconstructed from an amino acid alignment made with CLUSTAL (39) using default parameters.
We estimated the rate of synonymous (KS) and non-synonymous (KA) sequence divergence using the Li–Pamilo–Bianchi method (40,41). To demonstrate the independence of our results and the method of estimating KS and KA values, we repeated all computations using KS and KA estimates obtained using the Yang and Nielsen method as implemented in the PAML package (42,43) (data not shown).
SNPs from the HGVbase database, Release 13 (44) were mapped onto protein sequences using the snp2prot program (45) and were classified into three functional categories (synonymous, non-synonymous damaging and non-synonymous benign) using PolyPhen software (45). The mapping resulted in 5923 nsSNPs and 5856 sSNPs observed in 4332 human proteins.
To evaluate the statistical significance of the observed dependences, we grouped the genes according to both variables to form a 4×3 contingency table with equally populated bins. A contingency table χ2 test of independence was applied to all statistical dependences considered. Since the data sample size and the number of degrees of freedom were kept constant, χ2 values and corresponding P-values were directly comparable for all tests, except of the dependence shown in Figure 2. Figure 2 presents only TSC (the SNP Consortium) (46) data corresponding to the systematic study, which used a fixed population sample for SNP discovery. Several χ2-based contingency measures corrected for the sample size dependence are common. We used the contingency coefficient
We relied on χ2 statistics on grouped data because it is distribution-free, i.e. does not depend on a specific hypothesis on the original distribution of the data. The results are robust with respect to the data grouping. We also repeated the analysis using correlation coefficient and ANOVA. The results were shown to be independent of the method for detection of statistical dependence. In ANOVA, genes were grouped according to the number of SNPs per gene. For the dependence of benign nsSNPs density on KA, ANOVA resulted in a P-value of 4.62×10−27(4.21×10−23for the χ2 test); the same test for damaging nsSNPs had a P-value of 0.03 (0.046 for the χ2 test); and the ANOVA P-value for the dependence of sSNP rate on KS was 0.64 (0.088 for the χ2 test).
In order to verify that our results were not due to the comparative sequence analysis method implied, we confirmed all of the results using a set of functional predictions made on the basis of three-dimensional structure alone. To prove that the qualitative results were not seriously affected by erroneous SNPs in the database, by possible bias due to non-uniform and generally unknown sample size, all results of the analysis were repeated on subsets of the database corresponding to SNP data, annotated as ‘validated’ in dbSNP (47) and also on the data obtained by systematic genome-wide SNP screens in a fixed sample of individuals provided by TSC and Japanese SNP database (JSNP) (48).
The data on human–mouse sequence divergence and SNP distribution in genes are available via ftp at genetics.bwh.harvard.edu/Sunyaev/snp2div/snp2div.xls.
We are grateful to Alexey Kondrashov and Alison Wellman for the careful reading of the manuscript and providing us with their valuable comments.
To whom correspondence should be addressed. Tel: +1 6175256675; Fax: +1 6177325123; Email: email@example.com
|nsSNPs versus sSNPs||sSNPs versus KS||Neutral nsSNPs versus KA||deleterious nsSNPs versus KA|
|Mutation rate differences||+||+||+||+|
|Accumulation of slightly deleterious alleles||−||−||−||−|
|nsSNPs versus sSNPs||sSNPs versus KS||Neutral nsSNPs versus KA||deleterious nsSNPs versus KA|
|Mutation rate differences||+||+||+||+|
|Accumulation of slightly deleterious alleles||−||−||−||−|