Abstract

Using available Information on the total absolute size of the coding region of the human genome, data on codon usage and pseudogene-derived mutation rates for different single nucleotide substitutions we have estimated, for the human genome, the potential numbers of mutation events capable to produce: (1) nonsense; (2) missense (radical and conservative); (3) silent; (4) splice; and (5) protein-elongating (those changing wild-type stop codon into an amino acid encoding codon) mutations. We used the NCBI dbSNP database to retrieve data on the observed number of polymorphisms of each category. The fraction of polymorphisms in each category among all potential events in the genome depends on the strength of selection: the higher the rate of polymorphism, the weaker the selection. We used nonsense mutations as a referent group. Compared with nonsense mutations, we found that the relative selection coefficient against protein-elongating mutations was 21%, and the relative selection was 12% against missense mutations. Radical missense mutations were found to be four times more deleterious compared to conservative ones. Surprisingly, we found that silent mutations on average are not neutral; with the average harmfulness of 3% of nonsense mutations. Silent mutations may be deleterious when they affect splicing by creating cryptic donor-acceptor sites or by disturbing exonic splicing enhancers (ESESs). The average selection coefficient against splice mutations was 48% of that against nonsense mutations. Converting the relative selection coefficients into absolute ones using data on loss-of-function mutations in Saccharomyces cerevisiae and Caenorhabditis elegans, or by analysis of the expected frequency of mutations in the human genome, suggested that genetic drift could play a role in population dynamics of conservative missense and silent mutations.

INTRODUCTION

Single-nucleotide polymorphisms (SNPs) are common sequence variants in the human genome (1). SNPs occur, on average, every 500–1000 bases (26). SNP density is lower in exonic than in intronic sequences, which suggests that purifying selection plays a role in the distribution of SNPs (7). Sunyaev et al. (8) compared synonymous and non-synonymous SNPs in human genes with the corresponding regions in murine homologues and concluded that most functional SNPs in the human genome are deleterious.

It long has been known that the substitution of an amino acid by amino acids with similar physicochemical properties is maintained during evolution more frequently than substitution of chemically different amino acids (911). This suggests that, on average, the substitution of an amino acid by a chemically similar one can be less deleterious compared with the substitution by a chemically different amino acid. Several classification schemes of amino acids were more recently developed (9,12,13). One of the first classifications of amino acids is Grantham's chemical difference matrix (11), a multivariate combination of residue side chain composition, polarity and volume differences, to quantify the severity of amino acid changes. A classification approach used Dagan et al. (10) a combination of Grantham's distance with polarity, volume and change and has been shown to correlate with intensity of selection.

Different categories of SNPs are expected to vary in deleteriousness; for example, missense mutations are more deleterious than silent mutations (14). It is of theoretical and practical interest to estimate the average deleteriousness of different functional categories of point mutations in coding regions of the human genome. In this study, we used the National Center for Biotechnology Information dbSNP database to estimate the relative selection coefficients against different functional categories of point mutations in coding regions of the human genome.

RESULTS

SNP retrieval

A total of 29 169 true (i.e. originated by point mutations) and frequency-validated SNPs in the coding regions of the human genome were retrieved from the dbSNP database. The numbers of the retrieved SNPs according to the functional categories of point mutations are shown in Table 1.

Average mutation rates

Relative mutation rates for the 12 variants of the point mutations derived from the analysis of processed pseudogenes are available (Table 2). We ascribed these mutation rates to each of the 576 possible wild-type (WT)-mutant codon pairs (Supplementary Material, Table S1) and computed the average relative mutation rates for the six functional categories used in our study (Table 3).

Observed and expected proportions of the functional categories

The proportions of the potential sites for the functional categories of point mutations adjusted to codon usage and mutation rates are shown in Table 1. The proportions given in the columns 3–5 are expected when there is no variation in the strength of purifying selection against different categories of the mutations before and after adjustments for codon usage, mutation rates. The observed proportions were estimated on the basis of the observed numbers of SNPs corresponding to each functional category reported in the dbSNP database.

We found that the observed proportions of the nonsense, protein-elongating and radical missense mutations were lower than expected and that the observed relative proportions of the conservative missense and silent mutations were higher than expected (Table 1 and Fig. 1A). The variations in proportions were most easily interpreted when the log of the ratio of the observed to expected proportions was used (Fig. 1B). This pattern can be explained by inter-categorical differences in the strength of purifying selection: stronger purifying selection against nonsense, protein-elongating and radical missense mutations produces a deficit of these mutations and therefore tips their proportions downwards, whereas the proportions of conservative missense and silent mutations are correspondingly inflated. Therefore, our results suggest inter-categorical variation in the strength of purifying selection.

Predicted average relative selection coefficients

Equation (7) was used to compute average relative selection coefficients. Table 4 shows these coefficients and their 99% confidence intervals for each functional category. We found that relative to the average selection coefficient against nonsense mutations, the coefficient was 21% against protein-elongating mutations, 3% against silent mutations and 12% against missense mutations. The average relative selection coefficient against radical missense mutations was four times higher than that against conservative missense mutations. The average relative selection coefficient against conservative missense mutations was slightly higher than that against silent mutations.

Deleteriousness of splice mutations

The dbSNP database contains reports on 95 frequency-validated splice mutations. Therefore, the proportion of polymorphisms for a splice site is 95/1 870 000≈5.1×10−5. Given that the proportion of polymorphic nonsense mutations among all potential nonsense mutations in the genome (Pnonpol) is ∼2.2×10−5, the average relative selection coefficient against splice mutations is 48% (95% confidence interval 33–63%) that against nonsense mutations. Given the median length of an exon is ∼120 nucleotides (15) and that there can be several sites in an exon that are important for splicing (1619), disturbance of normal splicing by silent mutations can explain their relatively high deleteriousness (3% versus expected 0%).

DISCUSSION

Our approach gives an unbiased estimate of average equilibrium frequency of a mutant allele qi (see Materials and Methods and Appendix). Confidence intervals for the estimated relative selection (si,rel) coefficients are very narrow for all functional categories analyzed except protein-elongating mutations (see Appendix and Table 4). We found substantial variation in the average relative selection coefficients against different functional categories of point mutations in coding regions of the human genome. After nonsense mutations, the strongest selection coefficient was for protein-elongating mutations (21% that for nonsense mutations). Protein-elongating mutations have an amino acid-encoding codon in place of a normal stop codon and are therefore likely to disturb protein structure and function, though not as much as nonsense mutations.

The relative strength of purifying selection against radical missense mutations was 17% that against nonsense mutations. Radical missense mutations change a WT amino acid into a mutant amino acid that is chemically different. These missense mutations are expected to disturb protein structure when they are located in sites important for protein folding or in functional sites. Conservative missense mutations were four times less deleterious than radical ones. The overall relative deleteriousness of missense mutations in the human genome was 12% that of nonsense mutations.

We found a relatively high deleteriousness of silent mutations in the human genome (3% of the average deleteriousness of nonsense mutations). Silent mutations do not change amino acid sequence and are generally believed to be selectively neutral, although in some cases, they can be extremely deleterious. It has been demonstrated that when silent mutations disturb a canonical splice site located near an exon–intron boundary, they can cause abnormal splicing and lead to the production of non-functional protein (16,17,2022). Silent mutations can also disturb splicing by activating cryptic splice sites (18,23,24) or exonic splicing enhancers (25). Aside from the effects on splicing, silent mutations can disturb mRNA processing and transport (26,27). The widespread biases in codon usage can also cause non-neutrality of the silent mutations: if a silent mutation switches encoding to a codon with a low pool of corresponding tRNA, it can severely decrease the efficiency of translation and protein production (28,29). We found that the average relative selection coefficient against splice mutations was half that against nonsense mutations. This result is in good agreement with the estimates of selection coefficients for splice mutations obtained by comparing human and chimpanzee sequences [Dr A.S. Kondrashov (NCBI), personal report]. Because there can be several regions in an exon important for splicing and because the average deleteriousness of a splice mutation was relatively high (half that of nonsense mutations), it seems not surprising, after all, that silent mutations appear to confer substantial deleteriousness.

Our approach provides estimates of relative selection coefficients. To convert relative estimates into absolute ones, we need to know the absolute selection coefficients against at least one functional category. We believe it reasonable to suggest that most of the nonsense mutations are loss-of-function mutations. Estimates of the average drop in fitness in loss-of-function mutations are available for Saccharomyces cerevisiae and Caenorhabditis elegans. Knocking out the 890 metabolic genes in S. cerevisiae and testing the knockouts in eight different conditions demonstrated the average drop in fitness to be ∼10–20% (30). Eliminating the function of the more than 16 000 C. elegans genes by RNA interference (RNAi) demonstrated that >90% of them have no detectible phenotypic effect (31). The proportion of the RNAi embryonic lethal phenotype was only 5.5% (32). The other two groups of phenotype (growth defects and post-embryonic phenotypes) make up another 5% of all phenotypic effects of RNAi loss of functions (32). If we suggest that the proportion of essential genes in the human genome is similar to that in yeast and C. elegans genomes, then the average absolute selection coefficient against nonsense mutations in humans will be about 0.1.

Alternatively, we can directly obtain an estimate of the absolute selection coefficient against nonsense mutations by using the proportion of polymorphic sites among all nonsense sites in the genome. We have estimated from the dbSNP data that the average mutant allele frequency for polymorphic nonsense mutations is ∼0.2. Then, the average mutant allele frequency of nonsense mutations qnon will be qnon=0.2 · Pnonpol+qnonmon · (1 − Pnonpol), where Pnonpol is the proportion of polymorphic sites and qnonmon is the true population frequency of the nonsense mutations that were never reported as polymorphisms. If we assume qnonmon equals zero, then qnon=0.2 · Pnonpol=0.2 · 0.00002=4×10−6. On the basis of the formula (1), and assuming that coefficient of dominance equals 0.2 (3335), an estimate of absolute selection coefficient s will be s=μ/hq=2 · 10−8/0.2 · (4×10−6)=0.025. This estimate of s is about 1/4 that obtained using data on average fitness reduction of loss-of-function mutation in S. cerevisiae and C. elegans. The estimate could be biased because there is insufficient sequencing of the genome at present to identify rare variants, so that the qnon is underestimated and s could, therefore, be overestimated. Nevertheless, the estimate that we obtain from this method is not dramatically different from the estimate obtained by the evaluation of model organisms.

Estimates of the absolute selection coefficient for polymorphic nonsense mutations show that their deleteriousness is much lower than overall average for nonsense mutations. For the average frequency of the mutant allele of 0.2, the selection coefficient, based on formula (1), will be s=μ/hq=2×10−8/0.2 · 0.2=5×10−7. This means that polymorphic nonsense mutations are virtually neutral, whereas an average absolute selection coefficient against all nonsense mutations in the human genome is of the order of 10−1–10−2. Therefore polymorphic nonsense mutation is unlikely to explain mutational damage in man estimated based on the data on consanguineous marriages (36).

Our analysis is based on the assumption that level of polymorphism in the coding regions of the human genome is controlled by the purifying selection rather than drift. Genetic drift is expected to play a role when the selection coefficient is ≪1/Ne. There is general agreement that the effective population size Ne for the human population is ∼104 (37). Therefore, one can expect that random drift will be important when the selection coefficient is <10−4 (38,39). One can suggest that average effect of loss-of-function mutations is ∼10−1 as indicated by observations in model organisms or from the direct computation we performed earlier. Considering nonsense mutations as a loss-of-function mutations, we can suggest that absolute selection coefficient against them to be ∼10−1. On the basis of our estimation of the relative selection coefficients, we can suggest that the absolute selection coefficient against all splice mutations is ∼10−1–10−2, against all radical missense mutations is ∼10−3 and an average selection coefficient against all conservative missense and silent mutations is ∼10−3–10−4. Our estimates of selection coefficients do not exclude the possibility that genetic drift can play a role in the population dynamics of conservative missense and silent mutations, but would play a less important role for other mutations.

Fixation of the conservative missense and silent mutations due to genetic drift will result in decreasing the polymorphism that can be erroneously interpreted as an effect of purifying selection. This means that our estimates of the relative selection coefficient for conservative missense and silent mutations can be upwardly biased. We believe, however, that this bias is unlikely to affect the conclusion that average deleteriousness of silent mutations is about several percent of that of nonsense mutations. Firstly, for genetic drift to play a major role in population dynamics of mutations, their deleteriousness should be much lower than critical 10−4. Our estimates, however, show that the deleteriousness of conservative missense and synonymous mutations is close to or slightly higher than 10−4. Secondly, average deleteriousness of silent mutations is comparable to that of conservative missense mutations, suggesting that silent mutations are on average slightly deleterious. Recently, Yampolsky et al. (40) have estimated a bird eye's view distribution of missense mutations by their deleteriousness. Their analysis suggests that selection coefficient against majority of missense mutations should be ∼10−3–10−4 that is in good agreement with our estimates.

It is possible that nonsense mutations do not always lead to complete loss-of-function phenotype. Nonsense mutations located close to 3′ end of the coding region of the gene can retain some function. The mild effect of nonsense mutations can be relatively common in genes encoding large proteins with long non-functional or low functional N-terminus. An analysis of nonsense mutations in BRCA2 gene provides an example of a nonsense mutation with a mild effect on the protein function (41). Therefore, one can expect that average deleteriousness of nonsense mutations can be lower than deleteriousness of true loss-of-function mutations, suggesting that our estimates of absolute selection coefficients can be upwardly biased.

Our estimates can be used to predict the average fitness reduction (mutation load) attributable to new point mutations in coding regions of the human genome. There are ∼7.5×107 potential sites for missense mutations in the human genome. Given that the per-nucleotide, per-generation mutation rate is ∼2×10−8 (42,43), the average number of new missense mutations per genome per generation will be ∼1.5. According to our estimates, the absolute selection coefficient against missense mutations is ∼0.01, the dominance coefficient is ∼0.2 (3335) and the average decrease in fitness attributed to the de novo missense mutations is ∼0.3% per generation. Similar calculations show that de novo nonsense mutations can cause an ∼0.2% reduction in fitness per generation, protein-elongating mutations ∼0.1% and silent mutations ∼0.2%. Therefore, the overall total per-generation reduction in fitness attributable to point mutations in coding regions of the human genome is ∼1%.

MATERIALS AND METHODS

The dataset

The dbSNP is the largest public database for genomic human SNP polymorphisms (44). The latest build of the database (build 125) includes 11 170 620 submissions. Different submitters target different regions of the human genome. The size of the targeted regions depends on the method used and can vary from 100 nucleotides to thousands of nucleotides (for expanding a long-template polymerase chain reaction system). The choice of the target sequence is not random, if we consider a specific submission, but it is approximately random when results of several millions of independent submissions are pooled together. Therefore, the dbSNP database can be described as a collection of polymorphisms detected in many randomly chosen and overlapping fragments of the human genome. Depending on the position of the point mutation in the codon, a SNP can result in a nonsense, missense or silent mutation. Probing all nucleotide positions in the coding regions of the human genome by all possible point mutations allows the estimation of the absolute number of potential sites for a specific functional category (e.g. nonsense mutations). In a large random sample of DNA fragments from the human genome, the proportions of potential sites for different functional categories are expected to follow the proportions of the potential sites in the whole genome.

The quality of a SNP characterization depends on the method used to detect it and varies from submitter to submitter (45). To reduce discovery errors, only frequency-validated SNPs (i.e. those reported by at least two independent submitters and at least one submission validated by a non-computational method) were used in our study. The retrieved SNPs were stratified into six functional categories: (i) nonsense mutations, which change an amino acid-encoding codon into a stop codon, (ii) protein-elongating mutations, which modify a wild-type (WT) stop codon into an amino acid-encoding one, (iii) missense mutations, which change a WT amino acid into a mutant one and (iv) silent mutations, which do not change the amino acid encoded. We stratified missense mutations as (v) radical or (vi) conservative by adopting the classification system used by Dagan et al. (10). Briefly, all amino acids were subdivided into three groups according to their charge: positive (R, H, K), negative (D, E) and uncharged (A, N, C, Q, G, I, L, M, F, P, S, T, W, Y, V). The amino acids were further subdivided by volume and polarity: special (C), neutral and small (A, G, P, S, T), polar and relatively small (N, D, Q, E), polar and relatively large (R, H, K), non-polar and relatively small (I, L, M, V), and non-polar and relatively large (F, W, Y). We considered radical missense mutations to be those that change amino acid categories (e.g. R→L) and conservative missense mutations to be those that do not change amino acid category (e.g. L→V). We also analyzed splice mutations (i.e. single-nucleotide substitutions in the two first and two last nucleotides of an intron).

For each SNP, we extracted nucleotide and amino acid variants from the original dbSNP ASN.1 file. SNPs with no data on the type of amino acid substitutions were excluded from our analysis. In some cases, more than one mutant variant was reported for a specific rsID. For example, two nucleotide variants of rs2280279, ACG(Thr) and TCG(Ser), were reported. In these cases, each variant was treated as an independent polymorphism.

Estimation of the number of potential sites

Point mutations produce three mutant variants per nucleotide, for example, ‘A’ can mutate into ‘C’, ‘G’ or ‘T’. Thus, nine point mutations are possible per codon, and for 64 codons there will be 64.9=576 possible WT–mutant codon pairs. These 576 pairs are all possible variants of the point mutations in a codon (see Supplementary Material, Table S1). Depending on the WT and mutant codons, each pair can be assigned to one of the nonsense, protein-elongating, radical missense, conservative missense or silent mutation category. The proportions of the different categories of the WT–mutant codon pairs are equal to the proportions of the potential sites for the corresponding functional categories in the human genome conditional on no differences in codon usage or mutation rate. Different codons are used, however, in the human genome with different frequencies, for example, the frequency of CTG(Leu) is 40 times higher than the frequency of TAA(Ter). To deal with the differences in codon usage, we weighted the numbers of WT–mutant codon pairs by codon frequencies provided by Nakamura et al. (46). To adjust for the variation in mutation rates, we used Graur and Li's (47) pseudogene-derived matrix of nucleotide substitutions (Table 2). The data in Table 2 reflect mutation rates unaffected by selection because there are no selection constraints on nucleotide replacements in pseudogenes.

According to the International Human Genome Sequencing Consortium, the total length of the coding sequences in the human genome is ∼34 Mb (48). If we allow that each nucleotide can produce three point mutations, then the total potential number of point mutations in the human genome is ∼1.02×109. On the basis of this estimated proportion of potential sites for each functional category, we predicted the absolute number of potential sites for each functional category.

Computation of selection coefficients

There is general agreement that point mutations in coding regions of the human genome are under purifying selection. For deleterious mutations, there should be equilibrium between a decrease in the number of mutations due to purifying selection and an increase in the number of mutations due to new mutations. Let us consider a locus with two alleles ‘A and a’ in a large, randomly mating, sexually reproducing population. Let ‘A’ be the normal allele and ‘a’ the mutant allele. Let the genotypic values for fitness be 1 for AA, 1−hs for Aa and 1−s for aa, where h is the dominance coefficient and s (0<s<1) is the selection coefficient. At equilibrium, the frequency of the mutant allele will be
\[q{=}{\mu}/hs,\]
where q is the equilibrium frequency of mutant allele a and μ is the mutation rate (i.e. the proportion of A alleles that mutate to a alleles each generation) (49). It is widely believed that the distributions of h on metric phenotypic traits are leptokurtic and highly deleterious mutations tend to be nearly recessive (3335). However, the distribution of h on fitness is less leptokurtic with coefficient of dominance similar for different mutations (50). Therefore, it is reasonable to assume that fitness mutations have the same h (close to 0.1) (51). For simplicity, we suggest that h is the same for the different functional categories of point mutations.
The average equilibrium frequency of the mutant alleles for the ith category i is a parameter of interest. We cannot use only SNPs with reported allele frequencies because the estimate will be upwardly biased. To have an unbiased estimate of i, one should consider all potential sites of the ith functional category in the genome regardless of whether they are polymorphic (i.e. segregating). Let i be the average frequency of the WT allele A and i the estimated average frequency of the mutant allele a.
\[{\bar{p}}_{i}{=}1{-}{\bar{q}}_{i}.\]
The probability that a site will be monomorphic (i.e. non-segregating) after screening n randomly chosen chromosomes is
\[P_{i}^{\mathrm{mon}}{=}(1{-}{\bar{q}}_{i})^{n}.\]
Because screening equally targets all functional categories in the targeted DNA fragment, n should be the same for different functional categories of the point mutations.
We can rewrite Eq. (3) as
\[{\bar{q}}_{i}{=}1{-}(P_{i}^{\mathrm{mon}})^{1/n}.\]
Pimon can be estimated as a fraction of monomorphic sites among all potential sites of the ith category. From Eq. (4),
\[{\bar{q}}_{i}{=}1{-}(1{-}P_{i}^{\mathrm{pol}})^{1/n},\]
where Pipol is the proportion of polymorphic sites of the ith category. Because Pipol was very small (∼10−4) for all functional categories analyzed (Table 4), we could simplify Eq. (5) as
\[{\bar{q}}_{i}{\approx}1{-}\left(1{-}\frac{P_{i}^{\mathrm{pol}}}{n}\right){\approx}\frac{P_{i}\mathrm{pol}}{n}\]
and rewrite Eq. (1) as
\[\frac{P_{i}^{\mathrm{pol}}}{n}{=}\frac{{\mu}_{i}}{h{\cdot}s_{i}}{\ }\mathrm{or}{\ }s_{i}{=}\frac{{\mu}_{i}{\cdot}n}{P_{i}^{\mathrm{pol}}{\cdot}h}\]
If we assume that the selection coefficient is highest against nonsense mutations among all point mutations and arbitrarily set it to a value of 1, then the relative selection coefficient si,rel for the other functional categories can be computed as
\[s_{i,\ \mathrm{rel}}{=}\frac{s_{i}}{s\mathrm{non}}{=}\frac{{\mu}_{i}{\cdot}n{\cdot}h{\cdot}P_{\mathrm{non}}^{\mathrm{pol}}}{P_{i}^{\mathrm{pol}}{\cdot}h{\cdot}n{\cdot}{\mu}_{\mathrm{non}}}{=}\frac{{\mu}_{i}{\cdot}P_{\mathrm{non}}^{\mathrm{pol}}}{P_{i}^{\mathrm{pol}}{\cdot}{\mu}_{\mathrm{non}}}.\]

Therefore, the relative selection coefficient for point mutations of the ith category equals the ratio of the proportion of polymorphic nonsense sites to the proportion of polymorphic sites of the ith category adjusted for mutation rates. The ratio of the absolute mutation rates is the same as the ratio of the relative mutations rates; therefore, we can use the ratio μinon from Table 3 for Eq. (7).

Analysis of splice mutations

To validate our hypothesis of a relatively high deleteriousness of silent mutations resulting from their effect on splicing, we estimated the average selection coefficient against splice mutations in the human genome. It is well known that the first two nucleotides (donor site) and the last two nucleotides (accepter site) in an intron are almost invariantly GT and AG, respectively. There are about 20 000–25 000 genes in the human genome (48) and an average of 7.8 exons per gene (52). Therefore, there are ∼23 000 · 6.8 · 4 625 600 nucleotides in the splice sites of the human genome, which suggests that there are 625 600 · 3≈1 877 000 potential splice mutations in the genome. We used dbSNP data to identify SNPs located in splice sites.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at HMG Online.

ACKNOWLEDGEMENTS

We thank two anonymous reviewers and also A.S. Kondrashov and S. Sunyaev for discussion and critical comments. M.K. was supported in part by the NCI grant CA 75432 and C.A. was supported by ES09912 and P01 CA34936.

Conflict of Interest statement. The authors declare no conflicts of interest.

APPENDIX

Expression for the confidence interval of the estimated relative selection coefficient

On the basis of expression (7), we see that the relative selection coefficient has the form
\[s_{i,\ \mathrm{rel}}{=}\frac{{\mu}_{i}{\cdot}P_{\mathrm{non}}^{\mathrm{pol}}}{P_{i}^{pol}{\cdot}{\mu}_{\mathrm{non}}}\]
where
\[\begin{array}{l}\frac{P_{\mathrm{non}}^{\mathrm{pol}}{=}N_{\mathrm{non}}^{\mathrm{pol}}}{N_{\mathrm{non}}}\\\frac{P_{i}^{\mathrm{pol}}{=}N_{i}^{\mathrm{pol}}}{N_{i}}\end{array}\]
and Nnonpol is the observed number of polymorphic (segregating) sites for nonsense mutation, Nnon is the total potential number of nonsense mutations in the human genome, Nipol is the observed number of polymorphic (segregating) sites for mutation of the ith category and Ni is the total potential number of mutations of the ith category in the human genome. Frequencies Pnonpol and Pipol have distributions, which can be derived in the usual way based on the binomial distributions of the numerators (53). Therefore, the standard deviation of si,rel can be estimated on the basis of the delta method (53).
\[{\sigma}(s_{i,\ \mathrm{rel}}){=}\frac{{\mu}_{i}N_{i}}{{\mu}_{\mathrm{non}}N_{\mathrm{non}}}\sqrt{\frac{\begin{array}{l}(N_{i}^{\mathrm{pol}})^{2}\ N_{\mathrm{non}}^{\mathrm{pol}}\ (1{-}N_{\mathrm{non}}^{\mathrm{pol}}/N_{\mathrm{non}})\\{+}(N_{\mathrm{noni}}^{\mathrm{pol}})^{2}\ N_{i}^{\mathrm{pol}}\ (1{-}N_{i}^{\mathrm{pol}}/N_{i})\end{array}}{(N_{i}^{\mathrm{pol}})^{4}}}\]
The 95% confidence intervals for si,rel will then have the form
\[(s_{i,\ \mathrm{rel}}{\mp}1.96{\sigma}\ (s_{i,\ \mathrm{rel}})).\]
The approximation holds whenever the standard deviation is relatively small, which seems to be an acceptable assumption in our case.

Figure 1. Observed and expected fractions of the different functional categories of point mutations in the human genome. (A) Observed (black) and expected (gray) fractions of the SNPs. The expected fractions were computed on the basis of the null hypothesis that there are no differences in the strength of purifying selection against different categories of point mutations in the human genome. (B) Log of the ratio of the observed to the expected fractions of the SNPs. The horizontal dotted line is the ratio expected under the null hypothesis.

Table 1.

Observed (Obs) and expected (Exp) fractions of the six functional categories of SNPs in the coding regions of the human genome

CategoryNumber of WT-mutant pairsExp fraction of SNPs (%) (adjustments)Obs SNPsObs fraction/Exp fractionLog of Obs fraction/Exp fractiona
NoneCUCU and μNo.Fraction (%)
Nonsense234.03.93.6880.30.08−1.08
Protein-elongating234.00.20.2210.10.50−0.30
Missense39268.17366.913,83852.30.78−0.11
Radical missense28649.762.948.28,37826.10.54−0.27
Conservative missense10618.410.118.75,46026.21.410.15
Silent13823.922.929.415,22247.41.610.21
CategoryNumber of WT-mutant pairsExp fraction of SNPs (%) (adjustments)Obs SNPsObs fraction/Exp fractionLog of Obs fraction/Exp fractiona
NoneCUCU and μNo.Fraction (%)
Nonsense234.03.93.6880.30.08−1.08
Protein-elongating234.00.20.2210.10.50−0.30
Missense39268.17366.913,83852.30.78−0.11
Radical missense28649.762.948.28,37826.10.54−0.27
Conservative missense10618.410.118.75,46026.21.410.15
Silent13823.922.929.415,22247.41.610.21

CU, codon usage; μ, mutation rate.

aAdjusted for CU and μ.

Table 1.

Observed (Obs) and expected (Exp) fractions of the six functional categories of SNPs in the coding regions of the human genome

CategoryNumber of WT-mutant pairsExp fraction of SNPs (%) (adjustments)Obs SNPsObs fraction/Exp fractionLog of Obs fraction/Exp fractiona
NoneCUCU and μNo.Fraction (%)
Nonsense234.03.93.6880.30.08−1.08
Protein-elongating234.00.20.2210.10.50−0.30
Missense39268.17366.913,83852.30.78−0.11
Radical missense28649.762.948.28,37826.10.54−0.27
Conservative missense10618.410.118.75,46026.21.410.15
Silent13823.922.929.415,22247.41.610.21
CategoryNumber of WT-mutant pairsExp fraction of SNPs (%) (adjustments)Obs SNPsObs fraction/Exp fractionLog of Obs fraction/Exp fractiona
NoneCUCU and μNo.Fraction (%)
Nonsense234.03.93.6880.30.08−1.08
Protein-elongating234.00.20.2210.10.50−0.30
Missense39268.17366.913,83852.30.78−0.11
Radical missense28649.762.948.28,37826.10.54−0.27
Conservative missense10618.410.118.75,46026.21.410.15
Silent13823.922.929.415,22247.41.610.21

CU, codon usage; μ, mutation rate.

aAdjusted for CU and μ.

Table 2.

Relative frequencies of single-nucleotide substitutions estimated by the analysis of processed pseudogenes [after (47)]

FromTo
ATCG
A3.4±0.74.5±0.812.5±1.1
T3.3±0.613.8±1.94.2±0.5
C4.2±0.520.7±1.34.6±0.6
G20.4±1.44.4±0.64.9±0.7
FromTo
ATCG
A3.4±0.74.5±0.812.5±1.1
T3.3±0.613.8±1.94.2±0.5
C4.2±0.520.7±1.34.6±0.6
G20.4±1.44.4±0.64.9±0.7

The overall mutation rate for the 12 variants is 100%.

Table 2.

Relative frequencies of single-nucleotide substitutions estimated by the analysis of processed pseudogenes [after (47)]

FromTo
ATCG
A3.4±0.74.5±0.812.5±1.1
T3.3±0.613.8±1.94.2±0.5
C4.2±0.520.7±1.34.6±0.6
G20.4±1.44.4±0.64.9±0.7
FromTo
ATCG
A3.4±0.74.5±0.812.5±1.1
T3.3±0.613.8±1.94.2±0.5
C4.2±0.520.7±1.34.6±0.6
G20.4±1.44.4±0.64.9±0.7

The overall mutation rate for the 12 variants is 100%.

Table 3.

Average relative mutation rates (μ) for the six functional categories of SNPs in the coding regions of the human genome

CategoryAverage relative μAverage relative μ/μnon
Nonsense7.48±1.481.00
Protein-elongating5.93±0.840.79
Missense7.87±0.251.05
Radical missense7.59±0.361.01
Conservative missense8.61±0.631.15
Silent10.20±0.591.36
CategoryAverage relative μAverage relative μ/μnon
Nonsense7.48±1.481.00
Protein-elongating5.93±0.840.79
Missense7.87±0.251.05
Radical missense7.59±0.361.01
Conservative missense8.61±0.631.15
Silent10.20±0.591.36

μnon, μ for nonsense mutations.

Table 3.

Average relative mutation rates (μ) for the six functional categories of SNPs in the coding regions of the human genome

CategoryAverage relative μAverage relative μ/μnon
Nonsense7.48±1.481.00
Protein-elongating5.93±0.840.79
Missense7.87±0.251.05
Radical missense7.59±0.361.01
Conservative missense8.61±0.631.15
Silent10.20±0.591.36
CategoryAverage relative μAverage relative μ/μnon
Nonsense7.48±1.481.00
Protein-elongating5.93±0.840.79
Missense7.87±0.251.05
Radical missense7.59±0.361.01
Conservative missense8.61±0.631.15
Silent10.20±0.591.36

μnon, μ for nonsense mutations.

Table 4.

Predicted relative selection coefficients against different functional categories of SNPs in the coding regions of the human genome

CategoryNPSOBSPipolAverage relative selection coefficient (%) and 99% CI
Nonsense4 000 667880.0000220100 (referent)
Protein-elongating198 333210.000105920.8 (13.0–28.6)
Missense74 465 66713 8380.000185811.8 (9.9–14.4)
Radical missense64 194 2678 3780.000130516.9 (13.3–20.5)
Conservative missense10 271 4005 4600.00053164.1 (3.1–5.1)
Silent23 335 33315 2220.00065233.4 (2.5–4.3)
CategoryNPSOBSPipolAverage relative selection coefficient (%) and 99% CI
Nonsense4 000 667880.0000220100 (referent)
Protein-elongating198 333210.000105920.8 (13.0–28.6)
Missense74 465 66713 8380.000185811.8 (9.9–14.4)
Radical missense64 194 2678 3780.000130516.9 (13.3–20.5)
Conservative missense10 271 4005 4600.00053164.1 (3.1–5.1)
Silent23 335 33315 2220.00065233.4 (2.5–4.3)

NPS, number of potential sites of the ith category; OBS, observed number of polymorphisms; Pipol, fraction of polymorphisms estimated as a ratio of OBS to NPS.

Table 4.

Predicted relative selection coefficients against different functional categories of SNPs in the coding regions of the human genome

CategoryNPSOBSPipolAverage relative selection coefficient (%) and 99% CI
Nonsense4 000 667880.0000220100 (referent)
Protein-elongating198 333210.000105920.8 (13.0–28.6)
Missense74 465 66713 8380.000185811.8 (9.9–14.4)
Radical missense64 194 2678 3780.000130516.9 (13.3–20.5)
Conservative missense10 271 4005 4600.00053164.1 (3.1–5.1)
Silent23 335 33315 2220.00065233.4 (2.5–4.3)
CategoryNPSOBSPipolAverage relative selection coefficient (%) and 99% CI
Nonsense4 000 667880.0000220100 (referent)
Protein-elongating198 333210.000105920.8 (13.0–28.6)
Missense74 465 66713 8380.000185811.8 (9.9–14.4)
Radical missense64 194 2678 3780.000130516.9 (13.3–20.5)
Conservative missense10 271 4005 4600.00053164.1 (3.1–5.1)
Silent23 335 33315 2220.00065233.4 (2.5–4.3)

NPS, number of potential sites of the ith category; OBS, observed number of polymorphisms; Pipol, fraction of polymorphisms estimated as a ratio of OBS to NPS.

References

1

Collins, A., Lau, W. and De La Vega, F.M. (

2004
) Mapping genes for common diseases: the case for genetic (LD) maps.
Hum. Hered.
,
58
,
2
–9.

2

Altshuler, D., Pollara, V.J., Cowles, C.R., Van Etten, W.J., Baldwin, J., Linton, L. and Lander, E.S. (

2000
) An SNP map of the human genome generated by reduced representation shotgun sequencing.
Nature
,
407
,
513
–516.

3

Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Shaw, N., Lane, C.R., Lim, E.P., Kalyanaraman, N. et al. (

1999
) Characterization of single-nucleotide polymorphisms in coding regions of human genes.
Nat. Genet.
,
22
,
231
–238.

4

Halushka, M.K., Fan, J.B., Bentley, K., Hsie, L., Shen, N., Weder, A., Cooper, R., Lipshutz, R. and Chakravarti, A. (

1999
) Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis.
Nat. Genet.
,
22
,
239
–247.

5

Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey, D.L. et al. (

2001
) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms.
Nature
,
409
,
928
–933.

6

Wang, D.G., Fan, J.B., Siao, C.J., Berno, A., Young, P., Sapolsky, R., Ghandour, G., Perkins, N., Winchester, E., Spencer, J. et al. (

1998
) Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome.
Science
,
280
,
1077
–1082.

7

Zhao, Z., Fu, Y.X., Hewett-Emmett, D. and Boerwinkle, E. (

2003
) Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution.
Gene
,
312
,
207
–213.

8

Sunyaev, S., Kondrashov, F.A., Bork, P. and Ramensky, V. (

2003
) Impact of selection, mutation rate and genetic drift on human genetic variation.
Hum. Mol. Genet.
,
12
,
3325
–3330.

9

Clarke, B. (

1970
) Selective constraints on amino-acid substitutions during the evolution of proteins.
Nature
,
228
,
159
–160.

10

Dagan, T., Talmor, Y. and Graur, D. (

2002
) Ratios of radical to conservative amino acid replacement are affected by mutational and compositional factors and may not be indicative of positive Darwinian selection.
Mol. Biol. Evol.
,
19
,
1022
–1025.

11

Grantham, R. (

1974
) Amino acid difference formula to help explain protein evolution.
Science
,
185
,
862
–864.

12

Epstein, C.J. (

1967
) Non-randomness of amino-acid changes in the evolution of homologous proteins.
Nature
,
215
,
355
–359.

13

Miyata, T., Miyazawa, S. and Yasunaga, T. (

1979
) Two types of amino acid substitutions in protein evolution.
J. Mol. Evol.
,
12
,
219
–236.

14

Fay, J.C., Wyckoff, G.J. and Wu, C.I. (

2001
) Positive and negative selection on the human genome.
Genetics
,
158
,
1227
–1234.

15

Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W. et al. (

2001
) Initial sequencing and analysis of the human genome.
Nature
,
409
,
860
–921.

16

Wicklow, B.A., Ivanovich, J.L., Plews, M.M., Salo, T.J., Noetzel, M.J., Lueder, G.T., Cartegni, L., Kaback, M.M., Sandhoff, K., Steiner, R.D. et al. (

2004
) Severe subacute GM2 gangliosidosis caused by an apparently silent HEXA mutation (V324V) that results in aberrant splicing and reduced HEXA mRNA.
Am. J. Med. Genet. A
,
127
,
158
–166.

17

Xie, J., Pabon, D., Jayo, A., Butta, N. and Gonzalez-Manchon, C. (

2005
) Type I Glanzmann thrombasthenia caused by an apparently silent beta3 mutation that results in aberrant splicing and reduced beta3 mRNA.
Thromb. Haemost.
,
93
,
897
–903.

18

Denecke, J., Kranz, C., Kemming, D., Koch, H.G. and Marquardt, T. (

2004
) An activated 5′ cryptic splice site in the human ALG3 gene generates a premature termination codon insensitive to nonsense-mediated mRNA decay in a new case of congenital disorder of glycosylation type Id (CDG-Id).
Hum. Mutat.
,
23
,
477
–486.

19

Pfarr, N., Prawitt, D., Kirschfink, M., Schroff, C., Knuf, M., Habermehl, P., Mannhardt, W., Zepp, F., Fairbrother, W., Loos, M. et al. (

2005
) Linking C5 deficiency to an exonic splicing enhancer mutation.
J. Immunol.
,
174
,
4172
–4177.

20

Fernandez-Cadenas, I., Andreu, A.L., Gamez, J., Gonzalo, R., Martin, M.A., Rubio, J.C. and Arenas, J. (

2003
) Splicing mosaic of the myophosphorylase gene due to a silent mutation in McArdle disease.
Neurology
,
61
,
1432
–1434.

21

Mankodi, A. and Ashizawa, T. (

2003
) Echo of silence: silent mutations, RNA splicing, and neuromuscular diseases.
Neurology
,
61
,
1330
–1341.

22

Montera, M., Piaggio, F., Marchese, C., Gismondi, V., Stella, A., Resta, N., Varesco, L., Guanti, G. and Mareni, C. (

2001
) A silent mutation in exon 14 of the APC gene is associated with exon skipping in a FAP family.
J. Med. Genet.
,
38
,
863
–867.

23

Chen, W., Kubota, S., Teramoto, T., Nishimura, Y., Yonemoto, K. and Seyama, Y. (

1998
) Silent nucleotide substitution in the sterol 27-hydroxylase gene (CYP 27) leads to alternative pre-mRNA splicing by activating a cryptic 5′ splice site at the mutant codon in cerebrotendinous xanthomatosis patients.
Biochemistry
,
37
,
4420
–4428.

24

Deshler, J.O. and Rossi, J.J. (

1991
) Unexpected point mutations activate cryptic 3′ splice sites by perturbing a natural secondary structure within a yeast intron.
Genes Dev.
,
5
,
1252
–1263.

25

Zatkova, A., Messiaen, L., Vandenbroucke, I., Wieser, R., Fonatsch, C., Krainer, A.R. and Wimmer, K. (

2004
) Disruption of exonic splicing enhancer elements is the principal cause of exon skipping associated with seven nonsense or missense alleles of NF1.
Hum. Mutat.
,
24
,
491
–501.

26

Polony, T.S., Bowers, S.J., Neiman, P.E. and Beemon, K.L. (

2003
) Silent point mutation in an avian retrovirus RNA processing element promotes c-myb-associated short-latency lymphomas.
J. Virol.
,
77
,
9378
–9387.

27

Hellwinkel, O.J., Holterhus, P.M., Struve, D., Marschke, C., Homburg, N. and Hiort, O. (

2001
) A unique exonic splicing mutation in the human androgen receptor gene indicates a physiologic relevance of regular androgen receptor transcript variants.
J. Clin. Endocrinol. Metab.
,
86
,
2569
–2575.

28

Akashi, H. (

2001
) Gene expression and molecular evolution.
Curr. Opin. Genet. Dev.
,
11
,
660
–666.

29

Conrad, M., Friedlander, C. and Goodman, M. (

1983
) Evidence that natural selection acts on silent mutation.
Biosystems
,
16
,
101
–111.

30

Segre, D., Deluna, A., Church, G.M. and Kishony, R. (

2005
) Modular epistasis in yeast metabolism.
Nat. Genet.
,
37
,
77
–83.

31

Conant, G.C. and Wagner, A. (

2004
) Duplicate genes and robustness to transient gene knock-downs in Caenorhabditis elegans.
Proc. Biol. Sci.
,
271
,
89
–96.

32

Kamath, R.S., Fraser, A.G., Dong, Y., Poulin, G., Durbin, R., Gotta, M., Kanapin, A., Le Bot, N., Moreno, S., Sohrmann, M. et al. (

2003
) Systematic functional analysis of the Caenorhabditis elegans genome using RNAi.
Nature
,
421
,
231
–237.

33

Mackay, T.F., Lyman, R.F. and Jackson, M.S. (

1992
) Effects of P element insertions on quantitative traits in Drosophila melanogaster.
Genetics
,
130
,
315
–332.

34

Charlesworth, B. (

1979
) Evidence against Fisher's theory of dominance.
Nature
,
278
,
848
–849.

35

Simmons, M.J. and Crow, J.F. (

1977
) Mutations affecting fitness in Drosophila populations.
Annu. Rev. Genet.
,
11
,
49
–78.

36

Morton, N., Crow, J.F. and Muller, H. (

1956
) An estimate of the mutational damage in man from data on consanguineous marriages.
PNAS
,
42
,
855
–863.

37

Fan, J.B., Gehl, D., Hsie, L., Shen, N., Lindblad-Toh, K., Laviolette, J.P., Robinson, E., Lipshutz, R., Wang, D., Hudson, T.J. et al. (

2002
) Assessing DNA sequence variations in human ESTs in a phylogenetic context using high-density oligonucleotide arrays.
Genomics
,
80
,
351
–360.

38

Kimura, M. (

1983
)
The Neutral Theory of Molecular Evolution
. Cambridge University Press, Cambridge Cambridgeshire; New York.

39

Kimura, M. (

1994
)
Population Genetics, Molecular Evolution, and the Neutral Theory: Selected Papers
. University of Chicago Press, Chicago.

40

Yampolsky, L.Y., Kondrashov, F.A. and Kondrashov, A.S. (

2005
) Distribution of the strength of selection against amino acid replacements in human proteins.
Hum. Mol. Genet.
,
14
,
3191
–3201.

41

Claes, K., Poppe, B., Machackova, E., Coene, I., Foretova, L., De Paepe, A. and Messiaen, L. (

2003
) Differentiating pathogenic mutations from polymorphic alterations in the splice sites of BRCA1 and BRCA2.
Genes Chromosomes Cancer
,
37
,
314
–320.

42

Nachman, M.W. and Crowell, S.L. (

2000
) Estimate of the mutation rate per nucleotide in humans.
Genetics
,
156
,
297
–304.

43

Kondrashov, A.S. (

2003
) Direct estimates of human per nucleotide mutation rates at 20 loci causing Mendelian diseases.
Hum. Mutat.
,
21
,
12
–27.

44

Smigielski, E.M., Sirotkin, K., Ward, M. and Sherry, S.T. (

2000
) dbSNP: a database of single nucleotide polymorphisms.
Nucleic Acids Res.
,
28
,
352
–325.

45

Reich, D.E., Gabriel, S.B. and Altshuler, D. (

2003
) Quality and completeness of SNP databases.
Nat. Genet.
,
33
,
457
–468.

46

Nakamura, Y., Gojobori, T. and Ikemura, T. (

2000
) Codon usage tabulated from international DNA sequence databases: status for the year 2000.
Nucleic Acids Res.
,
28
,
292
–298.

47

Graur, D. and Li, W.-H. (

2000
)
Fundamentals of Molecular Evolution.
2nd ed. Sinauer Associates, Sunderland, MA.

48

Consortium, I.H.G.S. (

2004
) Finishing the euchromatic sequence of the human genome.
Nature
,
431
,
931
–945.

49

Crow, J.F. and Kimura, M. (

1970
)
An Introduction to Population Genetics Theory
. Harper & Row, New York.

50

Lyman, R.F., Lawrence, F., Nuzhdin, S.V. and Mackay, T.F. (

1996
) Effects of single P-element insertions on bristle number and viability in Drosophila melanogaster.
Genetics
,
143
,
277
–292.

51

Zhang, X.S., Wang, J. and Hill, W.G. (

2004
) Influence of dominance, leptokurtosis and pleiotropy of deleterious mutations on quantitative genetic variation at mutation-selection balance.
Genetics
,
166
,
597
–610.

52

Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A. et al. (

2001
) The sequence of the human genome.
Science
,
291
,
1304
–1351.

53

Sokal, R.R. and Rohlf, F.J. (

1973
)
Introduction to Biostatistics
. W.H. Freeman, San Francisco.

Supplementary data