-
PDF
- Split View
-
Views
-
Cite
Cite
Lin Xu, Hong Chen, Xiaohua Hu, Rongmei Zhang, Ze Zhang, Z. W. Luo, Average Gene Length Is Highly Conserved in Prokaryotes and Eukaryotes and Diverges Only Between the Two Kingdoms, Molecular Biology and Evolution, Volume 23, Issue 6, June 2006, Pages 1107–1108, https://doi.org/10.1093/molbev/msk019
Close - Share Icon Share
Abstract
The average length of genes in a eukaryote is larger than in a prokaryote, implying that evolution of complexity is related to change of gene lengths. Here, we show that although the average lengths of genes in prokaryotes and eukaryotes are much different, the average lengths of genes are highly conserved within either of the two kingdoms. This suggests that natural selection has clearly set a strong limitation on gene elongation within the kingdom. Furthermore, the average gene size adds another distinct characteristic for the discrimination between the two kingdoms of organisms.
Background
Gene elongation is recognized as one of the most important steps in the evolution of functional complexities of genes (Li 1997) and in the evolution of new genes (Long et al. 2003). Zhang (2000) calculated the mean and median of the proteins from 22 species including several representative organisms such as Escherichia coli, yeast, nematode, Drosophila, humans, and Arabidopsis of which the genome sequence information was available at the time. He observed that orthologous genes are longer in eukaryotes than in prokaryotes and that eukaryote-specific proteins are longer on average than prokaryote-specific proteins. Wang, Hsieh, and Li (2005) analyzed orthologous protein data in detail by reconstructing the ancestral states among the eukaryotes under question. They found that proteins in yeast, nematode, Drosophila, humans, and Arabidopsis are, on average, longer than their orthologs in E. coli and observed conservation of protein sequence length across eukaryotic kingdoms. We present here a more general pattern of the size of coding sequence of prokaryotic and eukaryotic genes and show that the mean length of genic coding sequence (MLGCS) is highly conserved in prokaryotes and eukaryotes but diverges between the two kingdoms.
Results and Discussion
We have surveyed almost all prokaryotic and eukaryotic species whose complete genome sequence data are available and well annotated up to date. These included 81 prokaryotes and 19 eukaryotes to which predictions of the coding sequences were validated and are listed in Tables 1 and 2 of the Supplementary Material online. The tables also illustrated genome sequence size (N) in kilobase pairs, number of predicted genes (n), and ratio of coding sequence over the genome sequence (r) for each of these species together with the key reference from which the data were collected. From these parameters, we calculated the MLGCS within each of the species as L = N × r/n. We regressed the estimate of total coding sequence length on the estimate of the number of genes for each of the two groups of species and demonstrated the analyses in figure 1.

Analysis of regression of total coding sequence length on the number of genes in 81 prokaryotic species and in 19 eukaryotic species.
It can be seen from figure 1 that there is a perfect linear relationship between the total coding sequence and the number of genes in both prokaryotic and eukaryotic genomes. The perfect linear relationship does hold between the number of genes and the total sequence length in the prokaryotes but it does not in the eukaryotes probably because of introns, transposable elements, and junk DNA in the eukaryotic genomes (data not shown). The mean (standard error), the coefficients of skewness, and the coefficients of kurtosis of MLGCS were estimated as 924 (9) bp, 0.1952, and 3.3501 for the prokaryotic group and as 1,346 (28) bp, 0.1723, and 2.5661 for the eukaryotic group, respectively. The analyses indicate that the genic coding sequence has a relatively constant average length in both prokaryotes and eukaryotes in spite of the remarkable variation in the coding sequence length among individual genes within these genomes. The coding sequence of a gene in the eukaryote kingdom is on average 445 bp longer than that in the prokaryotes.
It is widely accepted that natural selection favors shorter genic coding sequence length for higher transcriptional efficiency, for efficient protein synthesis, and for avoiding accumulation of deleterious mutation, on one hand, but evolution seems to improve the function of a protein through elongating its coding sequence on the other (Li 1997; Zhang 2000; Akashi 2003; Claverie and Ogata 2003; Wang, Hsieh, and Li 2005). Schneider and Ebert (2004) have recently argued that the covariation between genome size and gene length is expected to be strongest in smallest genomes and that selection for reduced gene length becomes progressively weaker when genomes become larger. Our observation suggests that there is a stringent structural constraint on evolution of gene size on a genomic scale. The species that have been diverged for more than a few billions of years ago in either prokaryotic (Prochlorococcus marinus) or eukaryotic (Ashbya gossypii) group share a relatively constant mean gene size. The mean gene size adds another distinct characteristic for the discrimination between the two kingdoms of organisms. Wang, Hsieh, and Li (2005) observed a tendency for conservation in length of orthologous proteins among the five eukaryotic genomes, which seems not very surprising given that the comparison was made between the proteins that are highly evolutionarily conserved across the species. The present study considers whole-genome coding sequence, contrasts it against the number of genes in the genome, and thus reveals a more general tendency in gene length evolution.
Question may arise for making use of the coding sequence data and the gene numbers predicted for the eukaryotic genomes because the current state of de novo gene prediction from sequence data may have various intrinsic limitations (Zhang 2002). The 19 eukaryotic genomes from 35 candidates under this study have the gene annotation validated either through genome-wide cDNA and/or expressed sequence tag comparison or through comparison in the gene prediction between one genome and that of its closely related species (see Supplementary Material online). Reliability of the criteria has been tested from three yeast species (Kellis et al. 2003). Moreover, all the 19 eukaryotes do also survive the commonly recommended 70% accuracy hurdle (Bork 2000).
William Martin, Associate Editor
We thank two anonymous reviewers for their constructively critical comments that have helped improve the presentation of this paper, and we particularly owe to one of the reviewers who pointed out the perfect linear relationship between the number of genes and the total length of genome sequence in the prokaryotes. This study was supported by China's National Natural Science Foundation (30430380) and the Basic Research Program of China (2004CB518605). Z.W.L. is also supported by research grants from the Biotechnology and Biological Sciences Research Council and the Natural Environment Research Council of the United Kingdom.
References
Claverie, J., and H. Ogata.
Kellis, M., N. Patterson, M. M. Endrizzi, B. Birren, and E. S. Lander.
Long, M. Y., E. Betran, K. Thornton, and W. Wang.
Schneider, A., and D. Ebert.
Wang, D. Y., M. F. Hsieh, and W. H. Li.
Zhang, J.
Author notes
*Laboratory of Population and Quantitative Genetics, School of Life Sciences and Institute of Biomedical Sciences, Fudan University, Shanghai, China; and †School of Biosciences, University of Birmingham, Edgbaston, Birmingham, United Kingdom