As genes originate at different evolutionary times, they harbor distinctive genomic signatures of evolutionary ages. Although previous studies have investigated different gene age-related signatures, what signatures dominantly associate with gene age remains unresolved. Here we address this question via a combined approach of comprehensive assignment of gene ages, gene family identification, and multivariate analyses. We first provide a comprehensive and improved gene age assignment by combining homolog clustering with phylogeny inference and categorize human genes into 26 age classes spanning the whole tree of life. We then explore the dominant age-related signatures based on a collection of 10 potential signatures (including gene composition, gene length, selection pressure, expression level, connectivity in protein–protein interaction network and DNA methylation). Our results show that GC content and connectivity in protein–protein interaction network (PPIN) associate dominantly with gene age. Furthermore, we investigate the heterogeneity of dominant signatures in duplicates and singletons. We find that GC content is a consistent primary factor of gene age in duplicates and singletons, whereas PPIN is more strongly associated with gene age in singletons than in duplicates. Taken together, GC content and PPIN are two dominant signatures in close association with gene age, exhibiting heterogeneity in duplicates and singletons and presumably reflecting complex differential interplays between natural selection and mutation.
Birth of new genes is associated with events of gene duplication (Betran et al. 2002; Long et al. 2003), horizontal gene transfer (Keeling and Palmer 2008), extant gene fragments (Gilbert 1978), and de novo creations from noncoding DNA/RNA (Knowles and McLysaght 2009; Toll-Riera et al. 2009). It is considered that the birth of new genes is one of several primary mechanisms underlying the evolution of novel functions in biological systems, and often facilitating adaptive evolution (Kaessmann 2010). Accordingly, genes, that are birthed and fixed into a species at specific evolutionary time, are left with distinctive age-related signature (ARS) in the genome. Therefore, deciphering ARS in molecular sequences holds great significance in better understanding molecular evolutionary processes and unveiling the underlying mechanisms that drive young genes to become indispensible integrants coupled with novel phenotypes and biological diversities (Long et al. 2013).
To date, attempts have been made to address this issue by detecting diverse ARS. These studies revealed that young genes are shorter, have fewer introns (Wolf et al. 2009), relate closely with the birth of new binding sites (Ni et al. 2012) and harbor more premature termination codon mutations (Yang et al. 2015). Furthermore, young genes possess fewer interactions with other genes (Zhang et al. 2015) and tend to play less essential functional roles compared with old genes (Chen et al. 2012). Additionally, young human genes are likely to present distinct temporal and spatial expression patterns (Long et al. 2013; Popadin et al. 2014). It is reported that young genes evolve more rapidly (Alba and Castresana 2005; Wolf et al. 2009) and experience more variable selection pressure than old genes (Vishnoi et al. 2010). Moreover, a recent study has further shown that young and old duplicates differ strikingly in their DNA methylation (Keller and Yi 2014).
Although different aged genes differ in multiple ARS as mentioned earlier, the relative significance of different ARS is not known and it remains unresolved what signatures dominantly associate with gene age. Additionally, previous studies on age identification mainly employed similarity search that have been reported to be error-prone (Alba and Castresana 2007; Tautz and Domazet-Loso 2011; Moyers and Zhang 2015, 2016) and determined gene age based on a rough evolutionary time-scale (Domazet-Loso and Tautz 2008; Wolf et al. 2009; Zhang et al. 2010). As a result, discriminating origins of divergent homologs and capturing important evolutionary events have been difficult. In this study, we provide a newly generated, comprehensive and improved gene age identification by combining homolog clustering with phylogeny inference. Accordingly, we determine gene age at an extremely refined evolutionary time-scale and categorize human genes into 26 evolutionary age classes spanning the whole tree of life. Using this age identification, we explore dominant ARS in the human genome based on a collection of 10 potential ARS and further investigate the heterogeneity of dominant ARS in duplicates and singletons.
Results and Discussion
Age Identification of Human Genes
Improving upon previous studies on age identification (Domazet-Loso and Tautz 2008; Wolf et al. 2009; Zhang et al. 2010), here we combine homolog clustering with phylogeny inference to identify gene ages (see “Methods” section) and categorize human genes into 26 age classes ranging from archaea/bacteria (age class 26) to human (age class 1), spanning an extremely long evolutionary time-scale of ∼4,000 million years (supplementary tables S1 and S2 and fig. S1, Supplementary Material online). Our classification provides the most refined evolutionary gene-age classes so far, compared with previous studies where genes were classified into seven classes in (Wolf et al. 2009), 11 classes in (Cai and Petrov 2010) and 19 classes in (Domazet-Loso and Tautz 2008). Although our refined age classification, by virtue of increased number of classes, could misclassify ages of some genes whose sequence and annotation in the current genome assemblies include errors, our results on gene age identification present three major improvements. First, previous studies for gene age identification were typically based on homolog similarity, and thus not well suited to effectively differentiate the origins of paralogs. In comparison, our study, by utilizing homolog clustering with phylogeny inference, is able to confidently identify evolutionary ages of paralogous genes.
Second, we utilize an extremely refined phylogenetic framework consisting of 26 age classes, encompassing major evolutionary events from unicellular organisms to human. Consequently, it is capable to investigate gene loss events (a gene loss event is determined and counted when a gene is present at certain evolutionary time, but absent afterwards) in a more detailed manner based on our age identification results. For instance, a previous study has reported that genes are lost after the divergence of human and rodents (Blomme et al. 2006). Contrastingly, our results show that those specific genes are heavily lost at the origination time of primates and scandentia (supplementary fig. S2, Supplementary Material online), yielding a higher resolution determination of important evolutionary events. Meanwhile, among the 26 age classes, primate-specific evolutionary time-scale is well separated into seven different age classes (namely tarsiiformes, platyrrhini, cercopithecidae, hylobatidaee, pongo, gorillae and human), which is of great significance for better understanding details of primate evolutionary processes and innovation of primate-specific genes. For instance, MYEOV (ENSG0000017292), a gene that has been reported to de novo arise from noncoding RNA in human-specific lineage (Xie et al. 2012), actually arose at age class 4, namely, hominoid-specific lineage, indicating that its transition from noncoding RNA to a coding gene is, more precisely, occurred at the origin of hominoidea.
Third, our method based on a phylogenetic framework features effective inference of evolutionary time of gene duplication events, allowing confident age assignments of paralogs. Accordingly, we find that 11% duplication events can be traced back to the origin of metazoan ∼900 million years ago (Mya) (supplementary fig. S3, Supplementary Material online) and 16% duplication events are assigned to the origin of vertebrate ∼450 Mya (supplementary fig. S3, Supplementary Material online), indicating that the origins of multicellularity and vertebrate are fundamental, presumably with key innovations for the emergence of human genes. These results are in good accordance with the hypothesis that hierarchical complexity increases at the origin of multicellularity (Rainey 2007) and that duplication events, including whole genome duplication, are major evolutionary forces underlying vertebrate genome evolution (Blomme et al. 2006).
Dominant ARS in Human Gene
As mentioned earlier, birth of genes leads to different ARS at multiple omics levels, including genomics, transcriptomics, epigenetics, etc. Based on our age identification, here we incorporate a total of 10 potential ARS, including nucleotide composition (GC/AG content) of entire CDS (coding sequence), sequence length, CUB, expression level, natural selection inferred from nonsynonymous/synonymous substitution ratio (Ka/Ks), DNA methylation, and PPIN (supplementary fig. S4, Supplementary Material online). Since these signatures are highly interdependent (Kim and Yi 2007; Park et al. 2012), correlation analysis cannot be used to identify dominant signatures associated closely with gene age (supplementary table S3, Supplementary Material online). Therefore, we perform principal component analysis (PCA), a widely used statistical method that is able to transform a set of possibly correlated variables into a set of linearly uncorrelated principal components, to decipher which signatures are highly dominant with gene age. According to the PCA results, the first four principal components account for 74% of the variance and the first two principal components are able to explain ∼44.55% of the variance (fig. 1A and supplementary table S4, Supplementary Material online). Notably, the first component is mainly dominated by GC content (41.46%) and the second component is largely determined by PPIN (22.79%). Our results clearly show that, albeit ARS are evolutionarily confounded and interrelated, GC content and PPIN are two dominant signatures associating closely with gene age. Additionally, to avoid bias due to the large number of genes in the strata of unicellular organisms (supplementary fig. S5, Supplementary Material online), we re-sample the same percentage of genes from unicellular organisms as that of multicellular organisms (supplementary table S5, Supplementary Material online) and consistently obtain the similar results that GC content and PPIN are dominant ARS.
Duplication is a main driving mechanism in the birth of new genes (Gu et al. 2002; Zhang 2003). It is reported that singletons evolve more rapidly (Jordan et al. 2004) and tend to have more consistent expression profiles than duplicates (Li et al. 2005). Given these observations, we hypothesize that dominant ARS may be different between duplicates and singletons. Therefore, we further perform PCA separately on singletons and duplicates. Our results show that in singletons (fig. 2A and supplementary table S6, Supplementary Material online), consistent with previous results, the first component is determined mainly by GC content (35.21%) and the second component is determined mainly by PPIN (26.14%). Intriguingly, in duplicates (fig. 2B and supplementary table S7, Supplementary Material online), the first component is determined mainly by GC content (35.79%) and the second component is still determined by GC content (30.63%). Together, GC content consistently dominates as a primary signature with gene age in duplicates and singletons, whereas PPIN dominates with gene age more significantly in singletons than in duplicates. These results indicate that duplicates and singletons may experience diverse evolutionary forces and yield different dominant signatures of gene age.
It is well documented that GC content, as one of the most fundamental gene features, is highly correlated with multiple factors [including mutation (Fryxell and Moon 2005), selection for specific synonymous codons for translation efficiency and accuracy (Plotkin and Kudla 2011), horizontal gene transfer (Philippe and Douady 2003), methylation modification (Bird 1986; Elango et al. 2008), and gene density (Duret et al. 1995), etc.]. It is notable that even though we separately examined factors known to associate with GC content, including gene length, expression, codon usage, selection strength and methylation, we still observe the dominant significance of GC content in the evolution. Therefore, our results indicate significant effects of GC content apart from the aforementioned factors. For example, replication dynamics correlates with GC content (Kenigsberg et al. 2016) and GC-biased gene conversion specifically in highly recombining genomic regions affects the genomes of most bacterial species (Lassalle et al. 2015) and many eukaryotes (Webster and Hurst 2012), further providing the possibility that the dominance of GC content helps shape the genome characterization universally. Consistently and strikingly, our results demonstrate that regardless of being duplicates or singletons, GC content is an overwhelmingly dominant signature associating closely with gene age. Conforming to this point, additional evidence has shown by a recent study that de novo new genes originating from long noncoding RNAs present heterogeneity in GC content (Chen et al. 2015).
It has been reported that more than one-third of known regulatory interactions in yeast (Teichmann and Babu 2004) and average 27% interaction networks for primate-specific young genes in human (Zhang et al. 2015) are inherited from their parental genes after duplication, so that duplication is a significant contributor of gene interaction network (Middendorf et al. 2005). In contrast to duplicates that have inherited PPIN from parental copies, singletons have little interactions at their early evolutionary stage, but, over time, they are gradually integrated into gene interaction networks to acquire biological functions. As genes evolve and age in the genome, therefore, singletons may experience more dramatic variations in PPIN than duplicates. Indeed, singletons do exhibit a much larger variability in PPIN compared with duplicates (supplementary fig. S6, Supplementary Material online). Consequently, even though old genes (including duplicates and singletons) tend to be highly connected in PPIN (Zhang et al. 2015), PPIN appears to be a more important signature of evolutionary age in singletons. Taken together, GC content and PPIN are two dominant signatures in close association with gene age, yet exhibiting heterogeneity in duplicates and singletons and presumably reflecting complex differential interplays between natural selection and mutation as they age.
Age Definition and Identification
For a given human gene, age was defined based on the presence of its ortholog in a wide range of species. We downloaded protein sequences from Ensembl (http://www.ensembl.org; supplementary table S1, Supplementary Material online) and obtained a collection of nonredundant proteins by only keeping longest splicing variants (supplementary table S1 and fig. S1, Supplementary Material online). BLAST searches were constructed for all nonredundant proteins (E-value < 10 −3). Furthermore, we conducted homolog clustering using the Markov Cluster algorithm (inflation value = 1.5) with OrthoMCL (Li et al. 2003) after loading BLAST results into MySQL database. Consequently, we assigned all resulting proteins into 35,948 homolog clusters. Among them, 12,493 singleton groups and 2,142 duplicate groups included human homologs. To infer the orthology relationships for duplicate group, multiple sequence alignments were conducted by MAFFT (Katoh and Standley 2013) and spurious sequences or poorly aligned regions were removed by trimAl (Capella-Gutierrez et al. 2009). Furthermore, we carried out phylogenetic inferences by phyML (Guindon et al. 2010) with bootstrap resampling tests by 100 times and utilized RIO (Resampled Inference of Orthologs; reliability values > 0.6) (Zmasek and Eddy 2002) for automated phylogeny inference to estimate the reliability of orthology assignments. As a consequence, we classified all human genes (including singletons and duplicates) into 25 age classes from the origin of eukaryotes, spanning ∼1,500 Mya (age class 25) to human. Moreover, we used PANTHER (Mi et al. 2013) to determine orthology relationships between human genes and archaeal/bacterial genes. A human gene was assigned to age class 26 originating from ∼4,000 Mya if its orthologs were detected in at least two archaeal/bacterial organisms (given the possibility of horizontal gene transfer). Detailed results of age identification for all human genes were tabulated in supplementary table S2, Supplementary Material online. In addition, gene ontology (GO) enrichment analyses were conducted and the corresponding results were summarized in supplementary table S8, Supplementary Material online.
We used homolog relationships between Homo sapiens and Mus musculus from NCBI HomoloGene database (http://www.ncbi.nlm.nih.gov/homologene) and obtained gene expression profiles across human 32 tissues from (Uhlen et al. 2015). We collected methylation data from GSE database with accession number GSE31848, including eight somatic cell samples (GSM868007-GSM868014) (Nazor et al. 2012). We retrieved PPIN data from (Cowley et al. 2012).
Estimation of Selection Pressure, Codon Usage Bias and Methylation Level
KaKs_Calculator (Zhang et al. 2006) was adopted to calculate nonsynonymous and synonymous substitution rates for human–mouse orthologs. Codon Deviation Coefficient (Zhang et al. 2012) was used to measure CUB for human genes as well as their orthologs among different species (supplementary fig. S7, Supplementary Material online). Methylation levels were estimated through an R package named Illumina Methylation Analyzer (Wang et al. 2012). For a given gene, DNA methylation level was averaged over its four regions including gene body region, promoter region, 5′- and 3′-UTR.
Principal Component Analysis
We used R for principle component analysis (package: pls). After logarithm transformation of four features including gene length, expression level, methylation level and PPIN, all features were scaled and normalized into [0, 1].
This work was supported by grants from The Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13040500 to Z.Z.), National Programs for High Technology Research and Development (863 Program; 2015AA020108 and 2012AA020409 to Z.Z.), International Partnership Program of the Chinese Academy of Sciences (153F11KYSB20160008), National Natural Science Foundation of China (Grant No. 31200978 to L.M.) and the “100-Talent Program” of Chinese Academy of Sciences (awarded to Z.Z.). This study was also supported by an NSF grant (SBE-131719), an NIH grant (1R01MH103517-01A1), and a Zoo Atlanta-Georgia Tech collaborative grant to S.V.Y.