Evolution of DNA methylation in Papio baboons

Changes in gene regulation have long been thought to play an important role in primate evolution. However, although a number of studies have compared genome-wide gene expression patterns across primate species, fewer have investigated the gene regulatory mechanisms that underlie such patterns, or the relative contribution of drift versus selection. Here, we profiled genome-scale DNA methylation levels from five of the six extant species of the baboon genus Papio (4–14 individuals per species). This radiation presents the opportunity to investigate DNA methylation divergence at both shallow and deeper time scales (380,000 – 1.4 million years). In contrast to studies in human populations, but similar to studies in great apes, DNA methylation profiles clearly mirror genetic and geographic structure. Divergence in DNA methylation proceeds fastest in unannotated regions of the genome and slowest in regions of the genome that are likely more constrained at the sequence level (e.g., gene exons). Both heuristic approaches and Ornstein-Uhlenbeck models suggest that DNA methylation levels at a small set of sites have been affected by positive selection, and that this class is enriched in functionally relevant contexts, including promoters, enhancers, and CpG islands. Our results thus indicate that the rate and distribution of DNA methylation changes across the genome largely mirror genetic structure. However, at some CpG sites, DNA methylation levels themselves may have been a target of positive selection, pointing to loci that could be important in connecting sequence variation to fitness-related traits.

targeted by selection, than protein-coding changes (Stern 2000). In addition, regulatory regions 26 are believed to have larger mutational target sizes, increasing the rate at which they may evolve 27 (Landry, et al. 2007). In support of the importance of regulatory evolution, a number of studies 28 have identified regulatory changes that contribute to species-specific adaptations. For example, 29 non-coding variants that regulate the ectodysplasin and pitx1 genes underlie morphological 30 changes that separate saltwater threespine sticklebacks (Gasterosteus aculeatus) from their close 31 expression in humans also have larger integrated haplotype scores, providing evidence for recent 48 positive selection (Nédélec, et al. 2016; Kim-Hellmuth, et al. 2017). Consistent with these 49 findings, variants associated with disease risk, fecundity, and other selectively relevant traits are 50 often found within non-coding regions, and likely affect gene expression levels (Nicolae,et al. 51 2010; Wray 2013). 52 The second approach investigates patterns of gene expression across species to search for 53 cases consistent with adaptive evolution. Several patterns have emerged from this work. First, 54 overall differences in gene expression accumulate over evolutionary time, such that more closely 55 related species have more similar gene expression profiles. Global clustering approaches from 56 the same tissue thus tend to faithfully reproduce the species phylogeny (Brawand, et al. 2011; 57 Sudmant, et al. 2015), and exceptions to this pattern suggest possible cases of natural selection. 58 For example, gene expression levels in testis, but not in other tissues, group humans and gorillas 59 to the exclusion of chimpanzees and bonobos (Brawand, et al. 2011). This pattern is consistent 60 with elevated sexual selection on male reproductive physiology in chimpanzees and bonobos, 61 which are characterized by unusually large testis to body size ratios relative to other primates 62 (Schultz 1938). Second, stabilizing selection appears to constrain most gene expression levels. 63 Comparative analyses of gene expression have found that most genes are characterized by low 64 levels of intra-and inter-specific divergence, a pattern consistent with stabilizing selection 65 (Rifkin, et al. 2003;Gilad, et al. 2006a;Khaitovich, et al. 2006;Blekhman, et al. 2008  Comparative studies to date have focused most intensively on DNA methylation, an 87 epigenetic regulatory mechanism that refers to the covalent addition of a methyl group to a 88 cytosine base and that can affect transcription factor binding, chromatin accessibility, and gene   In primates, comparisons between humans, chimpanzees, and rhesus macaques suggest that 92 divergence in DNA methylation is associated with changes in gene expression (Zeng, et    To do so, we generated genome-scale bisulfite sequencing data for 4 -14 members of 120 each of five of the extant species (all but chacma baboons).

133
Genome-wide variation in DNA methylation reflects geography and phylogenetic structure 134 We generated DNA methylation profiles for 39 baboons and 5 rhesus macaques (Macaca 135 mulatta) (Table S1) (Table S1). However, because source population was not  To investigate the relationship between DNA methylation levels and genetic divergence,    Table S2). Conversely, such variation was depleted for CpG sites in gene exons. This 243 dependency on genomic context was generally consistent between sites that exhibited significant 244 genus, clade, or species-level variation. However, species-level changes were more strongly 245 enriched in unannotated regions and more clearly depleted for other functional contexts ( Fig. S4; 246 Table S3), consistent with faster divergence in regions where genetic variation is more likely to 247 be selectively neutral.

248
To identify shifts in DNA methylation associated with specific baboon taxa, we focused 249 on the set of 46,260 sites that were taxonomically structured by clade or species membership.

250
For these sites, we then applied a binomial mixed effects model (Lea, et al. 2015) to identify 251 differential methylation (i) between each target species and all other baboons, and (ii) between 252 clades (10% FDR threshold). We required a minimum 10% difference in mean DNA methylation 253 levels between the focal species and all other baboon species to call a species-specific shift, and 254 a minimum 10% difference between all between-clade species pairs, as well as rhesus macaque, 255 to call a clade-level shift. Based on these criteria, we identified 2,959 -11,189 species-specific  CpG sites that are candidates for positive selection to differentiate baboon clades or species, 321 respectively. We note that this approach is likely to retain false positives (and also miss false 322 negatives, which is common in tests for selection): thus, this set should be treated as enriched for 323 a likely history of positive selection, rather than as a definitive list of positively selected sites.

324
In the second approach, we fit Brownian motion and Ornstein-Uhlenbeck models of 325 phenotypic evolution, which include explicit parameters for the strength of selection towards a 326 phenotypic optimum or optima (Butler and King 2004). We used a modified approach that takes  Table S2). This pattern is consistent for all sets of candidate positively 351 selected sites ( Fig. S4; Table S4). 129 DMRs were associated with species-specific selection (4-  [dark blue]; p = 3.86 x 10 -12 against the same sites, but with π averaged across nonselected lineages only 384 [gray]). π was calculated separately for each species and averaged across lineages, and is log transformed 385 here for visualization purposes only. Box plots show median (black bar) and interquartile range 386 (whiskers).     472 DNA methylation data were generated for 39 baboons across five of the six recognized 473 extant species (9 anubis, 6 yellow, 14 hamadryas, 6 Guinea, and 4 Kinda baboons; Table S1). 474 We also generated RRBS data for 5 rhesus macaques as an outgroup. For P. anubis samples from  Table S1). All DNA samples were 479 extracted from whole blood with the exception of 2 P. cynocephalus, 1 P. anubis, 2 P. kindae, 480 and 1 P. hamadryas for whom samples were obtained from banked white blood cells.  Table S1).

491
To assess the efficiency of the bisulfite conversion, 1 ng of unmethylated lambda phage DNA 492 (Sigma Aldrich) was added to each sample prior to library construction.  Table S1).

504
After excluding sites for which data were missing for ≥50% of our study subjects or for 505 which mean coverage was <5x, we retained 2,450,153 CpG sites for downstream analysis. As 506 expected for RRBS data sets, these sites were enriched in functionally important regions of the 507 genome and displayed typical mammalian patterns of CpG DNA methylation (Fig. S1). To focus 508 on the sites most likely to exhibit biologically meaningful variation, we further excluded

532
Gene ontology (GO) enrichment analyses were performed using the Cytoscape module 533 ClueGO (Bindea, et al. 2009). To link differentially methylated sites to genes, we first identified 534 clusters of CpG sites with similar patterns of differential methylation (differentially methylated 535 regions or DMRs). We called DMRs when ≥3 CpG sites within a 2 kb window exhibited the 536 same type of lineage-specific change (e.g., hypo-methylation in hamadryas baboons), and 537 bounded the DMR by the first and last CpG site that exhibited lineage-specific methylation. We 538 then collapsed overlapping DMRs. We assigned a DMR to a gene when a CpG site within the 539 DMR fell within 10 kb of the gene body. To test for gene set enrichment, we analyzed GO 540 Biological Processes that fell between levels 3 and 8 of the GO tree, included at least 4 genes in 541 our data set, and for which at least 5% of genes assigned to the term were present in the test set. 542 We also collapsed GO parent-child terms with at least 50% overlap. Enrichment analyses were 543 corrected for multiple hypothesis testing using the Benjamini-Hochberg (B-H) method 544 (Benjamini and Hochberg 1995). Gene set enrichment analyses for DNA methylation data can be 545 biased if some gene sets are systematically associated with larger numbers of CpG sites than 546 others (Geeleher, et al. 2013). However, in our data set, genes associated with differentially 547 methylated sites were not associated with more tested sites than other genes (logistic regression: 548 z = 0.054, p = 0.957).

550
Covariance between genetic structure and DNA methylation patterns 551 To assess the relationship between phylogenetic structure and DNA methylation patterns 552 in our data set, we conducted principal components analysis in R (version 3.2.5; R Core Team 553 2016) on the scaled variance-covariance matrix of the DNA methylation level data. We ran the 554 PCA both including and excluding the rhesus macaque samples, and in baboons after 555 subsampling to the same number of individuals per species (n=4; Fig. 1A, 1B, and S2).

556
To test the correlation between DNA methylation levels and pairwise genetic distance 557 between samples, we used Mantel tests. We called genotypes from RRBS data for 49,607 functional compartment (gene, enhancer, CpG island, CpG shore, promoter, unannotated) and 562 mean methylation level (Fig 2A). We also tested whether windows of the genome where genetic 563 structure followed an alternate phylogeny (a consequence of incomplete lineage sorting or 564 admixture) exhibited a lower correlation between genetic and DNA methylation covariance (see 565 Supplementary Methods). 566 Finally, to investigate the relationship between DNA methylation divergence and genetic 567 divergence between species, we retained CpG sites for which each species was represented by at 568 least three individuals and a total (across individuals) of at least 10 reads (n = 438,713 CpG 569 sites). We calculated the mean DNA methylation level per species for each retained CpG site and 570 the difference in mean methylation between each species pair. We then tested whether  575 For sites in which clade or species significantly contributed to variance in DNA 576 methylation levels (n=46,260 taxonomically structured sites, identified using ANOVA and a 577 10% FDR threshold), we tested for lineage-specific shifts using the beta-binomial model where wi is a vector of fixed effect covariates including an intercept and the sample-specific 590 bisulfite conversation rate; α is a vector of coefficients for wi; xi represents species or clade 591 membership coded as 1 (for the taxon of interest) or 0 (for any other taxa) and β is the coefficient 592 for the effect of taxonomic membership; is an n-vector of independent residual error with 593 variance σ 2 ; and I is a n-by-n identity matrix. We did not model genetic non-independence in this 594 analysis; thus, the K matrix input to MACAU was an identity matrix.

595
In addition to a 10% FDR threshold (q-value: Storey & Tibshirani 2003), we required a 596 minimum difference of 10% in mean methylation between either (i) the focal species compared 597 to all other species, for species-level shifts, or (ii) for all pairwise comparisons between northern 598 clade and southern clade species, for clade-level shifts. We assigned clade-level shifts to one of 599 the two lineages based on post-hoc comparison to rhesus macaques. For example, we assigned a 600 shift to the northern clade when there was a mean difference in DNA methylation of ≥10% 601 between northern clade baboons and macaques, but not between southern clade baboons and 602 macaques.

604
Identification of candidate directionally selected sites 605 To test for positive selection using the heuristic approach, we first calculated the 606 intraspecific variance for each of the 756,262 CpG sites in our primary data set, after mean-607 centering DNA methylation levels for each species. We then binned the CpG sites into 5% 608 quantiles based on mean methylation level, and retained sites with intra-specific variance in the 609 lowest 10% quantile for each bin. We intersected these low-variance sites with the set of sites 610 that exhibited species-or clade-specific methylation, based on the criteria outlined for 611 identifying taxonomic structure with ANOVA followed by beta-binomial regression. This 612 intersection set is likely to be enriched for a history of positive selection.

613
As an alternative approach, we fit Ornstein-Uhlenbeck (OU) models of the evolutionary 614 process, based on the phylogenetic tree for baboons (Rogers, et al. in review). In OU models,