A recent study conducted the first genome-wide scan for selection in Inuit from Greenland using single nucleotide polymorphism chip data. Here, we report that selection in the region with the second most extreme signal of positive selection in Greenlandic Inuit favored a deeply divergent haplotype that is closely related to the sequence in the Denisovan genome, and was likely introgressed from an archaic population. The region contains two genes, WARS2 and TBX15, and has previously been associated with adipose tissue differentiation and body-fat distribution in humans. We show that the adaptively introgressed allele has been under selection in a much larger geographic region than just Greenland. Furthermore, it is associated with changes in expression of WARS2 and TBX15 in multiple tissues including the adrenal gland and subcutaneous adipose tissue, and with regional DNA methylation changes in TBX15.
To identify genes responsible for biological adaptations to life in the Arctic, Fumagalli et al. (2015) scanned the genomes of Greenlandic Inuit (GI) using the population branch statistic (PBS) (Yi et al. 2010), which detects loci that are highly differentiated from other populations. Using this method, they found two regions with a strong signal of selection. One region contains the cluster of FADS genes, involved in the metabolism of unsaturated fatty acids. Several of the single nucleotide polymorphisms (SNPs) with the highest PBS values in this region were shown to be significantly associated with different phenotypes including fatty acid profiles, weight, and height.
The other region contains WARS2 and TBX15, located on chromosome 1. WARS2 encodes the mitochondrial tryptophanyl-tRNA synthetase. TBX15 is a transcription factor from the T-box family and is a highly pleiotropic gene expressed in multiple tissues at different stages of development. It is required for skeletal development (Singh et al. 2005) and deletions in this gene cause Cousin syndrome, whose symptoms include craniofacial dysmorphism and short stature (Lausch et al. 2008). TBX15 also plays a role in the differentiation of brown and brite adipocytes (Gburcik et al. 2012). Brown and brite adipocytes produce heat via lipid oxidation when stimulated by cold temperatures, making TBX15 a strong candidate gene for adaptation to life in the Arctic. SNPs in or near both of these genes have also been associated with numerous phenotypes in GWAS studies—in particular waist–hip ratio and fat distribution in Europeans (Heid et al. 2010) and ear morphology in Latin Americans (Adhikari et al. 2015).
Multiple studies have shown extensive introgression of DNA from Neanderthals and Denisovans into modern humans (Green et al. 2010; Meyer et al. 2012; Prüfer et al. 2014; Sankararaman et al. 2014; Vernot and Akey 2014; Racimo et al. 2015). Many of the introgressed tracts have been shown to be of functional importance and may possibly be examples of adaptive introgression into humans, including genes involved in immunity (Abi-Rached et al. 2011; Dannemann et al. 2015), genes associated with skin pigmentation (Sankararaman et al. 2014; Vernot and Akey 2014), and EPAS1, associated with high-latitude adaptation in Tibetans (Huerta-Sánchez et al. 2014). It has been hypothesized that Archaic humans were adapted to cold temperatures (Steegmann et al. 2002). Therefore, in this paper we examine if any of the selected genes in Inuit (Fumagalli et al. 2015) may have been introduced into the modern human gene pool via admixture from archaic humans, that is, Neanderthals or Denisovans (Meyer et al. 2012; Prüfer et al. 2014). We will show that the WARS2/TBX15 haplotype that is at high frequency in GI was likely introgressed from an archaic human population. We will also show that the selection affecting this haplotype is relatively old, resulting in a high allele frequency in other New World populations and intermediate allele frequencies in East Asia. Finally, functional genomic analyses suggest that the selected archaic haplotype may affect the regulation of expression of TBX15 and WARS2.
Suggestive Archaic Ancestry in GI SNP Chip Data
We first computed fD (Martin et al. 2014) to assess putative local archaic human ancestry in candidate genes in the highest 99.5% quantile of the PBS genome-wide distribution in GI (Fumagalli et al. 2015). This statistic is intended to detect imbalances in the sharing of alleles from an outgroup panel between two sister population panels or genomes. It is robust to differences in local diversity along the genome, which tends to confound other related statistics, such as Patterson’s D (Green et al. 2010), which also use patterns of allelic imbalances (i.e., “ABBA” and “BABA” sites, see Materials and Methods section). Patterson’s D is also meant to identify excess archaic ancestry but is better suited for genome-wide analyses (Green et al. 2010; Durand et al. 2011).
For the test population we used the SNP chip data from the selection scan in GI (Fumagalli et al. 2015), obtained from 191 GIs with low European admixture (<5%). When computing fD, we used a Yoruba genome sequenced to high-coverage (HGDP00927) (Prüfer et al. 2014) as the non-introgressed genome. For the archaic genome, we used either a high-coverage Denisovan genome (Meyer et al. 2012) or a high-coverage Neanderthal genome (Prüfer et al. 2014). All sites were polarized with respect to the inferred human–chimpanzee ancestral state (Paten et al. 2008).
The only top PBS locus showing some evidence of archaic ancestry is the WARS2/TBX15 region, which has a high degree of allele sharing with the Denisovan genome (Durand et al. 2011). However, because we used SNP chip data, we found that most genes had few informative (ABBA or BABA) sites. For genes for which there are two or more informative sites (i.e., ABBA + BABA ≥ 2), TBX15 is in the 88% quantile of the genome-wide distribution of fD, and WARS2 is in the 87% quantile. For genes for which there are more than ten informative sites (i.e., ABBA + BABA ≥ 10), the quantile rankings for these genes are almost the same: TBX15 is in the 88% quantile, and WARS2 is in the 86% quantile. However, due to the small number of SNPs available for testing in GI, we could not assess with confidence whether this region was truly introgressed from archaic humans. We therefore sought to identify the selected haplotype in other samples for which full sequencing data are available.
The alleles with high PBS values and high frequency in GI are almost absent in Africa, but present across Eurasia. In figure 1A, we show the geographic distribution of allele frequencies for one of these SNPs (rs2298080) as an example, using data from phase 3 of the 1,000 Genomes Project (Auton et al. 2015) and the Geography of Genetic Variants Browser (Marcus and Novembre 2016). The high-frequency alleles in GI tend to match the Denisovan and Altai Neanderthal alleles in this region. For example, rs2298080 has an A allele at a frequency of 45.45% in Han Chinese from Beijing (CHB) and at 99.74% frequency in GI. This allele is absent or almost absent (<1% frequency) in all African populations, and the Denisovan and Altai Neanderthal genomes are both homozygous for the A allele. We observe a similar pattern when looking at the Simons Genome Diversity Project (SGDP) (Mallick et al. 2016) (fig. 1B), which contains high-coverage genomes from a wide variety of populations across the world, including San, native Papuans and Australians, and various Native American populations. Here, we also observe that the archaic allele of rs2298080 is almost absent (1.11% frequency) in Africans, but has a much higher frequency outside the African continent, especially in East Asia (46.36%), Central Asia and Siberia (64.81% frequency), and the Americas (89.13% frequency), though it is not at very high frequencies in Oceania (20.37% frequency). We therefore analyze sequencing data from Eurasians to determine if the selected alleles were truly introgressed from an archaic human population.
Excess Archaic Ancestry in Eurasians
A particularly useful way to detect adaptive introgression is to identify regions with a high proportion of uniquely shared sites between the archaic source population and the population subject to introgression (Racimo et al. 2016). This was one of the lines of evidence in favor of Denisovan adaptive introgression in Tibetans at the EPAS1 locus (Huerta-Sánchez et al. 2014). We therefore partitioned the genome into non-overlapping 40 kb windows and computed, in each window, the number of SNPs where the Denisovan allele is at a frequency higher than 20% in Eurasians but less than 1% in Africans, using the population panels from phase 3 of the 1,000 Genomes Project (Auton et al. 2015). The windows containing TBX15 and WARS2 have four and three such sites, respectively, which is higher than 99.99% of all windows in the genome (fig. 2A). We used a length of 40 kb because the mean length of introgressed haplotypes found in an earlier study (Prüfer et al. 2014) was 44,078 bp (Supplementary Information 13 in that study).
In each of the same 40 kb windows, we also computed the 95% quantile of Eurasian derived allele frequencies of all SNPs that are homozygous derived in Denisova and less than 1% derived in Africans (using EPO alignments, Paten et al. 2008, to define the ancestral and derived alleles with respect to the human–chimpanzee ancestor). This quantile is assigned as the score for each window. This second statistic is designed to detect archaic alleles that are uniquely shared with Eurasians and have risen to extremely high frequencies. Here, WARS2 and TBX15 are also strong outliers, with a quantile frequency score above 99.95% of all windows (fig. 2B). The chance that the region would randomly show such an extreme pattern of both excess number, and high allele frequency, of derived alleles shared with Denisovans and rare or absent in Africa, is exceedingly small under models of neutrality, positive selection alone or introgression alone (Racimo et al. 2016) and strongly suggests that selection has been acting on alleles introgressed from archaic humans.
Identifying the Introgression Tract
We used a Hidden Markov Model (HMM) method (Seguin-Orlando et al. 2014) for identifying introgression tracts (“HMM-tracts”), using Yoruba as the population without any introgression (
We also queried the region using the output from the conditional random field framework for detecting archaic tracts developed by Sankararaman et al. (2016), applied to the SGDP. Using this method (“CRF-tracts”), we find that the tract is generally longer (∼85,000 bp, though varying in length depending on the population) than the one inferred by the HMM, for example, extending in several East Asians from position 119,541,452 to position 119,627,434 (fig. 3). We also find that there are individuals from three populations in East and Central Asia (the Naxi, the Yakut and the Even) in which the tract is surprisingly much larger than in any other population (ranging from 181,288 bp to 628,816 bp). Although the Yakut are Siberian populations that are closely related to the Inuit, the version of the introgressed haplotype that is at high frequency in the Yakut does not appear to match better with the Inuit haplotype than other versions of the introgressed haplotype present in other populations (
We further examined the SNPs that are selected in GI and that contain archaic alleles that are uniquely present in the introgressed haplotype background, as inferred by the HMM. We queried their allelic state in different human population panels (fig. 4). We find a sharp distinction between the two most prevalent haplotypes across Eurasia, with one of them being almost identical to the Denisovan genome and the other being highly differentiated from it. As expected, the frequencies of these two haplotypes agree with the frequencies of the inferred introgressed tracts and with the allelic frequencies of the top PBS SNPs. This pattern echoes the pattern observed for another well-known case of adaptive introgression in Tibetans (Huerta-Sánchez et al. 2014), although in the present case the archaic haplotype is more widely distributed across Eurasia.
The average recombination rate per bp in the region is low (0.15 × 10−8,
The Selected Alleles Are the Putatively Introgressed Alleles
The archaic haplotype frequencies agree well with the frequencies of the selected alleles in different human populations. We therefore aimed to verify that the selected alleles were the same alleles that were uniquely shared with archaic humans. We focused on the SNPs in the TBX15/WARS2 region that are located in the 99.95% quantile of genome-wide PBS scores in the GI SNP chip data (Fumagalli et al. 2015). Noticeably, 28 out of the 33 top SNPs in the TBX15/WARS2 region lie in the region where the introgressed haplotype is located, as inferred by both the HMM and the CRF frameworks. Out of the remaining five SNPs, two of them (rs10923738, rs12567111) lie within the inferred CRF-tract, but not in the HMM-tract. For each of the SNPs that overlap the track as inferred by both methods, we checked whether the selected allele in GI was the same as: 1) the alleles present in the high-coverage Altai Neanderthal genome (Prüfer et al. 2014), 2) the alleles present in the recently-sequenced high-coverage Vindija Neanderthal genome (https://bioinf.eva.mpg.de/jbrowse), 3) the alleles in the high-coverage Denisova genome (Meyer et al. 2012), 4) the alleles present in a present-day human genome from the 1,000 Genomes Project (Auton et al. 2015) (HG00436) that is homozygous for the introgressed tract, 5) the alleles in a present-day human genome (HG00407) that does not contain the introgressed tract, and 6) the alleles in three modern human genomes obtained from ancient DNA: Ust-Ishim (Fu et al. 2014) (dated at ∼45,000 ka), Stuttgart (dated at ∼7,000 ka) and Loschbour (Lazaridis et al. 2014) (dated at ∼8,000 ka).
In all of the SNPs showing the highest evidence of selection, the present-day human genomes that are homozygous for the introgression tract are also homozygous for the favored alleles. Additionally, in all of these SNPs, the present-day human genomes lacking the introgression tract are homozygous for the allele that was not favored by selection. Furthermore, in 79% of the SNPs, the selected allele is present in homozygous form in Denisova, whereas this is only true for 64% of the SNPs in Neanderthals (table 1), regardless of whether we look at the high-coverage Altai Neanderthal or at the recently sequenced high-coverage Vindija Neanderthal. This indicates that the selected alleles are also the introgressed alleles, and that a population closer to the sequenced Denisovan at this locus is the most likely source of archaic introgression. In the six SNPs where the introgressed tract carries a different allele than Denisova, the allele in the introgressed tract is derived, suggesting these differences are due to mutations that occurred more recently than the time the introgressing lineage coalesced with the sequenced Denisovan’s lineages, and possibly more recently than the introgression event. Finally, we find that in 100% of the SNPs the Ust’-Ishim, Stuttgart and Loschbour genomes are homozygous for the non-introgressed alleles, so they did not carry the introgressed haplotype.
|CHR||POS||SNP ID||REF||ANC||NONSEL||SEL||GR FREQ||CHB FREQ||CEU FREQ||ALTAI||VIN||DEN||HG00436||HG00407||UST’-ISHIM||STUTTGART||LOSCHBOUR|
|CHR||POS||SNP ID||REF||ANC||NONSEL||SEL||GR FREQ||CHB FREQ||CEU FREQ||ALTAI||VIN||DEN||HG00436||HG00407||UST’-ISHIM||STUTTGART||LOSCHBOUR|
CHR, chromosome; POS, position (hg19); SNP ID, dbSNP rs ID number; REF, reference allele; ANC, human–chimpanzee ancestor allele (based on EPO alignments); NONSEL, non-selected allele; SEL, selected allele; GR FREQ, frequency of the selected allele in Greenlandic Inuit; CHB FREQ, frequency of the selected allele in CHB (Chinese individuals from Beijing); CEU FREQ, frequency of the selected allele in CEU (Individuals of Central European descent living in Utah); ALTAI, selected allele counts in the high-coverage Altai Neanderthal genome; VIN, selected allele counts in the high-coverage Vindija Neanderthal genome (https://bioinf.eva.mpg.de/jbrowse/); DEN, selected allele counts in the Denisova genome; HG00436, selected allele counts in a present-day human genome that is homozygous for the introgressed tract; HG00407, selected allele counts in a present-day human genome that is homozygous for the absence of the tract; UST’-ISHIM, selected allele counts in the Ust’-Ishim genome; STUTTGART, selected allele counts in the Stuttgart genome; LOSCHBOUR, selected allele counts in the Loschbour genome.
Additionally, by analyzing low-coverage ancient DNA data from various Eurasian populations (Gamba et al. 2014; Raghavan et al. 2014; Allentoft et al. 2015), we find that the selected alleles are prevalent in Eurasian steppe populations but almost absent in western Neolithic and Mesolithic European populations (Allentoft et al. 2015; Haak et al. 2015).
The Region Also Shows Selection Signatures in Native Americans
To examine whether selection in GI on the introgressed haplotype was shared with Native Americans, we performed a scan for positive selection in the latter population by measuring the genetic differentiation against a population of African descent. To correct for recent admixture in Latin America, we selected the Latin American individuals from the 1,000 Genomes project (Auton et al. 2015) showing the highest proportion of Native American ancestry in a 100 Mbp region around the introgressed haplotype, using ngsAdmix (Skotte et al. 2013). These came only from individuals from Peru (PEL) and from individuals from Los Angeles with Mexican ancestry (MXL), as these have higher proportions of Native American ancestry than other Latin American panels (
We observed a local increase of FST in the proximity of the introgressed haplotype when comparing individuals from the PEL and MXL panels against populations of African (YRI) descent (fig. 5). We did not observe high FST values when testing panels of East Asian, South Asian or European descent as the target population for selection (
To assess whether the observed values of FST around the introgressed haplotype may be explained by pure genetic drift, we calculated FST for all biallelic SNPs at the whole-genome level and compared the empirical distribution with a marker SNP (rs2298080) for the selected haplotype. We found that FST values between PEL/MXL against YRI or CEU for the introgressed haplotype are strong outliers in the genome-wide distribution (
We aimed to elucidate whether the high FST values observed between East Asians and Native Americans could be due solely to selection in the ancestral population of Native Americans and GI, or if they could be better explained by continuing selection in Native Americans after their split from GI. We simulated several demographic scenarios of the history of CHB, PEL and GI, under various PEL-GI split times and PEL population sizes (see Materials and Methods section). We then tested whether the top FST for our 20 kbp sliding window scan was an outlier under scenarios with positive selection starting in the ancestral population of PEL and GI, and either continuing in both PEL and GI, or only persisting in GI. We found that demographic scenarios involving low effective population sizes in PEL and/or very strong selection in the ancestral population could result in the observed FST value being in the 95% percentile of the distribution, without invoking continuing selection in Native Americans after their split from GI (Raghavan et al. 2015), so it is possible that selection need not have continued to operate in Native Americans after their split from GI.
Using rs2298080 as a proxy for the selected haplotype, we infer that the haplotype has a frequency of 98%, and 86% in the PEL and MXL population panels, after filtering for the individuals with the most Native American ancestry, respectively. The apparent lack of fixation of the introgressed haplotype in these populations might be explained by the residual non-Native American ancestry (ranging from 11% to 29%) for the analyzed American individuals. Clearly, selection on this allele is not unique to GI but has affected a large proportion of New World groups, likely in pre-Columbian times. This also explains the higher frequency of the introgressed tract in the SGDP Native American panel than in the 1,000 Genomes Project panel (figs. 1B and 3), as the individuals from the former project were specifically sampled from peoples with well-documented Native American ancestry.
The Introgression Tract Is More Closely Related to the Denisovan Genome than the Neanderthal Genome
The divergence between the introgressed haplotype (as defined by the HMM) and the Denisova genome (0.0008) is lower than the divergence between the haplotype and the Altai Neanderthal genome (0.0016). To further examine if the haplotype could be of Neanderthal origin, we computed the divergence between the Altai Neanderthal genome and a randomly chosen individual that was homozygous for the introgressed tract (HG00436). We compared this divergence to the distribution of divergences between the Altai Neanderthal and the Mezmaiskaya Neanderthal (Prüfer et al. 2014), computed across windows of the genome of equal size (
We also obtained the distribution of scaled divergences between the Denisovan genome and a high-coverage Yoruba genome (HGDP00936) (Prüfer et al. 2014). We compared this distribution to the divergence between the introgressed haplotype (using an individual homozygous for the haplotype), and the Denisovan, and observe that this divergence falls towards the left end of the distribution, in the 16.44% quantile (
Additionally, we simulated two archaic populations that split at different times (100,000, 300,000, or 450,000 years), under a range of effective population sizes (1,000, 2,500, 5,000), based on estimates from an earlier study (Prüfer et al. 2014). We compared these simulations to the divergence between the Neanderthal genome and the introgressed haplotype, as well as to the divergence between the Denisovan genome and the introgressed haplotype and the divergence between the introgressed haplotype and the Yoruba genome (
To understand the relationship between the introgressed haplotype and the archaic and present-day human genomes, we plotted a network of the haplotypes in the region as defined by the HMM method (fig. 6). This network shows the most parsimonious distances among the 20 most common present-day human haplotypes and the four archaic haplotypes from Altai Neanderthal and Denisova in the region. We observe two distinct clusters of present-day human haplotypes: one cluster that is distantly related to both archaic genomes, and another cluster that is closely related to Denisova and contains mostly East Asian and Native American individuals, some European and South Asian individuals and almost no Africans. The Altai Neanderthal haplotypes fall at an intermediate position between the Denisovan haplotypes and the first cluster, but share more similarities with the Denisovan haplotypes. The present-day human cluster that is closest to Denisova has a smaller Hamming distance to Denisova than does the Neanderthal haplotype (
We compared the distances between the haplotypes in TBX15/WARS2 to those observed in EPAS1 (Huerta-Sánchez et al. 2014; Huerta-Sanchez and Casey 2015). We focused on the distances among Neanderthal, Denisova and the haplotype that is most closely related to the archaic genomes, for each of the distinct present-day clusters. We observe that, although the distances among Denisova, Neanderthal and the putatively introgressed haplotypes are similar, the distance between any of these and the non-introgressed haplotype are approximately double of what is observed in the case of EPAS1 (
Regulatory Differences in Archaic and Modern Humans
We previously identified four regions in TBX15 in which the Denisovan DNA methylation patterns significantly differ from those of present-day humans (Gokhman et al. 2014). This high concentration of differentially methylated regions (DMRs) makes TBX15 one of the most DMR-rich genes in the Denisovan genome. These DMRs are found around the transcription start site (TSS) of TBX15, bear many active chromatin marks in present-day humans (DNase I hyper-sensitivity, binding by p300 and H2A.Z, and the histone modifications H3K27ac and H3K4me1), and were shown to be associated with the activity levels of TBX15 (Kron et al. 2012; Chandra et al. 2014). This suggests that the activity level of TBX15 in the Denisovan genome was different than in present-day human genomes.
The extensive differences in DNA methylation between the Denisovan and present-day human genomes, and the fact that the introgressed tract overlaps regulatory regions of TBX15 (Chandra et al. 2014) suggest that present-day individuals who carry the introgressed haplotype might display differential methylation as well. Despite the fact that the four Denisovan DMRs do not overlap the introgressed region (fig. 7A), they might reflect TBX15 activity levels determined by sequence changes at the introgressed region, as it is not uncommon that sequence changes far from a gene alter its activity (Mansour et al. 2014; Diederichs et al. 2016). Thus, we turned to investigate the link between the introgressed allele and DNA methylation around TBX15 and WARS2.
To this end, we studied a data set of skin fibroblasts, where both genes are active, and in which each individual is characterized for sequence, methylation and expression. This data set included 62 individuals, of which two are homozygous for the introgressed tract, 21 are heterozygous, and 39 do not carry the tract (Wagner et al. 2014). Due to the low number of individuals who are homozygous for the introgressed tract, we divided the individuals into two groups: individuals who do not carry the tract, and those who carry at least one copy of it. Then, we searched for CpG positions whose methylation level differs between the groups. Searching between the promoter of WARS2 (5 kb upstream to the TSS) and the transcription termination site of TBX15, we identified two such regions (FDR < 0.05, t-test,fig. 7B). Interestingly, these CpGs almost completely overlap one of the previously reported DMRs (fig. 7A). The second region is a single CpG position that is found within the introgressed region, 11 kb upstream of the TSS of TBX15 (hereinafter “upstream CpG”). Here too, introgressed individuals are hypomethylated compared with individuals who do not carry the tract (fig. 7B and
In order to further test the effect of introgression on the activity of TBX15 and WARS2, we investigated whether the methylation levels of the 16 TSS CpGs and the upstream CpG are associated with the levels of expression of TBX15 and WARS2. We found that none of these CpGs are associated with the expression of WARS2, while all of them are significantly associated with the expression of TBX15 (FDR < 0.05, Pearson correlation). In fact, both the TSS CpGs and the upstream CpG are found within two previously reported regulatory regions of TBX15, and it was shown that their methylation levels are linked to the activity levels of this gene (Chandra et al. 2014). We also detect a more general association between the introgressed haplotype and expression levels: individuals carrying the haplotype exhibit a 22% increase in the expression level of TBX15 in skin fibroblasts (Bonferroni-corrected P = 0.014, t-test, fig. 7C and
TBX15 exhibits a unique link between expression and methylation, where, in contrast to most genes, global hypermethylation of its promoter is associated with elevated activity (Chandra et al. 2014). However, the local relationship between the methylation at specific sites within its promoter and its activity level is substantially more complex, highly tissue-specific, and little understood to date. For example, unlike the rest of the promoter region, hypomethylation of the TSS region was shown to be associated with elevated expression levels of TBX15 (Chandra et al. 2014). This trend matches the one we see in the fibroblast data set, where introgressed individuals exhibit reduced methylation in the TSS CpGs and elevated expression levels of TBX15 (fig. 7C).
A sensible assumption is that the methylation patterns in the introgressed individuals would resemble those in the Denisovan. However, the trend we observe is opposite; the Denisovan is hypermethylated compared with present-day humans, while introgressed individuals are hypomethylated compared with other present-day human individuals. While counterintuitive at first glance, it is important to take into account the complex and tissue-specific regulation of TBX15, and the fact that our analysis was conducted in fibroblasts, while the Denisovan methylation levels are from bone. Both the TSS CpGs and the upstream CpG reside in regions where methylation is variable across tissues (Chandra et al. 2014). Moreover, the method of measurement of DNA methylation is different between these samples (high-resolution single site measurements in fibroblasts vs. regional reconstruction in the Denisovan bone). Therefore, further analyses are needed in order to determine how methylation patterns are affected by introgression in additional tissues, and specifically in bone.
The complex relationship between the introgressed tract and the activity levels of TBX15 and WARS2 can also be observed in the wider context of expression across tissues. To this end, we queried the GTEx database (GTEx-Consortium 2015) to examine if the introgressed SNPs are associated with an effect on human gene expression in various tissues. The sample sizes in the GTEx project do not provide enough power to detect trans-expression quantitative trait loci (eQTLs), so we could only search for a cis-eQTL relationship with genes within a ±1 Mb window of each SNP. We therefore only checked whether the 28 introgressed SNPs that had signatures of positive selection in GI were also eQTLs for TBX15 or WARS2. We found that all of these SNPs are tightly linked and therefore have almost exactly the same P-values in each of the tissues, so we only focus on one of these (rs2298080) below.
Because the haplotype is at intermediate frequencies in Europeans, we queried the GIANT consortium GWAS data (Wood et al. 2014; Locke et al. 2015; Shungin et al. 2015), which contains a number of anthropometric traits tested on a European panel. When looking at all P-values in the region, we find that there is a peak of significantly associated SNPs for three phenotypes right where the haplotype is inferred to be located (table 1). The introgressed alleles in the queried SNPs are associated with a positive effect size for all three phenotypes. However, even though some of these SNPs are significantly associated with BMI-adjusted waist-circumference (P < 10−5), they are not among the top most significant SNPs in the region for any of the three phenotypes (
Interestingly, we observed that the region overlaps a 100 kb region designated as a mouse QTL for the induction of brown adipocytes (MGI:2149993) (Xue et al. 2005).
We have identified a highly divergent haplotype in the TBX15/WARS2 region, which was likely introduced into the modern human gene pool via introgression with archaic humans. A priori, one would expect the source of introgression to be Neanderthals, due to their geographic distribution and the known admixture event(s) from Neanderthals into Eurasians (Green et al. 2010; Prüfer et al. 2014). However, the introgressed sequence is more closely related to the sequenced Denisovan genome than the sequenced Neanderthal genomes. This suggests that either the introgressing Neanderthal sequence is missing in the Neanderthals sequenced to date (and perhaps present in the sequenced Denisovan genome due to incomplete lineage sorting), or that the introgression event occurred from an unidentified population present in Eurasians that was more closely related to the Denisovan individuals found in the Altai Mountains than to any Neanderthal population.
The archaic haplotype is almost absent in Africans, present at higher frequencies in East Asians than in Europeans and South Asians, and at even higher frequencies in Native Americans and GI, where it is almost fixed (after correcting for post-Columbian admixture). This suggests there may have been a temporally and geographically extended period of selection for the archaic haplotype throughout eastern Eurasia. Population genetic differentiation values between Native Americans and Yoruba are significantly high (YRI vs. PEL: P < 10−4), marginally significant when comparing East Asians and Yoruba (YRI vs. CHB: P = 0.01), and marginally significant when comparing East Asians and Native Americans (P = 0.02 for CHB vs. PEL and P = 0.06 for CHB vs. MXL). These findings suggest that selection on this locus may have acted during the early phases of the peopling of the Americas, perhaps in the Siberian or the Beringian ancestors of both modern Native Americans and GI. Selection may have continued to operate in Native Americans after their split from the GI, but we found that one need not invoke a model of continuing selection, if selection in Beringia was strong (2 N s > ∼500) and/or the effective population size in Native Americans after their split was very small. While very strong selection is unlikely, effective population sizes for Native Americans have been estimated to be very low (Raghavan et al. 2015).
It is intriguing that the introgressed tract is extremely long in three specific populations of central Asia and Siberia: the Yakut, the Even and the Naxi. The Yakut and Even are northeastern Siberian groups that migrated from the Lake Baikal and Transbaikal regions north of Mongolia. The Naxi are a population living in the Himalayan foothills that migrated from northwestern China. It is possible that these longer versions of the tract are remnants of the originally introgressed haplotype, which may have been shortened by recombination before sweeping to high frequencies in other populations. Though the haplotype is present in Europeans, there is some evidence to suggest it may have been introduced via the eastern steppe migrations of the Late Neolithic (
The TBX15/WARS2 region is highly pleiotropic: it has been found to be associated with a variety of traits. These include the differentiation of adipose tissue (Gburcik et al. 2012), body fat distribution (Heid et al. 2010; Liu et al. 2013, 2014; Shungin et al. 2015), facial morphology (Lausch et al. 2008; Pallares et al. 2015), stature (Lausch et al. 2008), ear morphology (Curry 1959; Adhikari et al. 2015), hair pigmentation (Candille et al. 2004), and skeletal development (Singh et al. 2005; Lausch et al. 2008). Interestingly, for several of the body fat distribution studies, the introgressed SNPs lie in the middle of a region with significant genome-wide associations, although the introgressed SNPs themselves do not have genome-wide significant P-values, after conditioning for the SNP with the strongest association in the region (which is not linked to the introgressed tract).
The haplotype is located immediately upstream of TBX15, overlapping some of its regulatory regions. Using fibroblast data where individuals were characterized for sequence, methylation and expression, we have shown a three-way association between the haplotype, the levels of DNA methylation, and the levels of expression of TBX15. However, the GTEx analysis found an association between the haplotype and the expression of TBX15 only in the testis. Similarly, though we could not detect a regulatory link to the haplotype when analyzing WARS2 in the fibroblast data, the GTEx analysis revealed that the Denisovan SNPs are cis-eQTLs for WARS2 across various tissues. This included skeletal muscle and subcutaneous adipose. These contrasting results, together with the different methods used to perform measurements in these data sets, make it difficult to assess whether the downstream phenotypic changes are due to changes in the regulation of TBX15, WARS2 or both.
Altogether, our study suggests a complex multi-factorial regulation of TBX15 and WARS2. We show that the introgressed region is associated with regional changes in methylation and expression levels, but our findings also hint to other factors that affect the regulation of these genes that are yet to be elucidated.
Materials and Methods
fDStatistics in GI SNP Data
Following Green et al. (2010) and Durand et al. (2011), at site i, let Ci(ABBA) = ((1 - fYoruba) × fGreenlandic Inuit × fArchaic human), where f is the derived allele frequency (with respect to the human–chimpanzee ancestor) in either a population panel (for the GI) or a diploid genome (for Yoruba and the archaic humans). Furthermore, let Ci(BABA) = fYoruba × (1 - fGreenlandic Inuit) × fArchaic human. Then, for a set of N sites within a particular region of the genome, we computed D as follows:
Here X is defined—dynamically for each site i—as the population (either Archaic or GI) that has the highest derived allele frequency (Martin et al. 2014).
Introgressed Tracts in Eurasian Whole-Genome Data
First, we used a HMM to find the archaic introgressed tracts in the region (Seguin-Orlando et al. 2014) (Sankararaman et al. 2014, 2016). This method searches the genome for runs of archaic alleles that are of a length consistent with introgression.
Uniquely Shared Sites
We defined “Eurasian uniquely shared sites” (Racimo et al. 2016) as sites where the Denisovan genome is homozygous and where the Denisovan allele is at low frequency (< 1%) in Africans (AFR, excluding admixed African-Americans), but at high frequency (> 20%) in non-American Eurasians (EUR + EAS + SAS) from phase 3 of the 1,000 Genomes Project (Auton et al. 2015). Similarly, we defined the “derived shared quantile” statistic (Racimo et al. 2016) as the 95% quantile of all derived allele frequencies in Eurasians, for SNPs where the Denisovan allele is homozygous for the derived allele and where the derived allele is at low frequency (< 1%) in Africans. In both cases, we only used sites that were not in repeat-masked regions (Smit et al. 1996–2010) and that lied in regions with 20-bp Duke mappability equal to one (Derrien et al. 2012). The mapability track ensures that the sites are located in uniquely mappable 20-bp windows of the genome, to avoid issues that may spring from mismapped reads.
To examine the haplotypes in the region containing the introgressed tract, we computed the number of pairwise differences between every pair of haplotypes in a particular continental panel. Then we ordered the haplotypes based on their number of pairwise distances to the archaic sequence in each continent. Figure 4 is generated using the heatmap.2 function from the gplots package of the statistical computing platform R (R-Core-Team 2012).
We built a haplotype network based on pair-wise differences using R (R-Core-Team 2012) and the software package pegas (Paradis 2010). To plot the network, we used the 20 most abundant present-day human haplotypes. To make a fair comparison with published distances for EPAS1 (Huerta-Sanchez and Casey 2015), we looked at the 40 most abundant present-day haplotypes instead, and only counted differences at SNPs that were segregating in present-day humans. The network is produced using statistical parsimony, such that the most closely related haplotypes are connected first via the least number of mutations (“steps”) (Templeton et al. 1992).
FST Scan in Native Americans
We extracted sequencing data in BAM format for a 100 Mbp region surrounding the putatively introgressed haplotype from the 1,000 Genomes Project data set (Auton et al. 2015). First, we selected all unrelated individuals belonging to four population panels: African Yoruba (YRI), Central Europeans from Utah (CEU), Han Chinese (CHB) and Peruvians from Lima (PEL). To select PEL individuals that would serve as optimal representatives of Native American genetic variation, we calculated admixture proportion among all YRI, CEU, CHB and PEL individuals assuming four ancestral populations using NGSadmix (Skotte et al. 2013). We then extracted the first 30 PEL individuals showing the highest proportion of inferred Native American ancestry (> 89%). Similarly, we selected the first 30 individuals ranked by African, European and East Asian ancestry, respectively, and used each of these four 30-individual cohorts in the principal component analyses described below. For this analysis, we therefore processed a total of 120 individuals. We then repeated the same procedure using Colombian (CLM), Mexican (MXL) or Puerto Rican (PUR) panels. After calculating the ancestry proportion as described above, and given the higher European admixture proportions for these populations, only 20 individuals for MXL, CLM or PUR were chosen as representatives of Native American variation (Native American ancestry component > 64%). To visually verify whether we correctly selected unadmixed individuals in the Latin American cohorts, we performed a principal component analysis using ngsTools (Fumagalli et al. 2014). As only PEL and MXL selected individuals explained more than 70% of Native ancestry (
We computed FST, a measure of population genetic differentiation, between the Latin Americans with the highest proportion of Native American ancestry and other populations against YRI, using a method-of-moments estimator implemented in VCFtools (Danecek et al. 2011) from VCF files from the 1,000 Genomes Project (fig. 5). To identify signatures of positive selection in Native Americans, we scanned the region around the putatively introgressed haplotype using a sliding-windows approach, with window size of 20 kbp and step size of 2 kbp, and computing FST against YRI in each window. We also performed this same analysis but using all non-American, non-African populations present in the 1,000 Genomes Project, by randomly sampling 30 individuals at each population for ease of comparison (
We were also interested in determining which demographic scenarios could generate levels of FST between PEL and CHB as extreme as those observed, without invoking selection after the split between PEL and GI. We explored several combinations of effective population size for Native Americans (from 1,000 to 10,000 with a step size of 1,000) and split time between Native Americans and GI (from 8,000 to 18,000 years ago with a step size of 1,000 years). All other parameters were fixed to those inferred using dadi (Gutenkunst et al. 2009). Additionally, we assumed that selection started in the ancestral population of Native Americans and GI 19,000 years ago. After the split, selection stopped in Native Americans, and lasted in the GI for 500 more years after the split. We tried a range of selection coefficients: 0, 20, 50, 200, 500, or 1,000 (in units of 2 N s). For each scenario, we simulated 10,000 replicates of a 20 kbp region, using the software msms (Ewing and Hermisson 2010), and recorded the 95th percentile of the FST distribution between Native Americans and East Asians.
TBX15 and WARS2 Regulation
Data from SNP arrays, methylation arrays and expression arrays for the 62 fibroblast samples were downloaded from Gene Expression Omnibus (GEO accession number GSE53261). Expression values were normalized using Median Absolute Deviation (MAD) scale normalization (Fundel et al. 2008). Chromatin peaks (H3K27ac, H3K4me1 and DNase-I) were downloaded from the Roadmap integrative analysis of 111 epigenomes (Kundaje et al. 2015). For the analysis of the link between introgression and the expression level of TBX15, we corrected the t-test P-value using Bonferroni correction, taking into account the eight GTEx mesenchymal tissues in which TBX15 is expressed. Osteoblast and fibroblast RRBS maps were downloaded from the ENCODE project (GEO accession number: GSE27584).
Conditional Association Study
We used the method of conditional association testing based on summary statistics (Yang et al. 2012) that is implemented in the GCTA software package (Yang et al. 2011). To obtain information about patterns of linkage disequilibrium for the European population analyzed by GIANT—which is required by the method—we used the called genotypes for the CEU panel from phase 3 of the 1,000 Genomes Project.
We thank Montgomery Slatkin and members of the Slatkin and Nielsen labs for helpful advice and discussions. We also thank Jacob Crawford for his help and advice in inferring admixture tracts using a hidden Markov model. Additionally, we thank Sriram Sankararaman for providing us with the output of his conditional random field method in our region of interest. R.N. is supported by a National Institutes of Health grant (R01HG003229). L.C. is supported by the Israel Science Foundation FIRST individual grant (ISF 1430/13). M.F. is supported by a Human Frontier Science Program fellowship (LT00320/2014). E.H.S is supported by UC Merced start-up funds and an NSF-DEB 1557151 grant.