The Complex Admixture History and Recent Southern Origins of Siberian Populations

Although Siberia was inhabited by modern humans at an early stage, there is still debate over whether it remained habitable during the extreme cold of the Last Glacial Maximum or whether it was subsequently repopulated by peoples with recent shared ancestry. Previous studies of the genetic history of Siberian populations were hampered by the extensive admixture that appears to have taken place among these populations, because commonly used methods assume a tree-like population history and at most single admixture events. Here we analyze geogenetic maps and use other approaches to distinguish the effects of shared ancestry from prehistoric migrations and contact, and develop a new method based on the covariance of ancestry components, to investigate the potentially complex admixture history. We furthermore adapt a previously devised method of admixture dating for use with multiple events of gene flow, and apply these methods to whole-genome genotype data from over 500 individuals belonging to 20 different Siberian ethnolinguistic groups. The results of these analyses indicate that there have been multiple layers of admixture detectable in most of the Siberian populations, with considerable differences in the admixture histories of individual populations. Furthermore, most of the populations of Siberia included here, even those settled far to the north, appear to have a southern origin, with the northward expansions of different populations possibly being driven partly by the advent of pastoralism, especially reindeer domestication. These newly developed methods to analyze multiple admixture events should aid in the investigation of similarly complex population histories elsewhere.


S1A. 3-population test and TreeMix results
The 3-population test (Patterson et al. 2012) is based on allele frequency correlations across populations, and the observation of a significantly negative value of the f3 statistic is taken as clear evidence of admixture. We have applied this test (in its TreeMix implementation) to the Siberian data, and although most of the admixture signals inferFar Eastern and discussed in the manuscript are confirmed (see Table S2 for all significantly negative values of f3), some patent signals of admixture, which are confirmed by multiple other independent analyses and attested to historically are missed. In particular, the test confirms that the South Siberian populations show complex signals of admixture, and also confirms the signals of admixture in the Turkic-speaking Dolgans of northern Siberia. Yet, while historical accounts (supported by the results of other analyses based on genetic data, see Text S3A) suggest that the Dolgans trace their ancestry to Yakuts and Evenks (Stapert 2013, and references therein), Yakut ancestry in the Dolgans is not confirmed by the f3 statistic. Also, whereas the three-population test provides statistically significant evidence that the southern Tungusic-speaking populations (Oroqen, Hezhen, Xibo) trace their ancestry to populations related to the modern day Koryaks, Nganasan and Mongolians (as is suggested by other analyses in Text S3B), no such signal is detected for the northern Tungusic-speaking groups (Even and Evenks); the strong signal of recent gene flow into these groups from the neighbouring Yakuts and Koryaks (particularly evident in the analysis of IBD segments) is not identified either. Similarly, while the test supports a signal of admixture in the Chukchi when Naukan Yupik and Koryaks are used as source populations, the signal of the reciprocal gene flow from the Chukchi into the Naukan Yupik is not supported. In addition, while the Samoyedic-speaking populations (except the Nganasan) appear to be closely related based on PCA, ADMIXTURE, AHG, and the analysis of IBD segments, the three-population test fails to detect Khanty ancestry in the Selkup, while it does identify it in the Nenets groups.
We also attempted to investigate the history of admixture in the Siberian populations with the TreeMix approach (Pickrell & Pritchard 2012). This approach infers a maximum-likelihood tree and then sequentially adds migrations that reduce the residuals between the observed data and the fitted data. However, even after ten migration edges were added (Fig.S7A) the residuals were still quite large (24.9 standard errors; Fig.S7B). The results are difficult to interpret due to the close relationship between the populations being analysed, although the analysis does confirm that all present-day Siberian populations trace their ancestry to South Siberia. Some other results reported in the Main Text and Text S3 are also supported, e.g. additional gene flow into the Samoyedicspeakers from populations related to the modern day Nganasan and Khanty. Admixture events suggested by PCA, ADMIXTURE, analysis of the IBD segments and multiple regression analyses (and in many cases supported by the f3 statistic or independent genetic analyses and attested to in the historical records) that are nevertheless not inferred by TreeMix include: Finno-Ugric gene flow into Russians, recent gene flow from Yakuts into the neighbouring Even population, and reciprocal gene flow between Chukchi and Naukan. In addition, the Dolgans are not being modelled as admixed between Evenks and Yakuts, and the impact of the Xiongnu invasions (Asian admixture in the northern Turkic-and Tungusic-speakers, see Text S3E below) is also not observed. At the same time, a number of inferred migration edges are hardly plausible, e.g. the gene flow from the San population of southern Africa into the Buryats of south Siberia, or the gene flow from the Han Chinese into the San. Also unlikely is the gene flow edge inferred between the Hezhen (southern branch of the Tungusic language family) and the branch leading to all the northern Tungusic and Far Eastern populations as well as Greenland Inuit (this gene flow therefore could not have been recent, and must have preceded the migration of Paleo-Yupik-Inuit into the Bering Sea region).
While significant f3 scores are taken to indicate admixture, nonsignificant tests do not necessarily indicate its absence, as processes such as genetic drift following admixture could mask the signal (Patterson et al. 2012). This could affect for example admixture signals in the Yakuts -a population which has been shown by the analysis of the distribution of lengths of IBD segments to have recently experienced a founder effect (Text S3A). In addition, given complex admixture histories with multiple admixture events as suggested by the ADMIXTURE analysis (Fig.3), the tests based on pairwise comparisons might not be very informative. Similarly, it is not surprising that the tree inferred by the TreeMix analysis appears to be a poor fit to the data, since all other results suggest recent shared ancestry followed by complex multiple and often reciprocal admixture events; i.e. the Siberian populations did not experience a tree-like population history.

S1B. Validation of the Admixture History Graph method
In order to validate the AHG method, we generated (using R) populations with all the possible combinations of admixture proportions A, B and C such that in the populations reconstructed as they were at the time of admixture the initial ancestry component A was allowed to vary around 10%, 20%, 30%, 40% and 50% (with B=1-A), and the later contribution of C similarly varied from around 10%, 20%, 30% etc to around 90%. For each of the different admixture histories 100,000 outcomes were generated. We then ran the covariance test described in the Materials and Methods section of the main text to determine the sequence of admixture events. For sample sizes of 10 individuals the sequence of admixture events was inferred correctly for all values of A (100% of outcomes with the correct history inferred), unless the initial proportion of A or C was less than 10%, or the initial proportion of C was over 90%, bringing the final proportion of A to below 5%.
When the final proportion of ancestry A in a population dropped to below 5%, the performance of the method diminished slightly (Fig.S2A).
To determine if the test is useful for more complicated admixture scenarios we added another ancestry component (D) to the hypothetical populations already containing ancestries A, B, and C at various mixture proportions. We tested various possible mixture combinations that would allow for the final proportions of any ancestry to be above 5% (Fig.S2B). For sample sizes of 10, the admixture history is inferred correctly more than 95% of the time for a wide range of possible permissible values of A and C (0.2-0.8) and for values of D up to 70%. (Fig.S2B).
Additionally, we explored a scenario where one admixture event between two already admixed populations leads to the appearance of a population with four sources of ancestry ((AB)(CD)) ( Fig.S2C). Again, for sample sizes of 10, the admixture history is inferred correctly more than 98% of the time for a wide range of possible permissible values of A and C and for different mixing ratios of AB and CD (Fig.S2C).

S1C. Validation of AHG results for South Siberian populations
Altaians For the Altaians, the configuration most supported by the AHG calculation is (European, Western Siberian)(Central Siberian, EAsian), shown in Fig.7B. This graph is supported by three out of four sets of covariance tests across the top 30% of the ADMIXTURE runs (see Main Text and Table S3), while the remaining set of tests involving West Siberian, Asian and European ancestries suggest Western Siberian was coupled with Asian, rather than European. Because the admixture history suggested by the AHG results is different for the Altaians and Tuvans, despite their apparent similarity in ancestry composition as inferred by ADMIXTURE, we tested the admixture tree supported by the AHG for the Altaians (European,West Siberian)(Central Siberian, EAsian) against Based on the results of 100,000 simulations we estimate the probability of this error to be around 5%.

Tuvans
Because Tuvans have two out of four ancestries present at low proportion (namely Western Siberian and European), the covariance tests were largely inconclusive (Table S3), and we had to rely only on covariance tests which did not involve both of these ancestries simultaneously as part of the tested trio. Based on these more reliable tests, out of fifteen possible configurations only three were Nevertheless, analysis of ancestry block structure reveals that blocks of European and Central Siberian ancestry are the smallest and therefore are likely to be the oldest in Tuvans. However, there is clear evidence of recent additional Central Siberian gene flow into Tuvans, as in addition to small blocks they exhibit an abundance of wide blocks not seen in the Altaians (Fig.7C). On the other hand, the average width of Western Siberian ancestry blocks is only slightly smaller than that of Asian ancestry blocks, and although this implies that the Asian ancestry is more recent (in agreement with all the AHG results), the difference is too small to distinguish in terms of inferred admixture dates (the date inferred for these two ancestries is ~1,980 ya). Given this additional evidence, we conclude that the configuration best supported for Tuvans is (((European, Central Siberian)Western Siberian)EAsian), and it was this configuration that we tested against the graph inferred for Altaians (European, Western Siberian)(Central Siberian, EAsian). We tried to simulate the admixture history (AB)(CD) using as the input for A, B, C and D the sample means and variances for the European, Central Siberian, Western Siberian and EAsian ancestry components observed in the Tuvans. However, as part of the simulation we generate the parameter space of all possible combinations of A, B, C and D available for the population at the time of admixture under a given admixture model and observed admixture proportions, and the Tuvan set of parameters was outside such a parameter space, i.e. the (AB)(CD) admixture model (at least in its simplest form without additional gene flow) inferred for Altaians is not possible for Tuvans. For example, if we assume that the Central Siberian and EAsian ancestry we observe in Tuvans originated from the same source population (as in Altaians), then we can estimate the initial parameters in the admixing population. For Tuvans, let mc=0.37 (observed mean frequency of Central Siberian ancestry) md=0.39 (observed mean frequency of EAsian ancestry) ma=0.11 (observed mean frequency of European ancestry) mb=0.08 (observed mean frequency of Western Siberian ancestry) (Note that the Tuvans also have 3% Far Eastern and 2% Yupik-Inuit ancestry components, which do not enter into the calculations). From these values we can first estimate the initial frequency of the Central Siberian ancestry in the initial admixing population: mz = mc / (1-(ma+mb)).
In this example, the initial mean frequency of the component mc was 0.46. As mz is distributed in the interval (0,1) and follows a truncated gamma distribution, the maximal expected variance is ((1-mz) 2 )/3 or 0.097. We can compare this value for variance with the one that would yield the observed variances in mc and md : unrelated Yoruban (YRI) and Han Chinese (CHB) individuals and individuals of European ancestry (CEU), downloaded from the International HapMap project home page (International HapMap Consortium et al. 2007). First, we simulated admixture between YRI and CEU haploid chromosomes. Admixed genomes were constructed as described previously (Pugach et al. 2011).
Briefly, we started by building a recombination map by drawing from an exponential distribution with weight λ, such that an ancestry switch occurred with probability 1 -e -λg for each distance of g Morgans. The λ parameter reflects the time of admixture through the expected number of ancestry switches (or breakpoints) that occur given the rate of admixture and recombination (Pugach et al. 2011). In our previous study (Pugach et al. 2011) we showed that for simulated data, the observed number of breakpoints starts to deviate from the expected value beginning around 50-100 generations since the admixture event (depending on the rate of admixture and the effective population size, Ne) ( Fig.S18A), which means that the actual time of admixture the simulated number of breakpoints (parameterized by λ) corresponds to is older than λ. In other words and based on the results obtained previously (Pugach et al. 2011), recovering a λ of 80 in the hybrids would indicate power to recover admixture events which have occurred 85 generations ago, if the rate of admixture was 30%; 100 generations ago, if the admixture rate was 40%, and 115 generations ago, if 30% admixture happened in a population with small Ne (which applies to most of the Siberian populations as evident by the genome-wide pattern of LD (Fig.S11)). To generate artificial genomes we started at the beginning of each simulated chromosome and at each of the recombination points from the recombination map we sampled CEU ancestry with probability α and YRI ancestry with probability 1 -α, where the value of α was sampled from a beta distribution with mean 0.30 and standard deviation 0.10. The following values of λ were simulated: 80, 100, 120. We constructed 20 admixed chromosomes for each value of λ. These artificial chromosomes were then used as one of the parental populations for another simulated admixture event, involving CHB. New hybrids were constructed exactly as above, but the values of λ simulated for this more recent admixture event were: 10, 20, 40, and 60. As before, 20 haploid chromosomes were simulated per population. Thus, in total 12 artificial data sets with different admixture histories were simulated (each recent admixture time point is simulated three times, each time using a different admixture background). To recover these 24 admixture dates (two dates of admixture for each of these artificial populations), we first inferred ancestry along the chromosomes using PCAdmix with the same settings as for the empirical data. Wavelet decomposition of the admixture signal as inferred by PCAdmix was performed following the workflow described for the empirical data. The results are reported in Fig.S20 and demonstrate that the admixture dating protocol (Pugach et al. 2011) which we now modified to apply to dating two (or more) pulses of gene flow into the same 6 155 156 PCAdmix with the same settings as for the empirical data. Wavelet decomposition of the admixture signal as inferred by PCAdmix was performed following the workflow described for the empirical data. The results are reported in Fig.S21. As expected, additional gene flow lowers the estimated dates of admixture, as it introduces new wider ancestry blocks into the population, thereby altering older narrower block structure and erasing information on past admixture events. Higher rates of additional gene flow amplify this effect, and the underestimates are most severe for scenarios involving the most recent gene flow (λ =20), and the highest admixture rates (α = 0.4). Furthermore, the confidence intervals for the dates inferred for the earlier vs. later event of admixture overlap considerably. This means that in some situations an earlier admixture event might be inferred to be younger than a more recent admixture event.

S1F. Simulations to determine the time-depth resolution of the method
The power to correctly infer small ancestry tracts characteristic of older admixture events depends a) on the density of informative SNPs present in the data, b) on the degree of genetic differentiation of the source populations and c) on the robustness of the method to the source population being mis-specified. Although we have shown previously that in general the wavelet transform method offers a better resolution than any of the existing admixture dating methods it has been compared to (Pugach et al. 2011), the data itself could pose a limitation on how far back in time the dates of admixture could be recovered. To test this we again constructed different sets of artificially admixed genomes following the strategy described in the previous paragraph, but using the same populations (and same markers) that were used as the source groups for our admixture dating routine, i.e. Koryaks (KRY), Italians (ITA), Khanty (KHA), Nganasan (NGA) and Greenland Inuit (ESK), generating hybrid chromosomes to mimic the admixture histories observed for the Siberian dataset. For the first event of admixture we simulated 22 different values of λ ranging from 20 to 200. For each value of λ 20 admixed chromosomes were constructed. These artificial chromosomes were then used as one of the parental populations for another simulated admixture event. New hybrids were constructed exactly as above, but the values of λ simulated for this more recent admixture event ranged from 10 to 100 (but always keeping this λ smaller than the λ used to generate the background event of admixture). As before, 20 haploid chromosomes were simulated per population. In total 118 artificial data sets for each of the admixture histories were simulated. To recover the dates of admixture (two dates of admixture for each of these artificial populations) we followed the workflow described for the empirical data, we first inferred ancestry along the chromosomes using PCAdmix, and then used wavelet decomposition of the admixture signal as inferred by PCAdmix to recover the dates. The results are reported in Fig.S18 and demonstrate that most simulated admixture events are inferred to have happened not more than 4,000 years ago.
Nevertheless, most of the dates we infer for the Siberian dataset are within the bounds of the method's resolution. Furthermore, they are in good agreement with what has been suggested by other studies, e.g. Haak et al (Haak et al. 2015)and Allentoft et al (Allentoft et al. 2015) which show that large scale bi-directional migrations across the Eurasian steppe in the Bronze age produced major changes in the genetic structure of Eurasia, admixing with and replacing autochtonous populations.

S1G. Simulations to assess the robustness of the method to mis-specification of parental groups
To assess the effect of mis-specification of parental groups on the time of admixture estimates, we similarly constructed different sets of artificially admixed chromosomes (chromosome 1 only), but using CEPH panel Affymetrix 500K SNP chip data (López Herráez et al. 2009) from Siberian (Oroqen), Pakistani (Balochi) and European (French) individuals, generating hybrid chromosomes which matched the Siberian dataset in the most important characteristics such as the degree of separation of the parental groups and the density of markers (24,068 Affy SNPs vs 26,420 Illumina SNPs). First, we simulated admixture between French (FRE) and Oroqen (ORO) haploid chromosomes. We used a rate of admixture of 0.3 and simulated 19 different values of λ ranging admixture time estimation (Fig.S22C) are only observed when the proxies used are dramatically either less (experiment b) or more differentiated (experiment c) than the true source populations are ( Fig.S22D). In both examples the dates involving Burusho and Yoruba admixture are overestimated. In the first scenario, where all three proxy populations are not well differentiated for the SNPs used in the analysis, PCAdmix seems to randomly assign segments of chromosomes it does not "recognize" making the inferred ancestry blocks more uniform in length. In the second scenario one of the proxies used was so different from the true source population (Fst=0.136 between Yoruba and Balochi), that only 3% of the chromosomes were assigned to this ancestry (instead of the simulated 30%), making the ancestry blocks very small and leading to a dramatic overestimation of one of the admixture dates, while the second one was underestimated due to artificial extension of the remaining ancestry blocks. Unfortunately, most of the currently available admixture dating methods require the use of pre-defined source populations. Identification of an appropriate source group is always difficult, especially for populations that have experienced admixture a long time ago, and hence had more time to experience genetic drift. Since incorrect identification of source groups could potentially lead to erroneous conclusions, careful attention is required when identifying parental groups. These experiments however show that PCAdmix is fairly robust to the mis-specification of parental groups (as was already shown in (Brisbin et al. 2012) for admixture events involving two ancestral populations) and that the mistake in the choice of proxies would have to be considerable for it to noticeably affect the inferred admixture dates.

S1H. Addressing other concerns with dates of admixture inference
We assessed the possibility of overestimating the dates of admixture due to potential errors in ancestry estimation by PCAdmix. Indeed, if small errors are being introduced during ancestry estimation (with small blocks of "inappropriate" ancestry being inserted within long continuous blocks), such errors would cause the signal of admixture to appear older. We calculated the widths of ancestry blocks, as inferred by PCAdmix, in Anabar Dolgans and scrutinized the genomewide distribution of block widths for each individual and each ancestry. The results are summarized in inferred with high confidence (at least 0.8 probability in at least 80% of samples). Comparison of dates of admixture inferred using a full set of windows vs. only high confidence windows ( Fig.S24A) reveals no tendency to overestimate admixture dates when all the data are used, and again argues against PCAdmix being the source of systematic error in our dates of admixture inference.
Since it is known that haplotype phasing at the level of the entire chromosome is subject to phasing (switch) errors (Tang et al. 2006;Andrés et al. 2007), we wanted to assess to what extent such potential switch errors might affect our results. For two-way admixture we can directly compare results obtained from phased and unphased data (see the discussion in section S1D of similar dates obtained for admixture in Russians using two different methods). For admixture scenarios involving more than two sources of ancestry, the direct comparison is impossible, since the StepPCO approach (which uses unphased data) cannot handle more than two parental populations and PCAdmix requires phased data. Therefore, to test the effect of phasing on admixture dating in a population with more than two sources of ancestry, we re-phased part of the data using a different phasing algorithm, SHAPE-it (Delaneau et al. 2008). SHAPE-it has been shown to have a lower error rate than BEAGLE (Browning & Browning 2007), which was the phasing algorithm employed for this study. The expectation here was that if phasing switch error has a strong influence on the inferred dates of admixture, then a different switch error rate should result in a different date of admixture estimate. Our results show that dating a three-way admixture event using BEAGLEphased vs. SHAPE-it-phased data yields almost identical dates (Fig.S24B).
In another set of experiments we took the PCAdmix output and "unphased" it by collapsing each pair of windows corresponding to two phased chromosomes of each individual into one, and then running wavelet transform analysis on such "unphased" chromosomes. Although this is not the same as using unphased data (as "unphasing" is done on windows, not on genotypes), we still expect the small ancestry blocks potentially introduced by switch error to disappear and their effect on dates of admixture estimation to be removed. For example, if the ancestry inferred by PCAdmix in six consecutive windows on two phased parental chromosomes is 1 2 1 2 0 1 and 2 1 2 1 0 1 (ancestry from parental populations 0, 1 or 2), then our "collapsed" chromosome will have the following assignment for these windows: 3 3 3 3 0 1 (3 coding a window with ancestry assigned to both parental populations 1 and 2). It is easy to see that such "unphasing" converts four potentially erroneous small ancestry blocks into one longer ancestry block. If we indeed have such erroneously inferred small blocks of ancestry in our data, "unphasing" should result in a more recent admixture TEXT S3. SUPPLEMENTARY NOTE ON GENETIC HISTORY AND

ARCHAEOLOGICAL RECORD OF INDIVIDUAL POPULATION GROUPS
In the following sections we analyze signals of admixture and discuss the genetic prehistory of populations grouped either by their present-day geographical location or by their linguistic affiliation.

S3A. Yakuts and Dolgans
Although they live in central and northern Siberia, the Yakuts and Dolgans speak Turkic languages related to those spoken in the south. The Yakuts practice animal husbandry like the South Siberian Turkic-speakers, whereas the Dolgans are nomadic reindeer breeders and hunters like their neighbours, the Tungusic-speaking Evenks (Popov 1956a;Tokarev & Gurvich 1956;Vasilevich 1956). Notwithstanding the great geographic distances that separate the Yakuts and Dolgans from their linguistic relatives, their genetic relationship with the South Siberian populations emerges clearly in both the PC and ADMIXTURE analyses. In the PC analysis, the Turkic-speaking populations are not separated along PC1-PC2, and a distinction between the southern (Altaians, Tuvans) and northern (Yakuts, Dolgans) Turkic-speaking groups is first observable only at PC3 and PC4 (Fig.S3). It has been shown before that for spatially structured data the first two PCs accurately capture geography, and genetic information can thus be used to infer geographic origin (Lao et al. 2008;Novembre et al. 2008). Such a genetic trace of pre-migration origins can be discerned for the Yakuts: their position in the PC analysis would place their geographic origin to the east of Tuva and north of Buryatia; this is in good accordance with their purported origins around Lake Baikal (Tokarev & Gurvich 1956;Crubézy et al. 2010).
The hypothesis of a southern origin is also indirectly supported by the results of the IBD analysis.
The amount of the IBD segments shared within the Yakuts is relatively high in comparison to other populations (Fig.S10); for instance, they share twice as many segments within the population as the Buryats, even though the census sizes of Buryats and Yakuts are comparable 389 individuals, Yakuts -478,085 individuals; Russian 2010 census). Furthermore, the slow decay in the frequency of small IBD segments (2-10 cM) in comparison to longer segments in the Yakuts IBD segments is consistent with a recent expansion from a small founding population, which is also supported by uniparental data (Pakendorf et al. 2002;Pakendorf et al. 2006;Zlojutro et al. 2009) and by the relatively high genome-wide LD in the Yakuts (Fig.S11B).
In terms of sharing with other populations, the Yakuts share more and the longest IBD blocks with the Dolgans, who are thought to be partially descended from Yakuts (Stapert 2013, and references therein), but also with the neighbouring Even groups (Sakkyryyr, Sebjan, Tompo), Buryats, and Evenks ( Fig.5 and Fig.S12). The sharing of IBD blocks with Buryats is consistent with the idea that the Yakut ancestors once lived in close proximity to the ancestors of modern-day Buryats. The Dolgans share a large amount of long IBD segments with the Yakuts, Evenks, Nganasan and Evens ( Fig.S12). While the Yakuts and Evenks are thought to have contributed genetically to the modernday Dolgans, the affinity towards the Nganasan and Evens is probably indirect, and is best explained by these populations having themselves shared ancestry with the Evenks, although direct admixture with Nganasan cannot be excluded.
The ADMIXTURE results (Fig.3) suggest three sources of ancestry for the Yakuts and the Dolgans: Central Siberian (blue) (59% in the Yakuts, 67% in the Dolgans), EAsian (pink) (30% in the Yakuts, 18% in the Dolgans), and European (light green) (10% in the Yakuts, 11% in the Dolgans). In contrast to their South Siberian Turkic-speaking relatives, the Yakuts and Dolgans lack the Western Siberian (yellow) component. involved in an early admixture event described in Text S1E). The ethnogenesis of the Dolgans, who are linguistically very closely related to the Yakuts, but culturally very close to Evenks, has occurred in recent times and is known to have been influenced by the neighboring Yakuts, Evenks and Russians. Indeed, all our analyses confirm that the Dolgans are most closely related to the Yakuts and the Evenks (although some additional gene flow from Nganasan cannot be excluded with the genetic data), with evidence for recent additional gene flow from a European source. Speakers of Tungusic languages are spread from the Yenisey river in the west to the Kamchatka Peninsula in the east, and from the Taimyr Peninsula in the north to China in the south (Fig.3). The Evenks and Evens, whose languages belong to the Northern Tungusic branch, are traditionally highly mobile hunters and reindeer herders who are dispersed over vast territories of Siberia (Levin 1956;Vasilevich 1956), while the linguistically closely related Oroqen from northeastern China are traditionally hunters and horse herders. In contrast, the other Tungusic populations of the Amur region and northern China, such as the Hezhen and Xibo, speak languages belonging to the Southern Tungusic and Manchu branches. The Hezhen are traditionally fishers and hunters (Ivanov et al. 1956), while the Xibo are agriculturalists (Cheboksarov et al. 1965).
In the PCA, the Even and Evenk subgroups sampled all across Siberia show remarkably little genetic differentiation; the Evenk and Even populations are differentiated only along PC3 and PC4 ( Fig.2, Fig.S3) and along PC2 in the analysis comprising only the Siberian populations (Fig.S4).
Surprisingly, there is no observable structure within either of these populations, even though the samples are from locations that are up to 2700 km apart.
Analysis of the IBD blocks suggests that the Evenks and Evens have a lower effective population size than the Tungusic peoples of the south (e.g. the Oroqen), as the former share more IBD segments within the population than the latter (Fig.S10). This is also evident in the pattern of genome-wide LD, where the Oroqen have lower genome-wide LD values than the Evens or the Evenks (Fig.S11A). Overall, Evenks and Evens share a large number of long IBD segments, and they also share such segments with neighbouring populations: the Evenks with the Nganasan, Dolgans, Yukaghirs and Nenets, the Evens with the Yukaghirs, Dolgans and the Yakuts (Fig.S12).
The results of the ADMIXTURE analysis ( Fig.3) reveal that the Tungusic populations from all across Siberia, with the exception of the Xibo, trace their ancestry to three sources (Central Siberian (blue), EAsian (pink) and Far Eastern (red)). The Central Siberian component is higher in the west than in the east and south, and ranges from 85% in the Evenk subgroups and 64% in the Evens of Kamchatka to 40% in the Oroqen, 24% in the Hezhen, and 15% in the Xibo. The Far Eastern component is highest in the easternmost Even subgroups and lowest in Tungusic populations of western and southern Siberia; this ranges in frequency from 28% in the Evens of Kamchatka to 6% in the Evenks, 5% in the Oroqen, 4% in the Hezhen, and 2% in the Xibo. The EAsian component is seen at highest frequency in the southeast, where it reaches 54% in the Oroqen and 71% and 80% in the Hezhen and the Xibo, respectively, while it is present at around 10% in the Evenks and Evens.
The Tungusic peoples are further characterized by the complete absence of any European (light green) ancestry (except for the Xibo, who have 2% of this ancestry component).
Analysis of the ancestry block width inferred by PCAdmix (Fig.S15B) indicates that the blocks of Central Siberian ancestry are wider and the variance around the mean is higher in the Evenk groups, while the blocks become narrower and the variance decreases in the Oroqen, Hezhen, and Xibo.
This pattern is compatible with a scenario of additional gene flow from a population of mainly Central Siberian ancestry into the western Tungusic groups (Evenks, and Evens from Sakkyryyr, Sebjan and Tompo). In addition, it has been shown previously with mtDNA and Y chromosome data that the Sakkyryyr and Sebjan Even subgroups have experienced substantial amounts of admixture from Yakuts (Pakendorf et al. 2007;Duggan et al. 2013). Our results based on autosomal data confirm this conclusion, as the Sakkyryyr Evens are much closer to the Yakuts than to the other Tungusic groups in all measures of genetic relatedness based on IBDs, while the Sebjan and Tompo Evens appear substantially closer to the Yakuts than do the Berezovka and Kamchatka Evens ( Fig.S12). Also, the ADMIXTURE results show that Sakkyryyr Evens carry around 1.5% European ancestry (presumably contributed by the Yakuts). Assuming, that the elevated levels of EAsian ancestry in the Sakkyryyr, Sebjan and Tompo Evens in comparison to the Kamchatka and Berezovka Evens were contributed via admixture from the Yakuts, we can roughly estimate that they have received 14-40% gene flow from Yakuts (also see Fig.S5 for the ADMIXTURE results at K=7). On the other hand, the widest blocks of EAsian ancestry, and the highest variance around the mean, are observed in the Oroqen, Hezhen, and Xibo (Fig.S15B). This suggests substantial recent gene flow from a EAsian source population into the southern Tungusic peoples. Similarly, the distribution of the Far Eastern ancestry blocks is consistent with some additional contact between the Even subgroups closest to Kamchatka and the Koryaks, although this gene flow was not as abundant as the Central Siberian admixture detected in the Evenks and western Even subgroups or the EAsian admixture into the Tungusic populations of the south (Fig.S15B).
The AHGs (consistent across the top 30% of the ADMIXTURE runs) therefore yield a composite picture of the successive layers of differential admixture experienced by the Tungusic subgroups, well as the Oroqen and Hezhen have experienced the least amount of recent gene flow from a population of Central Siberian ancestry, we assume that their AHG reflects the ancestral admixture history of all of the Tungusic-speaking populations. Later substantial gene flow from a source or sources of predominantly Central Siberian ancestry in the Evenks and western Even subgroups (Sakkyryyr, Sebjan, and Tompo) will have considerably lowered the estimated age of Central Siberian admixture for these groups, leading to this being reconstructed as the most recent event (see Fig.S21 and the simulations of the effects of additional gene flow from a population that was involved in previous admixture described in Text S1E). We therefore did not date the admixture events for the Evenks and western Even subgroups, but only show the inferred AHG (Fig.S15A).
Substantial gene flow from a population of predominantly EAsian ancestry will have lowered the estimated age of the EAsian admixture in the Tungusic populations of the Amur region, but this would not have altered the AHG configuration.
The earliest ages estimated for the first admixture event are observed in the Oroqen (3,120 ya; CI 2,700 -3,630) and Hezhen (3,120 ya; CI 2,910-3,930); however, these are unlikely to reflect the true age of the ancestral admixture event, as these dates exceed the resolution limit available for the Tungusic-speaking populations of ca. 2,700 years (Fig.S18F). These dates probably reflect the difficulty of PCAdmix to correctly assign fragments of chromosomes to either Central Siberian or ancestry tracing back more than 3,200 years ago, and the Tungusic-speaking populations lack this substrate. We therefore suggest the Amur River as the place of origin of the Tungusic populations, which is also in good agreement with the archaeological evidence. Archaeological sites along the Amur are numerous, and the earliest are dated to at least 13,000 ya (Popov & Tabarev 2008). The Far Eastern and Central Siberian admixture inferred by our analysis could correspond to the latest Neolithic material culture of the region, which is dated to 4,000-3,000 ya and is associated with the appearance of fundamentally new stone-working techniques, which are ascribed to the arrival of new peoples (Derevianko & Clark 1965).

S3C. Samoyedic populations
The Samoyedic languages spoken by traditionally semi-nomadic reindeer-herding as well as hunting populations of western Siberia and the Yamal and Taimyr Peninsulas of the northernmost Russian Arctic belong to the Uralic language family, which also includes Finno-Ugric languages such as Finnish and Hungarian in Europe and Khanty in western Siberia. From historical data it is known that some Samoyedic languages were spoken in the Altai region of south Siberia, but these are now extinct (Vajda 2009). As shown by the PC analysis ( Furthermore, wider on average blocks of Western Siberian and Central Siberian ancestry, and higher variance around the mean, suggest that all populations have experienced multiple events of gene flow from these sources (Fig.S16B). Therefore, the dates of admixture reported below are composite dates, and reflect an older date of admixture as well as this more recent additional gene flow. We estimate the date of admixture between the European (light green) and Central Siberian (blue) ancestries to be 2,700 ya (CI 2,490 -3,120) in the Selkup, and 2,490 ya (CI 2,310 -3,120) in the Yamal Nenets (Fig.S16A). We did not attempt to date this signal of admixture in the Taimyr Nenets subgroup, because the estimated proportion of admixture in this group is too low for our method to work reliably (see Fig.S2A and Text S1B). The date obtained for the more recent admixture with the Western Siberian ancestral group is 1,590 ya for the Selkup and the Yamal Nenets (CIs 1,470-1,710 and 1,470-1,860, respectively); the estimated date for the Taimyr Nenets is more recent at 1,170 ya (CI 1,020 -1,260). In this group we also observe wider and more variable ancestry blocks (Fig.S16B), which is consistent with the gene flow either being more recent, or occurring over an extended period of time. Importantly, the inferred dates for the Western Siberian admixture in the Samoyedic-speakers are at the limit of the resolution (Fig.S18E) and could therefore be an underestimate of the true date.
The sequence of admixture events and the dates of admixture estimated for the Selkup and Nenets ( Fig.S16A) are relatively close to those obtained for Tuvans (see Fig.7B). Although the dates for the prehistoric mixture between the European and Central Siberian components are more recent for the Samoyedic populations (~2500-2700 ya) than the dates estimated for Tuvans (~3100 ya), this is most likely due to later additional gene flow into the Selkup and Nenets from some population of predominantly Central Siberian ancestry, as discussed in the main text, and also due to probable additional recent admixture with Russians ( Fig.S16B). On the other hand, the EAsian component we observe in Tuvans (and Altaians) is not seen in any of the Samoyedic groups, and its arrival in the Sayan plateau (and the Altai region) predates the Western Siberian admixture estimated for the Samoyeds. Our analyses therefore suggest the following history for the Selkup and Nenets: it is likely that modern-day Selkup, Nenets, and Tuvans trace their ancestry to the same source population, and either modern-day Tuvans migrated to the Sayan plateau after their split from the ancestors of modern-day Selkup and Nenets, or the Sayan plateau itself was the ancestral homeland of these populations. Before later migrations brought Asian genes into this area of South Siberia, the ancestors of the Samoyeds started to expand north (e.g. it is thought that the northward migration of the Kulai people of the southeastern part of Western Siberia, which some suggest were the ancestors Genetically, they appear to be much closer to the neighboring Tungusic speakers and the Yukaghirs than to the Selkup and even the neighboring Nenets. This leads us to conclude that the Nganasan have a different genetic history than the other Samoyedic groups and that they were linguistically (Popov 1956b;Dolgih 1957;Helimskij 2000), but not genetically, assimilated by the Samoyedic populations who around 1,000 years ago migrated from the south to the northern Arctic.

S3D. Populations of the Russian Far East: Koryaks, Chukchi and Naukan Yupik
The Chukotka and Kamchatka Peninsulas were once part of ancient Beringia, a vast late Pleistocene in their social organization. The Chukchi and Koryaks, who speak languages of the Chukotka-Kamchatkan family, used to lead a nomadic lifestyle and practice inland reindeer herding in combination with coastal hunting and fishing as a mode of subsistence. The Naukan Yupik, who live in close proximity to the Chukchi, speak a Yupik language (which belongs to the Aleut-Yupik-Inuit language family extending across Alaska, the Canadian Arctic and Greenland) and traditionally engaged solely in coastal sea mammal hunting (Antropova & Kuznetsova 1956;Menovschikov 1956).
The PCA results reveal genetic affinities between the Koryaks, the Chukchi and the Naukan Yupik.
The genetic relationship between the Chukchi and Koryaks is further confirmed by the ADMIXTURE analysis, which in addition suggests reciprocal gene flow between the Chukchi and Naukan Yupik, albeit with gene flow predominantly from the Naukan Yupik into the Chukchi.
Strikingly, both the short-range and long-range genomewide LD is substantially higher in the Naukan Yupik than in the Chukchi, who often reside in neighboring villages or even the same settlements. This would imply that in contrast to the Chukchi, the Naukan Yupik underwent an additional strong bottleneck. Overall, the populations of Chukotka and Kamchatka extensively share IBD blocks among themselves as well as with the Yukaghirs and the neighbouring Even subgroups (Fig.5, Fig.S12). The Koryaks are observed to share fewer IBD blocks with the Naukan Yupik than the Chukchi do. This is probably due to admixture between the Naukan and the Chukchi, as also indicated by the ADMIXTURE analysis. This area is outside of the region where Yupik and Inuit settlements are attested; however, genetic admixture with the Chukchi will have reduced the genetic distance between the Naukan Yupik and Chukotka and thus will also reduce the geographic distance needed to provide the best match between genetics and geography. Given that from the perspective of geographic distances the territory of origin we reconstruct lies to the east of Alaska, we can conclude that the origin of the Naukan Yupik was neither in Chukotka nor in Alaska, but somewhere further east. This conclusion should be taken with caution however, since we did not consider how strongly this result could be affected by genetic drift and admixture, both processes shown here to have played a role in the demographic history of the Naukan Yupik.
Our results support an origin of the Naukan Yupik in North America and their subsequent backmigration to Siberia. Furthermore, the dates of reciprocal admixture estimated for the Naukan Yupik and the Chukchi of ca. 1,500 ya are in good agreement with the time of the first appearance of the Neo-Yupik-Inuit culture in Siberia. Yet given that the inferred date of admixture is close to the limit of the resolution available for the Chukchi and the Naukan Yupik ( Fig.S18I) we cannot rule out the possibility that the date of admixture is actually older.

S3E. General comments
For many of the Siberian populations, we can identify and distinguish signals of admixture events shared between populations or population subgroups (e.g. the signal of European admixture shared by all the Turkic-speaking groups) from local differential gene flow (e.g. the recent Yakut admixture which affected some, but not all, Even subgroups). We observe what might be the genetic impact of the powerful Xiongnu empire of ca. 2,000 years ago on the surrounding populations. The Xiongnu empire was centred around today's Mongolia, but stretched far beyond its current borders. The early dates of EAsian admixture that we infer for the populations of the Altai and Sayan Mountains (Altaians and Tuvans) as well as the populations which trace their origin to lake Baikal (Buryats, Yakuts, and ultimately Dolgans) might well be the genetic legacy of this powerful empire, rather than reflecting the impact of the much more recent Mongol empire under Gengis Khan and his successors. Similar early EAsian admixture that might reflect Xiongnu influence is also observed in all the Tungusic-speaking populations; the fact that this ancestry is shared by all the Tungusic peoples suggests that the gene flow must have happened before Evens and Evenks (and Oroqen) diverged and expanded across Siberia. In contrast, the Mongol empire founded by Gengis Khan, which lasted from 1206 to 1368, and was the largest empire in the world, stretching from central Europe to the Sea of Japan and from southern Siberia to India and Indochina, appears to have had a genetic impact only on the Buryats -a population in immediate proximity to Mongolia. This is consistent with the idea that the Mongol empire was maintained via appointed local viceroys with little if any physical presence of the conquerors. In fact, Russian territories were in subjection to the Golden Horde (part of the Mongol empire and its successors) for nearly 300 years, yet no evidence of this was detected in Y chromosome analyses of Russians (Balanovsky et al. 2008), and the absence of this signal is confirmed here with dense autosomal data (see Text S1D for the analyses and discussion of the admixture in Russians). Interestingly, it was suggested that the expansion of     Table S3.

Results of the AHG test A/B,C B/C,A A/C,B
Tungusic speakers: To work out the order of admixture events in populations with more than three sources of ancestry, we consider each trio of ancestries separately. For example, for the Altaians we first try to resolve the ordering of A,B and C ancestries. However the results of the three covariance tests for all possible configurations of A,B and C are inconclusive, hence this ordering cannot be resolved. The next trio we consider is A,B and D. Here, the smallest covariance is observed between the ratio of B/D and A, we can therefore conclude that A is the most recent of these three ancestries. The next trio is A,D and C. For this trio the smallest absolute value of covariance is observed between A/C and D, hence A ancestry must have mixed with C ancestry before mixing with D ancestry. The covariance tests for the last trio B,C and D are again inconclusive, yet the only possible configuration for the entire admixture graph given these results is (BD)(AC).  The simulated history is ((AB)C), where ancestries A and B were already present in a population at the time of admixture with C. Each rectangle summarizes the results for 100,000 simulations performed for a given set of A and C values, and the colour intensity reflects the number of simulations in which the tree was inferred correctly. The lowest value of 70% correctly inferred graphs is observed when the final frequency of A (or B) is below 10%, and the frequency of C is around 50%. (B) The simulated history is (((AB)C)D), where an already admixed population with two ancestry components A and B first admixes with C, which is then followed by another admixture event with D. Based on the simulation results of ((AB)C) admixture histories (previous panel), the scenarios where the final frequency of A (or B) dropped to below 10% were excluded. In each panel, each rectangle summarizes the results for 100,000 simulations performed for a given set of C and D values, with the value of A given in the panel's title. The color intensity reflects the number of simulations in which the tree was inferred correctly. Grey rectangles denote impossible combinations of A, C and D. For example, if the initial frequency of A is 0.2, that of C 0.5, and that of D 0.2, then addition of C would reduce the frequency of A to 0.1, while subsequent influx of ancestry D would bring its final frequency to 0.08 (see formulae given in Figure 2), which is too low given the criteria we set for these simulations. (C) The simulated history is ((AB)(CD)), where admixture occurs between two already admixed population. In each panel, each rectangle summarizes the results for 100,000 simulations performed for a given set of A and C values, with the value given in the panel's title denoting the mixing ratio between the AB and CD populations. The color intensity reflects the number of simulations in which the tree was inferred correctly, and grey rectangles denote impossible combinations of A, C and D.      It a ly _ T u s c a n R u s s ia n s K h a n ty N e n e ts _ T a im y r N e n e ts _ Y a m a l N e n e ts _ R e i i c h S e lk u p N g a n a s a n N g a n a s a n _ R e ic h K e ts Y u k a g h ir E v e n k s _ S to n y − T u n g u s k a    FIGURE S10. Analysis of IBD blocks shared within populations. The average number of IBD blocks shared per pair of individuals belonging to the same population is plotted against the genetic length of such blocks (blocks smaller than 2 cM were excluded from the analysis). Populations are coloured according to their linguistic affiliation, and the colouring scheme is described in the legend to Figure 1 of the main text.  Table  S1). Population labels are abbreviated to the first 3 letters of the population name, except EVN = Even, EVK = Evenk, MNG = Mongolian, JPN = Japanese. Each label is color coded according to the population's linguistic affiliation as described in the legend to Figure 1 of the main text. The size of each circle is proportional to the mean number of IBD segments shared between the population marked with an asterisk and the population named in the label. The color intensity is proportional to the mean length of such shared IBD segments.    E v e n k s − S t o n y T u n g E v e n k s − T a i m  FIGURE S16. Admixture profiles for the Samoyedic-speaking populations: Selkup and Nenets. (A) Admixture history graphs and admixture dates inferred for each population and each admixture episode. Proxy parental populations for the different ancestral components (represented as circles) were as follows: Blue = Nganasan, Yellow = Khanty, Light Green = Italians. (B) Cumulative distribution of all ancestry blocks. For each population the plot captures the total abundance of blocks of each ancestry (x axis) of different genetic lengths in cM (y axis); the average width of the blocks of each ancestry and variance around the mean are also shown. FIGURE S17. Extrapolated origins of the Siberian Naukan Yupik. The extrapolation is based on the results of the regression analysis of the genetic and geographic distances ( Figure S22), which revealed that all the genetic distances between the Naukan Yupik and other populations in Siberia are elevated given the geographic distance between them. In order to improve the fit between the genetics and geography, the Naukan Yupik were assigned different sets of geographic coordinates in Alaska, Canada and Greenland, each time recalculating the regression coefficient. The best improvement was achieved when the Naukan Yupik were assigned any set of coordinates in the geographic region of Canada circumscribed here by the shaded area. . Assessment of the time-depth resolution of the study. (A) Admixture dates recovered from the results of forward simulations (100 simulations for each parameter setting) where we model dynamics of recombination in an admixed population with a certain rate of admixture and Ne. The observed number of breakpoints starts to deviate from the expected value (parameterized by λ) beginning around 50-100 generations since the admixture event (see Text S1D for details and references). Accordingly, since the WT method recovers not the simulated number of breakpoints, but the time of admixture the observed number of breakpoints corresponds to, the recovered λ deviates from the simulated λ. (B-I) Recovery of the admixture dates from artificial genomes, generated from populations used in this study as proxies for the parental groups. Solid lines show recovered admixture dates in an artificially generated hybrid population where a more recent gene flow occurs on a background of an older admixture event. The shaded area denotes the 95% confidence interval. Dotted lines denote the date of admixture actually inferred from the Siberian data. In (F) the lighter blue dotted line indicates the date of the Red-Blue admixture inferred for the Oroqen, while the darker blue line indicates the date inferred for the Berezovka Evens. Estimate of background λ, when for recent event λ is 10, 20,40 or 60 FIGURE S20. Performance of the wavelet transform analysis (adapted to include complex admixture scenarios) in recovering the simulated dates of admixture. The method was applied to 12 artificial datasets with different admixture histories. Each recent admixture time point (λ = 10, 20, 40 and 60) was simulated three times, each time using a different admixture background (λ = 80, 100 or 120). The shaded area denotes the 95% confidence interval for the inferred dates of admixture.  Distribution of block lengths of ancestry 2 ("Asian") in Dolgans (for each individual) FIGURE S23. Test for excess of small-sized ancestry blocks as inferred by PCAdmix, with the expectation that an excess of short ancestral blocks would indicate erroneous insertions by PCAdmix. Shown is the genomewide distribution of block widths of each ancestry inferred for each Anabar Dolgan individual. Block width in (cM) is shown on the x-axis, while abundance is represented on the y-axis (note that the scale here is logarithmic). . Comparison of dates of admixture inferred using a full set of windows (blue) vs. only high confidence windows (red), and on artificially "unphased" data (orange). Dating is based on simulated data from 100 simulations with a 30% migration rate. Each curve represents a single admixed population. Average WT centers calculated for 100 chromosomes drawn at random from each population at exponentially growing time points are plotted as a function of time. Vertical lines indicate the time estimate, and shaded boxes define the confidence intervals. Time estimates are based on the entire unfiltered sample of Taimyr Nenents, and the admixture rate for the simulated data is chosen so as to match the empirical data. (B). Admixture time estimates (for two episodes of admixture) based on data phased with either BEAGLE or SHAPEit. Measurements obtained for a more recent vs. an older admixture episode using the BEAGLE-phased and SHAPEit-phased data are shown by blue and green horizontal lines, respectively. Vertical lines indicate the time estimate (blue font for the BEAGLE-phased, and green font for the SHAPE-it phased data), and shaded boxes define the confidence intervals. Time estimates are based on the entire unfiltered sample of Anabar Dolgans, and the admixture rate for the simulated data is chosen so as to match the empirical data.