Abstract

The geographic origin and time of dispersal of Austroasiatic (AA) speakers, presently settled in south and southeast Asia, remains disputed. Two rival hypotheses, both assuming a demic component to the language dispersal, have been proposed. The first of these places the origin of Austroasiatic speakers in southeast Asia with a later dispersal to south Asia during the Neolithic, whereas the second hypothesis advocates pre-Neolithic origins and dispersal of this language family from south Asia. To test the two alternative models, this study combines the analysis of uniparentally inherited markers with 610,000 common single nucleotide polymorphism loci from the nuclear genome. Indian AA speakers have high frequencies of Y chromosome haplogroup O2a; our results show that this haplogroup has significantly higher diversity and coalescent time (17–28 thousand years ago) in southeast Asia, strongly supporting the first of the two hypotheses. Nevertheless, the results of principal component and “structure-like” analyses on autosomal loci also show that the population history of AA speakers in India is more complex, being characterized by two ancestral components—one represented in the pattern of Y chromosomal and EDAR results and the other by mitochondrial DNA diversity and genomic structure. We propose that AA speakers in India today are derived from dispersal from southeast Asia, followed by extensive sex-specific admixture with local Indian populations.

Introduction

Austroasiatic is the eighth largest language family in the world in terms of the number of native speakers (104 million) (Lewis 2009). As its name implies, it is spoken in southern parts of Asia—in Vietnam and Cambodia as the main official languages and in India, Bangladesh, Nepal, Burma, Laos, Thailand, and Malaysia as the first language of many minority groups that are isolated from each other by other language speakers. Two major extant branches of the Austroasiatic language tree are Munda in eastern, northeastern, and central India and Khasi-Aslian, which stretches from the Meghalaya in the northeast of the subcontinent to the Nicobars, Malay Peninsula, and Mekong delta in southeast Asia (fig. 1A). Since the birth of historical linguistics in the 1640s, attempts have been made to explain the wide and continuous geographic spread of some language families, such as the Indo-European, Uralic, and Bantu, in contrast to the more patchy or constrained distribution of others, for example, the Basque and Khoi-San languages. Models proposed to explain the success of a few rather than many language families range from those stressing pure demic diffusion to pure cultural diffusion driven by some economic or technological advance as the key mechanism of the language spread. One of the prehistoric events that has been considered as a plausible device to fuel both demographic and cultural spread is the shift from a hunter-gatherer to an agricultural mode of subsistence thought to have occurred independently in only a few places in the world (Ammerman and Cavalli-Sforza 1984). However, the attempt at explaining the success of the ten most widely spoken language families of the world in terms of the Neolithic demic diffusion model (Diamond and Bellwood 2003)—that is, by linking the spread of languages, genes, and economy—has been challenged in almost every single case (Richards et al. 2000; Fuller 2003; Ehret et al. 2004). The hypothesis that the spread of the Austroasiatic language family can be traced back to rice cultivators of southeast Asia (Higham 2003; Bellwood 2005) is contested, but some relationship between early Austroasiatics and rice agriculture is a view that remains prevalent among linguists.

FIG. 1.

(A) Language tree of the major subgroups of the Austroasiatic (AA) language family according to Diffloth (2009). The branching of the hypothetical extinct para-Munda languages Melluha and Kubha-Vipas is shown by a broken line. The branching pattern of the extant languages allows for both south and southeast Asia to be considered equally as potential homelands for the initial spread of AA. According to Fuller (2007), the acceptance of the extinct para-Munda branch would support the origin of AA in the Indian subcontinent. The map depicts the geographic distribution of the AA family (adopted from Diffloth 2001 and Anderson 2007 covering southeast Asia and India respectively) and the sampling locations (with the precision of district) for the Indian AA samples. Numbers correspond to populations as given in table 1. Note, that for India, only the concentrated AA regions are highlighted. Munda speakers can be found in low frequencies throughout east India, thus the few sampling locations outside the shown AA areas still represent AA populations. (B) Out of southeast Asia and (C) out of India dispersal models. These two models represent two alternative views to explain the spread of AA-speaking populations, all sharing rice domestication related vocabulary, in south and southeast Asia. According to model B, the AA family originated in southeast Asia. This model requires only one domestication event of rice in East Asia. In contrast, model C implies the origin of the AA family and its initial split in India. According to this model, Oryza indica and Oryza japonica rice were independently domesticated in what today are India and China. Recent gene flow between local Indian (Ind) non-AA groups and Munda speakers (Mun) in model B and between Khasi-Aslian (Kh-As) and local East Asian (EAs) derived populations is indicated by broken lines. Depending on the extent of the recent admixture, model B allows for preservation of some southeast Asian genetic ancestry among Munda, whereas no distinguishable Indian contribution is expected among Khasi-Aslian groups of southeast Asia. Conversely, model C assumes continuity of Munda groups in India with no specific east Asian contribution to their genes (apart from secondary gene flow from local Tibeto-Burman groups of India), whereas Khasi-Aslian would be expected to represent admixture between populations derived from the Indian subcontinent and southeast Asia.

FIG. 1.

(A) Language tree of the major subgroups of the Austroasiatic (AA) language family according to Diffloth (2009). The branching of the hypothetical extinct para-Munda languages Melluha and Kubha-Vipas is shown by a broken line. The branching pattern of the extant languages allows for both south and southeast Asia to be considered equally as potential homelands for the initial spread of AA. According to Fuller (2007), the acceptance of the extinct para-Munda branch would support the origin of AA in the Indian subcontinent. The map depicts the geographic distribution of the AA family (adopted from Diffloth 2001 and Anderson 2007 covering southeast Asia and India respectively) and the sampling locations (with the precision of district) for the Indian AA samples. Numbers correspond to populations as given in table 1. Note, that for India, only the concentrated AA regions are highlighted. Munda speakers can be found in low frequencies throughout east India, thus the few sampling locations outside the shown AA areas still represent AA populations. (B) Out of southeast Asia and (C) out of India dispersal models. These two models represent two alternative views to explain the spread of AA-speaking populations, all sharing rice domestication related vocabulary, in south and southeast Asia. According to model B, the AA family originated in southeast Asia. This model requires only one domestication event of rice in East Asia. In contrast, model C implies the origin of the AA family and its initial split in India. According to this model, Oryza indica and Oryza japonica rice were independently domesticated in what today are India and China. Recent gene flow between local Indian (Ind) non-AA groups and Munda speakers (Mun) in model B and between Khasi-Aslian (Kh-As) and local East Asian (EAs) derived populations is indicated by broken lines. Depending on the extent of the recent admixture, model B allows for preservation of some southeast Asian genetic ancestry among Munda, whereas no distinguishable Indian contribution is expected among Khasi-Aslian groups of southeast Asia. Conversely, model C assumes continuity of Munda groups in India with no specific east Asian contribution to their genes (apart from secondary gene flow from local Tibeto-Burman groups of India), whereas Khasi-Aslian would be expected to represent admixture between populations derived from the Indian subcontinent and southeast Asia.

The Higham–Bellwood model (Higham 2003; Bellwood 2005) considers Indian Munda-speaking and Khasi-Aslian–speaking hunter-gatherer populations, who regardless of their current lifestyle, share rice cultivation related cognates with Khasi-Aslian–speaking populations of southeast Asia, as Neolithic immigrants in India, because traditionally a single origin of rice cultivation in China has been assumed (fig. 1B). However, as argued by Fuller (2007), the genetic evidence of independent domestications for the Oryza indica and japonica cultivars suggests a plausible alternative scenario (fig. 1C) by which the homeland of the Austroasiatic family lies in India. If O. indica rice was indeed domesticated first in India, then its spread to southeast Asia may have been coupled with the spread of Austroasiatic speakers (Fuller 2007). However, the phylogenetic evidence from genes associated with rice domestication is not unequivocal—phylogenies of some functionally important genes continue to support the single-origin model (e.g., Jin et al. 2008; Tan et al. 2008). Opposing evidence from different genes may be reconciled by a model according to which the domestication was a lengthy process extending back to and even beyond the Last Glacial Maximum, as opposed to the earlier view of a rapid transition which placed the domestication of crops to the Pleistocene/Holocene boundary (Allaby et al. 2008). However, according to current archaeological evidence, the shift to a lifestyle where rice would be an essential staple food would be younger than 7 thousand years ago (KYA) in China and even more recent in India (Fuller et al. 2009; Purugganan and Fuller 2009). In the light of the archaeobotanical, linguistic, and rice genomic evidence, the differentiation of Austroasiatic languages into their major subgroups could therefore be placed either in south or southeast Asia with their split or the latest date of contact probably being more recent than 7 KYA.

Table 1.

Detailed Description of AA and TB Samples Typed for O2a (M95), EDAR, and Illumina HumanHap 610K (WGA) in Present Study.

nr Population Indian state District Language group Number for M95 Number for EDAR Number for WGA 
Bonda Orissa Koraput South Munda 42 38 
Savara Orissa Koraput South Munda 21 38 
Gadaba Orissa Koraput South Munda 27 28 
Birhor Chhattisgarh Raipur North Munda 27 35 — 
Birhor Maharashtra Chandrapur North Munda 35 15 — 
Juang Orissa Sambalpur South Munda 54 20 
Baiga Orissa Kendujhar South Munda 42 21 — 
Mahli Jharkhand Bokaro North Munda 32 20 — 
Mawasi Jharkhand Gumla North Munda 27 29 — 
10 Santhal Jharkhand Gumla North Munda 20 19 
11 Kharia Chhattisgarh Raigarh South Munda 37 20 
12 Baiga Madhya Pradesh Guna South Munda 23 19 — 
13 Mawasi Madhya Pradesh Bhopal North Munda 12 10 — 
14 Ho Bihar Begusarai North Munda 45 32 
15 Khasi Meghalaya East Garo hills Khasi-Aslian 21 20 
16 Garo Meghalaya East Garo Hills Tibeto-Burman 25 20 
17 Asur Jharkhand Dhanbad North Munda 13   
18 Asur Jharkhand Ranchi North Munda 48 35 
19 Asur Jharkhand Palamau North Munda 27 — 
20 Burmese — — Tibeto-Burman — — 15 
21 Cambodians Li et al. (2008) and Xue et al. (2009) Khasi-Aslian 10 10 
nr Population Indian state District Language group Number for M95 Number for EDAR Number for WGA 
Bonda Orissa Koraput South Munda 42 38 
Savara Orissa Koraput South Munda 21 38 
Gadaba Orissa Koraput South Munda 27 28 
Birhor Chhattisgarh Raipur North Munda 27 35 — 
Birhor Maharashtra Chandrapur North Munda 35 15 — 
Juang Orissa Sambalpur South Munda 54 20 
Baiga Orissa Kendujhar South Munda 42 21 — 
Mahli Jharkhand Bokaro North Munda 32 20 — 
Mawasi Jharkhand Gumla North Munda 27 29 — 
10 Santhal Jharkhand Gumla North Munda 20 19 
11 Kharia Chhattisgarh Raigarh South Munda 37 20 
12 Baiga Madhya Pradesh Guna South Munda 23 19 — 
13 Mawasi Madhya Pradesh Bhopal North Munda 12 10 — 
14 Ho Bihar Begusarai North Munda 45 32 
15 Khasi Meghalaya East Garo hills Khasi-Aslian 21 20 
16 Garo Meghalaya East Garo Hills Tibeto-Burman 25 20 
17 Asur Jharkhand Dhanbad North Munda 13   
18 Asur Jharkhand Ranchi North Munda 48 35 
19 Asur Jharkhand Palamau North Munda 27 — 
20 Burmese — — Tibeto-Burman — — 15 
21 Cambodians Li et al. (2008) and Xue et al. (2009) Khasi-Aslian 10 10 

NOTE.—“nr” is the code of population shown in figure 1 and supplementary figure S4, Supplementary Material online.

Genetic studies on human populations of south and southeast Asia have, hitherto, proved to be inconclusive about the two opposing models of the geographic origins of the Austroasiatic-speaking people and about the timing of the split between the two major branches in this language family. The mitochondrial DNA (mtDNA) information available so far indicates a clear distinction of Indian Munda and southeast Asian Khasi-Aslian–speaking groups, as both share their mtDNA haplogroups with their regional neighbors who speak languages other than Austroasiatic (fig. 2 and table 2). Consistent with this linguistic separation, the Khasi-Aslian–speaking Nicobarese carry almost exclusively East Asian–specific mtDNA (Thangaraj et al. 2005). Notably, Khasi (the only Khasi-Aslian group of mainland India) speakers residing in Meghalaya state in India show an admixed package of both Indian and East Asian mtDNA haplogroups (fig. 2 and table 2). Overall, the mtDNA haplogroup distributions make a clear distinction between Indian and southeast Asian Austroasiatic speakers; because of the lack of shared lineages, this evidence is not informative about any shared phase of evolutionary history of Munda and Khasi-Aslian–speaking populations. In contrast, Y chromosome haplogroup O2a occurs frequently both among Indian and southeast Asian Austroasiatic speakers (table 2) and thus appears as evidence for some degree of shared ancestry (Kivisild et al. 2003). Because all other branches of haplogroup O are largely restricted to East Asia and given the recent time depth of Y short tandem repeat (STR) variation of Indian haplogroup O2a, its recent (<10 KYA) entry from southeast Asia (fig. 1B) has been implied in some studies (Sahoo et al. 2006; Sengupta et al. 2006). On the one hand, the frequency of haplogroup O lineages in India is correlated with languages boundaries and cannot be explained only by isolation by distance (fig. 2A and table 2). On the other hand, high levels of genetic diversity of mtDNA haplogroups in Munda speakers and an independent assessment of Y-STR diversity of haplogroup O2a in India, dating its origin to ∼65 KYA, have been used to argue in favor of a model that assumes direct descent of Austroasiatic speakers from the initial settlers of India (fig. 1C) and their subsequent dispersal to southeast Asia, possibly before the Last Glacial Maximum (Basu et al. 2003; Kumar et al. 2007; Chakravarti 2009). Arguably, the more recent (<10 KYA) estimates of the age of O2a variation in India could have been deflated by limited regional sampling. It should be noted, however, that the 65 KYA dating of haplogroup O2a in India appears much older than the estimated age of its ancestral haplogroups K and NO (Rootsi et al. 2007; Karafet et al. 2008). Moreover, the southeast Asian populations have been underrepresented in all previous studies, and furthermore, no high-resolution autosomal evidence has been considered in these debates. Therefore, the genetic origins of Austroasiatic-speaking populations remain largely controversial.

Table 2.

mtDNA and Y Chromosome Haplogroup Profiles in South (S) and Southeast (SE) Asia by Population.

 mtDNA
 
Y chromosome
 
 n S Asian SE Asian Unresolved n S Asian SE Asian Unresolved 
South Asia         
    Nicorbarese (MK\KA\AA) speakers 46 2.18 91.3 6.52 11 100 
    Khasi (KA\AA) speakers 363 39.67 38.57 21.76 465 10.11 74.62 15.27 
    Munda/AA speakers 742 75.2 24.8 1,572 26.78 60.56a 12.66 
    Indo-European speakersb 838 59.07 12.65 28.28 1,593 43.69 14.12 42.19 
    Dravidic speakers 665 59.55 0.3 40.15 1,445 62.63 2.49 34.88 
    Tibeto-Burman speakers 139 2.16 66.91 30.94 242 7.44 85.95 6.61 
SEA         
    KA/AA speakers 138 88.41 11.59 395 1.27 89.11 9.62 
    Tibeto-Burman speakers 523 0.57 75.72 23.71 387 1.55 66.93 31.52 
 mtDNA
 
Y chromosome
 
 n S Asian SE Asian Unresolved n S Asian SE Asian Unresolved 
South Asia         
    Nicorbarese (MK\KA\AA) speakers 46 2.18 91.3 6.52 11 100 
    Khasi (KA\AA) speakers 363 39.67 38.57 21.76 465 10.11 74.62 15.27 
    Munda/AA speakers 742 75.2 24.8 1,572 26.78 60.56a 12.66 
    Indo-European speakersb 838 59.07 12.65 28.28 1,593 43.69 14.12 42.19 
    Dravidic speakers 665 59.55 0.3 40.15 1,445 62.63 2.49 34.88 
    Tibeto-Burman speakers 139 2.16 66.91 30.94 242 7.44 85.95 6.61 
SEA         
    KA/AA speakers 138 88.41 11.59 395 1.27 89.11 9.62 
    Tibeto-Burman speakers 523 0.57 75.72 23.71 387 1.55 66.93 31.52 

NOTE.—AA, Austroasiatic; KA, Khasi-Aslian; MK, Mon-Khmer; n, number of samples.

a

The southeast Asian Y chromosomal frequency of Munda and Tibeto-Burman speakers of India is due to the presence of haplogroup O2a (M95) and O3 (M122), respectively. See supplementary tables S9 and S10 (Supplementary Material online) for detailed information on the population wise frequency.

b

Tharu and Mushar populations, who have frequent east Asian haplogroups, are included in Indo-European speakers.

FIG. 2.

Scatter plot, showing southeast Asian–specific lineages among different linguistic groups of India. The geographical distribution of Munda languages in India is mainly governed by longitudinal distances; therefore, frequencies of Y chromosome (left panel) and mtDNA (right panel) haplogroups are plotted against longitudinal distances (x axis). Mushar and Tharu (who now speaks Indo-European language and showing exceptional levels of east Asian haplogroups in contrast to their linguistic affiliation) are arrow marked. South Asian haplogroups—mtDNA: M2–6, N5, M33–65, R5–8, and R31–32; Y chromosome: C5, F, H, L, and R2. Southeast Asian haplogroups—mtDNA: A–G, M7–12, R22, and N9; Y chromosome: C2, C3, D, and M–O. Unresolved haplogroups—mtDNA: M*, R*, N* including other lineages, for example, M31 and West Eurasian specific; Y chromosome: C*, G, I–K*, P*, Q, and R1. Haplogroup frequencies and associated references are given in detail in supplementary information (supplementary tables S9 and S10, Supplementary Material online).

FIG. 2.

Scatter plot, showing southeast Asian–specific lineages among different linguistic groups of India. The geographical distribution of Munda languages in India is mainly governed by longitudinal distances; therefore, frequencies of Y chromosome (left panel) and mtDNA (right panel) haplogroups are plotted against longitudinal distances (x axis). Mushar and Tharu (who now speaks Indo-European language and showing exceptional levels of east Asian haplogroups in contrast to their linguistic affiliation) are arrow marked. South Asian haplogroups—mtDNA: M2–6, N5, M33–65, R5–8, and R31–32; Y chromosome: C5, F, H, L, and R2. Southeast Asian haplogroups—mtDNA: A–G, M7–12, R22, and N9; Y chromosome: C2, C3, D, and M–O. Unresolved haplogroups—mtDNA: M*, R*, N* including other lineages, for example, M31 and West Eurasian specific; Y chromosome: C*, G, I–K*, P*, Q, and R1. Haplogroup frequencies and associated references are given in detail in supplementary information (supplementary tables S9 and S10, Supplementary Material online).

In this paper, we sought to investigate the extent of population structure and admixture among the Indian and southeast Asian AA speakers embedded in their autosomal genomes and to combine the results obtained with data from uniparental loci and from regional selection signatures, such as that of the EDAR gene. We used Illumina HumanHap 610K genotyping chips on 45 diverse Indian samples covering three major language groups from India relevant to our study (22 Austroasiatic [19 Munda and 3 Khasi-Aslian], 19 Dravidian [Behar et al. 2010], and 4 Tibeto-Burman speakers) and 15 Burmese samples from Myanmar. These results were combined with the global data set (Li et al. 2008), generated with Illumina HumanHap 650K chips, which, among others, included a set of Pakistani populations as proxy for the Indo-European speakers of south Asia and a sample of ten individuals from Cambodia which is predominantly a Khmer-speaking country (for a full list of populations and sample sizes see supplementary table S1, Supplementary Material online).

Materials and Methods

A detailed description of experimental procedures can be found in the supplementary experimental procedures (Supplementary Material online). The genotyping experiments for Illumina HumanHap 610K on new 41 Indian and Burmese samples were carried out according to manufacturers’ specifications. We combined our newly generated data with relevant reference data sets from Stanford HGDP SNP Genotyping Data (http://hagsc.org/hgdp/files.html) and 19 Dravidian from Behar et al. (2010) (supplementary table S1, Supplementary Material online). The EDAR 1540T/C, a nonsynonymous single nucleotide polymorphism (SNP), in exon 12 was genotyped by polymerase chain reaction (PCR) direct sequencing using forward (5'-3')-GTAGGTCTTAGCCCCAC (annealing temperature [T] = 54 °C) and reverse (5'-3') CATCCAGCCGCTCAATC (annealing T = 54 °C) primers. Altogether, 1,077 Indian samples were assayed for this polymorphism. In total, 1,563 Y chromosome samples were analyzed in this study (supplementary table S7, Supplementary Material online). NRY-specific multiplex (Indian Y-Plex) PCR was designed to characterize 589 Indian AA and TB samples (supplementary table S8, Supplementary Material online). The ABI 3100 Genetic Analyzer (Applied Biosystems) was used for genetic typing. Fragment sizes were determined using the GeneMapper Analysis Software v4.0, and allele designations were based on comparison with allelic ladders included in the Y-filer kit.

Principal Component and Admixture Analyses of Genome-Wide SNP Data

We used PLINK 1.05 (Purcell et al. 2007) to filter the combined data set to include only SNPs on the 22 autosomal chromosomes with minor allele frequency >1% and genotyping success over 97%. Because background linkage disequilibrium (LD) can affect both principal component analysis (PCA) (Patterson et al. 2006) and “structure-like” analysis (Alexander et al. 2009), we thinned the data set by excluding SNPs unique to either of the two Illumina platforms, SNPs from mtDNA, X, and Y chromosomes and removing one SNP of a pair in a strong LD r2 > 0.4 in a window of 2,000 SNPs (sliding the window by 25 SNPs at a time), the combined data set had data for 215,729 SNPs that were used in subsequent analyses. For PCA, we generated an additional data set with the same filters but excluding the African samples yielding a matrix of 631 samples by 189,512 SNPs.

We carried out principal component (PC) analysis using smartpca program (with default settings) of the EIGENSOFT package (Patterson et al. 2006) to capture genetic variation described by the first ten PCs. The fraction of total variation described by a PC is the ratio of its eigenvalue to the sum of all eigenvalues (fig. 3A).

FIG. 3.

(A) PCA of Indian Austroasiatic, Dravidian, and Tibeto-Burman groups in the context of other Eurasian populations. PC analysis was carried out using smartpca program (with default settings) of the EIGENSOFT package. After filtering SNPs (see Materials and Methods for detail), the combined data set yielded a matrix of 615 samples with 189,533 SNPs. (B) Bar plot displays individual ancestry estimates for studied populations from a structure analysis by using ADMIXTURE with K = 7.

FIG. 3.

(A) PCA of Indian Austroasiatic, Dravidian, and Tibeto-Burman groups in the context of other Eurasian populations. PC analysis was carried out using smartpca program (with default settings) of the EIGENSOFT package. After filtering SNPs (see Materials and Methods for detail), the combined data set yielded a matrix of 615 samples with 189,533 SNPs. (B) Bar plot displays individual ancestry estimates for studied populations from a structure analysis by using ADMIXTURE with K = 7.

Of the several structure-like (baptized by Weiss and Long 2009) algorithms, we experimented with Frappe (Tang et al. 2005; Li et al. 2008) and ADMIXTURE 1.4 (Alexander et al. 2009), running the data set with different settings several times. Although we settled on using the latter mostly due to faster computation time, we note that Frappe gave very similar results. In the final setting, we ran ADMIXTURE with random seed number generator on the LD pruned data set one hundred times at K = 2 to K = 10. Following an established procedure, we examined the log likelihood scores (LLs) of the individual runs and found that up to K = 9 (including), the maximum difference between LLs in the 10% fraction of the runs with the highest LLs was minimal (<1 LLs unit). Thus, we could, with some confidence, assume that these individual runs from K = 2 to K = 9 converged on the global maximum. The new version of ADMIXTURE (1.4) assists in choice of K with a cross-validation (CV) procedure (we used hold-out fraction 0.1). The lowest CV scores we obtained at K =7 (fig. 3B). This choice of K was further bolstered by the observation that at higher Ks, the new emerging clusters (ancestry components) were largely restricted to one population and thus of little interest in a population comparison study. However, plots of all converged Ks are given in supplementary figure S1 (Supplementary Material online). For plotting, we took one run from the 10% fraction of runs with the highest LLs. We note, however, that vast majority of the runs at each K (K = 2 to K = 7) yielded very similar LLs (on the same plateau of LLS distribution), indicating very similar (visually indistinguishable) cluster (ancestry components) distribution.

Using PLINK, we pruned our initial autosomal data set and excluded one from each pair of SNPs with LD r2 > 0.1 in a 50 SNP window shifted at 10 SNP intervals to ensure complete data independence. This procedure resulted in a pruned data set containing 54,355 SNPs from which we calculated mean pairwise FST differences between linguistic and continental population groups using the method of Cockerham and Weir (1984). We also calculated Hs and Ho for all autosomal SNPs, in accordance to Nei (1987). Great circle distances were calculated as in Ramachandran et al. (2005).

Statistical Analysis (Y-STR)

Number of haplotypes and average number of pairwise difference (supplementary table S2, Supplementary Material online) of Y-STR for studied populations were calculated using the Arlequin 3.01 software package (Excoffier et al. 2005). DYS389I (DYS389cd) was subtracted from DYS389II and renamed 389ab. A median-joining network, resolved with the MP algorithm, was constructed using the Network package (version 4.5.0.2) (www.fluxus-engineering.com); one Steiner tree is shown in figure 5B. The M95 (O2a) variance isofrequency map was generated using Surfer8 (Golden Software Inc., Golden, CO), following the Kriging procedure. The age of M95 (O2a) was estimated from microsatellite variation within the haplogroup using the method described by Zhivotovsky et al. (2004) and updated in Sengupta et al. (2006). Moreover, different founders were identified based on Network analysis of Munda speakers. The age of these founders was estimated from the ρ statistic (the mean number of mutations from the assumed root of each and every founder), using a 25-year generation time and the TD statistic, assuming a mutation rate of 6.9 × 10−4 (Zhivotovsky et al. 2004), based on variation at 14 common Y-STR loci (supplementary table S3, Supplementary Material online).

Results and Discussion

Assessing Autosomal Population Structure and Admixture in Austroasiatic Speakers

In figure 3A, we present the PC analyses for Eurasian populations only. PC analysis (fig. 3A) resulted in a crude reflection of the geographic locations of the studied populations. We also performed PC analysis with the whole data set. Naturally, the first component there differentiates Africans from all other populations and PC2 and PC3 correspond very closely (data not shown) to PC1 and PC2 of the Eurasian PC plot (fig. 3A). However, the Eurasian PC analysis shows better resolution on the east–west and north–south axes within Eurasia, thus being better suited to answering the questions we address in the present study. For example, the two Pakistani samples (HGDP00130 and HGDP00175), which show a high level of admixture with Africans, were positioned surprisingly close to the Khasi samples on the PC2/PC3 plot in the global PC analysis (plot not shown). In the Eurasian PC analysis, the first component (explaining 5.6% of the total variation) differentiates west from east Eurasia, whereas the second component (1%) separates south Asians from the rest. None of the first ten significant PCs clustered the Munda-speaking populations from the Indian subcontinent together with Khasi-Aslian–speaking populations of southeast Asia. In the first two PCs, the Munda speakers from the eastern states of India cluster close to the Dravidian speakers while being slightly shifted toward the east Asian cluster by PC1 (fig. 3A). The Khasi-Aslian–speaking Khasi, on the other hand, are closer to East Asians than to the Dravidian speakers. The position of the Garo (Tibeto-Burman speakers) overlaps with that of the Cambodians (Khasi-Aslian), who cluster with Tibeto-Burman–speaking populations from Myanmar and China while being slightly drawn toward the Indian cluster. Mean genetic distances (FST) estimated over the whole genome recapitulate the pattern extracted by the first PCs, whereby Munda speakers are most closely related to Indian Dravidian speakers, whereas Khasi-Aslian and Tibeto-Burman groups from India and southeast Asia are more similar to each other, although the Indian Khasi-Aslian also have high affinity with Munda speakers (supplementary table S4, Supplementary Material online). The PC plots and genetic distance estimates support the view of southeast Asian origins of Indian Khasi-Aslian (and Tibeto-Burman)–speaking populations, whereas, in contrast, Indian Munda-speaking populations draw their genetic ancestry mainly from the source shared with Indian Dravidian speakers.

As another approach, we used the structure-like algorithm ADMIXTURE (Alexander et al. 2009), which gives a maximum likelihood estimate for the population structure of sampled individuals. It assumes a specified number of discrete ancestral populations (K) and computes respective ancestry proportions for each studied individual. The approach should be considered with the caveat that the assumption of discrete ancestral populations is generally not a realistic model of population history (Weiss and Long 2009). Regardless of these conceptual difficulties, the results of the ADMIXTURE analyses may represent a robust picture of the similarities and dissimilarities between studied samples in terms of genetic patterning within the raw data. Thus, with these limitations in mind, we note that irrespective of the number of assumed ancestral populations (2 < K < 7), the Munda speakers of India show consistently higher proportion of East Asian component than Dravidian or Indo-European speakers of the Indian subcontinent (supplementary fig. S1, Supplementary Material online).

At K = 7, the Munda speakers are characterized by two ancestry components (fig. 3B). The predominant “dark green” component makes up approximately three-quarters of the Munda ancestry palette. This component is most prominently apparent among the south Indian Dravidian speakers and is relatively rare among the Indo-European–speaking Pakistani populations. On the other hand, the Munda speakers lack the “light green” component that is prevalent among the Indo-European–speaking Pakistani populations and to a minor extent also present in south India, Near East, and Europe. The East and southeast Asian populations show the presence of two ancestry components: The pink component is most clearly pronounced in Oroqen and Hezhens from Northern China, whereas the orange component is overwhelming among Cambodians, as well as Burmese of Myanmar and Dai and Lahu populations from southwest China (fig. 3B). These two components reveal two contrasting patterns of East and southeast Asian admixture among south Asian populations. Consistent with their Central Asian/Mongolian ancestry, Uygurs and Hazara carry predominantly the pink ancestry component, whereas the Munda speakers exhibit membership only in the orange cluster. Garo, Burmese (both Tibeto-Burman), and, notably, also Khasi (Khasi-Aslian) appear to have both East and southeast Asian components, regardless of the absence of the pink component among the Khmer–speaking Cambodians. Although these results are thus consistent with notable (23%, standard deviation [SD] 5%) southeast Asian genetic admixture among Indian Munda speakers, in support of the model presented in figure1B, there are also detectable traces of south Asian (dark green) admixture among the Cambodians (16%, SD 5%). This finding provides some quantitative support for the alternative model presented in figure 1C that assumed an Indian origin for the Austroasiatic language family.

The observed patterns of genetic admixture on both sides of the Bay of Bengal suggest that models assuming only one episode of unidirectional gene flow are therefore likely to be oversimplifications in describing the historico-demographic processes underlying the origin and differentiation of the Austroasiatic-speaking populations. These patterns could, however, also be understood as a result of long-term gene flow under isolation by distance (IBD), which would be the default model to explain geographic correlations in genetic patterning among populations (Wright 1943). Arguably, a significant proportion of genetic variation in genome-wide STR and SNP diversity among worldwide populations can be explained by IBD (Handley et al. 2007). The IBD model would predict in our case that the ancestry components revealed by the ADMIXTURE analyses would apply to Indian and southeast Asian populations regardless of their linguistic affiliation. Indeed, our Illumina whole-genome data for Dravidian speakers come only from populations of Karnataka and Kerala, which are geographically distant from the Austroasiatic groups concentrated in the eastern states of Orissa and Bihar. The whole-genome genotype data from a small number of populations are robust in terms of the number of loci considered while revealing the extent of East and southeast Asian ancestry among the Indian Austroasiatic speakers. Yet, we would have to use data from a large number of populations, as for example, in the case of the Y chromosome diversity patterns (fig. 2B), to address the question of whether the observed patterns in autosomal genes are due to IBD or dispersals.

When populations admix, alleles under positive selection are expected to proliferate more efficiently in the hybrid population than other alleles on average. A positively selected allele could therefore be used in a conservative approach to test the IBD versus dispersal hypotheses because, even in cases of limited gene flow between populations, the positively selected alleles would be expected to show higher than average penetrance, unless, of course, the selection is region specific. Scans of positive selection on genome-wide polymorphism data from global human populations have identified the EDAR (ectodysplasin-A receptor) gene as a candidate for the strongest positive selection in East Asians (Sabeti et al. 2007). EDAR is a major genetic determinant of hair thickness and with a nonsynonymous allele (Val370Ala) SNP rs3827760 (1540C allele), which shows high frequencies in populations of East Asian and Native American origin but is essentially absent from European and African populations (Sabeti et al. 2007; Fujimoto et al. 2008).

Interestingly, in India, we observe the 1540C allele mainly in association with AA and TB populations (fig. 4). Tibeto-Burman speakers of India have the highest (∼61%) 1540C allele frequency in south Asia, consistent with their predominantly East Asian ancestry inferred from autosomal and uniparental loci. Meanwhile, the Khasi population is characterized by a 40% frequency of the allele (table 3). Munda speakers also show detectable presence, with a ∼5% average, in contrast to its complete absence among Indo-European and Dravidian speakers (with a few exceptions viz., Tharu, Mushar, Hazara, and Burusho populations) (fig. 4). These results are in line with the models suggesting gene flow from southeast Asia to India, albeit more significant among Khasi- than Munda-speaking populations. Given the evidence for strong positive selection on this allele in East Asia, our finding of only 5% frequency among Munda is surprisingly low, possibly reflecting the fact that the 1540C allele does not carry a significant biological advantage in India.

Table 3.

The Frequency of 1540C Allele of EDAR Gene in India by Language Family.

Language group n 1540C 
Tibeto-Burman 57 0.61 
Austroasiatic (Khasi-Aslian) 20 0.40 
Austroasiatic (Munda) 379 0.05 
Indo-European 338 0.01 
Dravidian 283 0.00 
Language group n 1540C 
Tibeto-Burman 57 0.61 
Austroasiatic (Khasi-Aslian) 20 0.40 
Austroasiatic (Munda) 379 0.05 
Indo-European 338 0.01 
Dravidian 283 0.00 

NOTE.—Global frequencies of the 1540C allele are given in supplementary table S7, Supplementary Material online. n, number of samples.

FIG. 4.

(A) Geographic distribution of the EDAR 1540C allele frequency worldwide. The map was generated using Surfer8 of Golden Software (Golden Software Inc.), following the Kriging procedure. Red dots indicate sampling location. (B) Geographic distribution of the EDAR 1540C allele frequency in different groups of south and southeast Asia. The frequency is shown in proportion to the bubble size.

FIG. 4.

(A) Geographic distribution of the EDAR 1540C allele frequency worldwide. The map was generated using Surfer8 of Golden Software (Golden Software Inc.), following the Kriging procedure. Red dots indicate sampling location. (B) Geographic distribution of the EDAR 1540C allele frequency in different groups of south and southeast Asia. The frequency is shown in proportion to the bubble size.

Overall, the genome-wide autosomal evidence is therefore consistent with bidirectional gene flow between India and southeast Asia restricted mainly to Austroasiatic (and Tibeto-Burman)-speaking populations. The analysis of geographic and linguistic patterns in the distribution of the 1540C allele of the EDAR gene in 49 Indian populations (fig. 4) shows that linguistic affiliation appears as a significant predictor of allele frequency and therefore, at least in case of this gene, the IBD model can be rejected. However, our analyses of autosomal variation did not inform us about the timing of the dispersal events.

Dating of the Genetic Variation in Y Chromosome Haplogroup O2a

The autosomal genetic evidence above appears to support previous claims made on the basis of Y chromosome evidence for the existence of a shared genetic component among Indian and southeast Asian Austroasiatic speakers. However, our analyses did not provide a date estimate for these shared elements of population history and furthermore suggested multidirectional gene flow. Genotyping of 12 SNP markers in 553 Y chromosome samples representing 13 Indian Austroasiatic populations sampled from 15 locations revealed the presence of 8 distinct haplogroups among Munda speakers, 7 of which they share with other Indo-European–speaking and Dravidian-speaking Indian populations (supplementary fig. S2, supplementary table S8, Supplementary Material online). Consistent with previous studies (Basu et al. 2003; Metspalu et al. 2004; Sahoo et al. 2006; Sengupta et al. 2006; Kumar et al. 2007), the eighth, O2a (M95), appears as the most frequent haplogroup among most Munda-speaking populations (supplementary fig. S2 and supplementary table S5, Supplementary Material online). Khasi (Khasi-Aslian) and Garo (Tibeto-Burman) populations of Northeastern India have two additional hg O subclades, that is, O3 (M122) and O*, the latter found only in Garo (supplementary fig. S2, Supplementary Material online). The presence of M122 at moderate frequency in Khasi is consistent with the autosomal data considered above and can be explained by their close geographic proximity to, and likely admixture with, Tibeto-Burman–speaking populations (e.g., Garo) among whom the O3 lineage is predominant (Cordaux et al. 2004; Reddy et al. 2007).

Previous Y chromosome studies have provided controversial dates for the shared O2a lineage either because of different sampling or genotyping approaches. To avoid these issues, we genotyped a wide range of samples both from India and southeast Asia with the same widely used approach (AmpFℓSTRY-filer kit). Using data from 14 Y chromosomal STR loci (supplementary table S3, Supplementary Material online), we estimate the age of all Y chromosomes from India and southeast Asia with the M95 mutation as ∼20 (±>2.7) KYA (table 4). This estimate is significantly younger than the 65 KYA estimate of (Kumar et al. 2007) but similar to the estimates of other haplogroup O subclades (Shi et al. 2005). O2a coalescent times appear to be significantly higher in southeast Asian populations than in India, in contrast to genome-wide heterozygosity patterns (supplementary fig. S3, Supplementary Material online), suggesting that the long-term effective population size of Munda Y chromosomes in India has been lower than that of Khasi-Aslian speakers in southeast Asia (fig. 5C and table 4). However, the lack of clear regional clustering in the STR-based phylogenetic network (fig. 5B) makes a simple founder-effect scenario unlikely to explain the lower diversity in India—if southeast Asia is the source of Indian O2a variation, more than one founding lineages would need to have been involved in the migration, and the differentiation time of Indian O2a lineages would have to be considered as the upper boundary of the migration rather than referring to the migration time itself (table 4). The Shompen remain outliers and stay significantly equidistant from other populations, consistent with the view of their linguistic isolation (fig. 5B).

Table 4.

Y Chromosome Age Estimates for Population Groups of India and Southeast Asia.

Group Sample size (nAge (KYA) 
India (overall) 178 15.9 ± 1.6 
    North Munda 87 12.4 ± 1.3 
    South Munda 79 18.4 ± 2.4 
    Garo 5.8 ± 2.7 
    Khasi 10.6 ± 2.6 
Southeast Asia (overall) 142 22.4 ± 4.9 
    Islanda (southeast Asia) 120 20.8 ± 4.9 
    Mainland (southeast Asia) 22 23.8 ± 4.2 
Nicobarese 11 16.9 ± 5.9 
Shompen 10 15.3 ± 10.8 
O2a (M95) overall 331 19.5 ± 2.7 
Group Sample size (nAge (KYA) 
India (overall) 178 15.9 ± 1.6 
    North Munda 87 12.4 ± 1.3 
    South Munda 79 18.4 ± 2.4 
    Garo 5.8 ± 2.7 
    Khasi 10.6 ± 2.6 
Southeast Asia (overall) 142 22.4 ± 4.9 
    Islanda (southeast Asia) 120 20.8 ± 4.9 
    Mainland (southeast Asia) 22 23.8 ± 4.2 
Nicobarese 11 16.9 ± 5.9 
Shompen 10 15.3 ± 10.8 
O2a (M95) overall 331 19.5 ± 2.7 
FIG. 5.

Surfer maps showing (A) the frequency and (C) the mean microsatellite variance distributions of haplogroup O2a (M95) in south and southeast Asia. Surfer maps were generated using Surfer8 of Golden Software (Golden Software Inc.), following the Kriging procedure. (B) Phylogenetic network relating Y-STR haplotypes within haplogroup O2a (M95). The network was constructed using a median joining with MP (maximum parsimony) algorithm as implemented in the Network 4.5.0.2 program. The size of the circles is proportional to the number of samples.

FIG. 5.

Surfer maps showing (A) the frequency and (C) the mean microsatellite variance distributions of haplogroup O2a (M95) in south and southeast Asia. Surfer maps were generated using Surfer8 of Golden Software (Golden Software Inc.), following the Kriging procedure. (B) Phylogenetic network relating Y-STR haplotypes within haplogroup O2a (M95). The network was constructed using a median joining with MP (maximum parsimony) algorithm as implemented in the Network 4.5.0.2 program. The size of the circles is proportional to the number of samples.

Our coalescent time estimate of 15.9 ± >1.6 KYA for Indian M95 carriers is more than 2-fold greater than the age estimated by Sengupta et al. (2006), whereas being more than 4-fold smaller than the one reported by Kumar et al. (2007). All three studies used different sets of STR loci and different ranges of sampling but the same phylogenetic calibration of the molecular clock. The difference between our estimate from that of Sengupta et al. (2006) can mainly be ascribed to the difference in geographic sampling: When applying the coalescent calculations to the subset of Ho and Santhal samples in our data, we observe a value (7.3 ± 1.5 KYA) which is not significantly different from the estimate (8.8 ± 2 KYA) reported for these same populations by Sengupta et al. (2006). It should be noted as well that all eight overlapping STR loci between our studies showed identical STR median haplotypes by this approach. Conversely, the age difference between our study and that of Kumar et al. (2007) cannot be explained by differences in the range of geographic sampling, as both studies cover a wide assortment of Austroasiatic-speaking tribes from India (supplementary fig. S4, Supplementary Material online). Overall, due to the apparent lack of geographic clustering of Indian Austroasiatic O2a Y-STR haplotypes in the phylogenetic network, our 15.9 ± 1.6 KYA age estimate for the Indian subset should not be taken as a genetic estimate of dispersal time of Austroasiatic groups to India, but rather, this date estimate can be considered as the upper boundary for any dispersal event(s) to India that involved the O2a lineage.

mtDNA Evidence for Sex-Specific Local Admixture among Indian Austroasiatic Speakers

Similarly to autosomal and Y chromosome data, the mtDNA evidence shows that Munda speakers of India have a substantial overlap with their local Dravidian-speaking and Indo-European–speaking neighbors in their mtDNA haplogroup composition. However, in contrast to the inferences based on other loci, there is no detectable evidence in >700 DNA samples from the Munda-speaking populations for a shared ancestry component with other Austroasiatic groups from southeast Asia (table 2).

The mtDNA haplogroup allocation of Munda speakers is similar to Dravidian and Indo-Europeans of the Indian subcontinent (Basu et al. 2003; Metspalu et al. 2004; Chaubey et al. 2007; Chaubey, Metspalu, et al. 2008; Chaubey, Karmin, et al. 2008; Thangaraj et al. 2009). We carried out a high-resolution analysis of those haplogroups of Munda speakers, which account for >4% of their maternal gene pool. All the seven maternal haplogroups found frequently in Munda speakers are autochthonous to India (supplementary fig. S5, Supplementary Material online) (Chandrasekar et al. 2009) and references therein, accounting altogether for 57% of the maternal gene pool of present Munda speakers. The extensive analysis of these haplogroups revealed relatively recent sharing of most recent common ancestors within these groups between AA and non-AA speakers (MRCA), suggestive of admixture; a similar result was observed recently for hg R7, which is the most frequent among these in AA speakers (Chaubey, Karmin, et al. 2008). The mtDNA lineages of Munda speakers do not cluster in basal parts of the tree (to founder haplogroups M, N, or R) but are spread among the derived branches that date to <10KYA (Supplementary fig. S5, Supplementary Material online) suggests that the mtDNA diversity found in contemporary Munda speakers is the result of admixture from neighboring populations of India.

In sharp contrast, among the geographically proximate Khasi-Aslian–speaking Khasi population, approximately one-third of the mtDNA lineages have southeast Asian ancestry (Fig. 2 and table 2). Notably, the Khasi are known historically to have been matrilocal. This pattern of sex-specific gene flow is perhaps not unexpected considering the patrilocality that most Munda-speaking groups practice today. Previous studies, though, have noted that the genetic effect of patrilocal practice in India is significantly different from southeast Asia due to different degrees of permeability in the marital boundaries (Kumar et al. 2006).

Conclusions

Thus, our analyses of genetic data from uniparentally and biparentally inherited loci provide a range of estimates of gene flow across geographic and linguistic borders. The analysis of autosomal data suggests bidirectional gene flow across the Bay of Bengal restricted to Austroasiatic-speaking and Tibeto-Burman–speaking populations. The presence of a significant (approximately one-quarter) southeast Asian genetic component among Indian Munda speakers is consistent with this model, implying their recent dispersal from southeast Asia followed by extensive admixture with local Indian populations. The strongest signal of southeast Asian genetic ancestry among Indian Austroasiatic speakers is maintained in their Y chromosomes, with approximately two-thirds falling into haplogroup O2a. Geographic patterns of genetic diversity of this haplogroup are consistent with its origin in southeast Asia approximately 20 KYA, followed by more recent dispersal(s) to India. Comparison of mtDNA and Y chromosome data reveals that the “import of local genes,” at least in case of the Munda speakers of India, has likely been biased toward the female sex, resulting in a situation where the southeast Asian ancestry signal in the mtDNA lineages of Indian Munda speakers has been entirely lost. Further sampling of southeast Asian AustroAsiatic–speaking populations and genome-wide sequence data along with in silico simulations would be required in the future to assess the demographic parameters of population dispersals between south and southeast Asia in explicit time frames.

Supplementary Material

Supplementary figures S1 to S6 and supplementary tables S1 to S10 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

We thank Ille Hilpus, Tuuli Reisberg, Viljo Soo, and Lauri Anton for technical assistance. We also thank Professor Thomas L. Rost, who allow us to use the rice plant picture (for fig. 1) from http://www-plb.ucdavis.edu/labs/rost/Rice/RICEHOME.HTML. EU European Regional Development Fund through the Centre of Excellence in Genomics, Estonian Biocentre, and University of Tartu, Estonian Basic Research grant SF0182474 to R.V.; Tartu University (PBGMR06901) to T.K.; Estonian Science Foundation (7858) to E.M. and (7445) to S.R.; UKIERI grant RG47772 to T.K., M.M.L., K.T., and L.S.; British Academy BARDA-48208 to M.B.R.; Estonian Ministry of Education and Research (0180142s08) and European Commission grant nr. 245536 (OPENGENE) to M.N. European Commission (ECOGENE 205419) to E.B.C. European Commission, Directorate-General for Research for FP7 Ecogene (205419) to D.M.B.; CSIR, Government of India to L.S. and K.T.; The Wellcome Trust to Y.X. and C.T.-S.

References

Alexander
DH
Novembre
J
Lange
K
Fast model-based estimation of ancestry in unrelated individuals
Genome Res
 , 
2009
, vol. 
19
 (pg. 
1655
-
1664
)
Allaby
RG
Fuller
DQ
Brown
TA
The genetic expectations of a protracted model for the origins of domesticated crops
Proc Natl Acad Sci U S A
 , 
2008
, vol. 
105
 (pg. 
13982
-
13986
)
Ammerman
AJ
Cavalli-Sforza
LL
The Neolithic transition and the genetics of populations in Europe
 , 
1984
Princeton (NJ)
Princeton University Press
Anderson
GDS
The Munda verb: typological perspectives
 , 
2007
Berlin
Mouton De Gruyter
Basu
A
Mukherjee
N
Roy
S
, et al.  . 
(11 co-authors)
Ethnic India: a genomic view, with special reference to peopling and structure
Genome Res
 , 
2003
, vol. 
13
 (pg. 
2277
-
2290
)
Behar
DM
Yunusbayev
B
Metspalu
M
, et al.  . 
(21 co-authors)
The genome-wide structure of the Jewish people
Nature
 , 
2010
, vol. 
466
 (pg. 
238
-
242
)
Bellwood
PS
First farmers
 , 
2005
London
Wiley-Blackwell
Chakravarti
A
Human genetics: tracing India’s invisible threads
Nature
 , 
2009
, vol. 
461
 (pg. 
487
-
488
)
Chandrasekar
A
Kumar
S
Sreenath
J
, et al.  . 
(20 co-authors)
Updating phylogeny of mitochondrial DNA macrohaplogroup m in India: dispersal of modern human in South Asian corridor
PloS One
 , 
2009
, vol. 
4
 pg. 
e7447
 
Chaubey
G
Karmin
M
Metspalu
E
, et al.  . 
(31 co-authors)
Phylogeography of mtDNA haplogroup R7 in the Indian peninsula
BMC Evol Biol
 , 
2008
, vol. 
8
 pg. 
227
 
Chaubey
G
Metspalu
M
Karmin
M
, et al.  . 
(14 co-authors)
Language shift by indigenous population: a model genetic study in South Asia
Intern J Hum Genet
 , 
2008
, vol. 
8
 pg. 
41
 
Chaubey
G
Metspalu
M
Kivisild
T
Villems
R
Peopling of South Asia: investigating the caste-tribe continuum in India
Bioessays
 , 
2007
, vol. 
29
 (pg. 
91
-
100
)
Cockerham
CC
Weir
BS
Covariances of relatives stemming from a population undergoing mixed self and random mating
Biometrics
 , 
1984
, vol. 
40
 (pg. 
157
-
164
)
Cordaux
R
Weiss
G
Saha
N
Stoneking
M
The northeast Indian passageway: a barrier or corridor for human migrations?
Mol Biol Evol
 , 
2004
, vol. 
21
 (pg. 
1525
-
1533
)
Diamond
J
Bellwood
P
Farmers and their languages: the first expansions
Science
 , 
2003
, vol. 
300
 (pg. 
597
-
603
)
Diffloth
G
“Tentative calibration of time depths in Austroasiatic branches”
 , 
2001
 
Paper presented at the Colloque Perspective sur la Phylogenie des Langues d'Asoe Orientales at Perigueux, 30 August 2001.
Diffloth
G
More on Dvaravati Old Mon
 , 
2009
 
Paper presented at the Fourth International Conference on Austroasiatic Linguistics. Mahidol University: Salaya (India)
Ehret
C
Keita
SOY
Newman
P
The origins of Afroasiatic
Science
 , 
2004
, vol. 
306
 pg. 
1680
  
author reply 1680
Excoffier
L
Laval
G
Schneider
S
Arlequin (version 3.0): an integrated software package for population genetics data analysis
Evol Bioinform Online
 , 
2005
, vol. 
1
 (pg. 
47
-
50
)
Fujimoto
A
Kimura
R
Ohashi
J
, et al.  . 
(14 co-authors)
A scan for genetic determinants of human hair morphology: EDAR is associated with Asian hair thickness
Hum Mol Genet
 , 
2008
, vol. 
17
 (pg. 
835
-
843
)
Fuller
D
Bellwood
P
Renfrew
C
An agricultural perspective on Dravidian historical linguistics: archaeological crop packages, livestock and Dravidian crop vocabulary
Examining the farming/language dispersal hypothesis
 , 
2003
Cambridge
The McDonald Institute for Archaeological Research
(pg. 
191
-
214
)
Fuller
D
Petraglia
M
Allchin
B
Non-Human Genetics, Agricultural Origins and Historical Linguistics in South Asia
Vertebrate Paleobiology and Paleoanthropology
 , 
2007
The Netherlands: Springer
(pg. 
393
-
443
)
Fuller
DQ
Qin
L
Zheng
Y
Zhao
Z
Chen
X
Hosoya
LA
Sun
GP
The domestication process and domestication rate in rice: spikelet bases from the Lower Yangtze
Science
 , 
2009
, vol. 
323
 (pg. 
1607
-
1610
)
Handley
LJL
Manica
A
Goudet
J
Balloux
F
Going the distance: human population genetics in a clinal world
Trends Genet
 , 
2007
, vol. 
23
 (pg. 
432
-
439
)
Higham
C
Bellwood
P
Renfrew
C
Languages and farming dispersals: austroasiatic languages and rice cultivation
Examining the farming/language dispersal hypothesis
 , 
2003
Cambridge
The McDonald Institute for Archaeological Research
(pg. 
223
-
233
)
Jin
J
Huang
W
Gao
JP
Yang
J
Shi
M
Zhu
MZ
Luo
D
Lin
HX
Genetic control of rice plant architecture under domestication
Nat Genet
 , 
2008
, vol. 
40
 (pg. 
1365
-
1369
)
Karafet
TM
Mendez
FL
Meilerman
MB
Underhill
PA
Zegura
SL
Hammer
MF
New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree
Genome Res
 , 
2008
, vol. 
18
 (pg. 
830
-
838
)
Kivisild
T
Rootsi
S
Metspalu
M
Metspalu
E
Parik
J
Kaldama
K
Usanga
E
Mastana
S
Papiha
SS
Villems
R
The genetic heritage of the earliest settlers persists both in Indian tribal and caste populations
Am J Hum Genet
 , 
2003
, vol. 
72
 (pg. 
313
-
332
)
Kumar
V
Langstieh
BT
Madhavi
KV
Naidu
VM
Singh
HP
Biswas
S
Thangaraj
K
Singh
L
Reddy
BM
Global patterns in human mitochondrial DNA and Y-chromosome variation caused by spatial instability of the local cultural processes
PLoS Genet
 , 
2006
, vol. 
2
 pg. 
e53
 
Kumar
V
Reddy
ANS
Babu
JP
Rao
TN
Langstieh
BT
Thangaraj
K
Reddy
AG
Singh
L
Reddy
BM
Y-chromosome evidence suggests a common paternal heritage of Austro-Asiatic populations
BMC Evol Biol
 , 
2007
, vol. 
7
 pg. 
47
 
Lewis
MP
Ethnologue: languages of the world
 , 
2009
Dallas (TX)
SIL International
 
[Internet; cited 2010 Sep]. Available from: http://www.ethnologue.com/.
Li
JZ
Absher
DM
Tang
H
, et al.  . 
(10 co-authors)
Worldwide human relationships inferred from genome-wide patterns of variation
Science
 , 
2008
, vol. 
319
 (pg. 
1100
-
1104
)
Metspalu
M
Kivisild
T
Metspalu
E
, et al.  . 
(16 co-authors)
Most of the extant mtDNA boundaries in south and southwest Asia were likely shaped during the initial settlement of Eurasia by anatomically modern humans
BMC Genet
 , 
2004
, vol. 
5
 pg. 
26
 
Nei
M
Molecular evolutionary genetics
 , 
1987
New York
Columbia University Press
Patterson
N
Price
AL
Reich
D
Population structure and eigenanalysis
PLoS Genet
 , 
2006
, vol. 
2
 pg. 
e190
 
Purcell
S
Neale
B
Todd-Brown
K
, et al.  . 
(11 co-authors)
PLINK: a tool set for whole-genome association and population-based linkage analyses
Am J Hum Genet
 , 
2007
, vol. 
81
 (pg. 
559
-
575
)
Purugganan
MD
Fuller
DQ
The nature of selection during plant domestication
Nature
 , 
2009
, vol. 
457
 (pg. 
843
-
848
)
Ramachandran
S
Deshpande
O
Roseman
C
Rosenberg
N
Feldman
M
Cavalli-Sforza
L
Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa
Proc Natl Acad Sci U S A
 , 
2005
, vol. 
102
 pg. 
15942
 
Reddy
BM
Langstieh
BT
Kumar
V
Nagaraja
T
Reddy
ANS
Meka
A
Reddy
AG
Thangaraj
K
Singh
L
Austro-Asiatic tribes of Northeast India provide hitherto missing genetic link between South and Southeast Asia
PLoS One
 , 
2007
, vol. 
2
 pg. 
e1141
 
Richards
M
Macaulay
V
Hickey
E
, et al.  . 
(36 co-authors)
Tracing European founder lineages in the Near Eastern mtDNA pool
Am J Hum Genet
 , 
2000
, vol. 
67
 (pg. 
1251
-
76
)
Rootsi
S
Zhivotovsky
LA
Baldovic
M
, et al.  . 
(20 co-authors)
A counter-clockwise northern route of the Y-chromosome haplogroup N from Southeast Asia towards Europe
Eur J Hum Genet
 , 
2007
, vol. 
15
 (pg. 
204
-
211
)
Sabeti
PC
Varilly
P
Fry
B
, et al.  . 
(99 co-authors)
Genome-wide detection and characterization of positive selection in human populations
Nature
 , 
2007
, vol. 
449
 (pg. 
913
-
918
)
Sahoo
S
Singh
A
Himabindu
G
, et al.  . 
(11 co-authors)
A prehistory of Indian Y chromosomes: evaluating demic diffusion scenarios
Proc Natl Acad Sci U S A
 , 
2006
, vol. 
103
 (pg. 
843
-
848
)
Sengupta
S
Zhivotovsky
LA
King
R
, et al.  . 
(14 co-authors)
Polarity and temporality of high-resolution y-chromosome distributions in India identify both indigenous and exogenous expansions and reveal minor genetic influence of Central Asian pastoralists
Am J Hum Genet
 , 
2006
, vol. 
78
 (pg. 
202
-
221
)
Shi
H
Dong
YL
Wen
B
Xiao
CJ
Underhill
PA
Shen
PD
Chakraborty
R
Jin
L
Su
B
Y-chromosome evidence of southern origin of the East Asian-specific haplogroup O3-M122
Am J Hum Genet
 , 
2005
, vol. 
77
 (pg. 
408
-
419
)
Tan
L
Li
X
Liu
F
, et al.  . 
(10 co-authors)
Control of a key transition from prostrate to erect growth in rice domestication
Nat Genet
 , 
2008
, vol. 
40
 (pg. 
1360
-
1364
)
Tang
H
Peng
J
Wang
P
Risch
NJ
Estimation of individual admixture: analytical and study design considerations
Genet Epidemiol
 , 
2005
, vol. 
28
 (pg. 
289
-
301
)
Thangaraj
K
Nandan
A
Sharma
V
, et al.  . 
(11 co-authors)
Deep rooting in-situ expansion of mtDNA Haplogroup R8 in South Asia
PloS One
 , 
2009
, vol. 
4
 pg. 
e6545
 
Thangaraj
K
Sridhar
V
Kivisild
T
, et al.  . 
(20 co-authors)
Different population histories of the Mundari- and Mon-Khmer-speaking Austro-Asiatic tribes inferred from the mtDNA 9-bp deletion/insertion polymorphism in Indian populations
Hum Genet
 , 
2005
, vol. 
116
 (pg. 
507
-
517
)
Weiss
KM
Long
JC
Non-Darwinian estimation: my ancestors, my genes’ ancestors
Genome Res
 , 
2009
, vol. 
19
 (pg. 
703
-
710
)
Wright
S
Isolation by distance
Genetics
 , 
1943
, vol. 
28
 (pg. 
114
-
138
)
Xue
Y
Zhang
X
Huang
N
, et al.  . 
(14 co-authors)
Population differentiation as an indicator of recent positive selection in humans: an empirical evaluation
Genetics
 , 
2009
, vol. 
183
 (pg. 
1065
-
1077
)
Zhivotovsky
LA
Underhill
PA
Cinnioğlu
C
, et al.  . 
(16 co-authors)
The effective mutation rate at Y chromosome short tandem repeats, with application to human population-divergence time
Am J Hum Genet
 , 
2004
, vol. 
74
 (pg. 
50
-
61
)

Author notes

Associate editor: Anne Stone