Abstract

South Asia is home to more than 1.5 billion humans representing many diverse ethnicities, linguistic and religious groups and representing almost one-quarter of humanity. Modern humans arrived here soon after their departure from Africa ∼50 000–70 000 years before present (YBP) and several subsequent human migrations and invasions, as well as the unique social structure of the region, have helped shape the pattern of genetic diversity currently observed in these populations. Over the last few decades population geneticists and molecular anthropologists have analyzed DNA variation in indigenous populations from this region in order to catalog their genetic relationships and histories. The emphasis is gradually shifting from the study of population origins to high resolution surveys of DNA variation to address issues of population stratification and genetic susceptibility or resistance to diseases in genome-wide association surveys. We present a historical overview of the genetic studies carried out on populations from this region in order to understand the influence of geographic, linguistic and religious factors on population diversity in this region, and discuss future prospects in light of developments in high throughput genotyping and next generation sequencing technologies.

INTRODUCTION

South Asia encompasses the Indo–Pak sub-continent and includes India, Pakistan, Bangladesh, the Kingdoms of Nepal and Bhutan and the islands of Sri Lanka, the Maldives and the Chagos Archipelago. The land mass is bordered by Iran and Afghanistan on the west, China in the north, and Myanmar on its eastern fringes. The Indian Ocean straddles its entire southern coast line. Its population of around 1.5 billion individuals constitutes approximately one-fourth of humanity. Over the centuries this area has witnessed many invasions and migrations mainly from the West. Genetically, it contains the site of what was once called ‘the grandest genetic experiment ever performed on Man’ and, in several surveys of worldwide diversity, one of the most outstanding populations making up a sixth grouping of humanity comparable in these analyses to major continental groupings [1, 2]. In this review, we will consider how our understanding of the historical, cultural and environmental factors that have shaped its inhabitants has developed, something of what we have learned, and prospects for future developments.

This now-densely occupied land was encountered by the first populations of modern humans that ventured out of Africa more than 50 000 years ago. It is suggested that they arrived in this part of Asia via a southern coastal route and continued to Southeast Asia. The fossil and archaeological evidence for human settlements in this part of the world prior to 9000 YBP is, unfortunately, sparse although settlements dating much further back are now beginning to emerge, including from Patne in western India and Batadomba-lena in Sri Lanka, dated to between 30 000 and 34 000 YBP [3–5]. There is evidence for indigenous animal and plant domestication in many places in the region, the earliest being found at a Neolithic site in Mehrgarh, in Southwest Pakistan [6]. Subsequent epochal events included the development and decline of the Indus Valley civilization's Harappan culture, the arrival of Indo-European speakers from central or west Asia, linked to the introduction of the caste system in India and the possible displacement of Dravidian speakers to their current location in south central India. More recent historical events have included Alexander's invasion in 327 BC, the Arab invasion and subsequent Muslim conquest and rule of India, followed by the British Raj that lasted until the partition in 1947 [7].

More than 4000 well-defined population groups including ∼500 tribal and approximately 30 hunter-gatherers reside in this region [8]. The overwhelming majority are endogamous. The endogamous Hindu caste system, and the presence of large consanguineous families in the region, especially among the Muslim populations, provides unique resources for unraveling the genetic basis of disease. Like elsewhere, demographic events such as genetic bottlenecks, population expansions and admixture have shaped the genetic diversity that is observed in this region today.

Linguistically the region comprises major groups of Indo-European, Dravidian, Tibeto-Burman and Austro-Asiatic speakers, minor groups like the Andamanese, and even language-isolate groups like the Hunza Burusho in northern Pakistan and Nihali in Madhya Pradesh, India (Figure 1). More than 70% of the population speak Indo-European languages. Tibeto-Burman speakers are present in the north and northeast and form a majority in Bhutan. Dravidian languages are prevalent in south India, Sri Lanka and, intriguingly, in Balochistan province of Pakistan. Austro-Asiatic languages are spoken by many Indians in the south and close to the border with Myanmar [9]. The region is home to followers of many religions, the major among them being Islam, Hinduism, Buddhism and Sikhism. The population also includes sizeable Christian, Jewish and Zoroastrian minorities. All have contributed to the genetic and cultural diversity found here.

Figure 1:

Linguistic boundaries in South Asia.

Figure 1:

Linguistic boundaries in South Asia.

During the past few decades molecular geneticists and anthropologists have analyzed DNA variation among human populations in order to catalog their genetic relationships and to glean information about recent human evolution. Several such studies have been carried out in populations from South Asia, especially Pakistan, which is well-represented in the Human Genome Diversity Project (HGDP), and more recently India, in order to study their origins and their susceptibility, or resistance, to disease. We first present an overview of the genetic studies carried out on populations from this region in order to understand the influence of geographic, linguistic and religious patterns on population diversity in this region and go on to discuss future prospects for analyses of genetic variation in this region.

PAST

Throughout the 1970s and 1980s classical serological markers such as human blood groups, leukocyte antigens (HLA), glucose-6-phosphate dehydrogenase and other isozymes were used to study a limited number of populations from South Asia [10]. The advent of polymorphic DNA markers such as restriction fragment length polymorphisms (RFLPs), variable number of tandem repeats (VNTRs), short tandem repeats (STRs) or microsatellites, and single-nucleotide polymorphisms (SNPs) and the technological breakthroughs that made it relatively easy to genotype large number of markers in thousands of samples permitted the extensive analyses of genetic variation in humans. Initially the laborious techniques of RFLP and hybridization were applied to analyze a few hundred samples but the development of polymerase chain reaction (PCR) amplification made possible the extensive analyses of DNA variation in thousands of samples.

Several initial studies analyzed mitochondrial DNA and the Y chromosome in selected populations from Pakistan and India in order to obtain glimpses into their female and male ancestry, respectively. Sequence differences among individual mitochondrial DNAs separate them into haplotypes and haplogroups that provide a snapshot of population origins from the female perspective [11]. Similarly, the male-specific part of the Y chromosome is passed down from father to son without change, except for the gradual accumulation of mutations which appear as DNA polymorphisms and provide a male perspective to human evolution. Human Y chromosomes are delineated into distinct haplogroups and haplotypes, defined by a combination of unique event or bi-allelic polymorphisms alone or their combination with Y-STRs, respectively [12]. A number of other studies analyzed autosomal STRs and Alu insertion polymorphisms to gauge an overall view of autosomal genetic diversity in the region and address the issue of population origins in light of historical context [13–16].

Migrants

Populations from Pakistan and north India share a sizable proportion of variation with European populations and Central and West Asians have been major contributors to the gene pool of this region, consistent with arrival of migrants from Northwest Asia [13, 14, 17]. In Pakistan the Karakoram Mountain ranges that are part of the Himalayan Mountains in the north have been a major barrier to gene flow from China, although this does not appear to be the case in India where Sino-Tibetan populations entered through the north-east corridor and Nepal [18].

Most Y-chromosomal lineages found in both tribal and non-tribal populations from South Asia date back to the Paleolithic period. Common haplogroups (markers) include C5 (M356), F* (M89), H1 (M52), H2 (Apt), L1 (M27; M76), R1a1 (M17) and R2 (M124) [19, 20]. All except for R1a1, which constitutes 55% of Y chromosomes in Pakistan, have long coalescent times and appear to be indigenous. Y haplogroup J2* (M172) which is prevalent in Pakistan and north and west India represents a West Asian contribution to the genetic diversity of the sub-continent (although the migrants would of course have carried other lineages as well) [21]. O2* (P31), O3* (M122) and Q* (M242) lineages in India and Pakistan are probably representative of migrants from East and Southeast Asia [21, 22]. Generally haplotype variation between populations is higher, especially between tribes [23, 24].

More than 60% of mitochondrial lineages in this region are represented by haplogroup M and mitochondrial variation, especially in haplogroup M, is largely local in origin with estimated coalescence times in the Paleolithic [18]. Approximately 30% of Indian mtDNA haplotypes belong to West Eurasian haplogroups. Haplogroups R7 and U2 lineages are specific to India and the variation in its distribution also suggests an ancient origin [25, 26]. It has been argued that the presence of Eurasian and Australasian mitochondrial haplogroups M, N and R and Y haplogroups C* (M130), D* (M174) and F* in South Asia supports the migration of humans out of East Africa via a southern coastal route through the sub-continent [27]. The Andaman and Nicobar Islanders that comprise six tribal populations and lie on the southern coastal route to Oceania show a high level of population differentiation with unique mitochondrial sequences in the Andamanese probably arising by genetic drift in an isolated population. However, the Nicobarese share genetic relatedness with Southeast Asians and are probably more recent migrants in comparison with their neighbors [28]. The presence of haplogroups B5a and F1 near the border between India and Nepal suggests movement across the Himalaya Mountains and this possibility is also supported by the detection of Indian haplogroups R6 in the Nepalese [18].

The Indian caste system represents a unique socio-economic hierarchy associated with the Hindu religion, that broadly distinguishes upper (Brahmins/priests), middle (Kshatriyas/warriors; Vaishyas/farmers and traders; Sudras/laborers) and lower (Panchama/untouchables) or scheduled castes. It is associated with the spread of Indo-European languages after ∼3500 YBP, but the extent of gene flow from outside remains a matter of debate. Early studies were interpreted to suggest extensive gene flow from West and Central Asia but recent data, obtained using a larger number of markers, have been used to argue that most DNA variation was indigenous in origin [20, 29]. Male substructure has been observed in the higher (priest and warrior) classes from Jaunpur District in central India and male gene flow between the castes was low (<1% per generation) there, as expected from the social rules [24]. Some studies show a genetic affinity between the lower castes and tribal populations and the frequency of these haplotypes is proportional to caste rank, the highest frequency of West Eurasian haplotypes being found in the upper castes [13, 30]. Upper caste Indians share 10–20% of variation with European populations [13, 14].

Several populations stood out as being genetically distinct in the initial studies based upon Y chromosomal and autosomal STRs. In the Kalash population from Pakistan, referred to at the beginning of this review as a sixth grouping of humanity, this was most likely due to drift in a geographically and religiously isolated group that has undergone a population bottleneck during their recent migration to their present day settlements in the Hindu Kush Mountain valleys in northern Pakistan from where the individuals were sampled. A larger survey that includes populations from their ancestral homeland in Nuristan, Afghanistan, would provide more insights about their unique genetic structure. They have a major West Eurasian mitochondrial component along with certain population-specific Y lineages (L3a; PK3) with little Y-STR variation and no genetic affiliation with East Asian populations [21, 26, 31].

The origins of the Parsi are well-documented and, although only a few thousand now live in Pakistan, there are many in India. Their migration to Gujarat in India after the collapse of the Sassanian empire is well documented and their name relates to their geographic origin—meaning ‘from Iran’. Their Y chromosomes are closer to populations from present-day Iran but 60% of their maternal gene pool belongs to South Asian haplogroups not found in Iran, highlighting their affinities with local Gujarati women [17, 26].

A number of East African slaves were brought to this region as involuntary migrants and among such groups are the Makranis who have physical features typical of African populations. The contribution of sub-Saharan African Y chromosomes to this population was estimated to be ∼12% and combined with the presence of sub-Saharan mitochondrial haplogroups represent the genetic legacy of the East African slave trade that existed in this region [17, 26].

Preliminary data on the Nepalese and Bhutanese populations using autosomal and Y-STRs show significant differences in comparison with their geographic neighbors consistent with genetic drift in small geographically isolated populations, the differentiation being more striking in the Bhutan [32–34]. A more detailed genetic analyses of the samples collected under the ‘Languages and Genes in the Greater Himalaya Region’ Project should provide a clearer picture.

Invaders

The power of a genealogical approach based on the hierarchical use of Y chromosomal markers in ethnic groups from this region has provided some interesting insights into historical events. In particular, analyses of Y lineages in the Hazara population from Pakistan established the presence of a ‘star haplotype’ that could be directly linked to Genghis Khan or his male ancestors and that spread by his sons throughout Eurasia [35]. An examination of their mitochondrial DNA suggests that they were accompanied by women of East Asian ancestry and their autosomal DNA also shows genetic relatedness with East Asians [2, 26].

Although several populations from northern Pakistan claimed that they were the descendents of Greek soldiers, left behind in this region by Alexander the Great, this was not generally borne out by genetic analyses. Only a small proportion of haplogroup E1b1b1a (M78) Y chromosomes in the Pathan population of Pakistan provided strong evidence of a small Greek contribution [31].

Similarly, the Muslim invasions do not appear to have left a readily detectable genetic imprint in India. Islam seems to have spread by conversions and cultural diffusion: overall the Muslims in India are genetically closer to their non-Muslim geographical neighbors than to other Muslim populations and no correlation exists between genetic variation and religious beliefs [36]. This is also true historically. Muslims constitute more than 400 million of the population of this region and since their advent in the seventh century they have included many ethnicities from Arabia, Turkey, Persia and Central Asia. Some studies [37] have attempted to analyze these Muslim populations based along the sectarian divide between the two major Muslim sects (Shias and Sunnis) but genetic analyses are not appropriate for analyses of recent political and religious events and any differences that are observed are likely to relate to the ethnic or geographic origins of the source populations.

Linguistics

Indo-European, Dravidian, Tibeto-Burman and Austro-Asiatic languages are spoken in this region and overall genetic relationships, as ascertained by STRs, SNPs and other genetic markers, are dictated primarily by geographic proximity rather than linguistic origin [9, 38].

Indo-European languages form the predominant language group and the genetic relatedness between the European and South Asian populations indicate that they may well have shared a common language superfamily, Dene-Caucasian. The Indo-European family may have spread to South Asia 6000–10 000 YBP replacing the languages spoken earlier almost everywhere [39, 40]. The Y haplogroup R1 (M173) is often referred to as an Indo–European marker and its associated haplogroup R1a1 is present at high frequency in many regions where Indo–European speakers live. The worldwide distribution of this haplogroup indicates frequency peaks in Eastern Europe and West and South Asia, which fits in with historical records of nomadic settlements in Europe and India. However, its presence in 15% of Dravidian speakers in India argues against a simple correlation. Although recently several new SNPs have been identified that refine this branch of the tree they are restricted to a few individuals in the same population or show genetic exchange across the Gulf of Oman [21, Underhill et al., unpublished data].

An elite dominance model of the Indo-European speakers partly explains the genetic similarities observed between the Dravidian and Indo-European groups and the seclusion of Dravidians in southern India and parts of Sri Lanka but it does not explain the enigma of the Brahui. This Dravidian-speaking population resides in the Balochistan province in south western Pakistan and is surrounded on all sides by Indo Europeans. Quintana-Murci and co-workers argued that they arrived in South Asia from southwestern Iran with the expansion of Dravidian-speaking farmers although this was contested by later studies using extensive Y markers [20, 29].

The Austro-Asiatic languages may be the most ancient in the region and preliminary analyses of mitochondrial DNA, a limited number of Y-chromosomal and autosomal markers had suggested that the Austro-Asiatic tribes appear to be descendants of remnant earlier settlers and share genetic affinity with Tibeto-Burman populations [14]. Subsequent analyses using a higher resolution of Y haplotypes in a larger sample of the three major Austro-Asiatic groups of India (Mundari, Khasi-Khmuic and Mon-Khmer) demonstrated a strong paternal genetic link amongst these populations and those from Southeast Asia. The authors estimated that the protohaplogroup O2a originated in the Indian Austro-Asiatic populations ∼65 000 years ago and entered Southeast Asia via the Northeast Indian corridor. Mitochondrial DNA varied between Austro-Asiatic groups from Southeast Asia and India reflecting a difference in the history of the sexes [25]. Since some estimates of the MRCA of all extant Y chromosomes are as recent as ∼59 000 YBP, such time estimates should be treated with caution [41].

Several Tibeto-Burman groups reside in north and north east India, and in Pakistan the Balti population from the Karakoram Mountains also speaks a Tibeto-Burman language. Y-chromosomal analyses of only a limited number of these Balti speakers has been carried out and they are not noticeably different then their neighbors in northern Pakistan [17].

The Hunza Burusho were of particular genetic interest because their language, Burushaski, is one of the few remaining language isolates in the world [9]. However, they are genetically close to their geographic neighbors in Pakistan [38]. Their isolation in the Karakoram Mountains habitat may have preserved their language but any differences between them and their neighbors, who have acquired new languages, have been greatly diluted by genetic exchange. Analyses of the language isolate Nihali speakers in India will show whether they exhibit similar genetic affinity with their geographic neighbors.

Disease Genetics

Families from South Asia have been invaluable in understanding genetic basis of several Mendelian disorders. The high rate of consanguineous marriages, large family sizes and contemporary inbreeding among tribes, clans and ethnicities make them suitable for linkage analyses [42, 43]. The genetic basis of several single gene disorders leading to syndromic and non-syndromic blindness, deafness, thalassemias, skeletal, hair, skin and nail disorders has been unraveled in families and populations from this region [44–49]. Several of these mutations are family- or population-specific [44, 50]. The immediate benefits of these analyses include genetic testing to exclude such disease in the unborn child. However, the real challenge is to translate these findings into public health education programs that will benefit these families and communities.

Determination of HLA frequencies at higher allelic resolution than achieved by serological methods through DNA based genotyping in ethnic groups from this region and their association with diseases like malaria, tuberculosis, leprosy and rheumatic heart fever have identified several high risk alleles [51–53]. Genetic association studies using single, or a few genetic variants, have also been carried out but their associations have not been replicated across populations possibly due to ascertainment bias, choice of markers, insufficient statistical power, population stratification, or differences in linkage-disequilibrium patterns in patients and controls.

PRESENT

As a consequence of advances in genotyping technologies, the focus of current studies is shifting from investigating population origins using small numbers of loci to analyzing large number of SNPs, structural and copy number variants (CNV) to address issues of population sub-structure and group membership that will have practical applications in the design of disease association studies, in rationalizing use of medicines tailored to an individual's genetic make up and DNA-based forensic analyses. The discovery of a single high risk variant in the cardiac myosin binding protein C (MYBPC3) that is restricted to 4% of South Asians and predisposes strongly to heart failure in later life is testament to both the promise of such studies and the distinct genetic features of the region [50].

The wide availability of DNA samples of ethnic populations from Pakistan through the Foundation Jean Dausset's HGDP-Centre d’Etude du Polymorphisme Humain (CEPH) Human Genome Diversity Cell Line Panel has permitted extensive analyses of STRs, SNPs and CNVs in these populations [2, 54–57]. Although some similar work has subsequently been carried out on tribal and caste populations from India and Indian expatriates settled in the USA the lack of readily-available cell lines or DNA from indigenous Indian populations has greatly hindered large-scale studies of other parts of South Asia [58, 59].

Using samples in the HGDP panel, Conrad et al. demonstrated that HapMap populations capture common haplotypes well in non-HapMap South Asian populations and that HapMap data could be used for imputing missing genotypes in these populations [60]. South Asians were expected to have intermediate levels of linkage disequilibrium (LD) between Europeans and East Asians, two populations that were part of the first phase of the HapMap Project. Overall, Indian and Pakistani populations are tagged most effectively by European populations but optimal HapMap mixtures increased ‘the fraction of polymorphic non-tag SNPs in a target population that are in LD with at least one tag SNP above a specified cut off point’ [61].

Analyses of populations from Pakistan in the HGDP Panel using the Illumina HumanHap 550K and 650K bead chips and expatriate populations from India, Pakistan and Sri Lanka using Affymetrix GeneChip Mapping Array 500K gave broadly similar results to an earlier study that had employed STR variation in this panel [56, 57, 62]. The South Asians were separated as a distinct cluster when regional identity was inferred for six groups using 650 000 SNPs, a better resolution than was obtained through analyses of autosomal STRs or the 550K chip in these populations, which could not clearly distinguish between populations from South Asia, Europe and West Asia. As expected, the Hazara shared ancestry with East Asians and there was a small East Asian contribution to the gene pool of Burusho, Pathan and Sindhi populations that was not apparent with use of STR datasets [2, 57]. Although the Kalash individuals reportedly harbored more than the average CNVs this was not replicated in the most recent survey of CNV in human populations using the Illumina 650Y arrays [56, 63].

Examination of variability in Indian expatriates in the USA using 471 insertion/deletion polymorphisms and autosomal STRs revealed low levels of genetic divergence but these samples, like the Indian Gujarati population from Texas that is included in the HapMap 3 sample collection, are hardly representative of the diversity of extant Indian populations [59]. This problem is being addressed by the Indian Genome Variation Consortium which recently analyzed more than 400 SNPs in an indigenous sample of 1871 samples representing numerous geographical, linguistic and religious groups from India [58]. They observed relatively low genetic differentiation overall, with mean Fst value of 0.03, although this value is still higher than in Europe (∼0.01), but these results could also reflect marker or sampling bias. The maximum genetic differences were observed among tribal populations speaking Indo-European languages, and certain populations and isolated ethic groups clustered on basis of ethnicity or language, suggesting that care should be taken in the selection of cases and controls while designing Genome Wide Association Studies (GWAS) in these populations.

FUTURE

Following the comprehensive documentation of genetic stratification within south Asia, a next logical step would be association mapping to identify genetic variants associated with complex multi-genic diseases and microbial resistance to pathogens such as malarial parasite, viral hepatitis, HIV, or Mycobacterium tuberculosis and Mycobacterium leprae that afflict a large section of South Asians. The increase in non-infectious diseases such as cardiovascular disease, type 2 diabetes mellitus and metabolic disorder in this region also places an enormous burden on these developing economies and their already stretched health care systems.

The development of dense SNP chips such as Illumina's Human1M-Duo BeadChip and Affymetrix v6.0 will enable an even denser and more uniform genomic coverage of both SNPs and CNVs to be obtained. GWAS will benefit from the identification of haplotypes that better describe association than single SNPs. The recent success of the Wellcome Trust Case Control Consortium (WTCCC) genetic association studies in identifying SNPs associated with cardiovascular disease, diabetes and other disorders that are being replicated by the National Human Genome Research Institute (NHGRI) and other research centres provide an ideal model for conducting collaborative large scale disease association studies in South Asian populations with a common set of controls [64, 65]. Besides raising ethical concerns and needing strict quality control, these studies are expensive, and require very large sample sizes. The presence of large South Asian Diasporas in Europe and USA should also facilitate these collaborative studies. The Pakistan Risk of Myocardial Infarction Study (PROMIS) being carried out as part of an international collaboration plans to analyze ∼20 000 myocardial infarction patients and appropriately matched controls and is one of the largest such studies on populations from South Asia, providing one model for future studies [66]. Like many studies from Pakistan, it includes Mohajirs, or Urdu ethnicities, which encompass diverse ethnic groups from India that migrated to Pakistan after independence and whose only commonality is the language (Urdu) that they speak. In such studies careful consideration must be made with regards to ethnicity and geographic origins to ensure that unsuspected population structure does not confound genetic analysis.

An increasing number of studies are also focusing on the role that selection may have played in recent human evolution and studying the constraints imposed by local environment of pathogens, nutrients, toxins and climate on genetic variation. A majority of genes that are under selection due to pathogens are involved in host invasion (e.g. FY), innate (CASP12, CD40, TLR,) or adaptive (HLA, interleukins) immunity [67, 68]. Others include those that aid adaptation to climate, exposure to toxins and dietary nutrients [68]. The genetic variant responsible for lactose tolerance (LCT –13910 T allele), which has been under positive selection in response to dietary milk in European populations, appears to have a recent origin in South Asia, being frequent in pastoral groups and present at low frequency in non-pastorals in Pakistan [69]. Analyses of variation in drug metabolizing enzymes such as alcohol dehydrogenase and cytochrome p450 enzymes are of immediate clinical significance.

The technological advances in DNA sequencing will soon make it economically feasible to resequence entire genomes of chosen individuals [70]. The current 1000 Genomes Project lacks South Asian samples, but the HapMap3 Gujarati in Houston sample meets the necessary international criteria for consent and availability and, in the absence of more suitable indigenous samples, may provide the first insights into South Asian diversity from whole-genome resequencing. We hope that indigenous South Asian samples will soon be available for resequencing as well.

The complexity of human genetic variation in South Asians and its role in gene regulation, expression and influencing disease and non-disease phenotypes in diverse populations from this region has yet to be fully unraveled. The challenge is to obtain clinically significant results that can translate into benefits for the general population leading eventually to early diagnosis, prevention and therapeutic intervention. Only then can this quarter of humanity truly benefit from the promise of the ‘genetic revolution’.

WEB RESOURCES

Key Points

  • South Asia is home to a quarter of humanity and harbors many diverse ethnicities, linguistic and religious groups.

  • We examine the influence of geographic, linguistic and religious factors on genetic variation in this region and discuss the importance of population stratification and its implication for genome-wide association surveys of South Asian populations.

  • Future prospects are discussed in light of developments in high throughput genotyping and next generation sequencing technologies.

FUNDING

The Wellcome Trust.

Acknowledgements

The authors would like to thank Daniel MacArthur and the reviewers for their helpful comments and Ambareen for the illustration.

References

Dobzhansky
T
Genetic Diversity and Human Equality
 , 
1973
New York
Basic Books
Rosenberg
NA
Pritchard
JK
Weber
JL
, et al.  . 
Genetic structure of human populations
Science
 , 
2002
, vol. 
298
 (pg. 
2381
-
5
)
Petraglia
MD
Allchin
B
MD
Petraglia
B
Allchin
Human evolution and culture change in the Indian subcontinent
The Evolution and History of Human Populations in South Asia: Inter-disciplinary Studies in Archaeology, Biological Anthropology, Linguistics and Genetics
 , 
2007
Dordrecht
Springer
(pg. 
1
-
22
)
Mellars
PA
Going East: new genetic and archaeological perspectives on the modern human colonization of Eurasia
Science
 , 
2006
, vol. 
313
 (pg. 
796
-
800
)
Deraniyagala
SU
The Prehistory of Sri Lanka: An Ecological Perspective
 , 
1992
Colombo
Department of Archaeological Survey, Government of Sri Lanka
Jarrige
JF
M
Jansen
M
Mulloy
G
Urban
Mehrgarh: its place in the development of ancient cultures in Pakistan
Forgotten Cities on the Indus. Early Civilization in Pakistan From the 8th to the 2nd Millennia BC
 , 
1991
Mainz
Verlag Philipp von Zabern
Wolpert
S
A New History of India
 , 
1997
Oxford
Oxford University Press
Majumder
PP
People of India: biological diversity and affinities
Evol Anthropol
 , 
1998
, vol. 
6
 (pg. 
100
-
10
)
Gordon
RG
Ethnologue: Languages of the World
 , 
2005
15th edn
Dallas
Summer Institute of Linguistics International
Cavalli-Sforza
LL
Menozzi
P
Piazza
A
The History and Geography of Human Genes
 , 
1994
Princeton
Princeton University Press
Underhill
PA
Kivisild
T
Use of Y chromosome and mitochondrial DNA:population structure in tracing human migrations
Annu Rev Genet
 , 
2007
, vol. 
41
 (pg. 
539
-
642
)
Jobling
MA
Tyler-Smith
C
The human Y chromosome: an evolutionary marker comes of age
Nat Rev Genet
 , 
2003
, vol. 
4
 (pg. 
598
-
612
)
Bamshad
M
Kivisild
T
Watkins
WS
, et al.  . 
Genetic evidence on the origins of Indian caste populations
Genome Res
 , 
2001
, vol. 
11
 (pg. 
994
-
1004
)
Basu
A
Mukherjee
N
Roy
S
, et al.  . 
Ethnic India: a genomic view, with special reference to peopling and structure
Genome Res
 , 
2003
, vol. 
13
 (pg. 
2277
-
90
)
Mansoor
A
Mazhar
K
Khaliq
S
, et al.  . 
Investigation of the Greek ancestry of populations from northern Pakistan
Hum Genet
 , 
2004
, vol. 
114
 (pg. 
484
-
90
)
Watkins
WS
Thara
R
Mowry
BJ
, et al.  . 
Genetic variation in South Indian castes: evidence from Y-chromosome, mitochondrial, and autosomal polymorphisms
BMC Genet
 , 
2008
, vol. 
9
 pg. 
86
 
Qamar
R
Ayub
Q
Mohyuddin
A
, et al.  . 
Y chromosomal DNA variation in Pakistan
Am J Hum Genet
 , 
2002
, vol. 
70
 (pg. 
1007
-
24
)
Thangaraj
K
Chaubey
G
Kivisild
T
, et al.  . 
Maternal footprints of southeast Asians in North India
Hum Hered
 , 
2008
, vol. 
66
 (pg. 
1
-
9
)
Karafet
TM
Mendez
FL
Meilerman
MB
, et al.  . 
New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree
Genome Res
 , 
2008
, vol. 
18
 (pg. 
830
-
8
)
Sengupta
S
Zhivotovsky
LA
King
R
, et al.  . 
Polarity and temporality of high resolution Y-chromosome distributions in India identify both indigenous and exogenous expansions and reveal minor genetic influence of Central Asian pastoralists
Am J Hum Genet
 , 
2006
, vol. 
78
 (pg. 
202
-
20
)
Mohyuddin
A
Ayub
Q
Underhill
PA
, et al.  . 
Detection of novel Y SNPs provides further insights into Y chromosomal variation in Pakistan
J Hum Genet
 , 
2006
, vol. 
51
 (pg. 
375
-
8
)
Carvalho-Silva
D
Zerjal
T
Tyler-Smith
C
Ancient Indian roots?
J Biosci
 , 
2006
, vol. 
31
 (pg. 
1
-
2
)
Carvalho-Silva
D
Tyler Smith
C
The grandest genetic experiment ever performed on man? A Y-chromosomal perspective on genetic variation in India
Int J Hum Genet
 , 
2008
, vol. 
8
 (pg. 
21
-
9
)
Zerjal
T
Pandya
A
Thangaraj
K
, et al.  . 
Y-chromosomal insights into the genetic impact of the caste system in India
Hum Genet
 , 
2007
, vol. 
121
 (pg. 
137
-
44
)
Chaubey
G
Karmin
M
Metspalu
E
, et al.  . 
Phylogeography of mtDNA haplogroup R7 in the Indian peninsula
BMC Evol Biol
 , 
2008
, vol. 
8
 pg. 
227
 
Quintana-Murci
L
Chaix
R
Wells
SR
, et al.  . 
Where West meets East: the complex mtDNA landscape of the Southwest and Central Asian corridor
Am J Hum Genet
 , 
2004
, vol. 
74
 (pg. 
827
-
45
)
Kivisild
T
Rootsi
S
Metspalu
M
, et al.  . 
The genetic heritage of the earliest settlers persists both in Indian tribal and caste populations
Am J Hum Genet
 , 
2003
, vol. 
72
 (pg. 
313
-
332
)
Thangaraj
K
Chaubey
G
Kivisild
T
, et al.  . 
Reconstructing the origin of Andaman Islanders
Science
 , 
2005
, vol. 
13
 pg. 
996
 
Quintana-Murci
L
Krausz
C
Zerjal
T
, et al.  . 
Y-chromosome lineages trace diffusion of people and languages in southwestern Asia
Am J Hum Genet
 , 
2001
, vol. 
68
 (pg. 
537
-
42
)
Thanseem
I
Thangaraj
K
Chaubey
G
, et al.  . 
Genetic affinities among the lower castes and tribal groups of India: inference from Y chromosome and mitochondrial DNA
BMC Genetics
 , 
2006
, vol. 
7
 pg. 
42
 
Firasat
S
Khaliq
S
Mohyuddin
A
, et al.  . 
Y-chromosomal evidence for a limited Greek contribution to the Pathan population of Pakistan
Eur J Hum Genet
 , 
2007
, vol. 
15
 (pg. 
121
-
6
)
Kraaijenbrink
T
van Driem
GL
Opgenort
JR
, et al.  . 
Allele frequency distribution for 21 autosomal STR loci in Nepal
Forensic Sci Int
 , 
2007
, vol. 
168
 (pg. 
227
-
31
)
Tshering of Gaselô K
Kraaijenbrink
T
van Driem
GL
, et al.  . 
Allele frequency distribution for 21 autosomal STR loci in Bhutan
Forensic Sci Int
 , 
2007
, vol. 
170
 (pg. 
68
-
72
)
Parkin
EJ
Kraayenbrink
T
van Driem
GL
, et al.  . 
26-Locus Y-STR typing in a Bhutanese population sample
Forensic Sci Int
 , 
2006
, vol. 
161
 (pg. 
1
-
7
)
Zerjal
T
Xue
Y
Bertorelle
G
, et al.  . 
The genetic legacy of the Mongols
Am J Hum Genet
 , 
2003
, vol. 
72
 (pg. 
717
-
21
)
Gutala
R
Carvalho-Silva
DR
Jin
L
, et al.  . 
A shared Y-chromosomal heritage between Muslims and Hindus in India
Hum Genet
 , 
2006
, vol. 
120
 (pg. 
543
-
51
)
Terroros
MC
Rowold
D
Luis
JR
, et al.  . 
North Indian Muslims: enclaves of foreign DNA or Hindu converts?
Am J Physical Anthropol
 , 
2007
, vol. 
133
 (pg. 
1004
-
12
)
Ayub
Q
Mansoor
A
Ismail
M
, et al.  . 
Reconstruction of human evolutionary tree using polymorphic autosomal microsatellites
Am J Physical Anthropol
 , 
2003
, vol. 
122
 (pg. 
259
-
68
)
Balter
M
Search for the Indo-Europeans
Science
 , 
2004
, vol. 
303
 (pg. 
1323
-
6
)
Diamond
J
Bellwood
P
Farmers and their languages: the first expansions
Science
 , 
2003
, vol. 
300
 (pg. 
597
-
603
)
Thomson
R
Pritchard
JK
Shen
P
, et al.  . 
Recent common ancestry of human Y chromosomes: evidence from DNA sequence data
Proc Natl Acad Sci USA
 , 
2000
, vol. 
97
 (pg. 
7360
-
5
)
Hagen
H-E
Carlstedt-Duke
J
Building global networks for human diseases: genes and populations
Nat Med
 , 
2004
, vol. 
10
 (pg. 
665
-
7
)
Bittles
AH
Endogamy, consanguinity and community genetics
J Genet
 , 
2002
, vol. 
81
 (pg. 
91
-
8
)
Khaliq
S
Ioseliani
OR
Consanguineous Marriages and the Genetics of Eye Disorders in Pakistani Families
New Developments in Eye Research
 , 
2006
New York
Nova Science Publishers Inc.
(pg. 
241
-
67
)
Ahmad
W
Faiyaz ul Haque
M
Brancolini
V
, et al.  . 
Alopecia universalis associated with a mutation in the human hairless gene
Science
 , 
1998
, vol. 
279
 (pg. 
720
-
4
)
Ahmed
ZM
Riazuddin
S
Riazuddin
S
, et al.  . 
The molecular genetics of Usher syndrome
Clin Genet
 , 
2003
, vol. 
63
 (pg. 
431
-
44
)
Kumaramanickavel
G
Joseph
B
Vidhya
A
, et al.  . 
Consanguinity and ocular genetic diseases in South India: analysis of a five-year study
Community Genet
 , 
2002
, vol. 
5
 (pg. 
182
-
5
)
Dhavendra
K
Genetic Disorders of the Indian Subcontinent
 , 
2004
Dordrecht
Kluwer Academic Publishers
Abid
A
Ismail
M
Mehdi
SQ
, et al.  . 
Identification of novel mutations in SEMA4A gene associated with retinal degenerative diseases
J Med Genet
 , 
2006
, vol. 
43
 (pg. 
378
-
81
)
Dhandapany
PS
Sadayappan
S
Xue
Y
, et al.  . 
A common MYBPC3 (cardiac myosin binding protein C) variant associated with cardiomyopathies in South Asia
Nat Genet
 , 
2009
, vol. 
41
 (pg. 
187
-
91
)
Mohyuddin
A
Ayub
Q
Khaliq
S
, et al.  . 
HLA polymorphism in six ethnic groups from Pakistan
Tissue Antigens
 , 
2002
, vol. 
59
 (pg. 
492
-
501
)
Raja
A
Immunology of tuberculosis
Indian J Med Res
 , 
2004
, vol. 
120
 (pg. 
213
-
32
)
Rehman
S
Akhtar
N
Ahmad
W
, et al.  . 
Human leukocyte antigen (HLA) class II association with rheumatic heart disease in Pakistan
J Heart Valve Disease
 , 
2007
, vol. 
16
 (pg. 
300
-
4
)
Cann
HM
de Toma
C
Cazes
L
, et al.  . 
A human genome diversity cell line panel
Science
 , 
2002
, vol. 
296
 (pg. 
261
-
2
)
Rosenberg
NA
Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives
Ann Hum Genet
 , 
2006
, vol. 
70
 (pg. 
841
-
7
)
Jakobsson
M
Scholz
SW
Scheet
P
, et al.  . 
Genotype, haplotype and copy-number variation in worldwide human populations
Nature
 , 
2008
, vol. 
451
 (pg. 
998
-
1003
)
Li
JZ
Absher
DM
Tang
H
, et al.  . 
Worldwide human relationships inferred from genome-wide patterns of variation
Science
 , 
2008
, vol. 
319
 (pg. 
1100
-
4
)
Indian Genome Variation Consortium
Genetic landscape of the people of India: a canvas for disease gene exploration
J Genet
 , 
2008
, vol. 
87
 (pg. 
3
-
20
)
Rosenberg
NA
Mahajan
S
Gonzalez-Quevedo
C
, et al.  . 
Low levels of genetic divergence across geographically and linguistically diverse populations from India
PLoS Genet
 , 
2006
, vol. 
2
 pg. 
e215
 
Conrad
DF
Jakobsson
M
Coop
G
, et al.  . 
A worldwide survey of haplotype variation and linkage disequilibrium in the human genome
Nat Genet
 , 
2006
, vol. 
38
 (pg. 
1251
-
60
)
Pemberton
TJ
Jakobsson
M
Conrad
DF
, et al.  . 
Using population mixtures to optimize the utility of genomic databases: linkage disequilibrium and association study design in India
Ann Hum Genet
 , 
2008
, vol. 
72
 (pg. 
535
-
46
)
Auton
A
Bryc
K
Boyko
A
, et al.  . 
Global distribution of genomic diversity underscores rich complex history of continental human populations
Genome Res
 , 
2009
, vol. 
19
 (pg. 
795
-
803
)
Itsara
A
Cooper
GM
Baker
C
, et al.  . 
Population analysis of large copy number variants and hotspots of human genetic disease
Am J Hum Genet
 , 
2009
, vol. 
84
 (pg. 
148
-
61
)
The Wellcome Trust Case Control Consortium
Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls
Nature
 , 
2007
, vol. 
447
 (pg. 
661
-
78
)
Manolio
TA
Collaborative genome-wide association studies of diverse diseases:programs of the NHGRI's office of population genomics
Pharmacogenomics
 , 
2009
, vol. 
10
 (pg. 
235
-
41
)
Saleheen
D
Zaidi
M
Rasheed
A
, et al.  . 
The Pakistan risk of myocardial infarction study: a resource for the study of genetic, lifestyle and other determinants of myocardial infarction in South Asia
Eur J Epidemiol
 , 
2009
, vol. 
24
 (pg. 
329
-
338
)
Cooke
GS
Hill
AV
Genetics of susceptibility to human infectious disease
Nat Rev Genet
 , 
2001
, vol. 
2
 (pg. 
967
-
77
)
Hancock
AM
Di Rienzo
A
Detecting the genetic signature of natural selection in human populations: models, methods, and data
Annu Rev Anthropol
 , 
2008
, vol. 
37
 (pg. 
197
-
217
)
Enattah
NS
Jensen
TG
Nielsen
M
, et al.  . 
Independent introduction of two lactase-persistence alleles into human populations reflects different history of adaptation to milk culture
Am J Hum Genet
 , 
2008
, vol. 
82
 (pg. 
57
-
72
)
Mardis
ER
Next-generation DNA sequencing methods
Annu Rev Genomics Hum Genet
 , 
2008
, vol. 
9
 (pg. 
387
-
402
)