Genome-Wide Analysis in Brazilians Reveals Highly Differentiated Native American Genome Regions

Despite its population, geographic size, and emerging economic importance, disproportionately little genome-scale research exists into genetic factors that predispose Brazilians to disease, or the population genetics of risk. After identification of suitable proxy populations and careful analysis of tri-continental admixture in 1,538 North-Eastern Brazilians to estimate individual ancestry and ancestral allele frequencies, we computed 400,000 genome-wide locus-specific branch length (LSBL) Fst statistics of Brazilian Amerindian ancestry compared to European and African; and a similar set of differentiation statistics for their Amerindian component compared with the closest Asian 1000 Genomes population (surprisingly, Bengalis in Bangladesh). After ranking SNPs by these statistics, we identified the top 10 highly differentiated SNPs in five genome regions in the LSBL tests of Brazilian Amerindian ancestry compared to European and African; and the top 10 SNPs in eight regions comparing their Amerindian component to the closest Asian 1000 Genomes population. We found SNPs within or proximal to the genes CIITA (rs6498115), SMC6 (rs1834619), and KLHL29 (rs2288697) were most differentiated in the Amerindian-specific branch, while SNPs in the genes ADAMTS9 (rs7631391), DOCK2 (rs77594147), SLC28A1 (rs28649017), ARHGAP5 (rs7151991), and CIITA (rs45601437) were most highly differentiated in the Asian comparison. These genes are known to influence immune function, metabolic and anthropometry traits, and embryonic development. These analyses have identified candidate genes for selection within Amerindian ancestry, and by comparison of the two analyses, those for which the differentiation may have arisen during the migration from Asia to the Americas.


Introduction
In the last three decades, Brazil has undergone a rapid transition from low-middle income to a burgeoning high-income country, and this economic development has led to a public health paradox; the diseases of poverty such as infectious diseases and malnutrition, although in decline, still exist (Paim et al. 2011;Victora et al. 2011), but now co-exist with an increasing incidence of "western" lifestyle metabolic diseases (de Carvalho Vidigal et al. 2013). Genetic analysis of susceptibility is an important tool to understand both diseases of poverty and wealth, but to apply genetics to Brazilian sub-populations, an understanding of the structure of the underlying genetic admixture is needed.
Twenty-first century Brazil has emerged as a genetic melting pot of races and ethnic groups reflecting its successive history of conquest, slavery, and migration. Three trans-continental population groups, Europeans, Africans, and native American Indians (Amerindians) substantially contribute to the variable ancestry within Brazil's population. Although exact estimates differ, there were at least three million indigenous Amerindians in Brazil when Portuguese explorers first landed in Bahia in 1500, but their numbers dropped precipitously during the next centuries through enslavement and forced labor, conflict with the invading colonists, and European-borne disease epidemics, although their numbers have rebounded somewhat since the mid-20th century, to a 2010 census enumeration of 820 thousand (Ricardo and Ricardo 2011) (Fundação Nacional do Indio, http:www. funai.gov.br;last accessed October 13, 2015). In the period after initial discovery by Europeans, migration gradually increased so that by 1,760, approximately 700 thousand Europeans had migrated to Brazil (Venâncio 2000). At the same time, to supplement and ultimately largely replace indigenous Amerindians as a workforce, Africans were imported as slave labor to work on colonial plantations, mostly for sugar production, but also for wood and other agricultural products (Schwartz 1978). African slave trafficking brought to Brazil an estimated total of 4.9 million slaves, 40% of all Africans shipped to the Americas (Bergad 2007).
In comparison to its population size and economic status, relatively little genetic work has been performed on Brazilian populations using genome-wide genetic panels for analysis (Giolo et al. 2012;Kuhn et al. 2012;Kehdy et al. 2015;Lima-Costa et al. 2015). Much of the previous research in Brazil, even during the current genomic era, has used genetic panels with limited numbers of SNPs leading to regionally biased estimates of population genetic parameters with high variance (Cardena et al. 2013;Manta et al. 2013;Vieira et al. 2013;Durso et al. 2014;Ruiz-Linares et al. 2014;Magalhaes da Silva et al. 2015). The majority of the genetics research has been aimed at estimating admixture and correlation with selfreported or socially perceived race groups, with little focus on genetic variants influencing disease risk (Guindalini, Colugnati, et al. 2010;Guindalini, Lee, et al. 2010;Suarez-Kurtz 2010;Suarez-Kurtz et al. 2012). As a necessary precursor to our work to identify genetic factors underlying infant growth and development, we undertook a genome-wide and locus-specific analysis of admixture in our study populations recruited in North-Eastern Brazil, centered on the capital city of Cear a state, Fortaleza, but also including participants from neighboring Para ıba, Pernambuco, and Piau ı states. We developed improved estimates of locusspecific admixture and allele frequencies, and used these to identify highly differentiated outlier SNPs in the Amerindian component of the Brazilian ancestry as candidates for SNPs and genes under selection pressure.

Results
We genotyped 2,010 DNA samples from six Brazil studies on the Affymetrix Axiom LAT-1 Latin American Array. These samples were drawn from six studies in the North-Eastern region of Brazil, centered on the city of Fortaleza, Cear a state, but also including six other towns or cities ( fig. 1). After quality control, we were left with 1,538 samples unrelated to second degree, and 755,801 SNPs (table 1) The inset shows the location of study recruitment centers in the North-Eastern region of Brazil. The location of Fortaleza is indicated by the yellow star icon, and other state color-coded locations are: Picos (Piau ı-dark blue); Ouricuri (Pernambuco-light blue); Crato (Cear a-green); Cajazeiras, Sousa and Patos (Para ıba-red). For scale, the distance from Fortaleza to Picos (1.) or from Fortaleza to Ouricuri (2.) is approximately 300 miles.   We recapitulated the PCA results in an unsupervised analysis using ADMIXTURE 1.23 (Alexander et al. 2009;Alexander and Lange 2011). We tested all models from K ¼ 1-10 for the Brazil samples and found that K ¼ 3 minimized the crossvalidation error, suggesting a latent three ancestral cluster model best fitted the data (supplementary fig. S1, Supplementary Material online). We computed pairwise Hudson Fst values (Bhatia et al. 2013) between the three inferred ancestral components and 1KG populations (  figure 3 distinguishes the degree of Eurasian admixture in the sub-Saharan African populations, with more Eurasian admixture in The Gambia (GWD) and Kenya (LWK) than in the Mende of Sierra Leone (MSL) or YRI/ESN (Bhatia et al. 2011;Gurdasani et al. 2015). The Brazilians (and ACB and ASW) are smeared along an axis from the centroid of the nonAfrican 1KG cluster towards the West-Central African YRI/ESN, representing more recent admixture with African populations that derived from near modern day Nigeria and the Gulf of Guinea, rather than further West towards Sierra Leone, Senegal, and Gambia. PC3 segregated the Mende MSL from the other West and Central African populations.

Proxy Population Samples for the Amerindian and African Components of Ancestry in North-Eastern Brazil
The admixture estimates from an unsupervised admixture analysis are known to be potentially biased and inaccurate (Tang et al. 2005;McVean 2009) so including samples of the source admixing populations in a supervised admixture analysis would likely yield more accurate estimates (Alexander and Lange 2011). The analysis above suggested the Iberian Spanish (IBS) samples as a proxy for the European but in the absence of a large enough sample size of known, nonadmixed native Amerindian samples available as a proxy, we sought to develop an ancestral proxy for the native Amerindian ancestry using available public genome-wide genotype data. We ranked the 1000 Genomes reference samples by their increasing PC2 coordinate in figure 2 and selected Amerindian proxy samples from the samples with the most negative PC2 values, that is, containing the smallest proportion of nonAmerindian admixture, up to a maximum of N(proxy) total samples. As seen in figure 2, the most extreme PC2 samples are predominantly Peruvians from Lima (PEL) with smaller numbers of Mexicans (MXL) and Columbians (CLM). We used a similar strategy to select African proxy samples. Although the 1KG

Estimates of Admixture Proportions in the Brazil Samples Using Supervised Admixture Analysis
Having established the best 1KG proxy samples, we identified the optimal number of proxy samples to use for the supervised analysis to maximize the precision and minimize the bias of the estimates of ancestral proportions. We varied the equal number of ancestral samples from each proxy group in supervised ADMIXTURE analyses with bootstrap estimates of standard error of ancestry proportions, reasoning that the standard error and precision would decrease with the addition of more proxy samples until a minimum was reached after which addition of more samples would increase the standard error. In supplementary figure S2, Supplementary Material online, we show that a minimum was reached with N ¼ 30 samples of each of the three ancestral proxy groups although the minimum is broad and little changed with N ¼ 20-50, so that even a relatively small number of supervising samples can be valuable in reducing variance of the estimates. We There was no significant difference in the mean ancestry between any pair of the five study groups located in Fortaleza after multiple testing correction (all P-value > 0.05/15), but the mean ancestry in the Recodisa case control group (enrolled in six other cities across four North-Eastern states) differed from all five Fortaleza groups (P < 1 Â 10 À10 ). The participants in Recodisa had slightly higher mean European (61% vs. 52-55%) and African ancestry (24% vs. 21-22%) and were more variable in these components, while their mean Amerindian ancestry was lower (15% vs. 24-26%) and less variable.  relative to a hypothetical single ancestral population from which the European, Amerindian, and African admixture components emerged. This is monotonically related to the population branch statistic method as described in Materials and Methods. The overall distribution of the 400,150 SNP Fst values was exponential-like in the right hand tail, with mean, median, and 75% percentile Fst values of 0.079, 0.041, and 0.126, respectively. FIG. 4. The proportion of continental ancestry within the Brazil samples, estimated using supervised ADMIXTURE analysis. The K ¼ 3 ancestral components are labeled "Eur" predominantly European; "Amr" predominantly Amerindian; "Afr" predominantly African. In each panel, each individual sample along the x-axis is a narrow vertical bar with three color intervals along the y-axis that are proportional to the percentage of the three ancestries and sum to 100% (y ¼ 1.0). In panel a, the Brazil samples are sorted along the x-axis from lowest to highest fraction of Eur ancestry (blue); in panel b, sorted by fraction of African ancestry (grey); in panel c, by fraction of Amerindian ancestry (yellow). The table shows the mean proportion of each ancestry with 95% confidence interval, and range.   Lahore (PJL) the next closest in the second cluster. We ran TREEMIX 1.12 to better understand the apparent relationship between BEB and the BRN2(Amr) component (Pickrell and Pritchard 2012). We included BRN2(Amr), all Asian 1KG populations and YRI as an outgroup and varied the number of migrations from 0 to 8. Since the 5-migration model was only slightly worse than a 6-migration model based on the change in log likelihood and a distinct flattening in the log likelihood change profile (supplementary fig. S8, Supplementary Material online), and the 5-migration model generated acceptable residuals (supplementary fig. S9, Supplementary Material online), we accepted this is as the preferred model. The phylogenetic plot (supplementary fig. S10) illuminates the source of the similarity in BEB and shows that the Amerindian component is most genetically similar to an admixed Central-East Asian ancestral group containing later admixture between a descendent group of the North-Indian subcontinent clade (PJL, GIH, and BEB) and an older South-East Asian/Japan lineage, albeit with evidence of later reverse gene flow between the Indian and East Asian clades. Similar to the LSBL test, the distribution of the Fst values was also exponential-like with mean, median, and 75% percentile Fst values of 0.083, 0.045, and 0.123 respectively.
The top 10 most highly differentiated loci by Fst between the BRN2(Amr) component and the BEB population are  To better understand the genetic history of the top differentiated SNPs and test for the possibility that the differentiation arose from founder effects within Asia, we computed the allele frequencies and locus-specific Fst values for ten of the 1KG populations: seven Asian populations resident within Asia countries; and Gujarati from Houston (GIH), LWK (Kenyans from Webuye), and TSI (Italians from Tuscany) as proxies for the ancestral African and European populations (table 6). TSI was chosen as the geographically closest population to the Levantine migration routes out of Africa although IBS (Spaniards) gave very similar results. In the five SNPs, we found three distinct patterns in the Fst and allele frequencies across the populations. SNPs in ADAMTS9 (rs7631391, chr3), ARHGAP5 (rs71519991, chr14), and CIITA (rs45601437, chr16) showed a trend from high Fst, low frequency in Africa with incremental Fst decreases and allele frequency increases in Asian populations, with the largest change occurring in the BRN2(Amr) component. The DOCK2 region (rs77594147, chr5) showed a low Fst in Africa (0.541 allele frequency) but higher Fst and frequency <0.10 in all other populations except Brazil Amerindian. The third pattern in SLC28A1 (rs28649017, chr15) was an increasing Fst and increasing allele frequency from Africa to BEB and PJL and then a dramatic decrease in Fst and allele frequency in the rest of Asia and BRN2(Amr).

Discussion
Brazil poses a complex methodological problem for genetic analysis due to extensive recent admixture between individuals of European, African, and Native American Indian descent, combined with a complex history of migration and forced slavery. This complexity is an advantage for disease gene mapping because it allows the interrogation of wider genetic variation and resulting clinical and biological effects. Our purpose in this study was to develop accurate estimates of ancestry from genome-wide SNP data and use the jointly fitted SNP allele frequencies in a genome-wide scan for the most highly differentiated loci in the Amerindian ancestry component. Although not proven, these loci are strong MBE candidates for having been under selection pressure. We chose to focus on Amerindian locus differentiation within our Brazil population since less work has been possible on Native Amerindian population genetics due to community sensitivities and fewer publicly available large Amerindian genome-wide data sets. By ranking SNPs in the upper tail of the genome-wide distribution of Fst values from a locus-specific branch length (LSBL) test of reconstructed Amerindian ancestry in Brazilians, we found five genome regions containing the top 10 SNPs which were the most differentiated SNPs, and therefore candidates for selection (table 4 and fig. 5). The LSBL analysis against the single ancestral root population is a model for testing SNP locus differentiation but is not intended to be a literal model of the archaic history of the continental populations and does not model complex ancestral history, replacement, or migrations between Africa, Europe, and Americas. The top differentiated SNP (rs6498115) was located within the proximal promoter of the CIITA gene, 6 kb (kb) upstream of the start site of CIITA transcription, within prominent H3Kme1 and H3K27Ac epigenetic marks in a region of transcriptional enhancement, and is a strong positional and functional candidate as transcriptional regulator of CIITA expression. The LD region for this SNP extended from EMP2 (5 0 ) to the PRM1/RMI2 gene cluster (3 0 ). The second SNP was in an intron of the structural maintenance of chromosomes 6 gene SMC6 (rs1834619), a gene that is obligate for normal development and chromosome structure but without prior clinical research results or GWAS signals to suggest possible beneficial changes in associated human phenotypes. Other SNPs were located in an intron of kelch-like family member 29 (KLHL29) rs2288697, a gene desert region of chromosome 16 (rs2866065, 75.8 Mb) approx. 5 kb from a localized isolated genome region conserved in mammals and with a cluster of transcription factor sites identified by CHIP-Seq experiments, (http://genome.ucsc.edu; last accessed October 16, 2016), and in an intron of Meis homeobox 2 (MEIS2) rs16964480.
In a second similar analysis of pairwise Fst values computed between the Brazilian Amerindian admixture component and the closest Asian population, Bengalis in Bangladesh (BEB), we found evidence of extreme differentiation of the top 10 SNPs in eight regions in the reconstructed Amerindian branch compared with a single ancestral root population (table 5). By comparing these two analyses we hoped to gain insight into where and when in the complex history of the Amerindian population, the locus-specific differentiation may have occurred, pre-or during Asia-to-Americas migration. The only region that contained SNPs ranking in the top 10 SNPs of both analyses was the CIITA region, and as table 6 showed, the greatest component of the differentiation is most likely to have occurred in the geographical migration between Asia and ending in the North-Eastern region of Brazil, although may already have begun in Asia. The LSBL test region SMC6 (rs75594147, chromosome 2) also contained SNPs within the top 500 of the Amerindian vs. Asia analysis (second green track, supplementary fig. S4, Supplementary Material online), which provides suggestive evidence for differentiation during the peopling of the Americas. We also found four other interesting regions in the Amerindian vs. Asian analysis, but which did not appear in the top 10 LSBL SNPs, although the ADAMTS9 (rs7631391, chromosome 3), DOCK2 (rs77594147, chromosome 5), and ARHGAP5 (rs7151991, chromosome 14) regions also contained SNPs that were ranked in the top 500 of the LSBL test (top green track, supplementary figs. S11-S13, Supplementary Material online, respectively). From comparison of the pairwise Fst and allele frequencies across the 10 populations with the Brazil Amerindian in table 6, ADAMTS9 and ARHGAP5 showed a smooth trend of decreasing Fst and increasing allele frequency from Africa to Asia with the largest change in Fst in the transition to Amerindian also suggesting these are good candidates for New World selection. Of the other top gene regions, DOCK2, showed a much higher allele frequency in Africa than in the European or Asian populations, and higher still in Brazil Amerindian. One possible explanation is that gene variant was originally at low frequency in the founding African migration, but experienced selection pressure separately and independently in Africa and the Americas. This pattern explains why the LSBL test for the SNP did not yield an extreme Fst statistic yet was highly differentiated in the Asian comparison. The final gene region, SLC28A1 (rs28649017, chromosome 15) contains a SNP that rose in Fst and allele frequency from Africa to Europe to Bangladesh and Pakistan but which subsequently drifted or experienced downward allele frequency selection pressure within Asia resulting in a similar frequency in Brazil Amerindian (table 6), but had occurred pre-migration. The reason for this is unknown.
The genes we have implicated have known functions that span biological processes that could potentially influence reproductive or survival fitness in different and fascinating ways. CIITA regulates MHC class II gene transcription and has been called "the master control factor" for expression of these genes. CIITA has been implicated in immune function through association with autoimmune diseases or very recently with leprosy (Liu et al. 2015). The gene complex including CIITA and neighboring DEXI/CLEC16A has been shown to be associated with multiple autoimmune diseases (Bronson et al. 2011;Gyllenberg et al. 2012Gyllenberg et al. , 2014Leikfoss et al. 2015). DOCK2 is predominantly expressed in hematopoietic cells, regulates migration and activation of neutrophils through Rac activation (Nishikimi et al. 2013) and is associated with early-onset invasive infections (Dobbs et al. 2015). ADAMTS9 has been shown to be associated with body fat distribution (Liu et al. 2013) and other anthropometry/metabolic traits including type 2 diabetes (Zeggini et al. 2008;Heid et al. 2010;Randall et al. 2013), as well as age-related macular degeneration (Fritsche et al. 2013) and other traits. ARHGAP5 is one of the RhoGTPase family important in embryonic development (Heckman et al. 2007) and in modulating myometrial contractility in uterine smooth muscle, including during pregnancy (O'Brien et al. 2008). SLC28A1 codes for a concentrative nucleoside transporter primarily recovering pyrimidines from urine in kidney (Elwi et al. 2006), but may also have a role in immunity and macrophage activation (Loffler et al. 2007).
Genome-Wide Analysis in Brazilians . doi:10.1093/molbev/msw249 MBE Our unsupervised analysis of genetic ancestry within our North-Eastern Brazil samples showed that an admixture model of three continental populations, Africa, Europe, and Amerindian, was sufficient to explain the most important ancestral structure, although if we had included other diverse Latin American samples in our admixture and PCA, we undoubtedly would have found finer structure of admixture (Johnson et al. 2011;Moreno-Estrada et al. 2013, 2014) but this was not the goal of the study, and would have been problematic for accurate supervised ancestry estimation at fine resolution. The unsupervised admixture analysis also showed that the African component was closest to African Americans in the US South-West (ASW) and African Caribbeans in Barbados (ACB) but this was most likely due to imperfect partitioning of genetic variance between admixing continental components and European admixture retained within the inferred African component in the absence of supervising proxy samples in the ADMIXTURE analysis, rather than significant recent differentiation (Bhatia et al. 2014). Among populations in Africa, the unsupervised African component was actually most similar to the Luhya in Kenya, but this was also probably biased by similarity to older East African variation through residual European or Amerindian variation. The Africa-centric PCA clearly showed that the recent North-Eastern Brazil admixture arose from a population genetically closer to the West African Yoruba/ Esan populations near the Bight of Benin, modern day Nigeria. This is consistent with the history of slave importation into Brazil. Three broad periods of slave importation are recognized, roughly corresponding to the 16th century (Senegambia/Upper Guinea); 17th century (a switch to importation from Central/West Africa, modern day Congo and Angola); and the 18th century (Mina Coast/Lower Guinea) (Sweet 2003). Before 1700 only 13% of total African slaves came from the Bight of Benin, while in the period 1700-1850 approximately 55% of the slaves that landed in Bahia province-the major landing point in North-Eastern Brazil-came from the Bight (Klein and Luna 2010). The European admixture component is most likely derived from Southern Europe with latter day Spanish being the closest match of the available 1000 Genomes populations, which probably reflects earlier Portuguese influence, although Spaniards and Dutch were also present as explorers and colonial powers.
Inclusion of proxy samples for the components ancestries in a supervised ADMIXTURE analysis resulted in significantly different estimates of individual admixture. Similar results were found in one of the very few and limited genetic studies of North-Eastern Brazilians (768 SNPs), with the use of differing pseudo-ancestral populations (Magalhaes da Silva et al. 2015). We found that as few as 30 proxy samples for each ancestry was sufficient, and of the 1000 Genomes populations, the closest proxies for the predominantly European, Amerindian, and African admixed components were respectively Spanish (IBS); Esan/Yorubans (ESN/YRI); Peruvians (PEL) and a few Mexicans (MXL). The North-Eastern Brazilians had a mean ancestry of 57% European, 20% Amerindian, and 23% African although with considerable ranges of individual ancestries within each component (19-95%, 2-49%, and 4-67%, respectively). The mean ancestries of the five study groups located in coastal Fortaleza, the capital of Cear a state, did not differ, but the mean ancestry in the Recodisa case control study group, which was enrolled in six other noncoastal cities across four North-Eastern states including Cear a, had slightly higher and more variable percentages of European and African ancestries, but about 10% lower and less variable Amerindian ancestry.
The advantages of including proxies to identify components of admixture have described previously but using different methods to assess likely accuracy of ancestral reconstruction (Falush et al. 2003;Tang et al. 2005;Alexander et al. 2009;Alexander and Lange 2011). Although we believe these differences are an improvement in the estimates of admixture proportions, a possible alternative explanation is that supervised ancestry estimates are biased due to European admixture in the Amerindian proxies, and to lesser degree, in the African samples. This illustrates the challenges in accurate admixture analysis and identifying suitable nonadmixed reference or proxy samples for supervised ancestral deconvolution.
Previous studies of urban and regional Brazilian populations have shown that the predominant admixture components are European, African, and Amerindian with systematic variation in the proportions between the five major regions in Brazil, although the accuracy may have been limited by small SNP panels and unsupervised or joint admixture estimation (Ruiz-Linares et al. 2014). More recent genome-wide panels with thousands of samples have found similar structure (Giolo et al. 2012;Kehdy et al. 2015;Lima-Costa et al. 2015) but this is the first study using Brazil samples that has attempted to carefully select best matching proxies and derive supervised genome-wide estimates of admixture components based on ancestral similarity. No other studies have attempted to interrogate the latent ancestry in Brazil for putative selection.
Based on the Fst genetic differentiation results and TREEMIX analyses, we found that the Bengali population is the closest proxy of the Asian 1KG populations for the source ancestral Asian population of migrants into the Americas, believed to be from North-Eastern Siberia (Zakharov et al. 2004;Achilli et al. 2013). The TREEMIX phylogenetic plot (supplementary fig. S8, Supplementary Material online) showed that the Amerindian component is most genetically similar to an admixed Central-East Asian ancestral group containing later admixture between a descendent group of the North-Indian subcontinent clade (PJL, GIH, and BEB) and an older South-East Asian/Japan lineage, albeit with evidence of later reverse gene flow between the Indian and East Asian clades. The topology of the plot is consistent with very recent results from reconstruction of ancestral relationships and admixture events within the Central/East Asia region and an emerging model of an early Southern route migration out of Africa through South Central and Eastern Asia (Duggan and Stoneking 2014;Qin and Stoneking 2015). There is evidence from Bronze Age specimens of Central-East Asian admixture in regions in Siberia proposed as a possible source of the migratory proto-Amerindian population (Hollard et al. Mychaleckyj et al. . doi:10.1093/molbev/msw249 MBE 2014), which now seems to have occurred as a single migration wave, approximately 23 kya (Raghavan et al. 2015).
In conclusion, we have identified multiple differentiated regions in the Amerindian ancestral component of the North-Eastern Brazil population, drawn from six separate studies containing SNPs located in genes involved in immune function, metabolism, embryonic development, and other diseases and traits. We recognize that our results could be biased by the genetic panel used as the source of the SNP genotyping data, although the panel is informative for Latin American populations; that we have only investigated the ancestry derived from a single country, albeit one with a high degree of admixture; and that we have not proven that the most differentiated SNPs or genes are functionally under selection. Further work is needed to replicate these findings in other studies and to understand the health implications of the results.

Study Populations
The genetic samples analyzed in this work were drawn from six cohorts/studies conducted on populations in North-Eastern Brazil and centered on Fortaleza, Cear a state. They have been previously described in detail and will only be reviewed briefly here. The Gonçalves Dias cohort was recruited in the Gonçalves Dias favela in Fortaleza between 1989 and 1993 to study the epidemiology, nutritional impact and causes of persistent diarrhea in early childhood (Lima et al. 2000). The Malnutrition and Enteric Disease Network (Mal-ED) Birth Cohort enrolled 242 children within 17 days of birth between 2010 and 2014; an additional 101 infants recruited under the ICIDR (International Center for Infectious Disease Research) program and evaluated by the same procedures as Mal-ED are included in the cohort. The prospective Mal-ED case-control (MCC) study enrolled 401 children 6-18 months of age between 2010 and 2014. Both Mal-ED study groups were enrolled in Fortaleza (Lima et al. 2014). The Recodisa prospective case-control study enrolled 1200 children aged 2-36 months between 2010 and 2014 from hospitals or clinic facilities in six semiarid countryside cities of North-Eastern Brazil to study the etiology of diarrhea. The cities were Crato (Cear a state), Cajazeiras, Souza, and Patos (Para ıba state), Ouricuri (Pernambuco state), and Picos (Piau ı state) and had >50,000 inhabitants in states with >50% area localized inside the Brazilian Semiarid region. Target enrollment was 100 cases and controls from each city.
The Parque Universit ario Zinc-Vitamin A clinical trial cohort enrolled 324 children between 2000 and 2006, and the Parque Universit ario Zinc-Arginine clinical trial cohort enrolled 349 infants between 2006 and 2010, both from the Parque Universit ario favela in Fortaleza (Lima et al. 2013). All families gave informed consent for genetic research into diseases and traits linked to malnutrition. The study protocols were approved by the Federal University of Cear a Committee for Ethics in Research and the University of Virginia Institutional Review Board for Health Sciences Research. Although this study contained incidental analysis of ancestry and anthropology, this was performed in so far as it was required to construct correctly-adjusted statistical tests to identify regions of the genome and SNPs that might be linked to disease susceptibility and to enable future tests of association. All participants were de-identified to the analysis and no interpretation of the ancestry of specific identified participants was performed.

Genome-Wide Genotyping and Quality Control
Saliva samples from all children were collected using Oragene DNA kit G-250 (DNA Genotek, Ontario, Canada). Briefly, the sample collector was mixed gently and incubated at 50 C for 1 h in a water bath. Unabsorbed liquid was transferred to a conical 15 mL centrifuge tube and the barrel of a 5 mL disposable syringe, containing collected sponges, was also placed inside the tube and centrifuged at 200 Â g for 10 min at 20 C. After centrifugation, the syringes were removed and the DNA was manually extracted from 4.0 mL of Oragene DNA/saliva according to published vendor protocols. All samples were genotyped on the Affymetrix Axiom Latin America Array (LAT-1) with 818,154 SNPs and Indels specifically informative for Hispanic and Latin American populations. A total of 2,119 Brazil sample CEL files were processed using Affymetrix power tools (APT 1.16.0), applying the vendor's best practices quality control (QC) criteria for samples and SNPs. More details are available in supplementary table S1, Supplementary Material online. After Affymetrix quality control 1,659 samples and 755,801 SNPs were available for genetic ancestry analysis. Additional sample quality control for cryptic relatedness up to degree 2 was performed using KING (Manichaikul et al. 2010), and after removing related and sex-misclassified samples, 1,538 samples remained. Further SNP QC, dropping SNPs with a call rate <99% and/or minor allele frequency (MAF) <5%, resulted in a total autosomal chromosome SNP count of 410,172 SNPs. Finally these SNPs were thinned to reduce residual linkage disequilibrium (LD), so that the maximum inter-SNP r2 was 0.3, resulting in 199,654 SNPs. Plink 1.07 was used for genetic data management and to calculate LD (Purcell et al. 2007).

Genomes Project Data and Quality Control
The 1000 Genomes Project (1KG) phase3 release data (version date 2013/05/02) was downloaded (ftp://ftp-trace.ncbi. nih.gov/1000genomes/ftp/release/20130502; last accessed January 30, 2015) and contained 2,504 samples from 26 populations. KING was used to identify residual relatedness up to degree 2, inferring eight parent-offspring, four full sibs and three 2nd degree relative pairs with one family of size 3. After dropping one of the related pairs and filtering SNPs > 1% MAF, the total data set was 2,490 samples Â 30.7M variants. Intersection of the post-QC Brazil SNP (MAF !0.05) with the 1KG data resulted in 400,150 total autosomal SNPs for locus testing. The merged Brazil low LD data set and 1KG data resulted in 195,090 autosomal SNPs. This data set was used for the Brazil þ 1KG joint PCA and supervised admixture analyses.
Genome-Wide Analysis in Brazilians . doi:10.1093/molbev/msw249 MBE Principal Component and Admixture Analysis Admixture analysis was performed using ADMIXTURE v1.23 in supervised and unsupervised modes as described in the main text, and was run with 10 cross-validation folds (-cv ¼ 10) (Alexander et al. 2009;Alexander and Lange 2011). Principal component analysis was performed using the EIGENSOFT package v5.0.1 Price et al. 2006). Unsupervised analysis used the Brazil post-QC LD-thinned SNP set of 199,654. For supervised analyses, the number of proxy reference samples was constrained to be equal for each component ancestry to reduce the bias in estimation of ancestry composition through different likelihood weighting, or the distortion of principal component axes (McVean 2009).

Statistical Genetic Analysis
Tests of equality of mean ancestry between the six study groups were performed using Hotelling's test applied to each pair of study groups testing bivariate equality of means of two independent ancestry proportions (%AMR, %AMR) of the three constrained total. The significance level was adjusted for 15 tests, (alpha ¼ 0.05/15). Genetic differentiation between ancestral admixture components and populations was measured using Hudson's Fst (Nei 1973;Hudson et al. 1992). This was computed using custom R functions according to the algorithm and estimator described in Bhatia et al. (2013). 95% Confidence intervals for Fst estimates were generated using 10,000 bootstrap resamples and the bootstrap percentile method (Efron and Tibshirani 1993). The genetic differentiation of SNP loci for the Amerindian-specific ancestral component of the Brazil samples was assessed under two scenarios. The branch-specific Fst (Shriver et al. 2004 This statistic is monotonically related to the population branch statistic which has an identical form but with scaled population divergence time estimated as T ¼ Àlogð1 À FstÞ (Cavalli-Sforza 1969) substituted for each pairwise Fst. The second scenario identified the most highly genetically differentiated loci comparing the closest proxy Asian 1KG population, Bengalis in Bangladesh (BEB) to the Brazil Amerindian BRN2(Amr) component, using direct pairwise Fst, BRN2(Amr) vs. BEB). R version 3.0.3 was used for all other statistical analyses (R Core Team 2014).

TREEMIX Analysis
The relationship between the Amerindian component BRN(Amr) and 1KG Asian populations was analyzed using TREEMIX 1.12 (Pickrell and Pritchard 2012). We included the Yorubans (YRI) as an outgroup and varied the number of migrations from 0 to 8. The BRN2(Amr) genotype counts were estimated as 2 Â mean proportion of BRN2(Amr) admixture (0.2) Â ADMIXTURE estimated allele frequency, rounded to the nearest integer. We compared the plots of residuals and tested the change in final composite log maximum likelihood using 10,000 bootstrap replicates of the SNPs with the same seed to estimate the bias-corrected 95% confidence interval on the log likelihood (-bootstrap -seed options in TREEMIX). We compared the 1-sided 95% confidence interval test of the model log likelihood with k migration events vs. k À 1 events to identify the most parsimonious migration model at which further increase in the migration parameter led to insignificant improvement in the maximum likelihood.

Supplementary Material
Supplementary tables S1-S5 and figures S1-S28 are available at Molecular Biology and Evolution online.