-
PDF
- Split View
-
Views
-
Cite
Cite
Erica Bianco, Guillaume Laval, Neus Font-Porterias, Carla García-Fernández, Begoña Dobon, Rubén Sabido-Vera, Emilija Sukarova Stefanovska, Vaidutis Kučinskas, Halyna Makukh, Horolma Pamjav, Lluis Quintana-Murci, Mihai G Netea, Jaume Bertranpetit, Francesc Calafell, David Comas, Recent Common Origin, Reduced Population Size, and Marked Admixture Have Shaped European Roma Genomes, Molecular Biology and Evolution, Volume 37, Issue 11, November 2020, Pages 3175–3187, https://doi.org/10.1093/molbev/msaa156
- Share Icon Share
Abstract
The Roma Diaspora—traditionally known as Gypsies—remains among the least explored population migratory events in historical times. It involved the migration of Roma ancestors out-of-India through the plateaus of Western Asia ultimately reaching Europe. The demographic effects of the Diaspora—bottlenecks, endogamy, and gene flow—might have left marked molecular traces in the Roma genomes. Here, we analyze the whole-genome sequence of 46 Roma individuals pertaining to four migrant groups in six European countries. Our analyses revealed a strong, early founder effect followed by a drastic reduction of ∼44% in effective population size. The Roma common ancestors split from the Punjabi population, from Northwest India, some generations before the Diaspora started, <2,000 years ago. The initial bottleneck and subsequent endogamy are revealed by the occurrence of extensive runs of homozygosity and identity-by-descent segments in all Roma populations. Furthermore, we provide evidence of gene flow from Armenian and Anatolian groups in present-day Roma, although the primary contribution to Roma gene pool comes from non-Roma Europeans, which accounts for >50% of their genomes. The linguistic and historical differentiation of Roma in migrant groups is confirmed by the differential proportion, but not a differential source, of European admixture in the Roma groups, which shows a westward cline. In the present study, we found that despite the strong admixture Roma had in their diaspora, the signature of the initial bottleneck and the subsequent endogamy is still present in Roma genomes.
Introduction
Roma people, also known with the misnomer term of Gypsies, are the largest transnational minority in Europe, accounting for 10–15 million people dispersed across the continent (Liégeois 1994; Fraser 1995); yet, little is known about the details of the Roma Diaspora. According to anthropologic, linguistic (reviewed in Liégeois 1994; Fraser 1995), and genetic studies (Gresham et al. 2001; Kalaydjieva et al. 2001; Morar et al. 2004; Mendizabal et al. 2011, 2012; Moorjani, Patterson, et al. 2013; Martínez-Cruz et al. 2016; Melegh et al. 2017), Roma left the North Western part of the Indian subcontinent ∼15 centuries ago, crossed the Iranian, Armenian, and Anatolian plateaus, and reached Europe in the 9th–10th century CE. In Europe, Roma migrated within the continent in different waves, associated with the Romani dialects spoken by the different groups (Liégeois 1994; Fraser 1995; Hancock 1995; Gresham et al. 2001). There are four main migrant groups: Balkan and Vlax Roma, living in the Balkan Peninsula; Romungro Roma, living in central Europe; and North/Western Roma, living in Northern and Western Europe (Liégeois 1994; Fraser 1995).
Despite the recent origin of Roma people (Fraser 1995; Gresham et al. 2001; Chaix et al. 2004; Morar et al. 2004; Mendizabal et al. 2012; Moorjani, Patterson, et al. 2013), all studies performed so far revealed a highly complex demographic history, at different levels (Chaix et al. 2004; Mendizabal et al. 2012; Moorjani, Patterson, et al. 2013; Martínez-Cruz et al. 2016; García-Fernández C, Font-Porterias N, Kucinskas V, Sukarova-Stefanovska E, Pamjav H, Makukh H, Dobon B, Bertranpetit J, Netea MG, Calafell F, et al., unpublished data). The first level of complexity is represented by the aforementioned recent origin of Roma people, together with the series of bottlenecks and splits experienced during their Diaspora (Chaix et al. 2004; Mendizabal et al. 2012), and different levels of endogamy (Liégeois 1994; Fraser 1995; Chaix et al. 2004; Kalaydjieva et al. 2005; Mendizabal et al. 2012), which have left traces of low intragroup diversity and high intergroup heterogeneity (Peričić et al. 2005; Malyarchuk et al. 2006; Irwin et al. 2007; Gusmão et al. 2008; Klarić et al. 2009; Zalán et al. 2011; Salihović et al. 2011; Martínez-Cruz et al. 2016). Roma people low effective population size (Ne), due to bottlenecks, founder effects, and endogamy, is thought to have resulted in the occurrence of a number of disease-linked variants that are private to the Roma groups (Kalaydjieva et al. 2001; Morar et al. 2004; Mendizabal et al. 2013). An additional level of complexity is generated by the different admixed ancestry components found in the present Roma groups. Whereas the South Asian component was likely present in proto-Roma people, represented by their Ancestral South Indian component (ASI), the West Eurasian component can be traced back either to the Ancestral North Indian (ANI) component of the proto-Roma (Reich et al. 2009; Moorjani, Patterson, et al. 2013) or to their recent admixture with other non-Roma European populations (termed “Europeans” from now on), following their arrival to Europe (Chaix et al. 2004; Peričić et al. 2005; Malyarchuk et al. 2006; Gusmão et al. 2008, 2010; Mendizabal et al. 2011, 2012; Moorjani, Patterson, et al. 2013; Martínez-Cruz et al. 2016; Font-Porterias et al. 2019; García-Fernández C, Font-Porterias N, Kucinskas V, Sukarova-Stefanovska E, Pamjav H, Makukh H, Dobon B, Bertranpetit J, Netea MG, Calafell F, et al., unpublished data). Such recent admixture events strongly influenced the present-day Roma West Eurasian component (Moorjani, Patterson, et al. 2013; Melegh et al. 2017), but its origins have only recently started to be explored (Font-Porterias et al. 2019).
So far, incomplete genomic data, lack of representative samples, and weak definition of Roma groups have precluded a deep analysis of the Roma genomic history. To explore at high resolution the history of the Roma Diaspora, and its consequences on the genomic landscape of current Roma groups, we generated whole-genome sequences, at high coverage (∼30×), from 40 Roma volunteers from five different countries and belonging to the four main migrant groups. These new data have been compared with genomes of non-Roma surrounding populations with the aims of: 1) analyzing the origins and substructure of Roma migrant groups; 2) defining the admixture patterns and the population origins of the genetic components present in the Roma; 3) exploring the degree of endogamy of Roma groups; and, 4) providing a demographic framework of the Roma Diaspora.
Results
Population Structure of Roma
We analyzed 40 newly sequenced Roma complete genomes from unrelated volunteers, belonging to the four main migrant groups (Balkan, Vlax, Romungro, and North/Western) from five different countries, together with six Roma Vlax individuals from Romania (Dobon B, Horst R, Laayouni H, Mondal M, Bianco E, Comas D, Ioana M, Bosch E, Bertranpetit J, Netea MG, unpublished data), within the landscape of Europe, the Middle East, and the Indian subcontinent (see fig. 1A for a map of the sampling location and supplementary tables 1 and 2, Supplementary Material online, for samples details) (Mallick et al. 2016; Mondal et al. 2016; Serra-Vidal et al. 2019). In total, we analyzed 155 genomes with an average coverage of 12–35× that contained 7,838,351 SNPs, that were reduced to 1,445,921 SNPs, after filtering for missing data and pruning for linkage disequilibrium.

Map of the samples, principal component analysis (PCA) and admixture. (A) Map of the samples used in this study. Sampling locations are approximated. EUR, European non-Roma; INDN, North India; INDS, South India; ME, Middle East; PAK, Pakistan; ROMAB, Balkan Roma; ROMAN, North/Western Roma; ROMAR, Romungro Roma; and, ROMAV, Vlax Roma. (B) PCA of Roma samples together with the rest of the data set and the 1000 Genomes Project samples from AFR, EAS, EUR, and SAS. (C) Admixture analysis of Roma samples together with the rest of the data set and the 1000 Genomes Project samples from AFR, EAS, EUR, and SAS, showing Roma with their two main ancestral components, SAS and EUR. For population codes, see supplementary table 2, Supplementary Material online.
We first assessed the genetic structure of Roma using principal component analysis (PCA, fig. 1B and supplementary fig. 1, Supplementary Material online) (Patterson et al. 2006; Price et al. 2006), and showed that Roma people form a cluster that falls in a cline between Europe and South Asia, in agreement with their South Asian origin of Roma and their subsequent admixture with Europeans (Mendizabal et al. 2012; Moorjani, Patterson, et al. 2013; Melegh et al. 2017). Within Roma, individuals from the same Roma group tend to cluster together in the PCA. We found a clear separation within the North/Western Roma, between Spanish and Lithuanian Roma, and between North/Western Roma and the other migrant groups (supplementary fig. 1, Supplementary Material online).
Admixture analysis (Alexander et al. 2009) showed the ancestral component composition of Roma (fig. 1C and supplementary fig. 2A and B, Supplementary Material online). At K = 4 (the lowest cross-validation value, supplementary fig. 2B and C, Supplementary Material online), Roma showed two main ancestry components: West Eurasian and South Asian, confirming again the South Asian origin of Roma people and the recent European admixture (Mendizabal et al. 2012; Rai et al. 2012; Moorjani, Patterson, et al. 2013; Melegh et al. 2017; Font-Porterias et al. 2019). From K = 5 onward (supplementary fig. 2A, Supplementary Material online), Roma people showed their own genetic component without any migrant group differentiation. Both PCA and admixture analyses identified an individual with full European ancestry (V5, Hungarian Vlax), who was removed from subsequent analyses (see also supplementary fig. 2B, Supplementary Material online).
Allele Sharing and Gene Flow in Roma Groups
To detect the signatures left by the European gene flow into the Roma gene pool, we analyzed the allele sharing using the outgroup f3-statistic in the form of f3(YRI; Roma, X), where Yoruba (YRI) from Africa were used as an outgroup, and X stands for the set of samples tested for comparison to Roma (Reich et al. 2009). The outgroup f3-statistic values show a cline from Europe to the Indian subcontinent linked to the route of the Roma Diaspora (fig. 2 and supplementary fig. 3, Supplementary Material online). European populations are the non-Roma group that shared the most alleles with Roma, in agreement with the recent extensive European gene flow in Roma (Font-Porterias et al. 2019) (regardless the migrant group, supplementary fig. 3, Supplementary Material online). Eastern Asians (i.e., CHB-Han Chinese), North Africans (i.e., EGY-Egyptians), Southern Indian populations (i.e., ILA—Irula and VLR—Vellalar), together with other populations outside the putative Roma Diaspora route, were the groups that shared less alleles with Roma.

Outgroup f3-statistic statistics. f3-statistics was calculated for Roma in the form f3 (ROMA, X; YRI), where X is any population on the Y axis and YRI is the outgroup (Yoruba). For population codes, see supplementary table 2, Supplementary Material online.
Among European populations, the f3-statistics showed no significant differences in the allele sharing by migrant group. We then tested for differences in the source of gene flow using D-statistics (supplementary note 1, Supplementary Material online) in the form D (E1, E2; CHB, Roma), where E1 and E2 are two European populations, Roma is any migrant group, and CHB is the Han Chinese as an outgroup. We found no evidence of specific European populations as the source of gene flow to a specific migrant group, although Southwestern and Southeastern Europeans appeared to have contributed more to the gene flow to Roma (supplementary note 1 and table 3, Supplementary Material online). Moreover, we tested whether migrant groups had differential European gene flow using D-statistic in the form D (E, CHB; MG1, MG2), where MG1 and MG2 are any two Roma migrant groups, and E is any European group. Our results showed a westward cline of increasing European gene flow from Balkan to North/Western Roma (supplementary note 1 and table 4, Supplementary Material online).
Among northern Indian populations, the Punjabi (PUN) were the population who shared consistently more alleles with the Roma than any other Northern Indian populations, and more or similar allele sharing than Pakistani populations, with Roma individuals (fig. 2 and supplementary fig. 3, Supplementary Material online), in agreement with previous observations, that pointed to Punjab as the proto-Roma area of origin (Mendizabal et al. 2012; Font-Porterias et al. 2019). Although other Pakistani populations showed higher or similar allele sharing than Punjab, it has been shown that these populations experienced gene flow from Europe after the Roma Diaspora (Hellenthal et al. 2014), which could increase the allele sharing between Pakistani populations and Roma due to common European gene flow. To test this hypothesis, we compared the allele sharing of Pakistani and Northern Indian populations with Roma and non-Roma Europeans in the form f3 (X, EUR; YRI), where X is any Pakistani or Northern Indian population, and EUR represents Europeans (supplementary fig. 4A, Supplementary Material online). Pakistani populations (BAL, BRA, BUR, HAZ, KAL, MAK, and SIN) and Uttar Pradesh Brahmins (UBR) from Northern India shared more alleles with Europeans than with Roma (higher f3 value), or there was no difference in the allele sharing with Roma and with Europeans, suggesting the allele sharing found in outgroup f3-statistic between Roma and Pakistan populations may be inflated by the common gene flow with Europeans rather than being just the result of a common origin.
Among Middle Easterner populations, Armenians were those who shared more alleles with Roma people, followed by Turks, with no differences between migrant groups (fig. 2 and supplementary fig. 3, Supplementary Material online). As for Pakistani populations, the higher allele sharing between Roma and some Middle Easterner populations than with Northern Indian populations is partly explained by the high allele sharing between these populations and Europeans (supplementary fig. 4B, Supplementary Material online). Furthermore, despite the strong allele sharing with Europeans, the genetic fingerprint of Middle Easterner populations in the Roma genomes confirms the route of the Roma Diaspora from Northern India through this part of the Middle East, before reaching Europe (Bánfai et al. 2019; Font-Porterias et al. 2019).
Within Roma, the f3-statistic values show that the amount of shared alleles between migrant groups was systematically higher than between a migrant group and any other tested populations (supplementary fig. 3, Supplementary Material online), confirming the single and common origin of Roma groups (Gresham et al. 2001; Kalaydjieva et al. 2005; Martínez-Cruz et al. 2016). No significant differences in the allele sharing between migrant groups were observed, being North/Western Roma the groups that shared the least with the other groups due to their larger admixture with Europeans (supplementary fig. 2B and table 4, Supplementary Material online).
Identity-by-Descent between Roma and Other Populations
We estimated the identity-by-descent (IBD) segments shared between individuals from different areas and Roma people (Browning and Browning 2013a). For each comparison, we excluded all segments that overlap with segments in IBD between Roma and any individual of another population group, including in the comparison of IBD segments uniquely shared between Roma and the area of interest only.
We found that the IBD load between Roma and Europeans was consistently higher than the IBD load between Roma and any other population (fig. 3A, by migrant group in supplementary fig. 5, Supplementary Material online), with the exception of the Punjabi (PUN), supporting the strong, very recent gene flow between these two populations, also shown in the f3-statistic results (fig. 2). We found no differences that suggest a single or main European source of admixture in the Roma.

Identity-by-descent (IBD) and runs of homozygosity (RoH). (A) Average pairwise cumulative length of uniquely shared IBD segments between Roma individuals and the individuals from that specific population, excluding the segments that intersected with populations of other areas (EUR, in green; IND, in turquoise; ME, in brown; and PAK, in pink, supplementary table 2, Supplementary Material online). (B) Average pairwise cumulative length of segments in IBD uniquely shared between Roma and Europe, by migrant group. (C) Average pairwise cumulative length of segments in IBD uniquely shared between Roma and Indian populations, by migrant group. In (B) and (C), the letters on top indicate whether the distributions are not significantly different in a rank pairwise comparison: same letter means no significant difference. (D) Number and total length of runs of homozygosity (RoH) tracts >1 Mb, per population: ME* = ARM, EGY, IRN, IRQ, IRJ, BED, DRU, PAL, SAM, JOR, TUR; PAK* = BAL, BRA, BRU, MAK, HAZ, KAL, PAT, SIN. *Individuals from different populations were grouped together, which may decrease the average load and length of IBD fragments.
Comparing the unique IBD load between migrant groups and Europeans, we found that North/Western Roma shared significantly more IBD with Europeans than the other migrant groups. In agreement with the results of the D-statistics (supplementary table 4, Supplementary Material online), this result confirms the higher gene flow from Europeans to North/Western Roma (fig. 3B and supplementary fig. 2B and C, Supplementary Material online). We did not detect significant differences between Balkan and Romungro Roma. Although the only Roma individual, we found to have full European ancestry was Vlax, Vlax Roma showed the lowest IBD sharing with Europeans (ROMAV in fig. 3B).
Among Northern India/Pakistan populations (fig. 3A and supplementary fig. 5, Supplementary Material online), we found the Punjabi (PUN) to be a clear outlier, having significant more IBD (overall length) than the rest of populations. The IBD sharing between Roma and Punjab was comparable with the IBD sharing between Roma and Europeans. This result confirms Punjab as the putative region of origin of Roma ancestors within the Indian subcontinent (Mendizabal et al. 2012; Font-Porterias et al. 2019). Moreover, we found no differences among the migrant groups either when comparing the IBD load between Roma and all Indian populations (fig. 3C) or between Roma and PUN (supplementary fig. 6A, Supplementary Material online), a footprint of the single founding event of Roma population (Gresham et al. 2001; Kalaydjieva et al. 2005; Martínez-Cruz et al. 2016) that also excludes gene flow after the Diaspora between Roma and populations from the Indian subcontinent.
Finally, we compared the IBD load between Roma and populations from the Middle East (fig. 3A). The overall IBD load was higher than that of Northern India (except with PUN), in agreement with the occurrence of gene flow between Roma and populations from this area (Bánfai et al. 2019; Font-Porterias et al. 2019). The homogeneity of the IBD load between Middle Easterner populations and Roma migrant group suggest no admixture between these two groups after Roma arrived in Europe (supplementary fig. 6B, Supplementary Material online).
Estimates of Endogamy
We evaluated the extent of Roma endogamy by analyzing the length and number of runs of homozygosity (RoH) within Roma individuals and the IBD segments between individuals of the same population and migrant group.
Roma showed a higher RoH number and total length than Europeans or Northern Indian populations, a clear signature of the founder effect and subsequent bottlenecks that Roma suffered in their demographic history (fig. 3D and supplementary fig. 7, Supplementary Material online). The only two groups that show higher RoH loads than the Roma were the South Indian populations Irula (ILA, tribal population) and Vellalar (VLR, Dravidian nontribal), who are known to present high levels of endogamy (fig. 3D and supplementary fig. 7, Supplementary Material online) (Juyal et al. 2014; Mondal et al. 2016). Within Roma, Balkan Roma (ROMAB) showed fewer and shorter RoH. In the analysis by length category, we found significant differences between ROMAB and the other migrant groups in long RoH (length >5 Mb), a possible signature of the additional bottlenecks that non-ROMAB experienced during their history, after the split from ROMAB (supplementary fig. 7, Supplementary Material online) (Mendizabal et al. 2012).
Additionally, we tested endogamy within migrant groups by evaluating the average cumulative length of the genome that was in IBD between two individuals from the same group (supplementary fig. 8, Supplementary Material online). We found the cumulative IBD length of Roma population to be comparable to the pattern observed in PUN and in the endogamic populations of Southern India (ILA and VLR) (Mendizabal et al. 2011; Mondal et al. 2016; Nakatsuka et al. 2017). Among migrant groups, we did not detect any differences in the load of IBD between individuals of the same migrant group. Together with the differences, we found in the RoH load of ROMAB, the same amount of IBD load in Roma is consistent with the very recent and weaker additional bottlenecks non-ROMAB have suffered (Mendizabal et al. 2012). The signature left on the genome of such recent, and weaker, bottlenecks can be detected only on the within-individual diversity (RoH), and not on the between-individuals diversity (IBD) (Severson et al. 2019). Moreover, the distribution of the total length of IBD fragments was not different from that of Roma people taken together or Southern Indian populations (supplementary fig. 8, Supplementary Material online).
Defining the Roma Demographic History
We finally analyzed the demographic history of Roma by testing different scenarios and using three main approaches: 1) assessing the changes in the effective population size from the IBD pattern; 2) analyzing the complex admixture history based on allele sharing between populations; and 3) estimating the best fitting demographic model to Roma people using an approximate Bayesian computation (ABC) approach (which also includes the two first approaches in its calculation).
Changes in the Effective Population Sizes throughout Generations
We analyzed the historical effective population size (Ne) of the last 200 generations inferred by the IBD pattern between individuals of the same population using IBDNe (Browning and Browning 2015) (supplementary fig. 9, Supplementary Material online). By analyzing all Roma groups together, we found that Roma Ne overlapped with the Northern Indian individuals Ne until ∼125 generations ago, in agreement with the Roma origins in this region of the subcontinent. Starting ∼125 generations ago, Roma Ne suffered a constant reduction and started to differentiate from Northern India Ne. Northern Indian and Southern Indians groups did not exhibit this dramatic reduction in population size. We found the minimum of Roma Ne to be 1,000 (871–1,160 95% CI), and it occurred between 1,159 CE and 1,333 CE (between 23 and 29 generations ago, assuming a generation time of 29 years). In recent times, Ne slightly increased and decreased again in the last generations before present. Overall, Roma Ne in the last 125 generations was lower than Northern Indians, and in the last 50 generations was also lower than the Southern Indian Dravidic groups (Juyal et al. 2014; Mondal et al. 2016) (supplementary fig. 9, Supplementary Material online). At the time of the out of India, which was previously estimated to occur between 35 and 50 generations ago (Liégeois 1994; Fraser 1995; Mendizabal et al. 2012; Moorjani, Patterson, et al. 2013; Martínez-Cruz et al. 2016), we found Ne to be 1,500 (1,380–1,640 95% CI) and 4,600 (3,850–6,120 95% CI), respectively.
Admixture History of Roma
The Roma West Eurasian component is the result of the recent gene flow on the Northern Indian background of proto-Roma, who were composed by a Northern Indian (ANI) and a Southern Indian (ASI) ancestral component as any other Indian population (Reich et al. 2009; Moorjani, Patterson, et al. 2013; Pathak et al. 2018; Yelmen et al. 2019). We tested different complex admixture scenarios to investigate the strength of recent admixture between Roma and Europeans using the admixture graph approach (Patterson et al. 2012). Our analyses found two scenarios that fit our data (fig. 4 and supplementary fig. 10, Supplementary Material online): both have an Ancestral Northern Indian population (ANI) and an Ancestral Southern Indian population (ASI) that admixed into the ancestors of proto-Roma, confirming the admixed nature of this population, but the proportion of ANI/ASI differs between the two scenarios. In the first scenario, the ancestors of proto-Roma were the ancestors of the PUN population, which presents 51% of ANI (shown in fig. 4 and in agreement with Mendizabal et al. [2012] and Font-Porterias et al. [2019]). In the second scenario, the ancestors were an unknown Northern Indian-like population, with 67% of ANI (supplementary fig. 10, Supplementary Material online). In both scenarios, two admixture events from a West Eurasian source were necessary for the scenario not to be discarded: a first admixture event between ANI and ASI populations to form the ancestors of the proto-Roma (i.e., the ancestors of PUN or the unknown sister population, depending on the model), and a second admixture event between proto-Roma and Europeans. Since both ANI and Europeans are of West Eurasian origin, depending on the ANI ancestry of the ancestral population of Roma, the subsequent admixture with Europeans will have different proportions, 67% in the first scenario (fig. 4) and 53% in the second (supplementary fig. 10, Supplementary Material online).

Admixture graph model for Roma complex admixture scenario. The estimated Z score of this model was Z score = −1.444 between observed and expected F statistics. Populations in capital letters, sampled populations; small cases, unsampled populations. Straight arrows, drift; dashed arrow, admixture, with corresponding admixture proportions between the two populations.
A byproduct of the recent admixture with Europeans was the reduction of the ANI and ASI components of Roma ancestors in present-day Roma ancestry (Mendizabal et al. 2012; Moorjani, Patterson, et al. 2013; Melegh et al. 2017). In our admixture graph analysis (fig. 4 and supplementary fig. 10, Supplementary Material online), the relative proportion of the ANI component depends on the population of origin of the Roma, but, in both models, the ASI component is ∼16%. Although we found that two scenarios fit our data, the first scenario with the ancestral PUN (a_pun) as a proxy population for Roma ancestors (fig. 4) seems most likely, since it considers a lower number of “ghost” populations and it is in agreement with our IBD analyses (fig. 3A) and previous data (Mendizabal et al. 2012; Font-Porterias et al. 2019).
Roma Demographic Model
We applied an ABC approach (Beaumont et al. 2002; Beaumont 2010) to discriminate between 12 demographic scenarios, which included two different branching structures and an increasing number of asymmetrical bidirectional gene flow between Roma migrant groups and Europeans (supplementary fig. 11 and table 5, Supplementary Material online, for model’s parameters, supplementary note 2, Supplementary Material online). The highest posterior probability was obtained for the model with four migration events and the two-branch structure (2b4m, supplementary fig. 12, Supplementary Material online, average posterior probability = 0.566 using neuralnet, supplementary table 6, Supplementary Material online). This model involved a first split between migrant groups in the Balkan Peninsula (Balkan and Vlax common ancestors) and migrant groups outside of the Balkan Peninsula (Romungro and North/Western common ancestors), and two more recent splits, one between Romungro and North/Western Roma and the other between Balkan and Vlax Roma.
Among the parameters that could be accurately estimated (table 1 and supplementary fig. 13, Supplementary Material online, see also supplementary note 2, Supplementary Material online), our ABC analysis indicated that the split between Roma and their Indian ancestors occurred ∼1.6–2 ka (TbotBAIND, 95% CI 475–3,700 ya, assuming a generation time of 29 years, table 1 and supplementary fig. 14, Supplementary Material online). The reduction of the Ne in the bottleneck was ∼44% of the ancestor population (95% CI 33–59%, mutual of bot1a), reducing the Ne of Roma ancestors to 1,536 (N1, 95% CI 188–2,387, table 1 and supplementary fig. 14, Supplementary Material online).
. | . | Neuralnet . | Neuralnet_LogTransform . | |||||
---|---|---|---|---|---|---|---|---|
Parameter . | Description . | Mean . | 2.5% CI . | 97.5% CI . | Mean . | 2.5% CI . | 97.5% CI . | |
NBEA | Ne common ancestor of all Eurasians | 3,187 | 3,109 | 3,251 | 2,427 | 2,212 | 2,601 | |
NIND | Ne Northern Indian population (Punjabi) | 3,497 | 317 | 7,207 | 2,027 | 1,590 | 2,506 | |
N1=NIND/bot1a | Ne Roma founders at the out of India | 1,536 | 188 | 2,387 | 977 | 1,091 | 853 | |
bot1a | Mutual of the bottleneck Roma had in the out of India | 2.28 | 1.69 | 3.02 | 2.07 | 1.46 | 2.94 | |
TbotBAINDa | Time of split from common ancestors of Roma and Northern Indian population | 2,126 | 475 | 3,760 | 1,632 | 1,103 | 2,099 | |
TsplitEUINa | Time of split from the common ancestors of Ancestral Northern Indian and Europeans | 31,901 | 23,821 | 40,087 | 11,656 | 8,478 | 14,675 | |
TsplitEUISa | Time of split from the common ancestors of Ancestral Southern Indians and Europeans | 31,509 | 28,583 | 37,273 | 38,813 | 35,717 | 44,392 |
. | . | Neuralnet . | Neuralnet_LogTransform . | |||||
---|---|---|---|---|---|---|---|---|
Parameter . | Description . | Mean . | 2.5% CI . | 97.5% CI . | Mean . | 2.5% CI . | 97.5% CI . | |
NBEA | Ne common ancestor of all Eurasians | 3,187 | 3,109 | 3,251 | 2,427 | 2,212 | 2,601 | |
NIND | Ne Northern Indian population (Punjabi) | 3,497 | 317 | 7,207 | 2,027 | 1,590 | 2,506 | |
N1=NIND/bot1a | Ne Roma founders at the out of India | 1,536 | 188 | 2,387 | 977 | 1,091 | 853 | |
bot1a | Mutual of the bottleneck Roma had in the out of India | 2.28 | 1.69 | 3.02 | 2.07 | 1.46 | 2.94 | |
TbotBAINDa | Time of split from common ancestors of Roma and Northern Indian population | 2,126 | 475 | 3,760 | 1,632 | 1,103 | 2,099 | |
TsplitEUINa | Time of split from the common ancestors of Ancestral Northern Indian and Europeans | 31,901 | 23,821 | 40,087 | 11,656 | 8,478 | 14,675 | |
TsplitEUISa | Time of split from the common ancestors of Ancestral Southern Indians and Europeans | 31,509 | 28,583 | 37,273 | 38,813 | 35,717 | 44,392 |
Note.—Mean and 95% CIs of the posterior distributions were estimated using neural network logistic estimation algorithm, with and without log transformation.
Years, generation time=29 years.
. | . | Neuralnet . | Neuralnet_LogTransform . | |||||
---|---|---|---|---|---|---|---|---|
Parameter . | Description . | Mean . | 2.5% CI . | 97.5% CI . | Mean . | 2.5% CI . | 97.5% CI . | |
NBEA | Ne common ancestor of all Eurasians | 3,187 | 3,109 | 3,251 | 2,427 | 2,212 | 2,601 | |
NIND | Ne Northern Indian population (Punjabi) | 3,497 | 317 | 7,207 | 2,027 | 1,590 | 2,506 | |
N1=NIND/bot1a | Ne Roma founders at the out of India | 1,536 | 188 | 2,387 | 977 | 1,091 | 853 | |
bot1a | Mutual of the bottleneck Roma had in the out of India | 2.28 | 1.69 | 3.02 | 2.07 | 1.46 | 2.94 | |
TbotBAINDa | Time of split from common ancestors of Roma and Northern Indian population | 2,126 | 475 | 3,760 | 1,632 | 1,103 | 2,099 | |
TsplitEUINa | Time of split from the common ancestors of Ancestral Northern Indian and Europeans | 31,901 | 23,821 | 40,087 | 11,656 | 8,478 | 14,675 | |
TsplitEUISa | Time of split from the common ancestors of Ancestral Southern Indians and Europeans | 31,509 | 28,583 | 37,273 | 38,813 | 35,717 | 44,392 |
. | . | Neuralnet . | Neuralnet_LogTransform . | |||||
---|---|---|---|---|---|---|---|---|
Parameter . | Description . | Mean . | 2.5% CI . | 97.5% CI . | Mean . | 2.5% CI . | 97.5% CI . | |
NBEA | Ne common ancestor of all Eurasians | 3,187 | 3,109 | 3,251 | 2,427 | 2,212 | 2,601 | |
NIND | Ne Northern Indian population (Punjabi) | 3,497 | 317 | 7,207 | 2,027 | 1,590 | 2,506 | |
N1=NIND/bot1a | Ne Roma founders at the out of India | 1,536 | 188 | 2,387 | 977 | 1,091 | 853 | |
bot1a | Mutual of the bottleneck Roma had in the out of India | 2.28 | 1.69 | 3.02 | 2.07 | 1.46 | 2.94 | |
TbotBAINDa | Time of split from common ancestors of Roma and Northern Indian population | 2,126 | 475 | 3,760 | 1,632 | 1,103 | 2,099 | |
TsplitEUINa | Time of split from the common ancestors of Ancestral Northern Indian and Europeans | 31,901 | 23,821 | 40,087 | 11,656 | 8,478 | 14,675 | |
TsplitEUISa | Time of split from the common ancestors of Ancestral Southern Indians and Europeans | 31,509 | 28,583 | 37,273 | 38,813 | 35,717 | 44,392 |
Note.—Mean and 95% CIs of the posterior distributions were estimated using neural network logistic estimation algorithm, with and without log transformation.
Years, generation time=29 years.
Ancestry of Described Mendelian Variants in Roma
The high level of endogamy present in the Roma genome might have increased the probability to carry deleterious mutations and risk alleles associated with diseases (Kalaydjieva et al. 2001). Among the 48 variants that have been associated with Mendelian diseases, and that have been previously described in Roma (summarized in supplementary table 7, Supplementary Material online), 11 were found in the Roma genomes analyzed here (supplementary table 8, Supplementary Material online). The local ancestry analysis of these variants, performed with RFMix (Maples et al. 2013), shows that nine variants had European ancestry, pointing to the major European component of the Mendelian load in present-day Roma. Only two variants show South Asian ancestry: variant rs121918355, associated with primary congenital glaucoma (Azmanov et al. 2011); and variant rs104894396, associated with deafness(Álvarez et al. 2005), that had mixed ancestry.
Discussion
The analysis of the demographic history of Roma people presents two main levels of complexity. First, Roma population emerged very recently, as they left the Northwestern part of the Indian subcontinent ∼1–1.5 ka, 35–50 generations ago (Liégeois 1994; Fraser 1995; Mendizabal et al. 2012; Moorjani, Patterson , et al. 2013; Martínez-Cruz et al. 2016). Second, despite being considered an isolated population, with extensive bottlenecks, splits, and endogamy, Roma are an admixed population, as in their Diaspora, they experienced gene flow from other non-Roma groups (Mendizabal et al. 2012; Moorjani, Patterson, et al. 2013; Martínez-Cruz et al. 2016; Bánfai et al. 2019; Font-Porterias et al. 2019).
An outstanding question of Roma history concerns their origins before their Diaspora toward Europe from the Indian subcontinent. To tackle this question, a complexity factor arises because of the admixed nature of both Roma and the populations of the Indian subcontinent (Mendizabal et al. 2012; Moorjani, Patterson, et al. 2013; Moorjani, Thangaraj, et al. 2013; Pathak et al. 2018; Font-Porterias et al. 2019; Tätte et al. 2019; Yelmen et al. 2019). Indian groups have been described as a mixture of two main ancestral components named ANI and ASI due to their higher frequencies in Northern and Southern India, respectively, although both components are found in most Indian subcontinent populations (Reich et al. 2009; Moorjani, Thangaraj, et al. 2013). The ANI component has a Western Eurasian origin and seems to have reached the Indian subcontinent in more recent times compared with the ASI component (Reich et al. 2009). The ANI/ASI components proportion of the proto-Roma in our admixture graph analysis (fig. 4) confirms the Northwestern part of the Indian subcontinent to be the area of origin of Roma (Mendizabal et al. 2012; Melegh et al. 2017; Font-Porterias et al. 2019), and it is supported by our IBD and outgroup f3-statistic results (fig. 2 and supplementary fig. 3, Supplementary Material online and fig. 3A and supplementary fig. 5, Supplementary Material online). The populations from this area showed the highest f3-statistic and IBD load among the Indian and Pakistani populations. Among them, our results support the Punjabi region as the region of origin of the Roma Diaspora. Indeed, the PUN population showed the highest IBD load, presented one of the highest outgroup f3-statistic values for Northern Indian populations, and shared the most recent common ancestor with Roma in the best fit scenario in the admixture graph analysis. Moreover, the Punjab as the region of origin of Roma is in agreement with previous linguistic data, because of the similarity between Romaní—Roma people language—and the languages of that area, and genetic studies that pointed to Punjab, despite the population heterogeneity found in the region (Fraser 1995; Gresham et al. 2001; Kalaydjieva et al. 2001; Mendizabal et al. 2011, 2012, 2013; Moorjani, Patterson, et al. 2013; Martínez-Cruz et al. 2016; Melegh et al. 2017; Alfonso-Sánchez et al. 2018; Font-Porterias et al. 2019).
Our demographic analysis using an ABC framework, using the Punjabi as proxy for Roma ancestors (NIND in supplementary fig. 12, Supplementary Material online), showed that the ancestors of the present-day Roma split from the Punjabi ∼1.6–2 ka (table 1 and supplementary fig. 14, Supplementary Material online, TbotBAIND). Historical records (Liégeois 1994; Fraser 1995) and previous genetic analyses (Mendizabal et al. 2012) pointed to slightly more recent times (1.2–1.5 ka), which still fall within our 95% CI, and the overlap increases when correcting for the generation time estimation (25 years in Mendizabal et al. [2012], 29 years in our analyses). Thus, our analyses suggest that the ancestors of the Roma and the Punjabi were already two differentiated populations within the Indian subcontinent, some generations before the Roma Diaspora started. The split from a common ancestor of two populations is rarely an instantaneous process (i.e., it does not occur in a single generation). Even though it is likely to have been overestimated by mixing Northern Indian individuals from different populations and geographic areas, the range of time it took for Roma ancestors and its Northern Indian ancestors to split it is suggested by the trend of the historical effective population size throughout the generations of the two populations, which started to differentiate before the Roma Diaspora began (∼125 generations ago, supplementary fig. 9, Supplementary Material online).
The complexity of the peopling history of India, where many questions remain to be answered (Reich et al. 2009; Moorjani, Thangaraj, et al. 2013; Pathak et al. 2018; Narasimhan et al. 2019; Tätte et al. 2019; Yelmen et al. 2019), further complicates the search for the area of origin of the Roma. Moreover, among the other South Asian populations, the outgroup f3-statistic results pointed to other Pakistani populations to share a similar amount of alleles with Roma as Roma do with Punjabi (fig. 2 and supplementary fig. 3, Supplementary Material online), but the IBD results show a lower amount of IBD sharing between Roma and these populations (supplementary fig. 6, Supplementary Material online). Since the geographical border between India and Pakistan was redefined in the last century, and the Punjabi region is shared between modern India and Pakistan (Melegh et al. 2017), we explored whether other Pakistani populations contributed to the Roma genomic landscape. We tested whether the high allele sharing between Roma and Pakistan populations (fig. 2) could be due to a common ancestor or to the recent common gene flow with non-Roma Europeans, which occurred after the Roma Diaspora (Hellenthal et al. 2014). In a further outgroup f3-statistics analysis (supplementary fig. 4, Supplementary Material online), we found non-Punjabi Pakistani populations to share more alleles, or a similar amount, with Europeans than with Roma. This observation could reflect a signature of the recent gene flow from Europeans to Pakistani groups, which led to the signal of allele sharing between Roma and Pakistani populations occurring after Roma left India (Hellenthal et al. 2014).
The present genome analysis of the Roma suggests that, during the Diaspora, the route followed by their ancestors included the Armenian highlands and Anatolia, as shown in the f3-statistic and IBD analyses, in agreement with historical, linguistic (reviewed in Liégeois 1994; Fraser 1995; Hancock 1995), and genetic data (Bánfai et al. 2019; Font-Porterias et al. 2019). By the other hand, our data clearly discard the North African origin of Roma groups or a route throughout North Africa, in our analysis represented by Egyptians, in contrast to previous historical hypotheses (Fraser 1995) and in agreement with previous genetic data (Gresham et al. 2001; Mendizabal et al. 2012; Moorjani, Patterson, et al. 2013; Martínez-Cruz et al. 2016; Font-Porterias et al. 2019) (fig. 2 and supplementary fig. 3, Supplementary Material online). Nonetheless, the major genetic contribution to present-day Roma comes from the recent gene flow from European groups, making Roma genetically more similar to Europeans than to their Indian parental population (Font-Porterias et al. 2019; Dobon B, Horst R, Laayouni H, Mondal M, Bianco E, Comas D, Ioana M, Bosch E, Bertranpetit J, Netea MG, unpublished data). Indeed, the recent gene flow between Roma and other populations in Europe accounts, in average, for >50% of Roma genomes (fig. 4 and supplementary fig. 10, Supplementary Material online).
The strength of European gene flow is shown also in the amount of allele sharing and IBD fragments between Roma and the rest of European groups, which were higher than Northern Indian populations (figs. 3A and 4). Previous studies found the IBD load was higher between Roma and European groups from Eastern Europe (Moorjani, Patterson, et al. 2013; Melegh et al. 2017) with strong influence from the Balkan peninsula (Font-Porterias et al. 2019). In our analysis, we found Southwestern and Southeastern European groups to have provided more gene flow to Roma compared with other European groups (supplementary table 3, Supplementary Material online), but no differences were found in the outgroup f3-statistic (supplementary fig. 3, Supplementary Material online), whose power can be limited due to drift (Patterson et al. 2012). Despite being settled in different areas of the European continent, the signal of the admixture between Roma and Europeans is shared across migrant groups according to D-statistics analysis (supplementary table 3, Supplementary Material online). Besides the strength of the European gene flow, higher in North/Western Roma and lower in Vlax Roma, no consistent differences were found among Roma migrant groups (fig. 3B and supplementary table 4, Supplementary Material online), pointing to a recent common origin and lack of differentiation, possibly due to the short time span since the split of Roma groups. This result contrasts to some differentiation found in the analysis of some uniparental lineages (Gresham et al. 2001; Martínez-Cruz et al. 2016), which might be explained due to the higher drift of uniparental genomes.
The strong admixture Roma experienced with European groups did not erase, however, the signature of the strong bottleneck undergone by the Roma at the beginning of their Diaspora, which represented ∼44% of the proto-Roma parental population (table 1), in agreement with previous data (Mendizabal et al. 2012). Despite the increase in effective population size (Ne) Roma had in the generations after the bottleneck (supplementary fig. 9, Supplementary Material online), which might be explained by the extensive gene from Europeans (Font-Porterias et al. 2019), the degree of genetic homogeneity in the Roma genomes are comparable to populations of Southern India who are known to follow cultural practices of marriages between close relatives (Mondal et al. 2016; Nakatsuka et al. 2017). At its lower point, the Ne of Roma was even lower than the Ne of Southern Indian populations (supplementary fig. 9, Supplementary Material online), making Roma more prone to carry deleterious alleles at higher frequencies (supplementary table 8, Supplementary Material online) (Kalaydjieva et al. 2001). It is noteworthy that the Mendelian associated variants found in the present data set are located in genome tracks of European origin, pointing to the extensive very recent admixture with Europeans. In the few generations after the admixture event that introduced the deleterious variants, natural selection might have been less effective in removing these harmful alleles from the population, mainly because of the low effective population size of Roma (Mendizabal et al. 2013). Furthermore, the effects of Roma multiple bottlenecks and consanguinity left a signature on both the Roma high number of long RoH (fig. 3E and supplementary fig. 7, Supplementary Material online) and on the long IBD chunks between Roma individuals (supplementary fig. 8, Supplementary Material online) (Mendizabal et al. 2011, 2012; Moorjani, Patterson, et al. 2013; Martínez-Cruz et al. 2016; Melegh et al. 2017; Font-Porterias et al. 2019). The effects of endogamy are stronger on RoH than on IBD (Severson et al. 2019), suggesting ROMAB as the migrant group that suffered less bottlenecks than the other migrant groups, in agreement with the Balkan peninsula being the cradle of Roma people currently living in Europe (Liégeois 1994; Fraser 1995; Gresham et al. 2001; Mendizabal et al. 2012; Martínez-Cruz et al. 2016).
Our analysis of complete genomes shed light on the Roma demographic history and their Diaspora from India to Europe. The small group of Roma ancestors that left Northwestern India were already a different population from their common ancestor with Punjabi when the Diaspora started. At the beginning of the Diaspora, proto-Roma underwent a strong bottleneck that reduced their effective population size to less than half their ancestral effective population size, but, along the route to Europe, Roma ancestors admixed with host populations of Middle Eastern highlands. In a few generations, 50% of the Roma ancestral component was replaced by recent European admixture, increasing their effective population size and thus compensating their previous loss of genetic diversity. The signature of their South Asian origin is still present, despite the strong admixture, as well as the signature of the initial and following bottlenecks, reflected by the high long RoH and IBD load of Roma. It is common to consider Roma as an isolated population, with reduced on no genetic and cultural exchanges with their close neighbors, but our study showed that, at least in the recent past, Roma people have admixed at a high rate with non-Roma people all along their Diaspora route.
Materials and Methods
Samples
DNA samples were collected from 40 volunteers, self-defined as Roma that belong to four main migrant groups: 10 Balkan, 5 Vlax, 10 Romungro, and 15 North/Western Roma, and coming from five countries (supplementary table 1, Supplementary Material online). All volunteers declared that their eight grandparents were self-defined as Roma. We tested for relatedness using Vcftools (Danecek et al. 2011)—relatedness and retained all the individuals who were less related than third degree cousins. Geographical positions of these samples are shown in figure 1A and detailed in supplementary table 1, Supplementary Material online. All samples were collected with informed consent from the participants under the approval of the IRB of the CEIC-Parc Salut Mar 2016/6723/I. Whole-genome sequencing data were obtained using Illumina HiSeqX sequencing platform, at average coverage of ∼30×. We merged our data with other samples of Roma (6 additional Romanian Vlax individuals) and non-Roma (109 individuals) origin (see supplementary table 2, Supplementary Material online). After quality control (FastQC, Brabraham Bioinformatics, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), all the individuals were mapped against the human reference genome GRCh38 using BWA mem (version 0.7.15) (Li and Durbin 2009). After filtering according to the GATK Best Practices guide (DePristo et al. 2011), variant calling was performed using GATK (McKenna et al. 2010). We performed an extra filter on coverage using a home-made python script to deal with differences in the average coverage of the samples (individual filter for coverage from half to twice the average coverage) to retain only those variants that had coverage between half and twice the individual average coverage.
Population Structure
We used vcftools (Danecek et al. 2011) to extract biallelic SNPs. LD pruning (window size of five SNPs, step size of five SNPs, and variance inflation factor of 2) and missing data filtering were performed using plink 1.9 (Chang et al. 2015). For the analysis including the data of the 1000 Genomes Project (The 1000 Genomes Project Consortium et al. 2015), our data were lifted over to GRCh37 human reference genome using picardtools v1.139 LiftoverVcf (Picardtools, Broad Institute, http://broadinstitute.github.io/picard). To explore the relationship between populations, we used the EIGENSOFT smartpca (Patterson et al. 2006; Price et al. 2006) to run PCA. To explore Roma ancestry components, we run the unsupervised clustering algorithm of ADMXITURE (Alexander et al. 2009). All plots were performed using R (R Core Team 2018), maps were drawn using ggmaps (Kahle and Wickham 2013).
Endogamy and Ne Estimation
We estimated RoH as in Serra-Vidal et al. (2019) using plink 1.9 (Chang et al. 2015), with the parameters --homozyg --homozyg-snp 50 --homozyg-kb 100 --homozyg-density 40 --homozyg-gap 2000. We calculated the cumulative RoH length and number of RoH fragments per individual, and compared the distribution within and between populations using R (R Core Team 2018).
We identified fragments IBD between individuals, after removing all variants that have >20% missing data (--geno 0.2) and without pruning for LD. To call IBD blocks, we used IBDSeq r1206 (Browning and Browning 2013b), with default values on the data lifted over to GRCh37, and using the genetic map provided by the authors on GRCh37 human genome reference. The distribution of the individual pairwise IBD total length was estimated and plotted with R (R Core Team 2018). We extracted IBD segments uniquely shared between two specific groups, excluding all those fragments that intersected between Roma and any other continental area (Europe—EUR, India—IND, Pakistan—PAK and Middle Eastern—ME) using bedtools options (Quinlan and Hall 2010).
We estimated the changes in the effective population size (Ne) in the last 200 generations, with 95% CI, using the IBDseq r1206 (Browning and Browning 2013b) and IBDNe v. 19Sep19.268 (Browning and Browning 2015). We set the minimum IBD segment length to 2 cM, and the maximum number of estimated generations to 200. To increase the number of samples of Indian groups in historical Ne calculation, we merged all Northern Indian (BEN, PUN, UBR, RAJ) and all Southern Indian (ILA, VLR) samples despite being from different, unrelated populations.
Roma Admixture History
We modeled the admixture population history of Roma using the admixture graph algorithm implemented in qpGraph, Admixtools package (Patterson et al. 2012). We ran the admixture graph model analysis on our data set including YRI and CHB populations from the 1000 Genomes Project (The 1000 Genomes Project Consortium et al. 2015). For all scenarios tested, we first generated a skeleton graph that included ten randomly selected YRI as outgroup, EUR (our European individuals), ten randomly selected CHB, and populations from North (Punjabi—PUN) and South (Irula—ILA) India. We then added Roma and refined the graph in order to find a model that best fits our data. To fit the data, the model must be without outliers, so that all the expected F statistics are not significantly different from the observed F statistics (Z score <|3|).
Outgroup f3-Statistic and D-Statistic
To test the shared drift between Roma and the other populations from a common outgroup, we applied the three-population test (f3) using the same data set as in the admixture graph. We used qp3pop package in Admixtools (Patterson et al. 2012) in the form of outgroup f3-statistic: f3 (YRI; Roma, X), where YRI are ten randomly selected Yuruba individuals from the 1000 Genome Project (The 1000 Genomes Project Consortium et al. 2015), X is any other population in the data set and Roma are the Roma as a single population or the four Roma migrant groups.
D-statistics were calculated using the qpDstat package in Admixtools (Patterson et al. 2012) in the form D (E1, E2; CHB, R) or in the form D (E, CHB; R1, R2), where E stands for European group and R for Roma migrant group, and CHB are ten randomly selected Chinese from Beijing individuals from the 1000 Genome Project (The 1000 Genomes Project Consortium et al. 2015).
Ancestry Identification of Disease-Associated Mutations
We performed a literature research for SNPs related to diseases in Roma people using the following terms: “Roma,” “Gypsy,” “mendelian,” “disease,” “genetic,” and “metabolic syndrome.” The genomic position of the mutations was determined using the UCSC Genome Browser. We estimated the frequency of each mutation in Roma and non-Roma using vcftools (Danecek et al. 2011). To estimate the local ancestry around the 11 mutations associated with Mendelian diseases, we found in our Roma data set, we performed a local ancestry analysis. First, we phased the variants using Beagle version 09Feb16.2b7 (Browning and Browning 2013a), with default parameters. Then, we assigned local ancestry using RFMix v1.5.4 (Maples et al. 2013), five EM iterations and forward–backward threshold set to 0.8, on the whole chromosome and the information of the region around the mutation was extracted. The ancestry of the haplotypes was inferred setting Europeans and South Asians (supplementary table 2, Supplementary Material online) as source individuals and Roma as target individuals.
Roma Demographic Inference
We explored Roma demographic history using the Approximae Bayesian Computation approach (ABC) (Beaumont et al. 2002; Beaumont 2010). We tested 12 scenarios that varied in the branching structure within Roma (a series of subsequent bottlenecks or a first split followed by two further splits between migrant groups) (supplementary fig. 11, Supplementary Material online). We run 100,000 simulations per model using fastSimCoal V2.6 (Excoffier and Foll 2011). Summary statistics (supplementary table 9, Supplementary Material online) were calculated using plink 1.9 (Chang et al. 2015), Admixtools (Patterson et al. 2012) and R (R Core Team 2018) on both simulated and observed data. Accuracy and parameter estimation were calculated on R using the approach implemented in the abc library from Csilléry et al. (2012). See supplementary note 2, Supplementary Material online, for extended Materials and Methods on the ABC approach.
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
All data generated during the current study are available upon request at EGA repository under the accession number EGAS00001004287.
Acknowledgments
We would like to thank all the DNA donors and volunteers who made this study possible. We also thank Mònica Vallés for technical support. This work was supported by the Spanish Ministry of Economy and Competitiveness (Grant No. PID2019-106485GB-I00 and CGL2016-75389-P—MINEICO/FEDER, UE) and “Unidad de Excelencia Maríade Maeztu” (funded by AEI—CEX2018-000792-M) to D.C. and F.C.; and Agència de Gestió d’Ajuts Universitaris i de la Recerca (Generalitat de Catalunya, Grant No. 2017SGR00702). N.F.-P. was supported by a FPU17/03501 fellowship. All samples were collected with informed consent from the participants under the approval of the IRB of the CEIC-Parc Salut Mar 2016/6723/I. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the article.
Author Contributions
D.C. and E.B. conceived study. E.B. and C.G.-F. did the preprocessing and quality control of the data. R.S.V. performed the local ancestry and diseases analysis. E.B. analyzed the data. B.D., E.S.S., V.K., H.M., H.P., M.G.N., and J.B. provided samples. G.L. designed and validated the ABC analysis. E.B. wrote the first draft of the article. N.F.-P., D.C., F.C., C.G.-F., and E.B. interpreted the results and provided comparative discussions. All authors contributed to the writing and editing of the final article. All the authors approved the final version of the article.
References
R Core Team.
The 1000 Genomes Project Consortium,