Simulation-Based Evaluation of Three Methods for Local Ancestry Deconvolution of Non-model Crop Species Genomes

Hybridizations between species and subspecies represented major steps in the history of many crop species. Such events generally lead to genomes with mosaic patterns of chromosomal segments of various origins that may be assessed by local ancestry inference methods. However, these methods have mainly been developed in the context of human population genetics with implicit assumptions that may not always fit plant models. The purpose of this study was to evaluate the suitability of three state-of-the-art inference methods (SABER, ELAI and WINPOP) for local ancestry inference under scenarios that can be encountered in plant species. For this, we developed an R package to simulate genotyping data under such scenarios. The tested inference methods performed similarly well as far as representatives of source populations were available. As expected, the higher the level of differentiation between ancestral source populations and the lower the number of generations since admixture, the more accurate were the results. Interestingly, the accuracy of the methods was only marginally affected by i) the number of ancestries (up to six tested); ii) the sample design (i.e., unbalanced representation of source populations); and iii) the reproduction mode (e.g., selfing, vegetative propagation). If a source population was not represented in the data set, no bias was observed in inference accuracy for regions originating from represented sources and regions from the missing source were assigned differently depending on the methods. Overall, the selected ancestry inference methods may be used for crop plant analysis if all ancestral sources are known.

Over the past twenty years, the development of high-density genotyping and sequencing technologies has promoted the development of accurate approaches to infer genetic ancestry of individuals based on genotyping data. Historically, the first proposed methods aimed at characterizing individual ancestries on a genome-wide scale by estimating the relative contributions of a given number of underlying ancestries. The most popular of these methods are based on unsupervised clustering approaches as introduced by Pritchard et al. (2000) in the Structure software, where clusters were interpreted as proxies for ancestries. An extension of the Pritchard's work by Falush et al. (2003), further allowed to perform local ancestry inference (LAI), i.e., to infer the ancestral origin at a local chromosome scale in individual genomes. Since then and as reviewed in Geza et al. (2019), more than 20 LAI methods have been published extending, in particular, this pioneering work by scaling up to high throughput genotyping data or by leveraging phased data for more accurate inferences.
Most LAI approaches have been developed in the context of human genetics studies for which their properties have been extensively characterized (Liu et al. 2013;Padhukasahasram 2014;Hui et al. 2017;Geza et al. 2019). Human studies relying on LAI approaches usually aim at assessing admixture between two or three populations and may benefit from a rich amount of genetics resources (The International HapMap Consortium 2005; The Wellcome Trust Case Control Consortium 2007) with, in particular, dense haplotype data for reference populations and/or admixed samples. For plant species, large-scale sequencing or genotyping resources are also increasingly available for some crops such as rice (The 3,000 rice genomes project 2014) or barley (Milner et al. 2019). However, for many other species of interest such resources remain scarce which implies that haplotype data may not yet be fully accessible, particularly for non-autogamous species. In addition, ancestries may be multiple as exemplified by the cacao tree (Theobroma cacao) germplasm which is composed of 10 major genetically differentiated groups with up to six-way admixed individuals (Cornejo et al. 2018) or pineapple (Ananas comosus) with up to four-way admixed individuals between cultivar groups and varieties (Chen et al. 2019). Moreover, populations representative of contributing ancestries may be unavailable or represented by only a few individuals. To that respect, the case for banana is particularly illustrative since hybridization events involving well-differentiated Musa acuminata subspecies are predicted to be involved in the formation of some major cultivars (Perrier et al. 2009(Perrier et al. , 2011). Yet, some of these subspecies are represented only by a few individuals (Christelová et al. 2017) or some contributors may not be represented in available germplasm (Sardos et al. 2016). Moreover, it remains unclear how some features, regarding reproduction modes (e.g., selfing or vegetative reproduction) that may be encountered in plant models, may affect the performance of LAI methods. Hence, in fruit crops such as citruses or banana, individuals resulting from inter(sub)specific hybridizations were further multiplied by vegetative propagation (also termed clonal propagation). Thus, they do not form a population but rather a collection of individuals sometimes of different origins with ancestry mosaics of relatively large blocs depending on the number of sexual generations they may have undergone. Datasets of vegetatively propagated individuals may thus be heterogeneous in terms of ancestry structure and in terms of time in generations since admixture events. The number of sexual generations since admixture is a parameter that is often required by LAI programs (Geza et al. 2019) and it can be difficult to correctly estimate it for plants that have been vegetatively propagated sometimes since hundreds of years. On the other hand, high selfing rates result in increased levels of homozygosity and generally in reduced diversity levels compared to outcrossing species (Brandvain et al. 2013;Barrett et al. 2014) while introducing additional levels of structuring of haplotype diversity when selfing and outcrossing populations are analyzed together. Finally, polyploidy that is a feature of many crop plants (e.g., wheat, sugarcane, potato, and major banana cultivars) is still a complex case to handle for LAI as genotypes are difficult to infer.
The purpose of this study was to evaluate the accuracy of LAI approaches to perform ancestry deconvolution, based on genotyping data simulated under scenarios that can be representative of plant species models. Given the lack of available methods dealing with polyploids, we only considered diploid individuals. Among the 22 LAI approaches recently reviewed by Geza et al. (2019), we chose to evaluate three methods -SABER (Tang et al. 2006); ELAI (Guan 2014) and WINPOP (Pas xaniuc et al. 2009) because they do not require prior phasing of the data and they could cope with more than two ancestries. We developed an R package to perform simulations and we focused our evaluation on the influence of i) the level of divergence between the source populations; ii) the number of generations since admixture for the admixed populations; iii) the number of contributing ancestries and their representation in the analyzed data sets; and iv) the mode of reproduction such as selfing (in a source representative population) or vegetative propagation (in the admixed population).

Simulation tool
We developed an R package (named plmgg for plant-like mosaic genome generator) to simulate individual chromosome-wide genotyping data from an arbitrary number of populations P deriving from S differentiated source populations under scenarios that may include hybridization events and modes of reproduction representative of plant model evolution (i.e., selfing or vegetative propagation). The simulation approach is depicted in Figure 1 and consisted of the three following successive steps i) coalescent simulation of a sample of founder chromosomes from S differentiated source populations; ii) forward in-time simulation of P populations deriving from the S source populations with complex demographic scenarios involving various modes of reproduction and admixture; and iii) sampling of individuals from the P populations to generate the genotyping data sets. For coalescent simulations, we relied on the scrm algorithm (Staab et al. 2015) implemented in the R package coala (Staab and Metzler 2016)  is the number of chromosomes from source population s). Sources were assumed to derive from a single ancestral population under a pure-drift model of divergence with a star-shaped history. The divergence scenario of the source populations was specified with three parameters: i) the divergence time t measured in units of 4N e (i.e., t ¼ t 4Ne where t is the number of generations since the ancestral population and N e is the haploid effective source population size assumed to be the same for the S source populations); ii) the scaled mutation rate u ¼ 4N e m (where m is the mutation rate per site and per generation); and iii) the scaled recombination rate r ¼ 4N e r (where r is the recombination rate per site and per generation). For the purpose of this study, t was varied to control the level of differentiation among the source populations (see below) while both u and r were set equal to 10 24 (as obtained for instance if one assumes m ¼ 2:5 · 10 28 mutation and r ¼ 2:5 · 10 28 recombination per site and per generation in a population of haploid size N e ¼ 10 3 ).
In the second simulation step, n P ¼ P P p¼1 n p diploid individuals belonging to P populations (where n p is the number of diploid individuals from population p) were first generated by randomly sampling two chromosomes (without recombination and with replacement) among the n ðhÞ S founder ones according to the S pre-defined source contributions (in a P · S ancestry proportion matrix). The n P individuals were further reproduced over G generations, in a forward-in-time process, by specifying for each generation g (in a P · G matrix) the proportions of the four following possible population-specific modes of reproduction: i) within population random mating; ii) across population random mating; iii) selfing; and iv) vegetative reproduction (consisting of randomly reproducing an individual's chromosome pair from one generation to the next). For sexual reproduction events (i.e., random mating and selfing), parental gametes were generated by randomly distributing one crossing-over between the two parental chromosomes which amounts to assume a 1 Morgan length chromosome map. In addition, mutations that only affected existing variant positions (switch to the alternate SNP allele) were introduced at each generation at the rate m defined above (whatever the reproduction mode). In other words, no new segregating sites appeared after the initial coalescent phase of the simulation (first simulation step described above).
In the third and last step of the simulation, o P ¼ P P p¼1 o p diploid individuals belonging to the P populations were sampled to generate the data set to be analyzed (where o p represents the number of genotyped diploid individuals from population p that are randomly sampled with replacement from the corresponding n p individuals available at generation G). After filtering out monomorphic SNPs, the simulation output consists of i) the genotyping data set in vcf (Danecek et al. 2011) and plink ped (Purcell et al. 2007); ii) the true local ancestry for each individual at each SNP position which may be displayed with plotting functions; and iii) summary statistics including pairwise population F ST (Weir and Cockerham 1984), population heterozygosities and ancestry block sizes. Figure 1 Overview of the admixture simulation process with plmgg. The coalescent step produces S source populations (here, three sources are represented in blue, red and yellow) that differentiated at t. In the forward step, source-representative populations and admixed populations are generated from sampling of the source populations. Then, each population follows for a number of generations g, a user-defined reproduction process that allows to select and combine reproduction modes (within population random mating, across population random mating, selfing, vegetative reproduction). In the last step, a sampling is performed on each population of the forward step to generate a data set for analysis.

Simulated scenarios
Six scenarios detailed in Table 1, each replicated 50 times, were considered for this study. The number of founder chromosomes was set to n ðhÞ s ¼ 300 for each source population (thereby mimicking bottlenecks involved by the domestication process from a small number of wild relatives) and the number of diploid individuals was set to n p ¼ 150 for all the populations (i.e., the source and admixed populations). Forward simulations were run for G ¼ 50 generations maintaining S non-admixed populations as source population proxies and two populations originating from an admixture event between three or more ancestries that occurred from t adm ¼ 5 to 50 generations ago. Unless otherwise stated, the sampled data set consisted of o p ¼ 20 diploid individuals for each ancestry representative population and o p ¼ 40 individuals for each admixed population. The scenarios were split into three groups to investigate the effect of i) the ancestry representative sample size; ii) the number of sources; and iii) the reproduction modes. First, the DiffGenSam scenarios (Table 1) aimed at evaluating the impact of the amount of differentiation between S ¼ 3 source populations (with t varying from 0.05 to 0.40); the number of generations since admixture for the two admixed populations (from t adm ¼ 5 to 50); and the sample size (from o p ¼ 5 to 40) of each of the three ancestry-representative populations. Five other scenarios were subsequently considered to address specific points while setting t ¼ 0:20 and t adm ¼ 50 (Table 1). The SamBal scenario aimed at evaluating the impact of unbalanced sample sizes among three ancestry representative populations (i.e., two with 20 sampled individuals and the remaining with 2 to 20 sampled individuals). We also considered two scenarios to address the impact of the number of source populations (SrcNum with S ¼ 3 to 6 source populations equally contributing to the admixed populations) or the presence of a nonsampled source population contributing to the admixed individuals (SrcMiss). In the latter case, S ¼ 4 source populations were simulated, but only three of them had representatives in the final data set. The contribution of the "missing" source population to the admixed populations varied from 0.05 to 0.15, the three other sources having equal contributions. Finally, the SrcSelf and AdmxVegProp scenarios aimed at investigating the impact of alternative modes of reproduction. In the SrcSelf scenario, we assumed that one of the three source populations was reproducing with a selfing rate varying from 0 (i.e., no selfing) to 0.99. The AdmxVegProp scenario modeled 10 admixed populations (with n p ¼ 100 and o p ¼ 10 for each admixed population) that switched from exclusive within population random mating to exclusive vegetative propagation t veg generations ago, with t veg varying from 0 (i.e., no vegetative propagation) to 45 for the 10 populations. Note that the realized number of SNPs (after filtering steps) in the different simulated data sets ranged from 10 4 to 2:7 · 10 4 (File S1).

LAI methods
As mentioned in the introduction, we retained the three LAI methods respectively implemented in the programs SABER, WINPOP and ELAI, that do not require prior phasing of the data and that could cope with more than two ancestries. The two methods SABER (Tang et al. 2006) and ELAI (Guan 2014 More precisely, SABER (Tang et al. 2006) extended the HMM by Falush et al. (2003), to account for background LD existing in ancestral populations by modeling the joint distribution of alleles from consecutive markers within each ancestral population. In addition, SABER allows modeling an arbitrary number of ancestral groups that may admix at different times estimated by a Likelihood Maximization algorithm (saberML function, here initialized with the simulated values). Each individual SNP-specific ancestry estimates were calculated as the posterior probability obtained with the forward-backward algorithm implemented in the pipeline function.
ELAI (Guan 2014) implements a two layers HMM to model two different scales of LD: the admixture LD (between alleles from different source populations) and a shorter ranged LD existing between alleles within each source population. This is achieved by introducing a local structuring of haplotypes into i) upper-layer clusters that represent different groups (interpreted as source populations); and ii) lower-layer clusters that represent group-specific haplotypes. We here set the number of upper clusters to the number S of simulated sources; the number of lower clusters to 5S as recommended; and the time since admixture (also required by ELAI) to the corresponding simulated one. Model fitting was carried out with the default Expectation-Maximization (EM) algorithm.
WINPOP, included in LAMP .=2.3 (Sankararaman et al. 2008), is a model-based LD-free method that focuses on ancestry informative markers (AIM) to assign local haplotype blocks to their originating source populations (Pas xaniuc et al. 2009). WINPOP works with variable-size overlapping windows along the chromosomes, and uses a clustering method to assign ancestries in each window, based on estimates of global ancestry proportions. WINPOP was used with the simulated recombination rate and default parameters for the configuration files including a LD pruning cutoff of r 2 = 0.1 and a fraction of sliding window overlap of 20%.
Both WINPOP and SABER required estimates of global ancestry proportion. These were obtained by running the default unsupervised hierarchical clustering algorithm implemented in the ADMIXTURE software (Alexander et al. 2009) setting the number of clusters to S, the number of simulated source populations.

Evaluation of the performance of LAI methods
To evaluate the performance of the LAI methods, we defined an accuracy metric a to quantify the overall differences between simulated and inferred local ancestries. Let z  (Table 1) in which one source representative population was missing, SNP positions with the corresponding missing ancestry were excluded from the computation of a. According to our definition, a always lies between 0 and 1 (the higher the a value, the more accurate the inference). For calibration purposes, we also computed a minimal value of a as would be obtained by randomly inferred local ancestries under the assumptions of equal contribution of the sources (i.e., setting x ðiÞ s;m ¼ 1=S for all s). Alternative metrics, such as the coefficient of determination (i.e., sample correlation coefficient between the inferred and true local ancestries) or mean square errors were also evaluated but were not presented since they lead to the same conclusions regarding the ranking of LAI methods.
We finally evaluated computational efficiency of the different LAI programs by recording for each run of analysis on our computer grid, both the memory usage and the system computing time (max_vmem and ru_wallclock, respectively) available from the Sun Grid Engine user notification.

Source differentiation and number of generations Since admixture
The impact of the level of differentiation among the sources and the number of generations since admixture on the performance of the three LAI methods was assessed with the DiffGenSam scenarios (Table 1). The analysis of the generated data sets showed that both the level of differentiation among sources and the number of generations since admixture had a strong impact on the performance of LAI methods (Figure 2, File S2). Indeed, the accuracy a decreased with an increasing number of generations after admixture (i.e., when ancestry block sizes became smaller) and with decreasing levels of differentiation between source populations ( Figure 2). Although, the three evaluated LAI approaches performed overall similarly, at the lowest levels of differentiation (t # 0:10), ELAI and WINPOP were more accurate than SABER for more recent admixture events (t adm # 20) (Figure 2, File S2). In the most favorable situations of high differentiation among the source populations (i.e., t $ 0:3), the accuracy a tended toward 1 (i.e., no error) with decreasing time since admixture for all the three LAI methods.

Number of individuals from the source representative populations
The impact of the number of sampled representative individuals for each of the three source populations was also evaluated within the DiffGenSam scenarios (Table 1). As shown in Figure 3, for a given time since admixture (here t adm ¼ 50, see Figure S1 and File S2 for alternative t adm values) decreasing the number of individuals representative of the source populations (e.g., from o ðsÞ p ¼ 20 as in Figure 2 to o ðsÞ p ¼ 5) had a higher impact on accuracy for ELAI compared to WINPOP and SABER. Conversely, except for the highest level of differentiation among source populations, increasing the number of source representative individuals improved ELAI performances. Figure 2 Accuracy of LAI methods with varying levels of differentiation and number of generations (DiffGenSam simulation). The accuracy (a) of the LAI methods (y-axis) is plotted for different levels of differentiation that vary from 0.05 to 0.4 (vertical tiles) and a number of generations after admixture that varies from 5 to 50 (x-axis). The sample size is set to 20 for the sources and the admixed populations. Each dot is the mean value of 50 repetitions of each simulation. Error bars indicate the standard deviation. ELAI, WINPOP and SABER scores are plotted in blue, red and yellow, respectively. Accuracy of random inference (proportion of ancestry fixed at 1/3) is plotted in gray.
We evaluated the robustness of the three LAI methods to unbalanced sample sizes of source-representative populations by analyzing data sets simulated under the SamBal scenarios where the number of samples was reduced for one of the three sources ( Figure S2, File S3). The accuracy of ELAI was lower than both the WINPOP and SABER when sampling was reduced for the third source (e.g., for 2 representatives instead of 20, accuracy of 0.720 for ELAI vs. 0.815 for WINPOP and 0.806 for SABER, File S3) but it increased when sampling was more balanced reaching accuracy of 0.870 for a completely balanced setting. For WIN-POP and SABER the accuracy was only marginally improved, reaching up to 0.850.
According to the results above, to allow better discrimination of the LAI methods in relatively challenging conditions, we chose to perform the remaining evaluations with a number of generations after admixture set to 50, a level of differentiation among sources of t ¼ 0:2 and 20 individuals per source representative population.

Number of source populations and absence of source representative individuals
With the SrcNum scenarios (Table 1), data sets were simulated for admixture events involving up to six source populations. The analysis of LAI results showed that the accuracy decreased with increasing numbers of sources for all three evaluated LAI approaches (Table 2). However, the magnitude of decrease in accuracy from S ¼ 3 to S ¼ 6 source populations remained moderate with rates equal to 2.7%, 8.7% and 11% for ELAI, WINPOP and SABER respectively (to be compared with the 45% decrease observed with the random inference) ( Table 2). We further assessed the impact of the absence of individuals from one out of four source representative populations using data sets simulated under the SrcMiss scenario (Table 1). Different proportions of this unrepresented source to admixed populations were tested (5, 10 and 15%) and accuracy was measured by excluding regions contributed by the missing source population. As shown in Table 3, the accuracy for all methods was stable in regions without the unknown ancestry, whatever the global proportions (at a data set level) of unknown ancestry. This suggested that the absence of individuals from a source representative population in the analyzed data sets did not introduce biases in inferring local ancestries of the represented source populations. Visual inspection of local ancestries inferred in regions containing the missing ancestry did not reveal any particular pattern (e.g., like a higher switching rate among the other represented ancestries). As an example, Figure  4 shows the inferred local ancestry mosaic of one individual from a simulated data set with a 10% contribution of the unrepresented source population. In general, the chromosomal regions originating from the missing source population tended to be assigned to different represented ancestries, the assignation also varying according to the LAI method used. Table 4 gives the accuracy of the different LAI approaches on data sets simulated under the SrcSelf simulation (Table 1) in which the third source representative population reproduced with a varying extent of selfing. For the three LAI approaches, increased proportions of selfing in the third source representative population resulted in a decrease of accuracy, to a small extent. Indeed the decrease in accuracy between rates of selfing of 0 and 99% was Figure 3 Accuracy of LAI methods with varying levels of differentiation and source-representative sample size (DiffGenSam simulation). The accuracy (a) of the LAI methods (y-axis) is plotted for different levels of differentiation that vary from 0.05 to 0.4 (vertical tiles) and the size of source-representative sample that varies from 5 to 40 individuals (x-axis). The source sample size is set to 20. Each dot is the mean value of 50 repetitions of each simulation. Error bars indicate the standard deviation. ELAI, WINPOP and SABER scores are plotted in blue, red and yellow, respectively. Accuracy of random inference (proportion of ancestry fixed at 1/3) is plotted in gray.

Selfing and vegetative propagation
n■ Table 2  equal to 0.92%, 6.6% and 3.5% for ELAI, WINPOP and SABER, respectively. Figure 5 plots the accuracies of LAI approaches estimated on data sets simulated under the AdmxVegProp scenarios (Table 1) consisting of individuals from three source representative populations and 10 admixed populations that switched to an exclusive vegetative propagation mode t veg generations ago (t veg varying from 0 to 45 for the different populations). Note that, the larger t veg , the larger the ancestry block sizes (since the smaller the number of post-admixture recombinations). For both WINPOP and ELAI based inference, the accuracy increased for increasing values of t veg as expected given larger ancestry block sizes. However, the accuracy of SABER, being very similar for individuals with t veg ¼ 45 and t veg ¼ 0, was mostly not influenced by t veg , although a slight decrease was observed at t veg ¼ 30. As this decrease appeared for higher numbers of generations of vegetative propagation, it may be linked to the fact that SABER performs its own estimation of time since admixture. To investigate this, a second run of SABER was performed without using the time since admixture estimation method (saberML function), but with a time since admixture fixed at t adm ¼ 50 as for WINPOP and ELAI ( Figure 5). SABER accuracy was found higher with this fixed number of generations but a decrease at t veg ¼ 30 was still observed.

Computational performances of LAI methods
Computational performance was measured for all the analyses performed on the simulated data sets. For the DiffGenSam scenario (S ¼ 3 sources), memory consumption for the different methods ranged from 0.5Gb to 2Gb of RAM and was not highly variable across scenario variations ( Figure S3). WINPOP was the fastest of the three LAI methods with a mean running time ranging from 20 s to 60 s in the Diff-GenSam data sets (with three source representative populations) while SABER runs lasted between 30 min and 60 min and ELAI runs between 50 min and 4h ( Figure S4). The analysis of data sets simulated under the SrcNum scenarios showed that the number of sources had the most significant impact on resource consumption (Figure 6), particularly for ELAI that used up to 10GB and 30h with S ¼ 6 sources. This corresponded to a 20-fold memory and a 38-fold computing time increases as compared with S ¼ 3 sources, (Figure 6) whereas the overall number n■ Table 3  Mean accuracy a, accuracy confidence interval (0.95) and accuracy standard deviation (sd) of ELAI, WINPOP and SABER on simulated data with different percentage of a fourth source population participating to the admixture event are indicated. Accuracy was computed after removal of the unknown population segments in the admixed individuals, to measure the impact on well represented segments. Simulations were conducted with 50 repetitions, t ¼ 0:2, 50 generations after admixture and 20 individuals sampled from each population. Random inference (1=S for each ancestry) was evaluated like LAI methods. of individuals (70 vs. 55) and the number of SNPs remained similar. Although memory usage increased steadily for WINPOP and SABER (from 0.7GB to 3.75GB), the computing time remained low for WIN-POP (20s to 3min20s) and intermediary (up to 5h) for SABER.

DISCUSSION
The approaches evaluated in this study (implemented in the SABER, WINPOP and ELAI programs) were mostly developed for applications in human populations. The purpose of our study was to carry out a detailed evaluation of the accuracy of these three LAI approaches on data simulated under scenarios with features that may be encountered in studies of plant domestication or diversification involving admixture. For instance, the three methods we considered here were originally tested on data simulated by resampling haplotypes from two to three human populations in scenarios consisting of two-way or three-way admixture with up to a few tens generations post-admixture and including from 100 to 200 genotyped individuals per source representative populations in the analysis (Tang et al. 2006;Pas xaniuc et al. 2009;Guan 2014).
We developed an R package (plmgg) to simulate genotyping data under a wider range of scenarios and sample designs that include plantlike features. Even if this simulator has some limitations (it does not simulate recombination hotspots, multiple recombination per chromosomes nor selection), it allowed us to assess the influence on LAI accuracy of the level of differentiation, of multiway admixture with up to six ancestries and of limited sampling of source populations. In addition, the impact of two plant reproduction modes was also evaluated: selfing (in a source representative population) and vegetative propagation (in the admixed population).
Overall, the two main factors that contributed to improve accuracy of all the three tested LAI approaches were the level of divergence between source populations (the higher, the better) and the number of generations since admixture (the smaller, the better) which was not surprising given their expected influence on the complexity of genome mosaics. Indeed, due to both mutations and recombination, divergence between source populations leads to increased differences among their originating haplotypes that facilitates their discrimination. Similarly, increasing the number of generations since admixture, results in shorter ancestral chromosome segment tracks, which are then more difficult to identify. However, it should be noticed that in scenarios with the most extreme level of differentiation among the source populations we considered here (t ¼ 0:4 which corresponds to a F ST ≃1 2 e 2t ≃0:33 in the pure-drift model of divergence we simulated), LAI accuracy remained acceptable even for the oldest admixture events (50 generations since admixture). In Citrus, average F ST values of 0.44 up to 0.85 were found between the four ancestral taxa depending on studies or marker types (Curk et al. 2015(Curk et al. , 2016. In the cacao tree or in pineapple, pairwise F ST ranges between genetic groups were of 0.16 to 0.65 (Cornejo et al. 2018) and 0.28 to 0.94 (Chen et al. 2019), respectively. The lowest part of these ranges are covered in our simulations and higher values of F ST will actually facilitate LAI even with older admixture events. For closely related source populations, LAI approaches only performed well if admixture events were very recent (i.e., below 10 generations). The three methods tested behaved roughly similarly, although WINPOP tended to be superior when source populations were more closely related whereas for more differentiated sources and between 20 and 50 generations after admixture, ELAI tended to be more accurate. This result was consistent with the WINPOP paper (Pas xaniuc et al. 2009) that showed that WINPOP performed well with closely related populations, with its improved modeling of recombination and adaptive window length that takes into account local genetic distances between ancestral populations. As for ELAI, its two-layer HMM model helps resolving short ancestry segments that can result from increasing generation numbers after admixture (Guan 2014). In practice, differentiation among the source populations may be estimated with genotyping data available in the source representative individuals even when few individuals are available (Willing et al. 2012).
The timing of admixture events, required by both ELAI and WIN-POP, may also represent in practice a parameter difficult to provide, n■  Mean accuracy a, accuracy confidence interval (0.95) and accuracy standard deviation (sd) of ELAI, WINPOP and SABER on simulated data with variation on selfing proportion in the third source-representative population are indicated. Simulations were conducted with 50 repetitions, t ¼ 0:2, 50 generations after admixture and 20 individuals sampled from each population. Random inference (1=S for each ancestry) was evaluated like LAI methods. Figure 5 Accuracy of LAI methods with varying number of generations of vegetative propagation (AdmxVegProp simulation). The accuracy of the LAI methods (y-axis) is plotted for different numbers of generations of vegetative propagation after admixture (t veg ) that vary from 0 to 45 (x-axis). The source sample size is set to 20, the differentiation set to 0.2 and the total number of generations after the admixture event set to 50. Each dot is the mean value of 50 repetitions of each simulation. Error bars indicate the standard deviation. ELAI, WINPOP and SABER scores are plotted in blue, red and yellow, respectively. SABER score with fixed number of generations after admixture is plotted in darker yellow. Accuracy of random inference (proportion of ancestry fixed at 1/3) is plotted in gray.
especially for populations reproducing with vegetative propagation. Also, as we fixed this parameter to its true simulated value when running ELAI and WINPOP programs, our evaluation of these two methods may be overly optimistic. Yet, results obtained on the Admx-VegProp scenarios that include several generations of vegetative propagations suggests that both ELAI and WINPOP remain robust to (at least) upwardly biased estimates of the timing of admixture. In practice however, it may be valuable to check the sensitivity of the results obtained with these methods to a biologically sound range of (exponentially) varying values for this parameter. On the other hand, the timing of admixture events may also be estimated as proposed in the SABER framework. We nevertheless observed that in our settings the SABER estimations were inaccurate (see Figure S5) which suggests in turn that LAI relying on SABER is also robust to biased estimates of the timing of admixture events. Other approaches may thus be preferable to that end, for example those modeling LD decay on a whole genome basis providing sampling allows it (e.g., Loh et al. 2013). Recently, Chen et al.
(2019) estimated an average of 37 generations since the onset of admixture events for 22 (primarily) vegetatively propagated pineapple (var. comosus) hybrids, with a range of 21-55 generations. Interestingly, we found that selfing (in a source representative population) or vegetative propagation (in the admixed population) had only a small impact on the inference accuracy. Selfing in a source population is of particular interest for banana as one of the M. acuminata subspecies contributing to banana hybrids is predicted to be frequently self-pollinated (Simmonds 1962). Reproduction by vegetative propagation is favored for many fruit tree crops (Miller and Gross 2011). Depending on the number of generations of sexual reproduction after admixture, vegetative propagation of admixed individuals can result in different levels of fragmentation of the mosaic structures. As mentioned above, this type of setting, with an overestimation of the generation number parameter had a minor impact on both ELAI and WINPOP, but a more notable impact on SABER inference for individuals where the overestimation was the highest.
Increasing the number of source populations (up to six tested) only marginally affected the accuracy of the tested LAI methods, particularly for ELAI. Nevertheless, this also increased the computational burden that became substantial for the ELAI program, presumably due to the higher number of model parameters. Hui et al. (2017) developed a tool (LAIT) to run four LAI methods including WINPOP and ELAI on a data set. They used LAIT to compare LAI methods on two-way and three-way admixture, and showed that ELAI performed better than WINPOP at the cost of increased resources consumption, which is consistent with our results.
Our results also showed that LAI methods perform similarly well for moderate to high levels of differentiation among source populations, even when the number of source representative individuals is small, which may have favorable practical consequences as it is not always possible to have access to large numbers of source representatives. Yet the three different methods behaved differently given an unbalanced data set, with a minor impact on SABER and WINPOP compared to ELAI. This may be explained by the two layers models of ELAI that ties haplotypes structure to ancestries, so that clustering will be hindered by low haplotypic variability. More generally, and in practice, assessing the number of source populations and assigning individuals to them might not be an easy task. Unsupervised clustering approaches (Pritchard et al. 2000;Alexander et al. 2009;Frichot et al. 2014) might be viewed as a reference choice (Stift et al. 2019) provided the source populations are differentiated enough and evenly represented in the data set (Puechmaille 2016). The Chromopainter method (Lawson et al. 2012) allows to determine ancestry sources without individuals assigned as sourcerepresentatives, provided that phased data are available.
A most critical issue regarding LAI performances was the absence of representative individuals for a given source. The results obtained on the SrcMiss simulations showed no particular bias in attributing the missing population to known ancestries. This result may come from the fact that in our simulation the population tree between the four sources is star shaped. In practice, a star shaped tree is uncommon, one known population may be closely related to the missing population and bias cannot be excluded in this case. Some empirical and specific sampling procedures have been proposed to circumvent the absence of source representatives, in the case of large proportions of unrepresented ancestry in admixed populations (Zhou et al. 2016). Recently, a promising and more generic alternative has been developed in the MOSAIC model of Salter-Townshend and Myers (2019) for haplotype data, which allows for extracting information on source populations from related (and possibly admixed) individuals. Yet, phased data that we purposely kept out of consideration may not be accessible for many crop species. Moreover, it has been shown that switch errors that can occur with statistical phasing (Scheet and Stephens 2006;Browning and  Browning 2011) reduce LAI accuracy (e.g., Guan 2014). However, haplotype-based LAI approaches such as RFMix (Maples et al. 2013), LOTER (Dias-Alves et al. 2018) and MOSAIC (Salter-Townshend and Myers 2019) that included switch error modeling demonstrated that, if properly modeled, inaccurate phasing is becoming less of a threat for LAI accuracy.
LAI on phased data may also be particularly well suited to deal with polyploidy, ploidy being highly variable in crop species (e.g., pineapple 2x, cacao tree 2x, banana 2x and 3x, citrus up to 4x, sugarcane up to 12x) although statistical phasing might be challenging. Alternatively, HMM-based methods such as those proposed by (Corbett-Detig and Nielsen 2017) for Pool-Seq data may also be of value.
The evaluation of LAI methods accuracy and performance with the plmgg R package, showed that LAI methods are usable in the scope of crops genetics, with caution particularly in case of a missing source population. The software WINPOP seems suited when source populations are close and admixture events recent. ELAI could be particularly adapted for well differentiated and relatively well represented sources, in case of selfing in source populations, for vegetative propagation settings, and multiway admixture although for the latter, computational performance might be a limiting factor. Other parameters more specific to different plant/crop models might be evaluated using the plmgg package.