Madeiran Arabidopsis thaliana Reveals Ancient Long-Range Colonization and Clarifies Demography in Eurasia

Abstract The study of model organisms on islands may shed light on rare long-range dispersal events, uncover signatures of local evolutionary processes, and inform demographic inference on the mainland. Here, we sequenced the genomes of Arabidopsis thaliana samples from the oceanic island of Madeira. These samples include the most diverged worldwide, likely a result of long isolation on the island. We infer that colonization of Madeira happened between 70 and 85 ka, consistent with a propagule dispersal model (of size ≥10), or with an ecological window of opportunity. This represents a clear example of a natural long-range dispersal event in A. thaliana. Long-term effective population size on the island, rather than the founder effect, had the greatest impact on levels of diversity, and rates of coalescence. Our results uncover a selective sweep signature on the ancestral haplotype of a known translocation in Eurasia, as well as the possible importance of the low phosphorous availability in volcanic soils, and altitude, in shaping early adaptations to the island conditions. Madeiran genomes, sheltered from the complexities of continental demography, help illuminate ancient demographic events in Eurasia. Our data support a model in which two separate lineages of A. thaliana, one originating in Africa and the other from the Caucasus expanded and met in Iberia, resulting in a secondary contact zone there. Although previous studies inferred that the westward expansion of A. thaliana coincided with the spread of human agriculture, our results suggest that it happened much earlier (20–40 ka).


Supplementary methods
δaδi For a test of our demographic inference, we used the composite likelihood approach, δaδi (Gutenkunst et al. 2009). To mitigate the confounding effect of sampling and population structure in the joint site frequency spectrum of the Madeiran clade, we eliminated nearly identical samples (one for each pair with 10 3 times lower pairwise differences per base pair, compared to average comparisons within Madeira), as well as single samples collected in isolated parts of the island, which accounted for the majority of the signal of, respectively, excess doubletons and singletons. We based the analyses on the joint site frequency spectrum computed on intergenic sites only, assuming they should evolve mostly neutrally. We replicated the analyses 200 times independently for each demographic model, with different, randomly chosen, starting values for each parameter within predefined ranges, and with a maximum number of iterations of 50.
We modelled a number of possible demographic histories of increasing complexity. Specifically, in all models the Madeiran clade splits from the Iberian relicts at some time T split , followed by either constant population sizes within demes (the "Simple split" model in supplementary table S2), or allowing for exponential changes in N e (t) (the "Exp.growth" model). We also tested the possibility of a colonisation bottleneck in Madeira, both constraining it to happen at the split (the "Bot.split" model), or at any time between the split and present (the "Bot.free" model). In all models, the parameter boundaries for optimization was set to (10 −3 ; 20) for N e , and (0; 10) for the split time. These ranges include, and are larger than those suggested in δaδi's manual (Gutenkunst et al. 2009

MSMC
We explored the possibility of multiple colonisation events to Madeira through simulations. With a first ancient colonisation at 85.4 kya, and a second more recent round of migration at 64 or 48 kya, the decay in CCR extends from the colonisation to the more recent migration event. However, both scenarios still produce a faster decay in CCR than in real data. Conversely, a second migration event as recent as 36.6 kya produced a marked spike in CCR, absent in real data. We also investigated whether the minimum in δaδi All models produced reasonable estimates for the effective population size before the split, that was inferred to be between 120 and 140 K. Also the inferred time to the split was relatively constant across models, varying between 75 and 87 kya, and very closely agreeing with inferences based on MSMC. In the model with constant effective population sizes after the split, N e in Madeira was optimised to 24K, and in the model allowing for exponential changes in N e (t) it decayed from around 50K to 19K, in both cases broadly agreeing with inferences from MSMC (long term N e in Madeira fluctuating around 30K). Both bottleneck models had more parameters and a lower likelihood than the exponential model, so overall they were outperformed by simpler scenarios. When we constrained the bottleneck to happen at the split, the optimised N e right after the split was actually greater than long-term effective population size in Madeira, consistent with results from MSMC, and inconsistently with a colonisation bottleneck at the split. When the bottleneck was not bound to happen at the split, the optimised timing and N e (between 32 and 1 kya, N e =14.7K) broadly coincided with the "ice age" period inferred by MSMC (between 40 and 15 kya, N e = 10K).

The McDonald-Kreitman test
The McDonald-Kreitman test (McDonald and Kreitman 1991) resulted in 15 genes with signatures consistent with positive selection.
Figure S1: Population tructure in Madeira. (a, b) Neighbour joining tree of Madeirans, with and without the three recent migrants; (c, d ) Principal component analysis of Madeirans, with and without the three recent migrants. Sample IDs are described in table S1. Triangles represent the three recent migrants (P1-3); circles represent native Madeirans (11 samples). Among native Madeirans, different colours represent different geographic regions (as in table S1). At the scale of subfigure a and c, two of the recent migrants are almost identical, and so are the native Madeirans in subfigure c. Effective population size (x1000) Madeira (2 haps) Madeira (8 haps) Iberian relicts (2 haps) Eurasian non-relicts (2 haps) Figure S2: Effective population size as a function of time (N e (t)) in Madeira (two shades of red), Iberian relicts (green) and Eurasians (blue). N e (t) is shown smoothed with a cubic spline across median N e for each time segment used in MSMC. Shaded areas represent confidence intervals (± 1.96 · SE). Due to small N e in the recent past, Madeiran genomes exhaust earlier haplotype information, so MSMC inference is not anymore reliable for times more ancient than about 200 kya in 8-haplotypes mode, 300 kya in 2-haplotypes mode.  Figure S4: Two rounds of migration and variable mutation rate. CCR decay with two rounds of migration to the island (a first colonization at 85.4 kya and a second migration event at, respectively, 64.4, 48.6 and 36.6 kya), and CCR decay with increased variance in the mutational process. Simulations with instantaneous splits in the same time frame are shown for comparison, as well as real data on the split between Madeira and Iberian relicts.    Figure S8: The six best fitting models. Three demographic models produced changes in N e (t) consistent with real data, with or without a colonization bottleneck of severity 0.1, for a total of six models. The first was the baseline model, that followed inferred N e (t) from MSMC. The second and third models assumed a carrying capacity in Madeira of N e = 30K, and an ice age bottleneck (N e =10K between 40 and 15 kya) with respectively a sudden, and a smooth recovery to carrying capacity. Dashed and dotted lines of different colors depict the corresponding simulated trajectories in N e (t).