Transposable elements (TEs) are drivers of evolution resulting in episodic surges of genetic innovation and genomic reorganization (Oliver KR, Greene WK. 2009. TEs: powerful facilitators of evolution. Bioessays 31:703–714.), but there is little evidence of the timescale in which this process has occurred (Gingerich PD. 2009. Rates of evolution. Ann Rev Ecol Evol Syst. 40:657–675.). The paleontological and archaeological records provide direct evidence for how evolution has proceeded in the past, which can be accessed through ancient DNA to examine genomes using high-throughput sequencing technologies (Palmer SA, Smith O, Allaby RG. 2011. The blossoming of plant archaeogenetics. Ann Anat. 194:146–156.). In this study, we report shotgun sequencing of four archaeological samples of cotton using the GS 454 FLX platform, which enabled reconstruction of the TE composition of these past genomes and species identification. From this, a picture of lineage specific evolutionary patterns emerged, even over the relatively short timescale of a few thousand years. Genomic stability was observed between South American Gossypium barbadense samples separated by over 2,000 miles and 3,000 years. In contrast, the TE composition of ancient Nubian cotton, identified as G. herbaceum, differed dramatically from that of modern G. herbaceum and resembled closely the A genome of the New World tetraploids. Our analysis has directly shown that considerable genomic reorganization has occurred within the history of a domesticated plant species while genomic stability has occurred in closely related species. A pattern of episodes of rapid change and periods of stability is expected of punctuated evolution. This observation is important to understanding the process of evolution under domestication.
Genomes of domesticated plant species have experienced extensive genome rearrangement (Bennetzen 2005; Baucom et al. 2011). Some of this may have resulted from periods of evolutionary stress upon introduction to new environments or cultural practices during domestication, and some may pertain to evolutionary events deeper in time prior to domestication. Genome rearrangement has been facilitated by transposable elements (TEs), which in turn has generated variation that has been subject to natural selection (Oliver and Greene 2009; Baucom et al. 2011; Tenaillon et al. 2011). Episodic surges of TE activity as adaptative responses fit the pattern of punctuated equilibrium seen in paleontology (Oliver and Greene 2009). It is thought that the rate at which TEs can infest genomes and be fixed in lineages is rapid (Bennetzen 2005), but an estimated constant rate of transposition based on phylogenies of extant taxa is usually used to place these evolutionary events in history (Venner et al. 2009). The time frame in which genomes of crop species have evolved under domestication remains unclear. Archaeogenomics allows the direct examination of taxa in their natural temporal succession so facilitating a resolution of evolutionary events not possible with modern genomics (Schindewolf 1948; Palmer et al. 2011).
Cotton is an interesting species for the study of genome evolution because it is both an economically important crop and has experienced recent genome divergence across the genus. The cotton genus, Gossypium, includes 13 genomes comprising ∼40 diploid and tetraploid species, many of which have undergone large and independent expansions of gypsy-like, copia-like, and LINE TEs (Hawkins et al. 2006; Hu et al. 2010), as well as episodes of rapid DNA loss (Hawkins et al. 2009). Furthermore, Zaki and Ghany (2004) demonstrated that TE activity in this genus is ongoing. Cotton is the most important domesticated plant species for fabric technologies, such as textiles, lines, and nets, and is present in the archaeological record in the Old and New Worlds. Domesticated cotton has complex origins involving two genomes (A and D) and four independent domestication events. In the Old World, two different diploid species of the A genome type, Gossypium herbaceum and G. arboreum, were domesticated in the African and South Asian continents, respectively. An A genome species resembling G. herbaceum was translocated to the New World, where hybridization with an indigenous species of D genome type, G. raimondii, occurred giving rise to wild allotetraploids of AD genome type. Senchina et al. (2003) estimated a date of 1.5 Ma for this hybridization event. Subsequently, two different AD allotetraploid species, G. hirsutum and G. barbadense were domesticated in the North and South American continents, respectively (Fryxell 1979; Wendel 1989; Senchina et al. 2003).
In this study, we directly examine recent genome evolution in Gossypium by using ancient DNA extracted from archaeological samples of cotton. We used the approach of shotgun high-throughput sequencing because it is the most suitable methodology available to reflect the genomic content of these samples (Tenaillon et al. 2011) and has not been previously applied to archaeological samples from low latitudes (Palmer et al. 2011). We selected a cross section of the cotton archaeobotanical record, with one archaeological sample from the Old World from the site at Qasr Ibrim (Egypt) (Clapham and Rowley-Conwy 2009) and three samples from the New World from sites at Samaca (Peru), Loreto Viejo (Peru), and the Boquette Caves (Brazil), respectively (Junqueira and Malta 1981; Freitas et al. 2003; Beresford-Jones et al. 2009; Owen 2009) (fig. 1). Species of the Gossypium genus are not discernable from the remains commonly found in the archaeobotanical record, so attribution of species was necessary for both the archaeology and comparative genomic analysis, which we achieved through the analysis of archaeogenomic data.
Materials and Methods
Methods are outlined in more detail in the Supplementary Material online.
The four samples used in this study represent a cross section of the cotton archaeobotanical record and are detailed as follows: BQT90 was a boll excavated from Boquete Caves in Januaría, Brazil, dated to 750 years BP by radiocarbon dating of other samples from the same site (Junqueira and Malta 1981; Freitas et al. 2003); P1 was seeds excavated from Samaca H-13 site in the Ica Valley in Peru, stratigraphically dated to the Middle Horizon, 800–1,000 years BP (Beresford-Jones et al. 2009); P2 was seeds excavated from Loreto Viejo, Lower Osmore Valley, Peru, radiocarbon dated to 3,820–3,630 years cal BP (Owen 2009), and QI92 was seeds excavated from the site of Qasr Ibrim in Egypt, stratigraphically dated to 1,600 years BP ± 50 years (Clapham and Rowley-Conwy 2009) (for images, see fig. 1).
Ancient DNA Extraction and Sequencing
Ancient DNA was extracted from archaeobotanical samples using a modified cetyltrimethylammonium bromide and silica column purification protocol using stringent precautions to avoid contamination by foreign materials and ensure authenticity. DNA samples (50–120 ng per sample) were made into standard 454 GS FLX shotgun libraries following the manufacturer's low molecular weight protocol (Roche Applied Science, Mannheim, Germany) and sequenced using the 454 GS FLX platform on two LR70 plates using a four region gasket. Reads were filtered for microsatellites and duplicate reads using a scripts developed in house for this purpose. All scripts used in these analyses are available for download from http://www2.warwick.ac.uk/fac/sci/lifesci/research/archaeobotany/downloads/.
Sequencing reads were treated as a metagenomic data set and assigned to taxa by highest scoring alignments to any known sequence in the NCBI nt, est, or gss databases using megablast and MEGAN (v 3.8). Metagenomic analysis was carried out in order to determine endogenous DNA content and to isolate the data sets assigned to Gossypium for further analyses. The repetitive and nonrepetitive DNA from this data set was separated for species assignation analysis and repetitive DNA analysis.
The data set that was assigned to the nonrepetitive portion of the Gossypium genome was utilized to determine the species of each sample by binomial analysis of the number of sequences assigned to the respective Gossypium species in the metagenomic analysis. The representation of each species in the NCBI nt database was used to calculate the unbiased probability of randomly selecting a particular species of Gossypium over any other, out of all the Gossypium species found in the metagenomic analysis. In the case of each sample, there was an expectation of two possible species based on the archaeology of the region and the geographical ranges of extant cotton; these were G. barbadense or G. hirsutum for samples BQT90, P1 and P2, and G. arboreum or G. herbaceum for sample QI92. The probabilities of the observed assignations for each of the Gossypium species were calculated using a posterior distribution. The reliability of the species identification was further assessed by the calculation of beta distributions of the expected and observed numbers of assignations to establish the extent of their separation. Parameters for the beta distribution used were α equal to p + 1, where p was the number of specific species assignations made in the metagenomic analysis, and the β parameter equal to q + 1, where q was equal to the total remaining assignations to members of the Gossypium genus.
Repetitive Element Analysis
The data set assigned to the repetitive portion of the Gossypium genome was interrogated against a custom database of published TEs using megablast and categorized to TE type and subtype according to annotation of the reference sequence. Retrotransposon frequencies were calculated for each sample for comparison with those already published for modern accessions of Gossypium A-, D-, and AD genomes. At the time of writing, retrotransposon profiles for the following species were available: G. herbaceum (Hawkins et al. 2006), G. raimondii (Hawkins et al. 2006) and G. hirsutum (Guo et al. 2008), respectively. Absolute values of archaeogenomic sequences assigned to the respective retrotransposon types were weighted according to retrotransposon lengths of 9.7, 5.3, and 3.5 Kbp for gypsy, copia, and LINE, respectively (Hawkins et al. 2006), to give an estimate of the genomic contribution of each retrotransposon type. Posterior probability distributions were calculated for each set of retrotransposon observations using a beta distribution. The parameter α equal to p + 1, where p was the number of counts observed for that retrotransposon type, and the β parameter equal to q +1, where q was the number of times a different retrotransposon type was observed.
Theoretical tetraploid TE profiles were calculated using the published TE profile of modern G. raimondii, with modern G. herbaceum (Hawkins et al. 2006) and the archaeological G. herbaceum (QI92), respectively, weighted by relative genome size. The posterior distributions of the theoretical tetraploid TE profiles were generated from the posterior distributions of the respective diploid genome progenitor genomes. In this process, the compositions of the diploid genomes were randomly sampled from their respective posterior distributions and the resulting tetraploid composition was calculated weighting the profiles by relative genome size. The process was repeated 10,000 times to produce a frequency profile that was converted to a probability distribution.
GS FLX 454 shotgun sequencing generated 16.1 Mbp of unique reads with an average length of 72.7 bp from BQT90, 6.4 Mbp with average length 57.9 bp from P1, 5.3 Mbp with average length 38.6 bp from P2, and 8.5 Mbp with average length 37.4 bp from QI92 (supplementary table S1 and fig. S1, Supplementary Material online). Sequence lengths decreased with sample age, which is consistent with other studies on samples of similar age and origin (Freitas et al. 2003; Palmer et al. 2009) (supplementary fig. S2, Supplementary Material online). Metagenomic analysis indicated proportions of endogenous DNA consistent with the context and integrity of each sample. Despite the relatively poor coverage of Gossypium genomes on publically available databases, the proportions of endogenous DNA from the assignable sequences were calculated to be 45% for BQT90, 64% for P2, and 95% for QI92 (fig. 1); however, it is likely that considerably more would have been assigned to Gossypium if a complete reference genome were available. From the sample P1, less than 4% DNA was assignable as Gossypium. This sample was of visibly poor integrity and our estimate of the base substitution error rate (see supplementary methods, Supplementary Material online) was higher from this sample than the others. We conclude that the endogenous DNA sequenced from this sample was too severely damaged for accurate assignation to taxa of origin. The high level of endogenous DNA in QI92 is comparable to that obtained from permafrost samples of human hairs (Rasmussen et al. 2010) and considerably higher than previous paleontological metagenomic studies of a permafrost preserved mammoth, a Neandertal and a caprine (Green et al. 2006, 2010; Poinar et al. 2006; Ramirez et al. 2009). Metagenomic analysis attributed only trace presence of bacteria from this sample, which is consistent with previous observations of remarkable preservation at the Qasr Ibrim site (O'Donoghue et al. 1996; Palmer et al. 2009).
Species Assignations of Cotton Samples
In the metagenomic analysis, reads from each sample were each assigned to one of a number of species of the Gossypium genus (supplementary fig. S3, Supplementary Material online). The unbiased probability of observing the number of metagenomic assignations or more based on database size is given in figure 2A. In this test, a probability value of 0.5 indicates no bias for or against selection of a particular species and probabilities below 0.5 indicate an increasing bias in favor of the associated species. For species in which the unbiased probability was below 1 × 10−5 the information value of the result was assessed using beta distributions of the observed values compared with the unbiased values expected if no one species was more likely to be assigned (fig. 2B). From these analyses, samples BQT90 and P2 were unambiguously assigned as G. barbadense and QI92 as G. herbaceum. The sample P1 was also assigned with statistical significance to G. barbadense, but beta distributions showed very little separation between observed and expected assignations, indicating that there were too few DNA sequences to be sure that the G. barbadense assignations of this sample were not drawn from the expected unbiased distribution.
This study represents the first identification of archaeobotanical cotton remains to a species level and demonstrates the power of archaeogenomic analysis for this purpose. The identification of the sample QI92 as G. herbaceum rather than G. arboreum indicates that it is of African origin as opposed to the alternative possibility of an imported plant from the Indian subcontinent (Clapham and Rowley-Conwy 2009), which has occurred for a number of domesticate species (Fuller and Boivin 2009). The identification of the South American samples as G. barbadense rather than G. hirsutum confirms the expectation for these samples based on biogeography (Beresford-Jones et al. 2011). Specifically for this study, the species identifications enabled comparative studies of genomic architecture in samples spanning the domestication history of cotton.
Repetitive Element Analysis
The repetitive DNA content of archaeogenomes ranged from 48% to 65% for BQT90, P2 and QI92, which is close to previous estimates for Gossypium of 40–65% (Hawkins et al. 2006; Guo et al. 2008; supplementary table S2, Supplementary Material online). The percentage of the QI92 archaeogenome assigned as repetitive DNA was 65%, concurrent with previous estimates for G. herbaceum of a minimum of 60% of the genome occupied by repetitive sequences (Hawkins et al. 2006). While there have been no direct estimates reported for modern G. barbadense, 53% and 48% of the BQT90 and P2 archaeogenomes, respectively, were repetitive in this study, so modern G. barbadense is likely to have about 50% repetitive DNA. The 5% difference in repetitive DNA composition between these two G. barbadense samples is unlikely to be the result of sampling error because the posterior distributions of genomic repetitive element proportion of the two samples are well separated (supplementary fig. S4, Supplementary Material online).
Gossypium retrotransposons accounted for the majority of archaeogenomic repetitive DNA (84% from BQT90, 100% for P1, 85% for P2, and 88% for QI92, supplementary table S3, Supplementary Material online) and a high proportion of which assigned to Gossypium retroelement gypsy elements, known as Gorge TEs (supplementary table S4, Supplementary Material online). Gorge TEs comprise a large portion of Gossypium genomes, in some lineages up to half of the genome, and only occur in members of the Gossypium genus (Hawkins et al. 2006). Hawkins et al. (2006) describe three phylogenetically distinct Gorge subtypes, of which Gorge3 has undergone significant expansion in lineages of the A genome. The relative abundance of Gorge TEs in the archaeogenomes were similar to estimates of modern Gossypium A and AD genomes, with a higher proportion of the Gorge3 type occurring in QI92 (supplementary table S5, Supplementary Material online) than the G. herbaceum accession of Hawkins et al. (2006). The evidence of Gorge TEs supported our findings of a high endogenous DNA content because Gorge TEs would not be recovered from any contaminant DNA originating from other species.
The retrotransposon profiles of the AD archaeogenomes were similar to that of the published G. hirsutum (Guo et al. 2008; fig. 3; supplementary tables S6 and S7, Supplementary Material online). In contrast, the profile of the G. herbaceum archaeogenome differed greatly to that of the modern published G. herbaceum profile (Hawkins et al. 2006), with a much greater dominance of gypsy-like elements occurring in the archaeogenome. The extent of differentiation of the retrotransposon profiles, calculated by posterior distributions for the probabilities underlying the observed retrotransposon counts observed, showed distinct separation between samples for both the gypsy- and copia-type retrotransposons (fig. 4; supplementary table S8, Supplementary Material online), confirming that the differences in counts between samples reflect differentiated genome architectures for these retrotransposon types and are not the result of sampling artifacts. For gypsy- and copia-like retrotransposon types, the tetraploid genomes overlap, distinct to the distributions of G. raimondii and G. herbaceum. In the case of the LINE-type retrotransposons, the overlap between the distributions of most samples is likely to be due to little activity of this retrotransposon type in Gossypium lineages, supporting the current theory (Hawkins et al. 2008). The retrotransposon data sets were additionally interrogated against a data set of reverse transcriptase (RT) genes from long terminal repeat (LTR) and non-LTR types (Hu et al. 2010) to ensure that the gene counts of the respective TE types were reflected in the archaeogenomic profiles described above. Even though the number of archaeogenomic alignments to RT genes limited the analysis, no increase in the proportion of LINE retroelements was discovered (supplementary table S9, Supplementary Material online). The analysis or RT genes presented an interesting ambiguity in the relative abundance in gypsy-like retrotransposon proportions in the QI92 sample. The RT gene analysis estimates a lower proportion of gypsy-like elements, in particular the Gorge3 subtype. Nested TE regions would be underrepresented by the RT analysis, and interpreted this way, our data show that one in every two Gorge3 elements in the QI92 lineage are nested.
Posterior probability profiles calculated for the theoretical AD genome tetraploid constructs demonstrated that the theoretical tetraploid based on the A archaeogenome from QI92 is significantly similar to that of modern G. hirsutum for all three retrotransposon types, as opposed to the profile constructed from the modern A genome, which sits outside the probability range observed in G. hirsutum (fig. 4).
Punctuated Evolution in Old World Cotton in the Holocene
This study shows a punctuation event in G. herbaceum that occurred over a timescale that is within the history of plant domestication and cultivation. Comparative analysis of retrotransposon profiles from archaeological and modern G. herbaceum genomes demonstrated significantly divergent genomic composition, indicating that the evolutionary development of the domesticate form of G. herbaceum has been ongoing to a hitherto unappreciated level during the course of human history.
The proportion of archaeogenomic sequences which aligned to published RT genes (Hu et al. 2010) support a Gorge3 expansion in the QI92 sample, an observation which is consistent with previous reports of Gorge3 subtype TE proliferation in the modern A genome (Hawkins et al. 2006; Hu et al. 2010). It is possible that the QI92 lineage may, like other crop species found at this site (Palmer et al. 2009), have been locally adapted to extreme environmental stress and that the proliferation of Gorge3 TEs observed in the QI92 lineage may be the signature of a transcriptional TE burst in response to dehydration stress. The difference between the modern and archaeogenome can be explained by a very recent copia expansion in the modern G. herbaceum lineage (Hawkins et al. 2008) that had not yet occurred at the time of the QI92 sample. These findings support a scenario of dynamic genome evolution within the Holocene, under which domesticates have actively responded to their environment and become locally adapted.
Genomic Composition of Extinct A Genome Donor to New World Cottons
It was surprising to observe such differentiation at the genomic level in the A genome lineage. Gossypium herbaceum is thought to be a relatively young species (Fryxell 1979) and the nearest extant relative of the A genome progenitor of the AD genome tetraploids (Wendel and Cronn 2003). The relationship between the contrasting G. herbaceum genomes was explored as possible A genome progenitors to the New World tetraploids by constructing theoretical tetraploids based on the retroelement composition from the D genome donor G. raimondii and each of the G. herbaceum genomes, respectively. The striking resemblance of the tetraploid construct using the archaeogenome as the A donor to modern G. hirsutum (figs. 3 and 4) led us to the conclusion that the profile represented by QI92 is more likely to reflect the true A genome donor than does the published modern G. herbaceum. This direct ancient DNA evidence of an extinct A genome ancestor answers questions that could only be speculated on using comparisons of modern genomes (Hu et al. 2010).
Genomic Stability of New World Cottons since Polyploidization
The three archaeological samples from South America, identified as G. barbadense, show retrotransposon composition close to that of the modern AD genome tetraploid G. hirsutum (Guo et al. 2008) (fig. 3) deviating in gypsy/copia content between the archaeological and modern tetraploid samples by 4–8%. In the posterior distribution analysis, the distributions of the AD genome tetraploid samples overlap and are distinct of both A and D genomes (fig. 4). BQT90 and P2 have distributions that overlap closely, indicating that they are likely to be drawn from highly similar underlying populations. The G. hirsutum distributions, for each retrotransposon type, encompass the archaeological G. barbadense distributions of BQT90 and P2. Gossypium barbadense and G. hirsutum are thought to have had a common ancestor about 1.5 Ma (Senchina et al. 2003). Our analysis reflects the minimal differentiation between these two tetraploid species and indicates a gradual genomic change over this time. The similarity that occurs amongst the G. barbadense archaeological samples despite separation by more than 2,000 miles in distance and 3,000 years in time implies stability in the G. barbadense lineage, and among the tetraploids since formation. The profile of the theoretical AD construct 2 differed in gyspy-like retrotransposon composition from G. hirsutum by 2.7% and differed to the archaeological G. barbadense by 7–10%. While the deviation in TE composition between the tetraploid lineages has been small, these data show there has been greater genomic change in the G. barbadense lineage than the G. hirsutum lineage since tetraploid formation.
This low divergence of the tetraploids at a genomic level is also reflected in the analysis of the nonrepetitive portion of the archaeogenomes. Beta distribution analysis of reads assigned to nonrepetitive DNA indicated that the older G. barbadense sample, P2, was less divergent from the null expected distribution than the younger G. barbadense sample, BQT90, which would be expected if the latter was in a more advanced state of lineage sorting (fig. 2B; supplementary SI1, Supplementary Material online). If this is representative of 3,000 years of lineage sorting in the G. barbadense lineage, and the null expected distribution representative of the point of speciation between G. barbadense and G. hirsutum, then these two species could be younger than has hitherto been realized.
In conclusion, the archaeogenomic analysis we present here shows that the New World cotton tetraploids were formed from an A genome donor that was ancestral to the QI92 genotype, rather than a genotype resembling the modern G. herbaceum. Both the QI92 and modern G. herbaceum lineages experienced rapid bursts of TE activity resulting in a punctuated change in genome content. This punctuation was complimented by long periods of genomic stability seen in the tetraploid lineages, giving an overall picture of evolutionary change reminiscent of punctuated equilibrium. It is possible that this observation may be extended to other plant domesticates, which would have important implications for the conservation of genetic resources. If there has been extensive genome modification in many landraces, which may be associated with local adaptation to various conditions, then these germplasms may represent an even more valuable resource relative to wild genetic resources than is currently appreciated.
This work was funded by the Natural Environment Research Council (NE/F000391/1). We thank the Egypt Exploration Society for access to the material from Qasr Ibrim and Dr Andre Prous at the Museu de História Natural Universidade Federal de Minas Gerais, for access to material from the Boquete Caves. We also thank J. Wendel, C. Grover, and J. Hawkins for supplying data and sequencing of modern Gossypium retrotransposon studies and additional G. arboreum sequencing and Gérard Sécond for useful contacts and discussions surrounding the Loreto Viejo Peruvian samples. Thanks also for useful comments from three anonymous reviewers. This research was carried out at School of Life Sciences, The University of Warwick, Coventry, United Kingdom.