Origin of the Aromatic Group of Cultivated Rice (Oryza sativa L.) Traced to the Indian Subcontinent

Abstract The aromatic group of Asian cultivated rice is a distinct population with considerable genetic diversity on the Indian subcontinent and includes the popular Basmati types characterized by pleasant fragrance. Genetic and phenotypic associations with other cultivated groups are ambiguous, obscuring the origin of the aromatic population. From analysis of genome-wide diversity among over 1,000 wild and cultivated rice accessions, we show that aromatic rice originated in the Indian subcontinent from hybridization between a local wild population and examples of domesticated japonica that had spread to the region from their own center of origin in East Asia. Most present-day aromatic accessions have inherited their cytoplasm along with 29–47% of their nuclear genome from the local Indian rice. We infer that the admixture occurred 4,000–2,400 years ago, soon after japonica rice reached the region. We identify aus as the original crop of the Indian subcontinent, indica and japonica as later arrivals, and aromatic a specific product of local agriculture. These results prompt a reappraisal of our understanding of the emergence and development of rice agriculture in the Indian subcontinent.


Introduction
Rice (Oryza sativa L.) is one of the oldest and globally most important staple crops. The tradition of rice cultivation spans several millennia, a multitude of cultures and diverse ecogeographic regions from Iran, across the Indian subcontinent, to East and Southeast Asia. Phenotypic diversity found in rice cultivar groups and traditional landraces mirrors this richness of agroecological settings and the complex population histories. Two major groups of rice-Hsien and Keng-have been recognized in China at least since the Han dynasty ($1,800 years B.P.) (Oka 1988), corresponding to the subsp. indica and subsp. japonica, respectively. Generally, these two groups can be distinguished by their grain shape and amylose content (the indica grain being usually longer; the japonica grain having low amylose content, making it sticky after cooking) and some agroecological characteristics (e.g., cold tolerance typical for japonica) (Oka 1988). Genetically, the indica and japonica groups are well-differentiated; however, several additional groups can also be recognized (Glaszmann 1987;Garris et al. 2005;Zhao et al. 2011). Besides a further subdivision of japonica into tropical and temperate ecotypes, two other genetically distinct groups are cultivated extensively. These are aus and aromatic, first differentiated in population genetic terms by Glaszmann (1987), who designated them as "group II" and "group V," respectively. Genetic distinctiveness of the aromatic group is further documented by limited cross-compatibility with indica and japonica (Engle et al. 1969), possibly driven by cytoplasm-nucleus interactions (Virmani et al. 1981).
Recent findings suggest that the indica, japonica, and aus groups originated from three different gene pools of wild Oryza rufipogon Griff., although the degree to which the different types were domesticated independently or arose as a result of gene flow between cultivated populations and/or wild rice remain controversial (Huang et al. 2012; Civ a n et al. 2015; Choi et al. 2017; Civ a n and Brown 2017Brown , 2018. In contrast, less attention has been directed at the origin of aromatic rice. This is probably due to the apparent genetic proximity of the aromatic and japonica types (Zhao et al. 2011;Wang et al. 2018) and the implicit assumption that aromatic is a subtype of japonica. Nonetheless, the population history of the aromatic group appears to be more complex than a simple postdomestication divergence from japonica. The first observation hinting at this complexity is a discrepancy in geographic distribution. While japonica rice originated in Southeast China (Huang et al. 2012; Civ a n et al. 2015; Civ a n and Brown 2018) and is a traditional crop across East and Southeast Asia, aromatic rice has its center of diversity along the Himalayan foothills and is traditionally grown in Pakistan and India, with its easternmost occurrence in Myanmar (Glaszmann 1987;Khush 2000). Secondly, the pattern of genetic diversity displayed by aromatic rice seems to indicate affiliation with the aus group. This is indicated by the main axes of genetic variation (Zhao et al. 2011), and also by analysis of 31 selective sweep regions colocated in all groups of cultivated rice (Civ a n et al. 2015).
The earliest written record of rice that likely belonged to the aromatic group is from the Sanskrit text Susruta Samhita dated around 400 BC (Singh 2000). The aromatic group owes its name to the presence of fragrance in the grain, which is one of the most valued characteristics of rice. When combined with other quality traits (e.g., amylose content, grain length, grain elongation after cooking, texture of cooked rice), fragrant varieties are considered superior and have high cultural significance in many regions of Asia (Singh 2000). In most cases, enhanced fragrance is caused by BADH2 protein deficiency resulting from an 8-bp deletion that introduces a premature stop codon in the seventh exon of the Badh2 gene (the badh2.1 allele) (Bradbury et al. 2008;Chen S et al. 2008).
Although group V, according to Glaszmann's classification (Glaszmann 1987), is "aromatic" rice, the presence of aroma or fragrance is neither a universal nor an exclusive trait of this group. Many group V varieties lack fragrance and the badh2.1 allele frequency in this population was estimated at 0.6 (Kovach et al. 2009). Moreover, fragrant varieties can be found among indica (most notably Jasmine rice) and tropical japonica, although at much lower frequencies (Kovach et al. 2009). Since the "group V" designation has not been universally accepted, recent publications inconsistently describe this group as aromatic (Zhao et al. 2011;Civ a n et al. 2015), Basmati/Sadri (3,000 Rice Genomes Project 2014), or circum-Basmati (Wang et al. 2018). The "Basmati" designations refer to the most produced and commercially successful varietal group, and do not fully appreciate the genetic diversity present in group V. Because it is practical to identify group V as an infraspecific taxonomic rank of O. sativa (as subspecies, varietas, or forma), the group name should follow the code of nomenclature and the conventions for infraspecific epithets (Turland et al. 2018). Here, we use the epithet "aromatic" (italicized) when referring to the whole group V population irrespective of fragrance, and the term "fragrant rice" when referring to the aroma trait.
In this article, we unravel the genomic history of the aromatic population by analyzing a unique genome-wide sativarufipogon diversity data set. We identify the source populations and present the most parsimonious explanation for the origin of aromatic rice. In doing so, we bring a new perspective on the origins of rice agriculture on the Indian subcontinent.

Remapping of the Wild Data Set
Raw data for 461 wild rice accessions (Huang et al. 2012) were obtained from the European Nucleotide Archive database. Read mapping and SNP-calling was carried out using the pipeline implemented in TOGGLE v0.3.3 (Monat et al. 2015). Paired-reads were mapped to the IRGSP-1.0 Nipponbare reference genome using the aln and sampe commands in BWA (Li and Durbin 2009). Alignments were sorted with picardToolsSortSam (http://broadinstitute.github.io/picard/) and samToolsView. The tool realignertargetcreator was used to define intervals to target for local realignment and indelrealigner to perform local realignment of reads around indels, both in the GATK package (McKenna et al. 2010). Markduplicates from picardtools was used to remove duplicates. Output bam files were divided into per chromosome bam files with bamtools. SNPs were called for each chromosome with GATK haplotypecaller (McKenna et al. 2010) using the badcigar filter. High-confidence SNPs were identified using GATK variantfiltration with the parameters DP > 10 and QUAL > 30. The per-chromosome vcf files were then combined into a single vcf file.

Construction of the Merged SNP Matrix
The LD pruned 404k CoreSNP data set was downloaded from http://oryzasnp-atcg-irri-org.s3-website-ap-southeast-1.amazonaws.com/ (last accessed October 2017) and converted to ped format using PLINK v1.90 (Chang et al. 2015; www.coggenomics.org/plink/1.9/). PCA was performed with smartpca (Patterson et al. 2006), using the lsqproject option without outlier removal. This sample set contained a large number of intermediate and apparently misclassified accessions (supplementary fig. S1, Supplementary Material online). To extract "typical" representatives of indica, japonica, aus, and aromatic (instead of maximizing the diversity within those groups), the medians for the top ten PCs in each group were identified and those accessions falling within 1 SD around the median in all cases extracted. This procedure produced a subset of 283 indica, 154 japonica, 124 aus, and 34 aromatic accessions (only accessions where reclassification was not needed were kept; supplementary table S1, Supplementary Material online). This subset was then extracted from the 18 million Base SNP data set that had been created by the 3k RGP by removing SNPs with excessive heterozygous calls from the complete biallelic SNP data set (http://oryzasnp-atcg-irri-org.s3-website-ap-southeast-1.amazonaws.com/; last accessed October 2017), and converted to vcf format with PLINK v1.90. From this 3k RGP subset and the newly created wild SNP matrix, all biallelic sites with identical position, reference, and alternative identifiers were extracted (with bash pipelines), creating compatible subsets. We noted that the reference variants of the 3k RGP data set did not always match the IRGSP-1.0 reference. We corrected those positions accordingly in a separate file (swapping the REF and ALT columns together with the called variants, using bash pipelines), added them to the compatible subset, and merged the two compatible subsets with GATK combinevariants command. Finally, all polymorphic sites that had at least one nonmissing data point in the wild subset were extracted, yielding a vcf file containing $2.4 million positions scored for 595 cultivated and 461 wild rice accessions.

Analyses of the Merged Data Set
PCAs were performed as above; however, due to the high proportion of missing data in the wild subset, the axes of variation were calculated using the domesticated accessions only, with subsequent mapping of the wild accessions onto these axes. Allelic frequencies, nucleotide diversity and pairwise F ST scans were calculated with VCFtools 0.1.15 (Danecek et al. 2011). Neighbor-Net networks (Bryant and Moulton 2004) were constructed in SplitsTree4, using 1-ibs (identityby-state) distance matrix calculated in PLINK v1.9. The joint allele frequency spectrum was calculated from a table of allelic frequencies in LibreOffice Calc, using basic operations together with the functions FREQUENCY, LOG10, and conditional formatting. For the purpose of the individual-based tracking of the non-japonica genomic fraction in aromatic, aus, and indica, only major alleles (group frequency >0.5) with <0.01 frequency in japonica were considered. Sites with >1/3 missing data per group were ignored, since the major variant at those sites can be a deletion. A complete set of such alleles was identified from a table of allelic frequencies, extracted from the vcf file, and proportions of sites matching the group's major allele was calculated for each individual in a spreadsheet.

Reconstruction of the Complete Chloroplast Genomes
This part of our work was previously posted, with additional details, on a preprint server (Civ a n and Brown 2016). Raw sequencing data for 508 indica, 482 japonica, and 460 wild accessions (ERP001143, ERP000729, ERP000106) (Huang et al. 2012) plus 124 aus and 34 aromatic accessions selected from the 3,000 Rice Genomes Project (2014) (ERP005654) were downloaded from the Sequence Read Archive (http:// ncbi.nlm.nih.gov/Traces/sra/) using the fastq-dump command from sratoolkit 2.3.5 (additional details in supplementary table S2, Supplementary Material online). For each of the 1,608 accessions, reads significantly matching known rice chloroplast genomes (E value <1e-5) were extracted using the fil-ter_by_blast script from seq_crumbs-0.1.9. Adapter contamination and low-quality regions were removed from the matching reads by Trimmomatic-0.33 (Bolger et al. 2014). The filtered and trimmed data sets were imported into Geneious 6.1 (http://www.geneious.com) and individually mapped onto the Nipponbare chloroplast genome (KM088016) used as a reference (5 mapping iterations; maximum of 5% mismatches and 10% gaps per read; maximum gap size set to 100; index word length 13; only paired reads matching within the expected distance used). The second copy of the inverted repeat region (the duplicated fraction of the chloroplast genome) was removed and the sequences were aligned in Geneious. Treating gaps as missing data, all SNPs with frequencies >0.005 were extracted, yielding a data matrix with 215 positions and 0.6% missing data points, and a median-joining network was built in Network 4 (Bandelt et al. 1999).

Local Ancestry Inference
Phased haplotypes were imputed from the merged data set separately for the aromatic group and its source populations using Beagle 5.0 (Browning and Browning 2007) with default parameters. We noticed that changing the ne parameter by orders of magnitude does not have an obvious impact on the subsequent local ancestry inference. However, the sample size of the putative source populations does; therefore, we randomly selected 30 japonica accessions to match the size of the non-japonica source sample (supplementary table S1, Supplementary Material online). Using these two samples as the putative source populations, the local ancestry of the aromatic haplotypes was inferred with Loter (Dias-Alves et al. 2018). Loter does not require specifications of uncertain biological parameters (genetic maps; recombination and mutation rate; average ancestry coefficients; number of generations since admixture) and has been shown to outperform other tools for local ancestry inference of ancient admixture events (Dias-Alves et al. 2018). Nonetheless, we tested the inference accuracy on our data. Seven aus individuals from the non-japonica source population were recombined in silico with seven randomly chosen japonica individuals (supplementary table S1, Supplementary Material online), using one recombination point per chromosome to split each chromosome into two equal ancestry blocks. These 14 in silico recombinants were phased and their local ancestry was inferred as above.

Merged Data Set of Nucleotide Diversity
Recently, 76 aromatic genomes have been sequenced on the Illumina platform in depths sufficient for reference-based genome reconstruction. The resulting diversity data are publicly available within the 3,000 Rice Genome Project data set (a.k.a. 3k RGP) (3,000 Rice Genomes Project 2014; Wang et al. 2018). However, the 3k RGP data set does not contain data from O. rufipogon-the wild progenitor of cultivated rice-and is not directly compatible with previous data sets of wild diversity due to different reference sequences being used (IRGSP4 in Huang et al. 2012 and IRGSP-1.0 in 3,000 Rice Genomes Project 2014). We therefore remapped the wild rice data of Huang et al. (2012) onto the newest Nipponbare reference (IRGSP-1.0) and created a new single nucleotide polymorphism (SNP) data set by merging the mapped wild data with a subset of the 3k RGP data. This new IRGSP-1.0-based genome-wide nucleotide diversity matrix comprises 595 domesticated accessions (283 indica, 154 tropical and temperate japonica, 124 aus, 34 aromatic) (3,000 Rice Genomes Project 2014) and 461 wild rice accessions (Huang et al. 2012) and consists of 2,365,188 biallelic positions. The domesticated and wild subsets are heterogeneous in respect to the proportion of scored data points ( fig. 1a), which can be mainly attributed to different sequencing depths in the two subsets. While the wild subset was constructed from data with mean sequencing depth $2Â, the mean mapping depth of the 3k RGP subset is 14Â (7.5Â and 27.2Â for 5th and 95th percentile, respectively), indicating that the missingness in the latter group is to a large extent caused by presence-absence variation (i.e., indels). Per individual proportions of missing data vary between 0.32-0.97 in wild rice and 0.07-0.54 in domesticated rice. The distribution of allelic frequencies in cultivated rice is markedly skewed toward the extreme categories, with 42% of variants approaching fixation (minor allelic frequency 0.05), compared with 31% in wild rice ( fig. 1b). This difference is likely to be a consequence of the domestication bottleneck, selection, and crop homogenization, facilitated by the predominantly self-pollinating mode of reproduction in O. sativa. The mean proportion of heterozygous sites is not significantly different in wild and domesticated rice ( fig. 1c), although this observation is probably biased by the low sequencing depths in wild rice (the proportion of heterozygous sites in wild rice is expected to be higher due to more frequent cross-pollination; Ishii et al. 2013). The SNPs are distributed across all twelve rice chromosomes, with average density of 6.3 SNP/kb (supplementary fig. S2, Supplementary Material online) and a modal category 400-500 SNPs per 100-kb window ( fig. 1d).

Population Structure and Genome-Wide Diversity Scans
Principal component analysis (PCA) of the 2,365,188 polymorphic sites separates O. sativa accessions into five statistically strong groups: tropical japonica, temperate japonica, indica, aus, and aromatic. Analysis of variance shows that the PC values of each of these five groups are distinct with high statistical significance (P value <1e-6) along the first four PCs jointly explaining 33% of the total variation, with the exception that tropicalþtemperate japonica are not distinguished along PC3, and indicaþaus are not distinguished along PC4 ( fig. 2a and b). The diversity of the aromatic group is clearly defined and separated from other cultivated groups along PCs1-3. Nonetheless, the position of the aromatic cluster along the first two PCs suggests ambiguous relationships to japonica and aus, and there is a strong overlap between aromatic and the indicaþaus clusters along PC4. We further investigated these relationships through diversity and pairwise F ST scans across the entire genome. These scans confirmed that parts of the aromatic genome have nearzero pairwise F ST with japonica, these parts often coinciding with regions of low diversity in aromatic (supplementary fig.  S3a and b, Supplementary Material online). Such observations suggest that parts of the aromatic genome are shared with japonica, and that some of those segments were subjected to selection in aromatic. However, a similar relationship between aromatic and the other cultivated groups is not apparent in the pairwise F ST scans (supplementary fig. S2c and d, Supplementary Material online), providing no indications for japonicaÂaus or japonicaÂindica hybridization in the ancestry of aromatic.
The Neighbor-Net network clearly separates indica, japonica, aus, and aromatic into four clusters in the context of wild diversity (supplementary fig. S4, Supplementary Material online). High network complexity (1,056 accessions; 214,146 edges) obscures the origins of aromatic, which become clearer after exclusion of the wild populations (supplementary fig. S5, Supplementary Material online). This second network shows that aromatic shares unique edges with japonica, separating these two groups from the rest, but also with aus, separating aus and aromatic from japonica and indica.
In order to check whether aromatic's variation is merely a subset of japonica's variation, or instead contains variants that are absent from japonica, we summarized the major aromatic alleles (allelic frequency >0.5) at 1,807,643 sites on a joint allele frequency spectrum ( fig. 3a). The spectrum shows that a substantial fraction of variants common in aromatic are rare or absent in japonica. Specifically, 100,016 of the major aromatic alleles have 0.05 frequency in japonica, and 58,168 of those can be considered absent in japonica (frequency <0.01). Subsequent PCA computed from the 100,016 sites revealed that aromatic is more closely associated with aus, indica, and some wild accessions when this "non-japonica" fraction of the genome is considered. The aromatic cluster partially overlaps with aus, indica, and wild rice along the first PC, and with a few wild accessions along the second PC ( fig. 3b).

Origin of the Non-Japonica Fraction of the Aromatic Gene Pool
We found that the majority of the 58,168 major aromatic alleles that are absent in japonica can be found in other rice groups at >0.05 frequencies. Specifically, 55.4%, 63.0%,  Civ a n et al. and 86.8% of these alleles are found in indica, aus, and wild rice, respectively. This indicates that the majority of the nonjaponica fraction of the aromatic gene pool comprises variants that did not emerge in aromatic postdomestication, but have a specific ancestor instead. We searched for this specific ancestor by complementing two lines of evidence: 1) perindividual distribution of the major aromatic alleles that are absent in japonica and 2) comparison of chloroplast haplotypes.
First, we extracted the 58,168 sites from the diversity matrix, and for each wild, indica, aus, and japonica individual we calculated the proportion of sites that match the aromatic major allele (supplementary table S1, Supplementary Material online). This effectively measures similarity of those individuals to aromatic within the selected genomic fraction. We identified a group of accessions that carry the non-japonica aromatic variants in distinctly higher proportions (supplementary fig. S6a, Supplementary Material online), consisting of 90 aus and 62 wild accessions. Within these accessions mostly coming from the Indian subcontinent, 89.9% of the 58,168 aromatic non-japonica alleles can be found. Analogically, we identified wild subpopulations distinctly similar to aus and indica (supplementary fig. S6b and c, Supplementary Material online) and found that the subpopulations associated with aromatic and aus have a strong overlap.
Reconstruction of the complete chloroplast genomes revealed that six (17.6%) aromatic individuals have chloroplast haplotypes that are very common in japonica (haplotypes A01 and A02; fig. 4 and supplementary fig. S7 and table S2, Supplementary Material online). However, the other 28 aromatic individuals (82.4%) carry chloroplast haplotypes A04 or A09 that are not found in japonica, and are not derived from the japonica haplotypes. This observation provides further evidence for hybrid origin of aromatic. Apart from the aromatic group, the A04 haplotype is also found in 55 wild, 8 aus, and 3 indica accessions. In relative proportions per group, the A04 haplotype is common in wild rice (in 12% of the wild accessions), can also be found in aus (6.5%), but is very rare in indica (0.6%). This leads us to conclude that most of the aromatic individuals inherited their chloroplast genomes (hence the cytoplasm) from the non-japonica ancestor, which was most likely a specific wild subpopulation or a member of the aus group. Therefore, we searched for the A04 haplotype among the accessions that are distinctly similar to the non-japonica aromatic fraction. We found the A04 haplotype in seven aus and 23 wild accessions, with a geographic distribution centred on the Indian subcontinent ( fig. 5). We conclude that these accessions represent the extant populations derived from the most probable non-japonica ancestors of the aromatic group.

Local Ancestry Inference
Identification of the extant lineages representing the source populations of the non-japonica component of aromatic allows per-individual inference of local ancestry across the entire genome, using modeling tools such as HAPMIX (Price et al. 2009) andLoter (Dias-Alves et al. 2018). We first tested the ability of Loter to correctly reconstruct local ancestry of in silico recombinants created from the putative source populations of the aromatic group. We found that Loter is able to capture the overall ancestry pattern, albeit with fine-scale errors that are often consistent across samples (supplementary fig. S8, Supplementary Material online). These in silico reconstructions revealed a systematic bias leading to underestimation of the non-japonica ancestry by a factor of 1.22, likely resulting from the differences in the proportions of missing data between the source populations.
The local ancestry inference of the aromatic individuals indicates that a significant portion of their genomes originated from the non-japonica source ( fig. 6). Following the bias correction, the results suggest that on an average, 36% of the aromatic genome can be traced to this source, the proportion ranging from 29% to 47% among individuals.
The highly consistent pattern of the local ancestry across individuals indicates that the whole aromatic group stems from a single admixture event, although the original genomic footprint has in some lineages been disrupted by subsequent recombinations.
Inspection of the local ancestry at the loci related to the domestication phenotype revealed that most of the examined genes in the aromatic group have either japonica or mixed origins (supplementary table S3 online). Among the genes often considered crucial for the basic domestication characteristics, the Rc gene (responsible for white pericarp; Sweeney et al. 2006 and reduction of seed dormancy in cultivated rice Gu et al. 2011), the Prog1 gene (erect growth; Tan et al. 2008) and the LABA1 gene (short and barbless awns; Hua et al. 2015) were uniformly inherited from the japonica ancestor. Several genes controlling grain size, shape, and yield also have complete japonica ancestry. Importantly, most but not all aromatic accessions carry japonica-like haplotypes of the S5 gene involved in the reproductive separation of indica and japonica (Chen J et al. 2008), which is the likely cause of the limited cross-compatibility between aromatic and indica. The ancestry of the Badh2 gene (fragrance; Bradbury et al. 2008;Chen S et al. 2008) is traced to the japonica ancestor in all fragrant members of the aromatic group, in accordance to the conclusions by Kovach et al. (2009). Interestingly, in our set of 34 aromatic representatives, only eight accessions carry the badh2.1 deletion, and other mutations previously reported as interrupting the reading frame of the Badh2 gene (Kovach et al. 2009) were not found. Out of the 76 accessions classified as aromatic in the full 3k RGP data set, only 26 carry the fragrance-causing badh2.1 allele, highlighting that fragrance is not the defining feature of the aromatic group.
Genes influencing starch synthesis, amylose content and gelatinization temperature were inherited from both source populations. The Sh4 gene (nonshattering ear; Li et al. 2006) also shows mixed ancestry in aromatic, with two thirds of the examined accessions inheriting the locus from the non-japonica ancestor. Only one of the examined genes has a higher level of non-japonica ancestry-the PCR1 gene influencing grain weight and Zn 2þ accumulation (Song et al. 2015) is uniformly inherited from this ancestor.

Discussion
The origins of aromatic rice have been uncertain. Its morphology and distribution on the Indian subcontinent led to an initial association with indica (Glaszmann 1986), but aromatic has distinctive features (phenol reaction, opaque kernel appearance, intermediate amylose content and medium gel consistency) that prompted suggestions that it might be intermediate between indica and japonica (Ahuja et al. 1995;Bhattacharjee et al. 2002). Later studies of genetic diversity revealed that aromatic is closer to and possibly a subgroup of japonica (Garris et al. 2005;Kovach et al. 2009;Zhao et al. 2011;3,000 Rice Genomes Project 2014). As japonica clearly originated in East Asia (Huang et al. 2012; Civ a n et al. 2015; Civ a n and Brown 2018), but aromatic is not found in China, it appears that aromatic emerged after japonica cultivation reached the Indian subcontinent, possibly following hybridization between japonica and a local variety of rice. Analysis of genomic regions with reduced diversity in all groups of cultivated rice supported this scenario, although only a small sample of aromatic accessions was examined (Civ a n et al. 2015). This study revealed that among such regions, which arguably were targeted by artificial selection, about a quarter of aromatic haplotypes are associated with japonica, another quarter are unique, while about one-fifth are associated with aus, indicating that aromatic could have emerged from hybridization between japonica and aus.
To make a more comprehensive analysis of aromatic origins we merged genomic data for cultivated and wild rice (Huang et al. 2012;3,000 Rice Genomes Project 2014), and analyzed this data set with a variety of population genomics tools. PCA of population diversity placed the aromatic accessions between japonica and the other cultivated groups along the top two axes of variation ( fig. 2a), while aromatic was FIG. 6.-Local ancestry inference for 34 aromatic individuals. Each row corresponds to a single chromosome, that is, each individual is represented by two rows. Columns represent polymorphic sites; the x-axis indicates the number of sites examined (not their physical location). Purple-japonica ancestry; yellow-non-japonica (Indian) ancestry; (a-l) chromosomes 1-12. Civ a n et al.
clearly separated from all other rice along the third axis of variation ( fig. 2b), in agreement with previous observations (Zhao et al. 2011). Based on the PCA, it is difficult to conclude whether the position of the aromatic cluster is due to genetic interaction between two groups, or due to the limitations of 2D projections of a multidimensional space. Genome-wide scans of nucleotide diversity and pairwise F ST confirmed that parts of the aromatic and japonica genomes are almost identical (supplementary fig. S3, Supplementary Material online), indicating common ancestry. Nonetheless, the F ST scans and the joint allele frequency spectrum ( fig. 3) showed that a substantial fraction of aromatic genome is dissimilar from japonica and of unknown provenance.
The maternal lineage record of the chloroplast genomes has previously demonstrated deep divergence between indica and japonica (Kawakami et al. 2007;Kumagai et al. 2016;Tong et al. 2016). Our results show that most aromatic accessions carry a chloroplast lineage that is not present in or derived from japonica, but is relatively frequent in other rice groups (supplementary table S2 and fig. S7, Supplementary Material online), strengthening the hypothesis that japonica is not the sole ancestor of aromatic. Further analyses of the nuclear variation revealed that the major aromatic alleles that are absent in japonica can often be found in indica and aus (accounting for 55.4% and 63% of such alleles, respectively), but their highest fraction is present in the wild superpopulation (86.8%). After integrating the chloroplast and nuclear diversity data, we identified a group of 30 accessions that best represent the extant lineages of the non-japonica ancestor of aromatic rice ( fig. 5). This group consists of some wild accessions sampled mainly from the Indian subcontinent, together with a few aus accessions. Treating this subset of 30 accessions as the non-japonica source of the admixed aromatic gene pool, per-individual local ancestry suggested that 7 [29-47% (mean 36%)] of the aromatic genome is derived from the non-japonica source. Local ancestry at selected domestication loci (supplementary table S3, Supplementary Material online) indicates that the non-japonica ancestor did not have a domesticated phenotype, since aromatic accessions inherited most of the crucial haplotypes from japonica. Interestingly, the Sh4 gene is an exception from this pattern, as it is assigned non-japonica ancestry in the majority of the aromatic accessions. This result is consistent with the observation of multiple haplotypes at the domestication loci in O. sativa (Wang et al. 2018) and suggests that the recessive sh4 allele was already present in the non-japonica ancestor. It is possible that the seed shattering locus was not under selection during the admixture event and became fixed later (Ray and Chakraborty 2018).
Our population genomics approach therefore shows that aromatic rice arose from hybridization between japonica and a wild rice, related to aus, that was local to the Indian subcontinent. A recent preprint by Choi et al. (2018) reported a phylogenomic approach to resolve the origin of the aromatic cultivar Basmati 334, by comparison with a selection of highquality Oryza sp. genomes. By counting alternative topologies of gene trees across the genome in a three-species phylogeny, they found that 51.3% of the informative gene trees group Basmati 334 with cv. Nipponbare (japonica), while 39.7% group Basmati 334 with an O. rufipogon accession. When the three-species phylogenetic reconstruction involves Basmati 334 with the representatives of aus (cv. N22), and indica (cv. R498), 53.4% of the informative gene trees group aus with indica, while 26.5% group aus with Basmati 334. The authors concluded that this indicates admixture events between Basmati 334, aus and O. rufipogon. Although this phylogenomic approach, involving only a handful of genomes, cannot provide detailed insights into the demography and geography of the proposed admixture events, the results of Choi et al. (2018) are consistent with our findings.
The results that we report, together with our previous work on the population origins of indica and aus (Civ a n et al. 2015; Civ a n and Brown 2018) offer a new perspective on the genetic identity of the early rice cultivated on the Indian subcontinent. According to the "proto-indica hypothesis" (Fuller et al. 2010;Fuller 2011), early rice exploitation on the Indian subcontinent was based on predomesticated indica grown in the Ganges region, with the fully domesticated indica emerging as a result of hybridization with domesticated japonica, following the spread of the latter from East Asia (Sang and Ge 2007;Huang et al. 2012;Choi et al. 2017). However, the hypothesis that the native Indian rice was indica-like is inconsistent with recent results showing that indicaspecific variants are usually found in wild populations from Thailand and the Brahmaputra valley, while the aus group is more similar to the central Indian wild population (Civ a n et al. 2015; Civ a n and Brown 2018) (supplementary fig. S6c, Supplementary Material online). Evidence of japonica introgressions in the indica genome is also ambiguous (Civ a n and Brown 2018; Wang et al. 2018). Importantly, archaeobotanical identification of indica relies on the grain length/width ratio , which is a poor proxy for classification of modern germplasm (Morishima and Oka 1981). Moreover, archaeobotanists do not distinguish aus from indica (Fuller et al. 2016), and the differentiation of aus and the morphologically diverse aromatic group (Islam et al. 2016;Lahkar and Tanti 2017) from indica and japonica is beyond the discriminating power of rice morphometrics. These archaeobotanical limitations lend further uncertainty in the proto-indica hypothesis.
The results presented here and previously (Civ a n et al. 2015; Civ a n and Brown 2018) indicate that the wild rice populations south of the Himalayas were ancestral to aus and the nonjaponica fraction of aromatic, but not to indica. We conclude that the domestication phenotype of aus arose independently on the Indian subcontinent (Civ a n and Brown 2018), followed by a later influx of the japonica population some 4,000 years ago (Petrie et al. 2016;Bates et al. 2017), the latter accompanied by the emergence of aromatic through hybridization with wild rice along the foothills of the Himalayas. Japonica also interacted with aus, though to a much lesser extent (e.g., donating a large portion of the chromosome 7) (Civ a n and Brown 2018) and the diversity of cultivated rice in the region was further enriched by the spread of indica, which entered the Indian plains either from the Brahmaputra valley or from Southeast Asia. Each these groups largely maintained their genetic integrity during the following millennia of cultivation, probably due to cross-compatibility barriers and differences in agroecological settings.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.