Evolution of Aspergillus oryzae before and after domestication inferred by large-scale comparative genomic analysis

Abstract Aspergillus oryzae is an industrially useful species, of which various strains have been identified; however, their genetic relationships remain unclear. A. oryzae was previously thought to be asexual and unable to undergo crossbreeding. However, recent studies revealed the sexual reproduction of Aspergillus flavus, a species closely related to A. oryzae. To investigate potential sexual reproduction in A. oryzae and evolutionary history among A. oryzae and A. flavus strains, we assembled 82 draft genomes of A. oryzae strains used practically. The phylogenetic tree of concatenated genes confirmed that A. oryzae was monophyletic and nested in one of the clades of A. flavus but formed several clades with different genomic structures. Our results suggest that A. oryzae strains have undergone multiple inter-genomic recombination events between A. oryzae ancestors, although sexual recombination among domesticated species did not appear to have occurred during the domestication process, at least in the past few decades. Through inter- and intra-cladal comparative analysis, we found that evolutionary pressure induced by the domestication of A. oryzae appears to selectively cause non-synonymous and gap mutations in genes involved in fermentation characteristics, as well as intra-genomic rearrangements, with the conservation of industrially useful catalytic enzyme-encoding genes.


Introduction
Aspergillus oryzae is an industrially important species mainly used in the manufacture of fermented foods in East Asia because of its strong amylase and protease activities. 1 Particularly, in Japan, tane-koji (seed rice malt) manufacturers use various strains that are sold to companies producing fermented foods. These strains show diversity in colour and fermentation function and are managed for different applications such as in sake, miso (soybean paste), and shoyu (soy sauce). However, the relationship between the diversity of A. oryzae species and genetic factors remains unclear.
In 2005, the whole genome of A. oryzae RIB40, a wild-type strain, was sequenced. 2 Comparative genomic analysis of the whole genomes of Aspergillus nidulans and Aspergillus fumigatus revealed that the A. oryzae genome was 7-9 Mb larger. 2,3 However, genes in newly acquired regions are only minimally expressed under normal conditions, 4 and most of their functions remain unknown, particularly for genes not directly involved in fermentation.
Aspergillus flavus and A. oryzae are genetically very closely related, with their genomes showing 99.5% similarity in coding regions, 5 and numerous comparative analyses have been performed between these species. Aspergillus flavus is an important species linked to food safety, as some strains produce fungal toxins, particularly aflatoxin, 6,7 and it has historically been distinguished from A. oryzae based on morphological differences and toxicity. 7,8 In addition, some A. oryzae strains contain all or parts of the aflatoxin biosynthetic gene cluster, although they are non-aflatoxigenic. 6 Some researchers suggested that A. oryzae can be detoxified and differentiated from A. flavus by domestication. 1,9,10 Based on the phylogenetic analysis of 11 genes, 9 comparative analysis of the aflatoxin gene cluster, 11,12 and single-nucleotide polymorphism (SNP) analysis of the whole genome, 13 A. oryzae was shown to form a monophyletic clade derived from one clade of A. flavus.
Aspergillus oryzae and A. flavus have long been considered asexual species with no sexual reproduction cycle. 14 However, recent studies of A. flavus revealed that sexual reproduction occurs in laboratory and field environments. 15,16 Genome analysis also showed that the two species contain a nearly complete gene set necessary for sexual reproduction. 3,17 All strains of A. oryzae and A. flavus possess one mating type (MAT type) locus in the genome, at which either MAT1-1 or MAT1-2 is encoded. 17,18 However, complete sexual reproduction has not been confirmed in A. oryzae. Breeding currently carried out by tane-koji manufacturers utilizes a single strain with mutations or recombination, but crossbreeding has not been successful. Genome analysis suggested that recombination occurred between the ancestors of A. oryzae based on the linkage disequilibrium between MAT types and the phylogeny of a single gene. 19 In this study, to uncover genomic diversity and evolutionary relationships among A. oryzae isolates, we acquired 82 industrial strains from five independent Japanese tane-koji manufacturers in different locations and conducted whole-genome sequencing to determine their draft genomes. For the classification of these strains, we performed orthologue clustering of predicted genes from each genome, phylogenetic tree inference of the chromosomal genome, and chromosome recombination analysis. Through these analyses, we hypothesized that A. oryzae strains have undergone multiple inter-genomic recombination events between A. oryzae ancestors, and that evolutionary pressure by A. oryzae domestication is extremely limited to intra-genomic mutations and rearrangements. Moreover, we identified genes that are mutated/duplicated/deleted within clades, which might reflect the fact that Japanese tane-koji manufacturers have passaged their strains to prevent changes in industrially useful traits in parallel with breeding.

Materials and methods
A full description of the methods, including software versions and parameters, is available in Supplementary Data ('supplementary_ methods.pdf').

Sample collection and DNA preparation
For genomic sequencing, 82 A. oryzae and three Aspergillus sojae (as an out group) industrially used strains were collected from five independent tane-koji manufacturers in Japan (Supplementary Table S1). Tane-koji manufacturers have their own isolates and have not shared them for several decades. Whole genomic DNA was extracted using 'Extraction method5'. 20 Yatalase was used for some samples (Supplementary Table S1).

Genome sequencing and assembly
For genome assembly, fragmented genome libraries were prepared based on 350 bp (for run no. 1) and 550 bp (for run no. 2-5) on average and sequenced on an Illumina HiSeq2500 system using 150 bp (for run no. 1) and 250 bp (for run no. 2-5) paired-end runs. Quality filtering and assembly of the paired-end reads were performed with Platanus. 21 The scaffolds aligned to bacterial genomes or the mitochondrial genome of RIB40 were removed. The reference primer sequences for the MAT type 17 were mapped to the genome sequences with bowtie2. 22

Gene prediction and orthologous clustering
Next, 152 genomic scaffolds (85 from our samples and 67 from NCBI GenBank) of the newly sequenced or NCBI GenBank Aspergillus strains were used (Supplementary Table S2). Gene coding regions were predicted using two methods, namely, GeneMark-ES 23 and AUGUSTUS 24 for ab initio prediction, and GMAP 25 for referencebased prediction, and these were combined with EVidenceModeler. 26 The predicted protein sets were evaluated with BUSCO. 27 Orthologue clustering was performed with OrthoFinder. 28 Orthogroups (OGs) were annotated with GhostKOALA 29 for protein function and InterProScan 30 for protein motifs/domains.

Comparative genomics
Alignments of degapped gene sequences (DGSs) of single-copy OGs (SCGs), which were common to 152 protein sets, were generated with MAFFT 31 and tandemly concatenated. Maximum likelihoodbased phylogenetic inference was performed with RAxML. 32 Similarly, a concatenated gene tree of 19 SCGs in the aflatoxin biosynthetic cluster was generated. To test the neutrality of the mutation, revised coding sequences were generated from the alignments by clade/species and synonymous, non-synonymous, and gap mutations were counted. Chromosomal duplications and deletions were inferred by direct read mapping to the RIB40 genome with bowtie2. Depths were calculated with samtools. 33 Table S2). The two draft genome sizes were not significantly different from those reported previously. 2,34 All samples were sequenced at depths of 100, but the depths of some samples from runs no. 3, no. 4, and no. 5 showed relatively low average coverage of the final scaffolds because the genomic DNA of Actinobacteria (Corynebacterium) used to produce yatalase was contaminated.

Gene prediction and evaluation
Gene prediction for our sample showed that the number of predicted genes of A. oryzae and A. sojae were 11, 196-11, 716 and 13, 309-13, 317, respectively (Supplementary Table S2, column S). To confirm the accuracy of assembly and gene prediction, the predicted gene set and reference gene set were analyzed with BUSCO. In the complete genome of RIB40 (GCF_000184455.2), the score of the gene set was 98.6% in our prediction pipeline.

Phylogenetic tree with concatenated genes
The length of the sequence of the concatenated DGSs of 4,361 SCGs was 5,677,852 columns (bp), corresponding to approximately 15% of the total genome length. We confirmed that A. oryzae was monophyletic and nested in a clade of A. flavus ( Fig. 1, original full figure: Supplementary Fig. S1). This is consistent with the results of previous studies. 9,[11][12][13] In contrast, some putative A. flavus strains (WRRL1519, NRRL35739, IFM54693, IFM57535, IFM59975, IFM60655, and 2017 Washington T4) were nested in the A. oryzae clade. However, WRRL1519, 35 NRRL35739, and the IFM strains 36 were confirmed as non-aflatoxigenic with A. oryzae-type aflatoxin gene clusters (also confirmed in this study). Thus, considering the location in the phylogenetic tree and toxigenicity, these strains were reclassified as A. oryzae.
Industrial strains of A. oryzae formed several clades, showing an intra-cladal DGS dissimilarity within 0.01%. For further analysis, we defined clades A-H to which Japanese industrial strains of A. oryzae belong. Although koji manufacturers do not share their strains, interestingly, many strains clustered in the same clade. Particularly, clade E included several strains provided by four different tane-koji manufacturers, one strain collected in Thailand (BCC7051), 37 and one strain as a clinical isolate from Japan. In contrast, some Chinese and Korean industrial strains, namely, AS3.951, 100.8/3.042 (China), 38 BP2-1, and the two SRCM strains (Korea), formed a group with a relatively large distance from the Japanese industrial strains. These strains were very closely related to each other but should be classified into different clades because of their different MAT types.
Industrial A. oryzae strains were not clustered based on their industrial uses. However, all strains in clades D, E, and H (and a single strain TK-29) were those used in shoyu production. Strains used for other purposes such as in miso, sake, sake/miso, mirin, or other products were mixed in the same clade, which is consistent with classification by appearance and enzymatic activity.

MAT type
Of the 152 genomic sequences, MAT regions were uniquely detected as either MAT1-1 or MAT1-2 from 146 sequences but not from six downloaded sequences (Fig. 1). We examined the presence/absence of genes in each strain and identified OG0012491/ AO090020000089 as MAT1-1 and OG0012281 þ OG0012282 as the MAT1-2 gene (these two were contiguous, and thus possibly merged into one gene). The identified region of MAT1-1 was consistent with that reported previously. 19 The genes unique to each MAT type were only these MAT genes. All strains included in the same clade showed the same MAT type. Linkage disequilibrium was observed between the topology of the phylogenetic tree of concatenated genes and MAT type (e.g. A&B and G&H). This strongly suggests that clade divergence was caused not by mutation but rather by the recombination of different strains.

Putative genetic recombination
In addition to MAT type, as an example of linkage disequilibrium, we also found that the phylogenetic tree of the concatenated gene showed a different topology from those of individual genes. For example, on comparing the phylogenetic tree of ytk6/ OG0002894/AO090023000584 to that of mdlB/OG0004275/ AO090701000644, the combinations of clades with the same sequence were different ( Supplementary Fig. S2-1A and B). In addition, the phylogenetic trees of adjacent genes showed a similar topology.  Fig. S3-5). This means that clades closer on the concatenated gene tree had larger proportions of homologous regions, and the pattern of syntenic regions depended on the clade to be compared. We also detected locally exclusively homologous regions among distant clades ( Supplementary Fig. S2-2). This suggests that clade divergence was caused by multiple recombination events in multiple strains, at least as many times as the number of clades, and not by passage mutation in only one ancestral strain.
In the A. oryzae clade, most genome regions were represented as a mixture with other A. oryzae clades; particularly, some were highly homologous to the A. flavus clade and vice versa. Thus, we cannot completely exclude the possibility of recombination between two species after speciation. However, because the exclusively homologous regions between the two species were very small, the strain used in this study might have been insufficient. Because our samples were biased toward practically used strains, information about wild-type strains of A. oryzae is lacking. Thus, the genomic mosaicism shown in this study does not directly represent the frequency of recombination. By analyzing more strains, the mosaic structure might be simplified and recombination processes in the clades could be clarified.
In addition, because our samples were human-managed strains, it might be possible to track whether hybridization occurred among them. However, none of the clades appeared to be expressed as a mixture of two other clades. In contrast, there were some A. flavus strains for which genomes were represented by a mixture of two or three other strains ( Supplementary Fig. S3-2). Interestingly, the genome of A. oryzae TK-27, a strain maintained by a tane-koji manufacturer for more than six decades and the use of which started in the 2000s, had an unusual structure; 85% or more of the genome was homologous to clade G, to which the MAT types were also identical, while some regions were closer to those of the other clades ( Supplementary Fig. S3-3). A Chinese strain, AS3.951, and the strains in clade K showed a similar pattern, although they had different MAT types ( Supplementary Fig. S3-4). However, they were not a clear mixture of any strain pair. Based on these results, we consider that the domestication process influenced the evolution of A. oryzae mostly through rearrangements within a single genome, and rarely via sexual recombination.

Evolutionary hypothesis
Our results highlighted the fact that the ancestor of A. oryzae underwent multiple complex recombination events. In contrast, considering that no simple recombination mixtures between strains from the tanekoji manufacturer were observed and many strains belonged to the same clade with wild-type RIB40, clinical isolates, or the strain from Thailand, humans might have chosen strains from nature as suitable for brewing and maintained them without crossbreeding. Therefore, the influence of domestication on the evolution of A. oryzae likely appeared only after clade divergence, suggesting that the domestication process does not contribute to genetic recombination.
In previous reports, vegetative compatibility group (VCG) divergence in A. flavus was estimated to have occurred 50,000-189,000 years ago. 19 VCG is a self-identification system, and there are at least 13 VCGs in A. flavus. 39 Considering that A. oryzae is monophyletic and nested in one of the clades of A. flavus based on the phylogenetic tree, and that A. oryzae is one type of VCG, the speciation of A. oryzae and A. flavus might have occurred contemporaneously. Domestication and industrial utilization of koji (rice malt) began in China over 3,000-2,000 years ago, and stocking and selling of these products began in Japan 700-500 years ago. 1 Therefore, domestication likely began influencing the evolution of A. oryzae in the last 3,000 or 700 years after clade divergence (Fig. 3).
In a previous study, mutational pressure was estimated by comparing the SNP frequencies of A. flavus and A. oryzae 13 or by comparative genomics of RIB40 and RIB326. 40 According to such studies, mutations tend to accumulate in non-synteny blocks (NSBs) and sub-telomeric regions. However, these estimates are considered to reflect the influence of selection pressure in nature during the period from species or clade divergence to domestication. Therefore, we focused on recently accumulated mutations or gene duplications/ deletions by comparing them within each clade.

Intra-cladal gene variants
We calculated the numbers of inter-and intra-cladal mutations in coding sequences for 14,711 SCGs present in 101 strains with BUSCO scores of 96 or more (Supplementary Method 2.4.5). There were a few intra-cladal mutations; in total, 265 synonymous mutations, 528 non-synonymous mutations, and 93 gap mutations were found in 201, 453, and 70 genes, respectively (Supplementary Table  S4, 1-3), of which synonymous mutations were not prevalent enough for neutrality tests within each gene (Supplementary Table  S4-4). Remarkably, however, the odds of the total number of intracladal non-synonymous/synonymous mutations were significantly higher than those of inter-cladal mutations in A. oryzae and much higher than those of intra-species mutations in A. flavus, with a similar tendency for gap/synonymous mutations (Table 1). This tendency was almost the same between synteny blocks (SB) and NSBs with A. nidulans/A. fumigatus (Supplementary Table S4 -5).
These results suggest that the influence of domestication, reflected by intra-cladal mutations and somewhat by inter-cladal mutations, causes the accumulation of non-neutral mutations changing gene functionality, such as loss of function. Even though there were no genes with sufficient number of inter-cladal mutations for the calculation of odds or P values (Supplementary Table S4-4), we found 1-4 intra-cladal non-synonymous or gap mutations among all the named genes ( Table 2). For example, both intra-cladal non-synonymous and gap mutations were found in two annotated genes, specifically wA/OG0003175/AO090102000545 and laeA/ OG0001970/AO0900030000489. A. oryzae wA is an industrially important polyketide synthase gene, an orthologue of A. nidulans wA, which is required for the synthesis of a green pigment. 41 The down-regulation of the A. oryzae wA gene leads to the production of white conidia. 42 Non-synonymous or gap mutations in wA were only found in some white mutants of clade A, C, and G used for miso production wherein the control of product color is important.  laeA is a global regulator of secondary metabolism in Aspergillus, 43 and it is also involved in the production of kojic acid. 44 Interestingly, a few mutations in catabolic enzymes were considered industrially useful; moreover, no intra-cladal non-synonymous or gap mutations were found in known proteolytic enzyme-encoding genes. In contrast, there were several non-synonymous/gap mutations in genes involved in expression, signaling, secondary metabolites, secretion/transporters, and cell traits (Table 2). Therefore, the domestication process might have contributed to altering the traits of strains while maintaining the activity of industrially useful catalytic enzyme-encoding genes by applying selective pressure to those peripheral genes rather than enzyme-encoding genes. This might reflect the fact that Japanese tane-koji manufacturers have continued to passage their strains to prevent changes in their traits in parallel with breeding.

Intra-cladal gene duplication
We estimated gene duplication/deletion by direct read mapping and calculation of the normalized depth, equal to the copy number of genes (Supplementary Table S5). As a result, we found 221 OGs (117 SCGs) with different estimated copy numbers in at least one clade and 179 OGs (20 SCGs) in more than two clades. We performed statistical analysis on KEGG BRITE annotation but found no significant feature (number of total detection > 1, P ¼ 0.05, Fisher's exact test), suggesting that selection pressure for gene duplication is not explained by gene function. The estimated copy numbers of OG0000041/PF03221/Tc5 transposase and OG0000321/PF14529/ endonuclease-reverse transcriptase, which are transposon-derived genes, had changed more than twice as compared with that in RIB40; in particular, the clade B and G strains showed 15-20 times the estimated copy number. Furthermore, rRNA genes exhibited a wide range in estimated copy number (e.g. it varied 13-48-fold among eight strains of clade C without bacterial contamination from yatalase). However, the selective pressure on rRNA gene duplication is hard to estimate, because a previous report showed straindependent copy number variation in rRNA genes in A. fumigatus by quantitative PCR, 45 suggesting that the duplication of rRNA genes is likely to occur in the natural environment. The estimated copy number of the tRNA gene also tended to vary, and intra-cladal change was observed at 102/276 tRNA gene loci.
As a general trend, changes in the sequence depth were more frequent in the sub-telomeric region ( Supplementary Fig. S4-1, Chr. I, Fig. S4-2). Change in depth was also observed in the sub-centromeric regions, which represents changes in the number of non-coding repeats in the unassembled region. In contrast, in some strains, a 60-70-kbp region containing tRNA genes (e.g. Supplementary Fig. S4-1, Chr. III/V) was duplicated, which might have drastically altered transcription or translation.
Three copies of a-amylase (amyA/amyB/amyC/OG0011956) have been detected in RIB40. 2 We found both intra-cladal and inter-cladal variation in the duplication number of OG0011956 (a-amylase) as follows: three to four copies in clade A, two to three copies in clade D/E, two copies in clade B, and four copies in clade G. Similarly, we found one copy in TK-24/TK-29 and four copies in TK-27. Moreover, the A. sojae strains TK-83, TK-84, and TK-85 had one copy, which is consistent with previous studies. 34,46

Aflatoxin biosynthetic gene cluster
The types of aflatoxin biosynthetic gene clusters in A. oryzae, including those in all of our samples (confirmed as non-aflatoxigenic) and strains estimated to be A. oryzae in this study, were classified into three groups as defined in a previous study. 6 Moreover, we found that Kusumoto Group 3 was nested in Group 2, while Group 1 was located far from the other two ( Supplementary Fig. S5). Aflatoxin cluster sequences were not clustered by their toxigenicity, suggesting that detoxification in A. oryzae/A. flavus had occurred in parallel. This is consistent with a previous report inferring that the selective pressure against toxins was lost in nature. 19 Because the non-toxicity of A. oryzae is phylogenetically guaranteed, tane-koji manufacturers might have distinguished A. oryzae from A. flavus based on their growth ability on rice and incidentally selected aflatoxicity.

Concluding remarks
We acquired 82 industrial strains from five Japanese tane-koji manufacturers and conducted whole-genome sequencing to determine their draft genomes. Through phylogenetic tree-based inferences of the chromosomal genome and chromosome recombination analysis, we showed that A. oryzae strains have undergone multiple intergenomic recombination events between A. oryzae ancestors. However, sexual recombination among domesticated species did not appear to have occurred during the domestication process, at least in the past few decades; therefore, we hypothesized that evolutionary pressure introduced by the domestication of A. oryzae is extremely limited to intra-genomic mutation and rearrangements. Through intra-and inter-cladal comparative analysis, we showed that the evolutionary pressure of domestication selectively caused nonsynonymous and gap mutations and intra-genomic recombination. Our results suggest that the domestication process might have contributed to altering strain traits while maintaining the activity of industrially useful catalytic enzyme genes by applying selective pressure to peripheral genes involved in fermentation rather than the enzyme-encoding genes themselves.
Our study provides suggestions on the relationship between the evolution and domestication of A. oryzae, and importantly, the whole genomic data and phylogenetic tree will help to develop breeding methods based on sexual reproduction using industrial strains.