The Distant Siblings—A Phylogenomic Roadmap Illuminates the Origins of Extant Diversity in Fungal Aromatic Polyketide Biosynthesis

Abstract In recent years, the influx of newly sequenced fungal genomes has enabled sampling of secondary metabolite biosynthesis on an unprecedented scale. However, explanations of extant diversity which take into account both large-scale phylogeny reconstructions and knowledge gained from multiple genome projects are still lacking. We analyzed the evolutionary sources of genetic diversity in aromatic polyketide biosynthesis in over 100 model fungal genomes. By reconciling the history of over 400 nonreducing polyketide synthases (NR-PKSs) with corresponding species history, we demonstrate that extant fungal NR-PKSs are clades of distant siblings, originating from a burst of duplications in early Pezizomycotina and thinned by extensive losses. The capability of higher fungi to biosynthesize the simplest precursor molecule (orsellinic acid) is highlighted as an ancestral trait underlying biosynthesis of aromatic compounds. This base activity was modified during early evolution of filamentous fungi, toward divergent reaction schemes associated with biosynthesis of, for example, aflatoxins and fusarubins (C4–C9 cyclization) or various anthraquinone derivatives (C6–C11 cyclization). The functional plasticity is further shown to have been supplemented by modularization of domain architecture into discrete pieces (conserved splice junctions within product template domain), as well as tight linkage of key accessory enzyme families and divergence in employed transcriptional factors. Although the majority of discord between species and gene history is explained by ancient duplications, this landscape has been altered by more recent duplications, as well as multiple horizontal gene transfers. The 25 detected transfers include previously undescribed events leading to emergence of, for example, fusarubin biosynthesis in Fusarium genus. Both the underlying data and the results of present analysis (including alternative scenarios revealed by sampling multiple reconciliation optima) are maintained as a freely available web-based resource: http://cropnet.pl/metasites/sekmet/nrpks_2014.


Introduction
The last 5 years of fungal genomics have been fruitful. We see mass characterization of fungal genomes, with increasing coverage given to both economically important and evolutionarily divergent lineages of eukaryotic microorganisms (Grigoriev et al. 2012;http://1000http:// .fungalgenomes.org, last accessed May 30, 2015. Availability of next-generation sequencing methods sped up the pace of novel genomic projects, whereas the increased awareness and use of deletion libraries and heterologous expression systems allow characterization of larger sets of genes (e.g., Bergmann et al. 2007;Nielsen et al. 2011;Ahuja et al. 2012). However, our knowledge of evolutionary basis underlying the extant genetic diversity is still incomplete.
In particular, the origins of diversity in fungal polyketide synthesis have been the subject of intense investigations, even prior to the advent of whole-genome sequence analysis. Previously, the landmark paper by Kroken et al. (2003) has postulated that horizontal transfers are not required to explain major parts of observed variability in fungal iterative polyketide synthases. However, since then the increased sampling of genomic data from diverse taxonomic groups has enabled researchers from multiple groups to consider evidence in support of individual HGT (horizontal gene transfer) scenarios. Salient examples include the aflatoxin cluster (Slot and Rokas 2011), the fumonisin cluster (Khaldi and Wolfe 2011), as well as the ancient origin of partially reducing polyketide synthases from transfer originating in bacteria (Schmitt and Lumbsch 2009). Thus, the availability of data and advances in phylogenomics have permitted elucidation and study of individual scenarios contributing to the present-day diversity of fungal secondary metabolism. This "classic" approach to proposing and corroborating evolutionary scenarios typically (Schmitt and Lumbsch 2009;Khaldi and Wolfe 2011;Slot and Rokas 2011;Proctor et al. 2013) proceeds by thorough analysis of multiple individual reconstructions of gene histories in context of multiple reference species. The analysis is then followed by serial likelihood ratio tests aiming to partition the adjacent genes according to their associated phylogenies supporting one or more origin scenarios (e.g., duplication, horizontal transfer, and loss). The discussion of evolutionary events leading to extant diversity is thus limited to supporting the cases separately. Conversely, larger scale statistical modelling does not capture particular, individual events leading to known biosynthetic activities (e.g., CAFE analysis of gene family expansions/contractions in multiple lineages; Bie et al. 2006;Bushley and Turgeon 2010).
Consequently, what is still lacking are "phylogenetic roadmaps"-resources that strive to reconcile species and gene histories explicitly. These, by definition, will not provide a conclusive proof in favor of one or the other scenarios but can show the parts of family evolutionary history that are better explained with or without recourse to horizontal transfer versus duplication. A large scale inference followed by reconciliation can enumerate well-supported scenarios given different assumptions as to the cost of particular evolutionary events (i.e., transfer, duplication, loss). Based on maximum parsimony reconciliations we can identify monophyletic clades clearly dominated by speciation events (where orthologous relationships likely point to a conserved chemical structure of the base polyketide chain), as well as indicate potential transfers and pinpoint alternative donor-recipient scenarios. Fast reconciliation methods are now available (Doyon et al. 2011;Bansal et al. 2012), with scalable algorithms for efficiently sampling the space of optimal solutions at random (Bansal et al. 2013), as well as for finding the most likely scenario in a probabilistic framework (ALE, Amalgamated Likelihood Estimation; Szollosi, Rosikiewicz, et al. 2013).
Our work's goal was to provide such a "proof-of-concept" roadmap for a large set of nonreducing polyketide synthases (NR-PKSs), involved in biosynthesis of diverse, well-characterized aromatic compounds (e.g., Cox 2007;Sanchez et al. 2012). This reconstruction and reconciliation was further supplemented by statistical analysis of genomic context and gene structure. For this, it was necessary to consider tradeoffs between scalability and accuracy of methods-choosing methods which produce best results in reasonable computation time (Anisimova et al. 2013) and provide a well-laid foundation for continuous inclusion of new (or corrected) data.

Materials and Methods
The entire workflow of conducted analysis is summarized on figure 1. Individual stages are described in detail below. All of the sequences and other supplementary material (i.e., trees, alignments, and annotations) referenced in this study can be found in the supplementary material, Supplementary Material online, as well as on the MetaSites database website, maintained by the authors: http://cropnet.pl/metasites/sekmet/ nrpks_2014.

Preparation of Reference Genomes
Genomes of 149 model fungi were gathered from the National Center for Biotechnology Information (NCBI)/ GenBank (Benson et al. 2015), JGI-DOE/MycoCosm (Grigoriev et al. 2012), and Ensembl/Fungi (Flicek et al. 2014) repositories. This reference set has also included the PKS lacking genomes (mainly Saccharomycetes and Schizosaccharomycetes). The genome sequence of Blumeria graminis f. sp. hordei was taken from BluGen powder mildew genome database (http://www.blugen.org; Spanu et al. 2010, last accessed January 4, 2015. The complete list of genomes and data sources can be found in the supplementary table S1, Supplementary Material online. All of the reference genomes were curated to ensure that reading frames and gene structure matched the recorded protein translation for all proteincoding genes.

Annotation of Domain Architecture
All protein-coding genes within the reference genomes were annotated with Hidden Markov Models, (HMMs) from Pfam database version 27 (Finn et al. 2014). Additionally, the mixed kingdom (Foerstner et al. 2008) and the fungi-specific (Delgado et al. 2012) models were created on basis of the alignments supplied with the original papers. These covered the typical NR-PKS domains (ketoacyl synthase, starter and main acyl transferase, thioesterase, methyltransferase, phosphopanthotein binding site, NAD-binding reductase). The HMM searches were conducted with the HMMer 3.0 hmmscan program (Eddy 2009) and were followed by double checking against the web-based version of NCBI Conserved Domain Database (Marchler-Bauer et al. 2015). The product template (PT) domain assignment was based solely on checking against hot dog fold superfamily signature in NCBI/CDD.

Prescreening for NR-PKS Subset within Reference Genomes
The selection of full NR-PKS complements from model fungal genomes was supported by three independent criteria: The clustering of ketoacyl synthase sequences based on all versus all Basic Local Alignment Search Tool (BLAST) comparisons (pairwise similarities between individual sequences), the

GBE
presence of architectural similarities (based on matching domain fingerprints), and inclusion in a monophyletic clade with all reference NR-PKS sequences during the preliminary phylogeny reconstruction step.
Initially, the set of ketoacyl synthases was selected from above-mentioned model genomes (including Caenorhabditis elegans as model outgroup), as well as PDB (Rose et al. 2013) and UniProt/SwissProt (UniProt Consortium 2014) sequences. The PDB (432 sequences) and SwissProt (611 sequences) subsets were included only during the clustering and initial selection of fungal PKS cluster, prior to preliminary phylogenetic tree reconstruction.
In the prescreening, only the sequences with at least one ketoacyl synthase domain occurrence (based on either fungal or mixed kingdom HMM matches, of length exceeding the threshold of 50 amino acids) were retained (3,067 in total). To further distinguish between the fragmentary sequences of incomplete PKSs (whether due to mistakes in gene prediction/ annotation or due to evolutionary events) and the loosely associated unrelated ketoacyl thiolases with a partial KS domain signature, we conducted unsupervised clustering analysis with CLANS (Frickey and Lupas 2004). Clustering used a 10 À5 inclusion threshold for hit P values. Based on membership of functionally and structurally annotated PKSs, the single cluster containing both C. elegans FAS and all reference PKS sequences (bacterial, fungal and protistan, 1,659 sequences) was chosen for further analysis.
To ensure the correct selection of the NR-PKS subset, we then employed an additional step based on the FastTree (Price et al. 2010;multithreaded version). One hundred bootstrap iterations were done on replicates obtained with SEQBOOT (Felsenstein 1989). This extended majority rule consensus tree was based solely on the ketoacyl synthase domain. MAFFT-LINSI v. 7.2 (Katoh and Standley 2013) was used to align the sequences, the resulting alignment was pruned with "trimal" (Capella-Gutié rrez et al. 2009) at 70% occupancy threshold for columns.
The final set of NR-PKS homologs was chosen by extracting the monophyletic fungal clade delineated by lowest common ancestor of all reference (experimentally validated) nonreducing polyketide synthases (node with full support from bootstrap). The 414 representatives from model fungal genomes (discarding PDB and SwissProt sequences, but including the fumonisin HR-PKS outgroup sequence) were retained for gene tree inference based on presence of conserved KS-AT module.
No outliers or additional sequences were indicated by either phylogeny or domain architecture, save for Ustilago maydis PKS1/UM04105 (and its two counterparts from Ustilago hordei and Sporisorium reilianum) which lacks an acyl transferase domain and thus was not included in the final analysis. For a single experimentally characterized NR-PKS dbaI (ANIA_07903; Gerke et al. 2012), the sequence was manually updated to match revised AspGD gene model which translates to a full length PKS protein instead of the truncated version.

Gene Tree Construction
The final NR-PKS alignment was constructed by aligning the core KS-AT module with MAFFT-LINSI (Katoh and Standley 2013). This alignment was curated with T-COFFEE transitive consistency score analysis (Chang et al. 2014; default exhaustive "proba_pair" setting-columns scoring 2 or above were kept in the final alignment, in keeping with the results of the above-mentioned article). The filtered alignment (available as supplementary material S2, Supplementary Material online, in NEXUS format; mappings of sequence identifiers to loci/gene names are available as additional supplementary file S3, Supplementary Material online) was used for final phylogeny reconstruction.
As the protein sequence identity levels can be low and homoplasies are expected in the data set, Bayesian inference (BI) was conducted with PhyloBayes-MPI 4 chains 200,000 each, first 40,000 trees discarded as burn-in, every fifth tree sampled; two chains of best convergence chosen for further analysis; CAT-Poisson parameterization was used due to computational resource constraints). Chain convergence was assessed with pairwise comparison using "bpcomp" tool (Lartillot et al. 2009). The analysis was also carried out in maximum-likelihood (ML) framework with IQTREE 0.9.6 (Minh et al. 2013; 1,000 ultrafast bootstrap replicates; exhaustive nearest neighbor search setting "-nni5"; LG+G4+I+F model selected with IQTREE internal model testing). The Bayesian consensus tree and the underlying sampled trees were selected for reconciliation analysis with ALEml and DTL-RANGER; the resulting ALEml tree is hereafter referred to as "amalgamated." Reconstructed gene tree topologies are available in supplementary material, Supplementary Material online (ML, Bayesian, and amalgamated in supplementary files S4-S6, Supplementary Material online). The amalgamated tree with annotated gene structure and domain architecture is shown in supplementary figure S1, Supplementary Material online.
Throughout the text, we refer to the nodes of the amalgamated gene tree by the "g<number>" notation derived from postorder traversal of tree structure. Where the original BI tree or the species tree is referenced, the analogous "go<number>" or "so<number>" notations are used.

Species Tree Construction
The species tree was constructed based on 23 best scoring single-copy orthologs (see supplementary table S2, Supplementary Material online) with best topological scores (over 95%) as reported in FUNYBASE (Marthey et al. 2008). The topological scores were previously introduced by Aguileta et al. (2008) and measured concordance of ML trees based on individual orthologs with a reference supertree inferred on basis of over 120 single copy orthologs identified in that study. The ortholog protein sequences were aligned with MAFFT-LINSI and resulting alignments concatenated (the concatenated alignment is available in the supplementary file S7, Supplementary Material online, in NEXUS format). The gapped positions were curated using 70% occupancy threshold (-gt 0.7) in "trimal," resulting in 15,314 columns in the final alignment (available as supplementary file S2, Supplementary Material online). BI was conducted with PhyloBayes-MPI 1.5a . The inference consisted of 4 chains of 40,000 iterations, first 5,000 trees from each chain were discarded as burn-in, every fifth tree was sampled. As in the case of gene tree reconstruction, two chains of best convergence were chosen for further analysis, CAT-Poisson parameterization was used due to computational resource constraints.
Additional ML reconstruction was conducted with IQTREE. Individual protein models were chosen for each of the 23 genes based on ProtTest v3 (Darriba et al. 2011) results (corroborated based on IQTREE's own model testing). The reconstruction used ultrafast bootstrap with default stopping criterion and exhaustive NNI (Nearest Neighbor Interchange) search (-nni5 option), resulting in convergence after 100 ultrafast bootstrap iterations. Caenorhabditis elegans represented the outgroup used to root the tree.
To obtain the approximate chronogram, the dating was performed on PhyloBayes-MPI consensus tree. Relaxed, log-normal autocorrelated clock with soft bounds under a birth-death prior was used (as implemented in PhyloBayes 3.3f; Lartillot et al. 2009). The dating constraints were introduced based on Fungi/Animalia split dated at 983 Ma (based on Douzery et al. 2004), as well as number of additional constraints based on subsequent inquiries into fungal phylogeny (Sung et al. 2008;Gueidan et al. 2011;Ohm et al. 2012;O'Donnell et al. 2013

Tree Reconciliation
The resulting chronogram (dated species tree) and the ensemble of Bayesian gene trees were reconciled using the ALE approach, as implemented in ALE v.0.3 (40,000 trees discarded as burnin, every fifth tree sampled from both runs). To minimize numerical errors on the large data set, scaled version of the program was compiled with floating point calculations carried out on 128 bit numbers (as implemented by "boost::multiprecision" library "float128" type). The inferred transfers are summarized in table 1.
Additional support for transfers as well as alternative scenarios was explored by sampling the multiple optimal reconciliations with DTL-RANGER (Bansal et al. 2012). Here, the support for duplication/speciation/transfer events as well as mapping of individual events on the species tree was annotated on basis of 1) frequency of inferred events/ mappings across different transfer cost values and 2) highest transfer cost where transfer event is indicated (only for inferred HGT events), with fixed duplication and loss costs (DTL-RANGER dated version with parameters L = {1, 2, 3}, Á = 4, e{5, . . . , 40}). The rationale here is that the horizontal transfer is best supported where "regrafts" between cotemporaneous parts of species tree are both most evident (appear at highly penalized transfer cost) and most consistent (as supported for a range of decreasing cost thresholds). For the purpose of sampling, the reconciliations at random, the algorithms from Bansal et al. (2013) were used with a random sample size of 1,000 optimal solutions per each cost combination.

Synteny Analysis
To establish the enrichment or depletion of specific gene families/subfamilies in the syntenic context of NR-PKS core genes, we introduced an approach based on multiple applications of the Fisher exact test.
First, all homologs within +20/À20 genomic context were extracted and subjected to exhaustive, all against all BLASTP searches. The BLASTP expectation values were then used for clustering with MCL (Enright et al. 2002). Only (bidirectional) BLAST hits with E value less than 1E-10 and majority coverage (>50%) of the longer sequence were considered. The MCL inflation threshold was set to 1.4. The clustering parameters were chosen as a compromise between stringent clustering (E value) and observed property of high inflation thresholds erroneously breaking up the clusters for highly diverse sequences (Chan et al. 2013). For our analysis, the inflation threshold setting was also supported in the average values of silhouette width (Rousseeuw 1987), a cluster quality measure independent of predefined class labels which consistently presented a slight peak at the 1.4 setting.
The 2Â2 contingency tables were set up to contrast inclusion of NR-PKS gene in a given clade (vs. the rest of the tree), with presence/absence of a candidate homolog group member in the vicinity of the said gene. Thus, the resulting candidate homolog groups numbering ten or more sequences (133 clusters) were iteratively tested for association with all possible subtrees containing five or more leaves. As remarked above, this was done with Fisher exact test, corrected for multiple testing with Bonferroni correction (based on total number of tests for all candidate homolog groups). The local P-value minima were marked and test results corresponding to the most significant, nonoverlapping subtrees reported for each homologous group. The corrected P-value threshold for inclusion in reported results was set to 0.0001. The detected associations are summarized in table 2 and an example demonstrating the detection of one of the strongest associations (b-lactamase accessory enzymes) is included in supplementary figure S4, Supplementary Material online. An analogous approach was employed for the gene structure, where exon positions were mapped to the respective protein domain sequence alignments and counts calculated for consensus positions based on the said alignments. The starter acyltransferase (SAT), ketoacyl synthase (KS), main acyltransferase (AT), and PT domains were analyzed based on the realignment (MAFFT-LINSI) of corresponding sequence fragments detected earlier. Due to highly divergent sequences of C-terminal domains (methyltransferase, thioesterase, NADdependent reductase) and the apparent incompleteness of a fraction of gene models, these were excluded from splicing site analysis.

Visualization
The trees/dendrograms and annotation (gene structure, domain architecture, reconciled events) were visualized using custom Python scripts dependent on BioPython (Cock et al. 2009;Talevich et al. 2012) and ETE2 (Huerta-Cepas et al. 2010). Where final, annotated versions of published genomes had yet to appear in NCBI/GenBank, annotated gene names were derived following the practice of combining existing gene numbers with organism-specific prefix (locus prefix tag) based on codes available through genome sequencing project pages (http://www.ncbi.nlm.nih.gov/bioproject/, last accessed January 4, 2015).
The functional annotation for experimentally verified and described nonreducing polyketide synthases was gathered from available scientific literature manually (for complete list and descriptions, see supplementary table S4, Supplementary Material online). Classification of cyclization activities (C2-C7, C6-C11, C4-C9) originates from nomenclature used by Li et al. (2010). The structural model of PT domain was visualized with PyMol 1.7.2 (www.pymol.org, last accessed May 30, 2015).

Inferred History of NR-PKS Complement Is Strongly Supported
The reconstructed species tree (see fig. 2 for simplified representation showing predicted evolutionary events), based on 23 single copy orthologous genes, is largely in accord with previously published results, in particular the comprehensive phylogenomic analysis by Ebersberger et al. (2012) and alignment-free composition vector-based comparison by Wang et al. (2009). Notably, the set of selected orthologs supports the alliance of Dothideomycetes and Transfer events are marked through their respective numbers from table 1 (filled outlines indicate donors, hollow outlines indicate transfer acceptors). Salient broad taxa within higher fungi are highlighted by colored backgrounds (in clockwise direction: Basidiomycota, Leotiomycetes, Sordariomycetes, Dothideomycetes, Eurotiomycetes). Subscript on branches denotes predicted numbers of, respectively, duplications/deletions/genes for respective branch. For ease of reference, some nodes were collapsed (thickened branches, proportional to the number of species/strains). Duplications within the Agaricomycetes clade are not shown due to space constraints (see fig. 4). Eurotiomycetes postulated by both. The individual splits are well supported by both inference methods (BI and ML), with majority of splits within Bayesian consensus tree supported by both methods (see supplementary material, Supplementary Material online).
The reconstruction of NR-PKS phylogeny (the gene tree) was based on the conserved KS-AT fragment of the megasynthases. Again, majority of splits are well supported by both Bayesian and ML results. Notably, although approximately 15% of splits (60/413) present in the Bayesian consensus tree are challenged during ALE reconciliation, the subdivision into major clades (as depicted on fig. 3) is not modified.

Inferred Events Are Largely Supported by Both Reconciliation Methods
Within the limits of comparability, the predictions obtained with DTL-RANGER and ALE align well (see supplementary tables S5-S7, Supplementary Material online, for details). Expectedly, there is a marked decrease of reconciliation cost on the amalgamated topology (which resolves uncertainties in the original Bayesian consensus in favor of the species tree) with reported minimum costs on average 10% lower, across all investigated parameter combinations (see supplementary table S5, Supplementary Material online, for detailed comparison).
Based on the amalgamated topology (where direct comparison is possible) predicted duplications and speciations are largely the same for DTL-RANGER (across varying cost thresholds). In particular, for the following parameter combinations there is over 99% identity in predictions (costs given in <D, T, L> format for, respectively, duplication, transfer, and loss): <4, 10-12, 1>, <4, 18-20, 2>, and <4, 25-28, 3>. Analogous results were observed for mapping of the events to the corresponding nodes of the species tree for the abovementioned parameter values. Overall, although predictions for speciation and duplication nodes correspond to ALE results in overwhelming majority (peaking at 98-99%), this correspondence is worse for transfer predictions (peaking at 92%, considerably worse for many parameter values). The DTL-RANGER predictions on the original tree (compared over the 352 bipartitions shared between the original Bayesian consensus and the amalgamated topology) also align well, showing that simple reconciliation without rearrangement can capture at least part of the events correctly (see supplementary tables S6 and S7, Supplementary Material online, for details).
Pertinently, both the inferred existence and the tentative topological dating of majority of duplication events (~70%) are unperturbed regardless of the assumed parameters. Thus, our prediction of the ancestral nature of most underlying duplications is largely independent of detected transfers.

Biosynthesis of Orsellinic Acid Is an Ancestral Trait of Higher Fungi
The obtained results support ancient origins of biosynthesis of compounds derived from modification of one or more molecules of orsellinic acid, in all three ancestral clades (a, b, and g). The earliest diverging clade a (see figs. 3 and 4) encompasses both basidiomycete and ascomycete sequences, including ones with experimentally established propensity toward orsellinic acid biosynthesis (pks1-Coprinus cinereus, pks14-Fusarium graminearum). Likewise, clade b presents multiple examples of biosynthesis of derivatives of orsellinic acid as the first intermediate metabolites (6-MOS, tropolone, violaceol, mitorubrinic acid-clades a and b). A final argument is provided by the unequivocal placement of Ustilago/Sporisorium clade within the g clade of NR-PKSs (by both BI and ML inference), as well as the confirmed orsellinic acid biosynthesis (Aspergillus nidulans gene orsA, predicted to be of transferred origin) in the early diverging g.C2-C7a clade. Notably, although grouping of these three clades is resolved differently by original gene trees (b as an outgroup to a/g-notion supported by divergent signatures of C-terminal thioesterase, as well as presence of the methyltransferase domain throughout b), the support for the groups themselves is strong and unchallenged.

Ancient Duplications Underlie a Patchwork of Distant Siblings
As remarked above, the predicted events giving rise to the present-day diversity of aromatic compounds are likely ancient (see fig. 2, for graphical summary). The amalgamated likelihood approach predicts them to be mostly duplications followed by extensive losses (849 predicted loss events according to ALE). By this view, extant nonreducing polyketide synthases constitute a patchwork set of distantly related sibling groups.
In particular, repeated emergence of novel cyclization specificities within the g clade (C6-C11-emodins, asperthecin, monodictyphenone; and C4-C9-sterigmatocystin, aflatoxins, fusarubin) is shown to be monophyletic and ancient. The topological dating of relevant nodes provided by reconciliations places the origin in the common ancestral lineage of four major classes of filamentous fungi (Leotiomycetes, Sordariomycetes, Dothideomycetes, and Eurotiomycetes).
Of the total 87 duplication events inferred by ALE reconciliation, 62 are predicted to have occurred in this common ancestral lineage. Furthermore, three ancestral duplication events are predicted to predate the split between Basidiomycetes and Ascomycetes (the two duplications underlying division into a, b, and g clades of NR-PKSs and a singular duplication within the a clade itself). Four duplications are tentatively dated at Dothideomycetes-Eurotiomycetes split. Slightly higher numbers of more recent duplications are inferred in Dothideomycetes and Eurotiomycetes (mostly in phytopathogenic Pleosporales-2 and geni Aspergillus-5 and Talaromyces-3). This seems related to their rich secondary metabolite repertoires and a relatively dense sampling of the closely related genomes from the respective species. Notably, such patterns cannot be seen for other extensive secondary metabolite producers such as the saprobic/pathogenic Metarhizium species. In Metarhizium, the high number of NR-PKS genes seems to stem from less extensive selection and not late duplications (only a singular duplication has affected the Hypocreales lineage-resulting in the doubling of aurofusarin ancestor gene).
By the same token, the extensive thinning of NR-PKS repertoire (874 losses) started after the divergence of filamentous fungi (estimated losses for Sordariomycetes-Leotiomycetes common ancestor-23, for Dothideomycetes-Eurotiomycetes common ancestor-4) and the losses continue following subsequent divergences. As indicated in the previous paragraph, the selective process did not affect all resulting species equally, with some lineages retaining more of the original diversity.

Transfers Are a Source of Additional Diversity
A ranked list of the top detected transfers with their ALEbased mappings are summarized in Table 1 (alternative scenarios found by DTL-RANGER can be examined on the results website). Briefly, 14 of the 25 transfers (56%) are detected consistently even prior to amalgamation (i.e., the bipartition is present in both amalgamated and original gene trees, the transfer is predicted on both topologies by DTL-RANGER across multiple cost thresholds). The detailed results, obtained from sampling of multiple optimal reconciliations at random, are available through the gene tree visualization FIG. 3.-The divergence of major NR-PKS clades, which enabled diversification of biosynthesized aromatic compounds, was facilitated by close linkage with key accessory enzymes and transcriptional factors. The simplified phylogeny (based on amalgamated gene tree) shows selected groups of syntenic homologs as colored shapes (also indicated by numbers in box shape, which reference descriptions in table 2).
Most predicted transfers are fairly recent in comparison to duplications ( fig. 2), in keeping with majority of the transfers being lost over time. The key predicted transfer events are predominantly in clade g (21 transfers) and include several cases of transfers involving well-characterized core biosynthetic genes: HGT of the norsolorinic acid NR-PKS into Podospora anserina genome ( fig. 5A), as well as transfers of pigment-related NR-PKSs (fusarubins- fig. 5B, acquisition of a putative pksP homolog by Aspergillus terreus from Metarhizium, fig. 5C). Very few transfers are predicted in clade b and no transfers are indicated (postamalgamation) for the early-diverging clade a.
The former example of sterigmatocystin biosynthetic cluster in P. anserina constitutes perhaps the best documented individual case of horizontal transfer involving a highly toxic and large secondary metabolite biosynthetic cluster (Slot and FIG. 4.-The phylogeny of ancestral clade a, which predates basidiomycete-ascomycete split, contains conserved NR-PKSs biosynthesizing orsellinic acid (pks14, pks1). Support noted below branches in form BI/ML/C (BI, support from BI; ML, support from ML ultrafast bootstrap; C, whether bipartition is present in Bayesian consensus tree; * denotes full support, % denotes support below 1% level). Domain architecture and gene structure are visualized on the right side (colors: red, SAT; blue, KS; green, AT; purple, PT; orange, ACP; pink, TE; brown, R). Color of the tree edges and letters denote predicted events (D, duplication; TD, transfer donated; TA, transfer accepted). Tree is not drawn to scale.
Rokas 2011). In our analysis, both the support for the event itself and its mappings are consistent. Interestingly, the prediction indicates transfer donor in Dothideomycetes, within R. rufulum lineage. Notably, this origin as unfragmented cluster in Dothideomycetes has been postulated in a recent paper (Bradshaw et al. 2013).
In some cases, the ALE has resulted in a significantly different scenario from simply sampling the optimal reconciliations based on the original tree topology. For a singular Tuber melanosporum NR-PKS (GSTUM_00003432001), its placement in the original BI tree (at 74% BPP, Bayesian Posterior Probability) has introduced an apparent HGT from early Pezizomycotina to Basidiomycetes. This scenario is rejected by ALE where an alternative grouping is incorporated in the final tree instead (bipartition g31 supported at 26% BPP; see fig. 4). This rearrangement shows that consideration of alternative rather than majority bipartitions is required for validating transfer events, even at most restrictive cost settings.
As the sampling of taxa is not even across the species tree, one can expect that increasing coverage of undersampled species will reveal more lineage-specific duplications and transfer events. This is well demonstrated by Eurotiales clade. For this reason, we refrain from making quantitative calls about the singular numbers of events. Still, some observations stand out. In particular, the branch leading to Glarea lozoyensis (Helotiales) appears to be a rather frequent donor (four donated transfers). This is partially explained by both the length of the corresponding branch of the species tree (by the dated chronogram the split between Glarea and B. graminis lines occurred~250 Ma) as well as the other represented species being plant pathogenic rather than saprobic throughout their lifecycle. Also, a high number of transfers (8) are predicted within the g.C2-C7a subclade (represented by orsA and zea1) which presents a sparse distribution of homologs from distantly related species. Although artifactual origins are possible (genes strongly differ in structure), HGT was previously (Xu et al. 2014) invoked as an explanation of the observed diversity of macrocyclic lactones. Additionally, the highly reducing polyketide synthases (indicative of the biosynthesis of lactone part not derived from orsellinic acid molecule) are strongly enriched for the entire g.C2-C7a clade (as well as part of the b clade containing the asperfuranone biosynthetic NR-PKS).

Speciation, Duplication, and Transfer Differentially Affected the Genomic Context
We investigated the overall differences in conservation of genomic context (+20/À20 genes) at each bipartition (in the amalgamated tree) by looking at the average number of conserved homologs shared between neighborhoods of any two members of the respective descendant clades (left and right subtrees of inner nodes in the tree). This estimate was then contrasted with event labels ascribed to bipartition by ALE. Duplications were found to disrupt the synteny most (11% of shared homolog groups on the average are retained), with significantly (Wilcoxon ranked sum test) higher retention for speciation (40%; P < 10 À5 ) and slightly higher for transfers (21%, not significant). The number of the putative counterparts decays with the depth of the node (Pearson correlation between the ratio of shared homologs vs. depth of the node in the tree r = À0.22, P = 5 Â 10 À6 ), confirming that context is also lost over multiple events (in part due to most of the deeper nodes being duplications, while many recent ones are speciations). Notably, we think that these estimates err on the strong side as counterparts were identified on basis of membership in the same CHG (gene family or subfamily) rather than precise orthology/paralogy relationships.
Although for some of the predicted transfer cases the genomic context is partially preserved (e.g., sterigmatocystin in Podospora-17% genes have a counterpart in Rhytidhysteron), there are five cases (g361, g511, g664, g731, g791), including the strongly supported transfer of the pksP into A. terreus (g664), where genomic neighborhood is not shared between predicted donor and acceptor branches at all.

Conserved Splice Junctions and Overrepresented Accessory Enzymes Are Associated with Emergence of Different Cyclizations
The gene groups significantly overrepresented in association with particular monophyletic clades are summarized in table 2 (graphic summary of selected associations is shown on fig. 3). Out of 132 homolog groups, 40 display significant (P < 0.0001) enrichment for at least one monophyletic clade in the tree (possibly indicating the point of origin for acquiring the accessory enzyme). Strikingly, no enrichment of any particular gene family was noted in association with the early diverging clade a (which contains sequences from both Basidiomycetes and Ascomycetes). This trend was not mirrored in exon junction positions which are conserved relative to consensus sequences of KS and AT domains encoded by a clade members.
In terms of splicing site conservation, all three clades (a/b/g) demonstrate specific patterns (see supplementary table S8, Supplementary Material online). In group a, the member genes typically present higher overall number of exons due to multiple splice junctions embedded within the above-mentioned KS and AT domains. In group b, the core NR-PKS domains are typically encoded on a single large exon, with a clear exception of the clade (g182) associated with biosynthesis of meroterpenoid products (terretonin, austinol). The meroterpenoid NR-PKSs show their own characteristic pattern of splice junctions within SAT, KS, and AT domains.
The g clade of NR-PKSs encompasses synthases capable of a number of divergent modes of cyclization (C2-C7, C6-C11, C4-C9; Li et al. 2010). This plasticity of function is perhaps facilitated by the observed tendency toward exon gain within both the SAT and the PT domains. Notably, by relating to available protein structure of the PT domain (norsolorinic acid synthase from Aspergillus parasiticus, C4-C9 cyclization type) one can see that the highest degree of fragmentation results in four individually encoded pieces, each of which contributes to the gating of the cyclization chamber (see fig. 6the model of PksA aflatoxin NR-PKS with delineated structural elements corresponding to coding sequence parts separated by splice junctions).
The subfunctionalization toward different cyclization schemes and/or end products is also shown to be related to recruitment of specific accessory enzymes. The manganesedependent b-lactamases are unambiguously associated with g.C6-C11 subclade (g552, corrected P = 8.6e-43), with the notable exception of endocrocin sister clade (g439; where b-lactamase is not always present in the vicinity of NR-PKS). Taken as an exemplary association, the endocrocin example demonstrates that strong linkage follows recruitment of the accessory enzyme and thus provides an upper constraint on the timing of the original involvement in the biosynthetic pathway.
As with most of the predicted duplications, the topological dating places this tight linkage with b-lactamases (Li et al. 2011), prior to the divergence of major classes of filamentous fungi. As in other clades, the syntenic relationships persist through multiple predicted horizontal transfer events, involving among others: The acquisition of capacity to synthesize alternariol/isocoumarins by A. nidulans/oryzae/flavus (pkgA homologs, predicted to originate from Sordariomycetes, HGT supported solely by ALE) and spread of asperthecin biosynthetic PKS aptA homologs to Arthroderma sp. as well as Metarhizium sp. (with the original donor in Aspergillus fischerii/fumigatus clade-a likely ortholog of the extant neosartoricin NR-PKS nscA involved in prenylated xanthone biosynthesis).
The enrichment analysis suggests that diversification of activity was also associated with recruitment of conserved but divergent groups of transcription factors (of the zincbinding finger variety). Of the five transcription factor groups, only one (ch43) is enriched for two separate subclades of g. The only other case, where a homolog group is strongly enriched in more than one part of the tree, concerns the highly reducing polyketide synthases mentioned in the previous subsection.
As a sidenote, there are multiple cases where accessory enzymes strongly associated with a monophyletic clade are nevertheless present (in singular numbers) around NR-PKS genes from different clades (see table 2, e.g.: Several aflatoxin biosynthetic cluster genes, aurofusarin biosynthesis dehydratase aurZ). Depending on the respective gene family histories, the presence of such "outliers" raises the salient questions of frequency and modes of exchange of accessory enzymes between different clusters, particularly if biosynthetic activities of individual clusters are linked by common end product further down the line (e.g., Andersen et al. 2013).

The Evolution of Pigment Biosynthesis Presents Both Conservation and Diversification
Perhaps, the most striking examples of both conservation and variation concern the evolution of melanin biosynthesis (see fig. 7) and splitting of the biosynthesis of alternative pigments and toxins from this conserved branch. First, the common ancestor of g.C6-C11, g.C2-C7b/c, and g.C4-C9 ( fig. 3) acquired an accessory enoyl-CoA reductase (of the short chain dehydrogenase superfamily). This likely enabled biosynthesis of reduced intermediate compounds (such as versicolorin during the biosynthesis of aflatoxins, tetra/ trihydrohydroxynaphthalene during biosynthesis of melanins, monodictyphenone during the biosynthesis of xanthones). Next, the linkage with copper-dependent laccase enzymes enabled oxidative polymerization (e.g., in biosynthesis of melanins, aurofusarin) in the common ancestor g.C2-C7b/c and g.C4-C9.
Movement between different components of the (twospeed) fungal genome is evidenced in the placement of melanin biosynthetic NR-PKSs (including the genes involved in biosynthesis of melanin through naphtopyrone intermediate) in the vicinity of confirmed housekeeping genes (see table 2). Conservation is further underscored by inheritance of the tetrahydroxynaphtalene synthase (core NR-PKS) as an ortholog in the majority of filamentous fungi ( fig. 7-most of Leotiomycetes, Sordariomycetes, and Dothideomycetes). In select Sordariomycetes (Hypocreales) as well as most Eurotiomycetes, pigment biosynthesis is instead carried out through modified naphtopyrone routes (through YWA1 heptaketide intermediate product-e.g., Chiang et al. 2011). Notably both the speciation patterns and the conserved gene structure of the core melanin biosynthetic NR-PKS (the tetrahydroxynaphthalene synthase) suggest possible application as a source of additional barcode markers across the relevant taxa.
The basic building blocks (accessory enzymes) partaking in biosynthesis of melanins have been further retained for biosynthesis of other compounds (and are present in the extant clusters). Furthermore, possibly due to the increased toxicity of C4-C9 cyclization products (such as sterigmatocystin and aflatoxins, javanicin, and fusarubins) that subclade is associated FIG. 7.-The phylogeny of tetrahydroxynaphthalene synthases (the core enzyme of melanin biosynthesis) mirrors speciation explicitly in the majority of Dothideomycetes, Leotiomycetes, and Sordariomycetes. Visual conventions (branch support, exon, domains) analogous to figure 4. Gene models for MBM_00260 and M7I_3853 were truncated from 5 0 and 3 0 sides, respectively. with significantly overrepresented retention of accessory methyltransferases in the close neighborhood of the core megasynthase.
The analysis of top-scoring, consistently predicted transfer events suggests that biosynthesis of aromatic pigments, following the divergence of genes involved in melanin biosynthesis through the DHN route, could have been influenced by key transfer events (e.g., origin of Pgl1/Fsr1 orthologs in Fusarium as a horizontal transfer from early Pleosporales, reintroduction of a putative pigment biosynthesis gene into A. terreus as a transfer from Metarhizium). As an example of finer resolution, due to the increased taxon sampling of our analysis (relative to earlier phylogenomic analysis by Brown et al. [2012]), the core genes of bikaverin and aurofusarin biosynthesis are now shown to be nonorthologous in origin. These NR-PKSs are now revealed to belong to sister clades which diverged early during the evolution of filamentous Ascomycota and (in case of aurofusarin NR-PKS) underwent an additional Hypocreales' specific duplication. The aurofusarin biosynthetic gene is however shown to be orthologous to experimentally characterized Trichoderma PKS4 (pigment synthesis PKS; Atanasova et al. 2013), following the said duplications ( fig. 8).
Last point of note concerns the replacement of thioesterase domain with NAD-binding reductive domain leading to aldehyde end products of cyclization. This event has likely occurred twice during the evolution of NR-PKSs (respectively, in g.C4-C9 clade leading to fusarubin subclade- fig. 5B-and in the part of the b clade associated with biosynthesis of azaphilones and tropolone-derived compounds). Based on the gene structure and similarity of protein sequences of affected domains, a likely possibility is the recombination (or fusion) between a donor gene from b clade and an acceptor gene from g clade (or vice versa). Regardless of whether the domain has been acquired from an outside or inside source, the implied scenarios show the need for considering mosaic ancestry of multidomain proteins, where different regions of the protein/gene could have conflicting origins.

Toward Phylogenomic Roadmaps of Secondary Metabolite Biosynthesis in Fungi
The reconstruction of evolutionary history of fungal NR-PKSs faced a number of obstacles. Foremost among these was the quality of gene tree inference for a large and diversified gene (sub)family across multiple species. First, the data are strongly saturated worsening the risks associated with long branch attraction and incorrect resolution of deeper nodes in the tree. Second, due to being focused on a conserved fragment of limited length (the KS-AT core modules and intervening region) attempts to put conservative filters on data, commonly accepted in species-focused phylogenomics, would likely introduce stochastic bias (particularly) in the resolution of rapid successions of events. Nevertheless, we have taken several precautions against the common artifacts of phylogeny reconstruction (use of protein sequences, filtering protein alignments based on transitive consistency score, use of different methods for phylogeny reconstruction including models able to take into account the heterogeneity of rates across informative sites).
In its present iteration, the reconstruction uncertainty is also alleviated by conscious inclusion of species trees in phylogeny reconstruction process (so in the absence of prevailing evidence, neutral scenario implied by species tree topology is considered by ALE). The contemporary developments in probabilistic models for inferring both species and gene phylogeny simultaneously ) and for inferring the gene phylogeny based on simultaneous reconciliation with a putative species tree (Szollosi, Rosikiewicz, et al. 2013) formalize this approach in a probabilistic framework. In our example, the application of ALE has also given an added benefit of explicitly modelling the two-step transfers through intermediary donors from extinct lineages .
The results demonstrate the utility of maximally parsimonious reconciliations and phylogenetic inference in uncovering the sources of extant diversity in a highly diverged gene family (duplication, transfer, and selective loss). As a proofof-concept our analysis demonstrates that in considering the evolution of secondary metabolism in microbial Eukaryotes, the scientists have to take into account transfers (at varying degrees) as a viable origin hypothesis for present metabolite diversity. This is even though, in this case, majority of molecular innovation seems ancient in origin and (throughout the history of multiple extant fungal lineages) has been subjected mostly to selective losses. The use of probabilistic methods (ALE) allowed us to address the parameter choice (costs of different events) in an unsupervised way-with additional support lent to early diversification of biosynthetic mechanisms and a limited, but not negligible, number of predicted transfer events.
In the future, the increased sampling of previously uncharacterized taxa of higher fungi will likely increase the number and improve the resolution of individual events. The biases in taxonomic coverage resulting from uneven sampling of divergent taxa are liable to influence what is perceived as an incorrect outlier (errant bipartition due to inclusion of a gene from a rogue taxa) rather than accumulated evidence from multiple, more distantly related species (Szollosi et al. 2015). The information about spread and retention of biosynthetic potentials in different lineages may thus reveal stronger and/or different trends among donor and receiver groups, as well as contrast these across ecological niches occupied by related organisms. The resolution of transfers, in particular can be challenged, should novel organisms from related clades (the increasing public coverage of previously unsampled lineages, such as Chaetothyriales, Lecanoromycetes or Xylonomycetes) imply different distributions of genes. This is one of the reasons why roadmap nature of the resource should be emphasized-one taking into account the "landmarks" (species) available at the time of its creation.
Likewise, the continuous development of novel computational approaches taking advantage of high-performance computing infrastructure results in more accessible methods increasingly able to account for some of the unavoidable biases in the data. For example, both rate heterogeneity and misleading signals, arising due to saturation, are better handled by improved Bayesian and ML methods, as evidenced by a number of studies (e.g., Boussau and Gouy 2006;Lartillot et al. 2009;Husník et al. 2011).
Last but not least aspect of the presented analysis lies in providing, for the first time, a phylogeny-based annotation of NR-PKS core genes involved in fungal secondary metabolism. With increasing number of inquiries into the phylogenetic basis of eukaryotic secondary metabolism based on functional experiments (knockout, deletion mutant libraries) and growing genomic coverage, similar initiatives should complement large tools and databases (Khaldi et al. 2010;Blin et al. 2013;Wang et al. 2014). In particular, a phylogeny-centric resource should facilitate research focused on chemotaxonomy, as well as further dissection of the evolutionary fates of specific compounds and clusters (e.g., adaptation of pigment molecules for protection from both abiotic and biotic stresses, emergence and spread of high toxigenic potential).

Conclusions
We present, for the first time, a phylogeny-based analysis elucidating the origins of diversity in one of the largest and most important groups of core genes involved in fungal secondary metabolism-the nonreducing polyketide synthases. The results support ancient duplications of a limited number of NR-PKSs into subsets catalyzing divergent chemical reactions. The highly diverse representation of these genes in extant fungal genomes is revealed to be a result of subsequent, large-scale selective losses, moderated by a number of more recent duplications and horizontal transfers (associated with, e.g., emergence of fusarubin biosynthesis in fusaria). The complementary analysis of phylogeny-associated traits (genomic neighborhood, splice junctions) shows gain of tightly linked accessory genes and modularization of key PKS domains (PT domain), as crucial traits accompanying the functional diversification.