A treasure trove of 1034 actinomycete genomes

Abstract Filamentous Actinobacteria, recently renamed Actinomycetia, are the most prolific source of microbial bioactive natural products. Studies on biosynthetic gene clusters benefit from or require chromosome-level assemblies. Here, we provide DNA sequences from >1000 isolates: 881 complete genomes and 153 near-complete genomes, representing 28 genera and 389 species, including 244 likely novel species. All genomes are from filamentous isolates of the class Actinomycetia from the NBC culture collection. The largest genus is Streptomyces with 886 genomes including 742 complete assemblies. We use this data to show that analysis of complete genomes can bring biological understanding not previously derived from more fragmented sequences or less systematic datasets. We document the central and structured location of core genes and distal location of specialized metabolite biosynthetic gene clusters and duplicate core genes on the linear Streptomyces chromosome, and analyze the content and length of the terminal inverted repeats which are characteristic for Streptomyces. We then analyze the diversity of trans-AT polyketide synthase biosynthetic gene clusters, which encodes the machinery of a biotechnologically highly interesting compound class. These insights have both ecological and biotechnological implications in understanding the importance of high quality genomic resources and the complex role synteny plays in Actinomycetia biology.


Introduction
The bacterial class Actinomycetia (recently reclassified as Actinobacteria) is hugely successful, with species commonly inhabiting a wide range of niches, both host-associated and environmental.Environmental Actinomycetia species, especially those inhabiting soil, often have a complex biology and a filamentous growth pattern, which makes laboratory studies difficult.
In the past century, there has been great scientific interest surrounding the class Actinomycetia.Many essential medicines such as antibiotics (e.g.tetracycline, erythromycin and vancomycin), immunosuppressants (e.g.rapamycin), anthelmintics (e.g.avermectin), and anticancer drugs (e.g.actinomycin D and daunorubicin), but also important insecticides (e.g.spinosyns) and food preservatives (e.g.ε -polylysine) either directly come from or are derivatives of compounds produced by Actinomycetia strains ( 1 ,2 ).
The biosynthesis of these natural products, often referred to as secondary or specialized metabolites, is encoded in genomic regions called biosynthetic gene clusters (BGCs).These BGCs code for a variety of enzymes that catalyze the biosynthesis of highly complex chemicals from 'simple' precursors from primary metabolism.In addition to the biosynthetic function, the BGCs often also contain genes for regulation ( 3 ), molecular transport / export ( 4 ) and self-resistance ( 5 ).Due to a high degree of conservation of the key enzymatic steps in the biosynthesis of these specialized metabolites, regions of the genome encoding such BGCs can be identified and annotated using bioinformatics tools such as antiSMASH ( 6 ), which classifies a BGC based on a set of rules specific to each BGC type ( 7 ).Anti-SMASH then compares the identified BGC to both known specialized metabolite pathways and existing genomic sequences, and suggests possible precursors.These features have established genome mining as a complementary approach to the bioactivity and compound-centered discovery of novel bioactive molecules ( 8 ).As some of these BGCs, such as the gargantulide BGC ( 9 ), have sizes of > 200 kb, high quality genome assemblies have a strong influence on the quality of the genome mining analyses, as contiguous sequence data allows the identification of complete gene clusters and not just fragments-a challenge that is prevalent with low to medium quality draft genome and many metagenomic datasets ( 10 ).
Most bacterial genomes consist of a single circular chromosome and maybe some smaller circular plasmids.Some genera of the Actinomycetia also have circular chromosomes, but members of the genus Streptomyces have large, linear genomes, similar to how most eukaryotic genomes are organized.The linear Streptomyces genome is usually capped with terminal inverted repeats (TIRs) ( 11 ), which can be > 1 Mb long and can encode BGCs (e.g. ( 12)).The TIRs contain short telomeres which are capped with covalently bound terminal proteins ( 13 ).The TIRs are highly dynamic and have been speculated to play a role in gene dose modulation by duplicating genomic regions, thereby potentially increasing compound production or diversity if BGCs are duplicated ( 14 ).The TIRs also present a special challenge in genome assembly, as even long read information regularly is not of sufficient length to cover the TIR from the chromosome start to the most terminal non-repeat regions.However, because of the repetitive nature, it is still possible to resolve TIRs and generate complete genome sequences.Information about telomeres, TIRs and chromosome arms is not a commonly recorded statistic for bacterial genome assemblies.It is possible that the duplicate TIRs are missing, placed on their own contig, or otherwise misrepresented in otherwise complete genome sequences of Streptomyces isolates in public databases, as large scale errors relating to the TIRs are often observed artifacts introduced by sequence assembly software optimized for circular bacterial genomes.The linear Streptomyces genome is widely hypothesized to be composed of a central 'core', where the basic cellular functions are encoded, and distal 'arms', which are dynamic and more likely to encode BGCs and accessory functions and often contain mobile genetic elements.Several studies have reported this genomic compartmentalization in smaller datasets ( 15 ) or using e.g.rRNA genes to mark the core region ( 16 ).
Previously, the high GC content (65-75%) of many Actinomycetia species hampered the ability to sequence complete genomes.But with the advent of the GC-agnostic nanopore sequencing technology ( 17 ), and implementation of lab protocols minimizing GC bias from illumina sequences ( 18 ), high GC-content is less of an obstacle.It is now possible to obtain complete genomes using technologies such as PacBio or Nanopore DNA sequencing, but currently available genomic resources are still too scarce to cover the biological diversity of Actinomycetia in nature ( 19 ), and large-scale efforts are often focused on quantity over completeness and chromosomal organization of the individual genome.In a recent massive collaborative study involving both the U.S. Department of Energy Joint Genome Institute (JGI) and the German Collection of Microorganisms and Cell Cultures (DMSZ), the lack of highquality resources was suggested to severely limit both the biological understanding of Actinomycetia species and the development of new bioactive compounds from this clade ( 19 ).
We report the whole genome sequencing of 1034 Actinomycetia strains from the New Bioactive Compounds (NBC) Collection of the Novo Nordisk Foundation Center for Biosustainability at the Technical University of Denmark; 881 strains with complete genomes and 153 near-complete genomes.We provide analysis of their phylogeny, genomic organization, BGC content and provide an example of how this large dataset can be used to identify novel BGCs.

Materials and methods
For basic statistics on the G1034 genomes, please refer to Supplementary Table S1 .

Biological resources: isolation, growth and DNA extraction of the NBC collection
Soil samples were collected by digging approximately 7-10 cm below the surface and placing a small scoop of soil into plastic bags or 50 ml Falcon tubes.Once back in the lab, the samples were stored at 4 • C for a maximum of 3 weeks before processing.Detailed metadata of the samples can be found in the individual BioSample records.
To isolate actinomycete strains, we mixed 1 g of soil with 3 ml 6% yeast extract, incubated the sample in a 60 • C water bath for 20 min, and then diluted this mixture 1:49 with sterile water.We then spread 200 μl of this dilution onto three types of agar media : ISP4, Actinomycete Isolation Agar Media (AIM) with Avicel, and AIM with xylose in 94 × 16 mm petri dishes containing approximately 25 ml of agar media.Actinomycete Isolation Agar, Avicel PH-101 and d -xylose were all purchased from Sigma-Aldrich and ISP4 was purchased from Becton Dickinson (Difco).AIM-Avicel contained 22 g / l AIM and 10 g / l Avicel.Similarly, AIM-xylose contained 22 g / l AIM and 10 g / l xylose.ISP4 was used at the recommended concentration.No antifungal compounds or inhibitors against Gram-negative bacteria were used since the carbon sources of the three media (ISP4: starch, Avicel: cellulose and xylose) selectively enrich for Actinomycetes.The plates were incubated at 30 • C. Fast-growing fungi that appeared after two or three days incubation were removed by cutting out the section of agar on which they grew.After 5, 10, 15 and 20 days incubation, we picked single colonies based on colony morphology, trying to avoid colonies with similar morphologies from each soil sample.The colonies were restreaked onto ISP2 plates (BD Difco).Actinomycetes have a distinct morphology with aerial mycelia.The colonies are often wrinkly or bumpy and sometimes colorful.In addition, actinomycete colonies are nearly always solid such that intact pieces break off when probed with a toothpick.Actinomycetes do not have the same amorphous morphology that bacteria such as E. coli and B. subtilis have.The ISP2 plates were again incubated at 30 • C. Once actinomycete growth was observed on the ISP2 plates, a chunk of the bacteria was lifted from the agar and placed into triple-baffled flasks containing ISP2 liquid media.The cultures were incubated at 30 • C with shaking.After robust growth, an aliquot was mixed 1:1 with 50% glycerol to make frozen stocks, which were stored at −80 • C.
A detailed description of the growth conditions and DNA extraction protocol can be found in the protocol Alvarez-Arevalo, M. et al. ( 18 ).Briefly, strains were grown in liquid ISP2 or YEME medium in baffled flasks at 30 • C with constant stirring.Then the cells were harvested and washed with PBS, and DNA extracted with a modified Qiagen Genomic Tip 20 protocol.For strains with an aggregated growth pattern, physical shearing in liquid nitrogen using a mortar and pestle was applied.The concentration, purity, and amount of short DNA fragments of the resulting purified DNA was evaluated with Qubit BR dsDNA, nanodrop, and 0.7% agarose gel electrophoresis.The length of DNA fragments using this method is generally around 50 000 bp ( 18 ).Samples with many short DNA fragments were further purified using either the Short Read Eliminator kit (PacBio, San Diego, C A, US A), or 0.4-0.5 × volume KAPA PURE SPRI beads (Roche, Basel, Switzerland).

Illumina sequencing
Illumina data was either generated in-house on an Illumina NextSeq500 with the KAPA HyperPlus protocol (Roche, Basel, Switzerland), or at Novogene Inc (Beijing, China) on an Illumina Novaseq machine with the libraries either prepared using the NEBNext® Ultra™ II DNA Library Prep Kit or the Novogene NGS DNA Prep Set kit, both latter using a special 6 PCR cycle protocol to minimize GC bias in the sequencing results.All libraries were sequenced using 2 × 150 nt paired end sequencing kits.Raw illumina reads were trimmed with Trim Galore v0.6.4_dev ( 20 ) in paired mode with the minimum length set to 100 bp and the quality cutoff set to Q20.

Nanopore sequencing
Nanopore sequencing was performed on the Nanopore Min-ION platform, using v9.4.1 flowcells with either the SQK-RBK004 or SQK-RBK110-96 kit, both of which are transposase based.Eight to 16 strains were pooled per run, and flowcells were generally washed and reused after 24H, making sure to use the barcodes only for the same strains on subsequent runs.The protocol was modified as described in Alvarez-Arevalo et al. ( 18 ).Briefly, the DNA to transposase ratio was modified to allow for longer fragments and a post tagmentation size selection was performed to remove small fragments.All basecalling was performed with Guppy (v.5.0.17 + 99baa5b, client-server API version 7.0.0),which also was used for demultiplexing and adapter trimming.

Assembly and preliminary analysis
The workflow AAA-Actino-Assembly-and-Annotation (v.1.6.7) was used for all assembly and preliminary analysis (DOI10.11583/ DTU.25133513).Default parameters were used when not otherwise mentioned.The workflow consists of the following steps: The nanopore reads were assembled using Flye (v.2.9) ( 21 ) with the following settings: five polishing iterations and using the -nano-raw switch.The N50 and total number of bases in reads were extracted from the Flye statistics after assembly, and thus represent the dataset used for the assembly.The assembly graphs were manually inspected in Bandage v.0.9.0 ( 22 ), and in cases where Flye was unable to completely resolve an unambiguous graph, we applied the simplistic NPGMcontigger (DOI10.11583/ DTU.25133828).The topology of each chromosome is reported based on the manual inspection of the repeat graph and of the contigging.In cases where the assembly graph was ambiguous, we first subsampled the dataset using Filtlong ( https:// github.com/rrwick/ Filtlong ) to remove the shortest and lowest quality reads according to Supplementary Table S1 , and if a complete genome was still not obtained, deposited the genome as the lower contiguity 'WGS' rather than 'chromosome'.Coverage plots of both illumina (Bowtie2 ( 23 )) and nanopore (minimap2 ( 24)) data mappings were manually inspected for sudden coverage changes to half or double the surrounding area coverage (see Supplementary Figure S1 for coverage plot examples).The coverage plot inspection revealed large scale assembly artifacts of some of the chromosomes, resulting in 11 genomes, which needed to be reassembled after Filtlong removal of half the raw nanopore data.These errors related to the TIRs of Streptomyces genomes.

Sequence sanity c hec ks
Illumina reads were mapped on the nanopore assembly using Bowtie2 v. 2.3.4.3 ( 23 ).This sanity check was failed if < 60% of illumina reads mapped, and the sample was excluded from further analysis.This step is taken to ensure that the nanopore and illumina datasets come from the same sample.

Assembly polishing
All genomes were first polished using the nanopore data with the Flye polisher, this includes both samples contigged by Flye and by the NPGM-contigger.A total of five iterations of nanopore read polishing was used.Then, samples were polished with the illumina reads using Polypolish v.0.5.0 ( 25 ), and then using POLCA (part of MaSuRCA v.4.0.5 ( 26 )), as described in Wick et al. ( 25 ).

Gene and functional annotation, phylogenetic placement, assembly quality and antiSMASH annotation
Gene and functional annotation was performed at NCBI using the PGAP pipeline.Phylogenetic assignment was done using GTDB-Tk v.1.7.0 (reference data version r202) ( 27 ,28 ), and later verified at NCBI.In the handful of cases where the taxonomic assignment between NCBI and GTDB did not agree, or a genus was differently named, the NCBI assignment was kept in the genbank annotation.For the strain NBC_01635, GTDB-Tk repeatedly failed.However, based on the presence of core genes, the linear chromosome organization with TIRs, and the NCBI classification, this strain was classified as Streptomyces sp .
BUSCO is primarily a tool for evaluating genome completeness and consists of a set of gene model datasets and software to compare the datasets to a sequence of interest.However, we used the BUSCO ( 29 ) gene model datasets both to evaluate genome completeness, and to represent different levels of evolutionarily conserved genes.BUSCO ( 29 ) uses sets of HMMs of universal (defined as > 90% of species) single-copy orthologous genes to estimate the completeness and duplication levels of eu-and prokaryotic genome assemblies.It has become nearly universally adapted for assessing assemblies, and 193 BUSCO datasets currently exist for domain (e.g Bacteria), phylum (e.g.Actinomycetota), class (e.g.Actinomycetia), order (e.g.Streptomycetales), and more classification levels.BUSCO v.5.1.2( 29 ) with the databases bacteria_odb10 and actinobacteria_class_odb10 was used to estimate the gene placement on the genomes and genome completeness, respectively.

DnaA flipping
For the analysis of complete Streptomyces genomes, all sequences were oriented so that the dnaA gene was in the forward direction using the biopython script (Blin, Kai (2024)).Orienting linear chromosomes by the dnaA gene.Technical University of Denmark.Software.https:// doi.org/ 10.11583/ DTU.25662438 ).In total, 368 genome sequences (49.6%) were reversed and complemented, while 347 (50.4%) were not.

Plasmid identification
We manually curated a list of 170 plasmid specific PFAM domains (Jørgensen, Tue Sparholt (2024).List of PFAM domains used for plasmid classification.Technical University of Denmark.Online resource.https:// doi.org/ 10.11583/ DTU.25662450 ) and used it to classify extrachromosomal elements as plasmids if any of the plasmid-specific PFAM domains were present.This methodology is similar to a previously developed method and classified 85% of extrachromosomal elements as plasmids ( 30 ).The accuracy of this methodology was not experimentally confirmed.Elements without plasmid-like traits were classified as 'extrachromosomal'.

gbk to faa and gff3
For conversion from genbank format file to GFF3, we used bioconvert ( 31 ).For conversion from gbk to amino acid fasta, we used the script at https:// www.biostars.org/p/ 151891/ written by Istvan Albert.

TIR identification
To investigate the size of linear inverted repeats on the 742 Streptomyces linear chromosomes, we performed a BLAST search ( 32 ) of the Streptomyces chromosome level chromosomes against itself with the following settings: blastn v2.10.0+,gapopen 2 gapextend 1, hit starting on the first 99 bp of the query and ending on the last 99 bp of the subject to account for end polishing artifacts.

Submission to NCBI
All raw reads used for assembly, as well as the assemblies, were deposited at NCBI under project number PRJNA747871.

Data visualization: kernel density plots
To represent each gene and each BGC equally regardless of the length, we used the midpoint position of each gene model or BGC protocluster.This ensures that long and short BGCs have equal weight in the analysis.The Seaborn kernel density estimate plot 'kdeplot' was used on core genes, duplicate core genes and protocluster antiSMASH categories and types, respectively, with a smoothing of 0.5 (bw_adjust = .5).This means that the density of each plot is independent, and as such the number of genes or clusters necessary for a certain density is not constant between plots, so they can not be directly compared, however the position (x-axis) is constant and directly comparable between plots.The smoothing value used was chosen as it was found to minimize noise while maintaining clarity, however especially in the distal parts of the chromosome, artifacts are observed as the density will have a tendency to approach 0 even when the data does not show basis of this.As the 742 complete Streptomyces genomes each have 124 bacterial BUSCO gene models, and as the secondary BUSCO copy has been filtered out in their own plot, there are 742 observations for each BUSCO gene model, meaning that the area in the density plot is equal for each gene model, except for two genomes which each had 1 BUSCO gene missing, which means that only 741 observations are available for those two BUSCO gene models.For the duplicate core genes kernel density plot and the antiSMASH category plots, the number of observations of each type vary, and the area of each density is proportional to the number of observations, which can then be compared to each other .However , it is important to repeat that the densities can not be compared between subplots in the kernel density plots.Violin type plots visualize data distributions, with the secondary axis showing the unitless density distribution.

Comparison of genome quality with respect to the existing public dataset
The genus Streptomyces was the most abundant in the presented dataset and thus was selected for comparison of genome quality with respect to existing datasets in the public repositories.We retrieved 2938 genomes of the Streptomycetaceae family that were present in the NCBI RefSeq database as of 30 June 2023.These genomes were assigned taxonomic definitions according to the GTDB database ( 28 ,34 ).A total of 462 genomes were not assigned to any of the GTDBdefined species.We calculated the MASH ( 35 ) distance metric across these genomes to assess the diversity based on whole genome sequence similarity.We used a cutoff of 95% similarity (0.05 MASH-distance) to reconstruct a genome similarity network.To calculate the diversity we used the Louvain community detection algorithm that assigns different communities in the network.These communities are designated as different MASH-based species ( Supplementary Table S1 ).The genomes belonging to the genus Streptomyces were selected for further analysis (Figure 1 C).
The genomes were classified into three different categories based on the completeness of the assemblies.The complete or chromosome level assemblies were classified as high-quality (HQ), contig or scaffold level assemblies with ≤100 contigs were classified as medium-quality (MQ), whereas rest were low-quality (LQ).We selected the genomes of high-quality to evaluate the progress of sequencing high-quality genomes of the genus Streptomyces over the years and distribution across country of origin (Figures 1 A and B, see Supplementary Figure S2 for a list of countries of origin).

Phylogenetic tree reconstruction
Two different phylogenetic trees were reconstructed here using GTDB-TK ( 28 ).The dataset of complete genomes of Streptomyces contained 1230 genomes with 488 genomes from NCBI and 742 from this study.The NBC_01635 genome was assigned to the genus Streptomyces according to NCBI but not GTDB and thus was ignored.The resulting 1229 genomes were processed using GTDB-TK's de novo tree construction algorithm.Kitasatospora GCF_000269985.1 was used as an outgroup.Another tree was reconstructed using all 1034 genomes presented in this study.For this, Mycobacterium tuberculosis GCF_000195955.2 was used as an outgroup.Tree visualizations were generated using iTOL.For scripts and metadata used to construct the panels in Figure 1 , please refer to Mohite, Omkar Satyavan (2024).Metadata for Figure 1 .Technical University of Denmark.Online resource.https:// doi.org/ 10.11583/ DTU.25663026 .

Trans-AT PKS BGC mining and analysis
In analysis of trans-AT Polyketide Synthase (PKS) BGCs, we initiated our analysis by clustering all BGC regions identified by antiSMASH using the tool BiG-SCAPE ( 36 ), with the parameters -cutoff 0.3 -include_singletons -hybrids-off -mibig.This initial clustering utilizes a strict cutoff and does not use the -mix parameters for quick and conservative identification of gene cluster families (GCFs) as proposed previously ( 37 ).The nodes in the resulting network were then colored using the predicted compound with highest ranked similarity score against known reference BGCs in the MIBiG database using the ClusterBlast function in antiSMASH.We examined each connected component within the graph, selecting only those containing regions or MIBiG BGCs identified as trans-AT PKSs for further analysis.The BGCs in the filtered subgraph were then re-analyzed using BiG-SCAPE with the parameters set to -mode glocal -include_singletons -mix -clansoff -cutoffs 0.4.The network was visualized using Cytoscape (v3.10.1) ( 38 ) and the results were then manually examined in detail to confirm the presence of a trans-AT PKS producing BGC, based on the domain organization.Subsequently, Clinker (version 0.028) ( 39 ) was used to produce comparison plots to manually validate transAT-PKS classification.This step identified similarities between the trans-AT PKS BGCs confirmed experimentally in the MIBiG secondary metabolite database and the BGC regions in the G1034 dataset.In cases where multiple product classes were present in one re-gion,the known-cluster-blast feature in antiSMASH was employed for manual validation to accurately identify the trans-AT PKS BGC.Moreover, we evaluated the BGCs by comparing their domain organizations to that of the reference BGCs, which were similar and supported by experimental evidence from the literature.
Finally, based on these comprehensive efforts, we categorized the BGCs into three groups: trans-AT PKS BGCs producing a putative known molecule, unknown trans-AT PKS producing BGCs or BGCs discarded as unverified trans-AT PKSs.Analysis of this section can be found in: Sterndorff, Eva Baggesgaard; Nuhamunada, Matin (2024).Metadata for transAT-PKS analysis.Technical University of Denmark.Online resource.https:// doi.org/ 10.11583/ DTU.25663296 .

Results and discussion
The total number of bacterial genome sequences is steadily increasing over the last 15 years.However, the number of complete genomes is still relatively scarce.For the Family Streptomycetaceae, as of 30 June 2023, only 501 complete genomes were deposited in the NCBI database (Figure 1 A).
Here, we present and analyze a dataset which more than doubles this number of publicly available, high-quality and complete genomes from important genera of filamentous Actinomycetia (Figure 1 A and B).We then use the dataset to show how high quality assemblies can be used for analysis which were not previously possible with fewer or more fragmented genomes.

Sampling and basic statistics on raw data and assemblies
We isolated almost 2000 strains of filamentous Actinomycetia from soil, and sequenced > 1000 of them, with ∼90% of them assembling into complete chromosomes.The 1034 Actinomycetia strains presented here were primarily sampled in Denmark, with 575 strains from Denmark having GPS coordinates, and 155 strains not having recorded GPS coordinates (Figure 2 A, map and inner green dots, respectively).The soil samples were taken from all parts of the country of Denmark including islands such as Anholt, Bornholm and Samsø.In addition, there are 304 strains in the dataset which stem from soil samples collected outside of Denmark (Figure 2 A, outer green dots), primarily from Germany, USA, and the Netherlands ( Supplementary Figure S2 ).
Two technical factors are crucial for the completeness of a long read genome assembly: the length of reads and the total amount of data.Using an optimized, transposase based nanopore library kit protocol ( 18 ), we achieved a median nanopore read N50 of 15376 bp (s.d.= 5720, n = 881) for strains which assembled completely to chromosome level, and a median read N50 of 13055 bp (s.d.= 6721 bp, n = 153) for strains which did not fully assemble ('WGS' level in NCBI terminology).The median total amount of nanopore data per strain was 427 595 032 bp for chromosome level assemblies (s.d.= 393 977 543 bp, n = 881), and 549 676 990 bp (s.d.= 469 155 351 bp, n = 153) for near-complete assemblies (Figure 2 B, each dot represents one strain).The N50 of chromosome level assemblies is longer than that of near complete assemblies (median of 15 kb vs 12 kb).The six genomes which did not yield complete assemblies when the N50 was longer than 25 kb were due to manual filtration as the coverage plots for those strains could not be explained or the issues resolved.The median total nanopore data amount per sample is actually higher in near-complete than in chromosome level assemblies, which reflects extra sequencing effort for strains which did not fully assemble after the initial data generation.In cases where the full dataset assembled into an ambiguous chromosome, we subsampled to remove the shortest and poorest quality reads and reassembled (see Supplementary Table S1 for the subsampling metrics), which in some cases resolved the chromosome.For a comprehensive examination of genome assembly pitfalls, see ( 18 ,40 ).
Each genome was polished with illumina data.The median number of read pairs for the 1034 illumina datasets are 5 802 158 read pairs (SD: 2 191 734, q1: 4 580 544, q3: 6 971 980 read pairs), which is the rough equivalent of coverage 175 for a 10 Mb genome.However, for a single genome, the illumina coverage was below 30 × (coverage 8, NBC_00443, illumina read pairs: 246 947).From the Polypolish report we calculated the median number of bases not covered by the illumina reads to be 167 bp or 0.0019% of the chromosome length (SD: 9709 bp, q1: 59 bp, q3: 462 bp, uncovered bases on the longest contig in each assembly).However, in a few cases, the number of uncovered bases was larger, up to 208 161 bp (2.2918%) for the chromosome of strain NBC_01738, which could affect the polishing quality ( Supplementary Table S1 ).A large number of uncovered bases can be caused by poor library quality, which again can be caused by fragmentation, insufficient amount of input DNA, or excessive PCR amplification.Since a median of only a few 100 bp are not covered by illumina data, we consider the quality of the generated illumina data optimal for polishing, however certain types of errors will remain when using short reads for polishing ( 25 ).In line with previous findings on GC bias in illumina data ( 17 ), more uncovered bases are seen for genomes with higher GC % ( Supplementary Table S1 ).
A key metric for genome quality is identification of essential, conserved genes, often referred to as core genes.We explored this using the Benchmarking Universal Single-Copy Orthologs (BUSCO) system with the most specific dataset containing all 1034 genomes: the actinobacteria_class_odb10 dataset of 356 gene models.As seen in Figure 2 C, almost all genomes have more than 99% complete BUSCO genes from the Actinobacteria_class dataset, with a minimum of 95.8% complete BUSCO gene models (NBC_01737).Though a median of 99.7% complete BUSCO gene models (SD 0.33%, q1 99.5%, q3 99.7%) can be considered close to perfect, closer inspection of the 'fragmented' and 'missing' BUSCO genes reveal that out of the 1466 total missing and fragmented BUS-COs in 1034 genomes, 886 of them (60%) have the same BUSCO gene marked as 'fragmented' (334658at1760, 'coldshock protein').This potential artifact particularly affected the genus Streptomyces , where 858 out of 886 genomes (97%) were found to have the gene 'fragmented'.This could be due to an adaptation in the Streptomyces genus, or a technicality in the actinobacteria_odb10 BUSCO dataset.We do not use the BUSCO category 'duplicated' as a quality metric, as the validity could be obscured by BGCs carrying resistant versions of the core gene product they are targeting ( 41 ), or simply by non-lethal gene duplications.We did, however, use the duplication score to exclude genomes from the dataset if > 100 duplicated genes were identified.For three genomes, (NBC_00217, NBC_00887 and NBC_00899), the duplication score was high (16.9%,26.7% and 27.2%, respectively), and is in all three cases caused by unusually long TIRs duplicating a part of the core genome, possibly as a result of recent and potentially unstable genomic rearrangements.

Comparative genomics reveals a high taxonomic di ver sity of the G1034 dataset
In order to estimate the diversity of the G1034 dataset, we performed taxonomic analysis at the whole genome level.GTDB-based taxonomic assignment showed that 1034 genomes were distributed across 27 genera including Streptomyces (n = 885), Micromonospora ( n = 36), Kitasatospora ( n = 27) and Nocardia ( n = 21).This study significantly increased the number of complete genomes for several of these genera.One of the strains failed GTDB-Tk run (NBC_01635) and another (NBC_01309) was not assigned to any of the GTDB genera.This strain NBC_01309 with the assigned family Streptomycetaceae represents a potentially novel and unique genus based on genome sequence data.The taxonomic novelty was determined using the RED-score of 0.84.These two strains are not included in the analysis in this section.
We further note that 570 of the 1032 genomes could be assigned to one of the 145 GTDB-defined species.The remaining 462 genomes were not assigned to any of the taxonomically or computationally defined species.Thus, they likely represent new species that have not been described before and thus expand the existing phylogenetic diversity of the Streptomyces and other genera.We carried out MASH-based analysis ( 35 ) to detect diversity among these 462 genomes.A whole genome similarity network was reconstructed with 95% similarity as a cutoff for edges.Using a community detection algorithm, we identified 244 communities in the network that can be computationally assigned as different species with 95% similarity cutoff.Some of these MASH-based species had multiple genomes (e.g. 28, 16, 13, 10, etc.).As many as 171 of the 244 MASH-based species were represented by only a single genome, whereas 58 of the 145 GTDB-assigned species were represented by a single genome.Overall these statistics highlight the diversity of the sequenced collection.See Supplementary Table S1 for NCBI, GTDB-Tk and MASH taxonomic assignment for each strain.
Next, we inferred phylogenetic trees of the G1034 dataset and the dataset of the genus Streptomyces with publicly available genome (Figure 1 C, D) using GTDB-Tk ( 42 ).The different species in the G1034 dataset are widely distributed and include representatives of most sub-phyla.The phylogenetic tree of all complete Streptomyces genomes further illustrated the diversity of the dataset in the context of existing public data.We observed that there are certain clades of the tree that are overrepresented and very few that do not contain strains from our collection.It is likely that there is still high potential to isolate novel species from this genus and thereby expand the phylogenetic diversity of public databases.
As expected, the G1034 dataset is a rich source of specialized biosynthetic gene clusters.In an analysis of anti-SMASH7 annotations, 29 024 regions encoding BGCs were identified in the genomes.As also observed in other studies ( 43 ), there are clear correlations between genome size and the number of BGCs in the various genera in the dataset (Figure 3 A).Many Actinomycetia BGCs are hybrids between several biosynthetic types (e.g., type I PKS and NRPS).To generate BGC abundance statistics, we used the antiSMASH protoclusters ( 44 ), as the protocluster annotation level takes into account the composition of the hybrid BGCs.BGCs coding for the biosynthesis of terpenoids (5866), NRPS (5291) and type I PKSs (4099) are most abundant in the strains, but also siderophores (2396), RiPPs (1986), type III PKS (1848), butyrolactones (1766) and type II PKS (1545) containing BGCs are widespread.

Streptomyces chromosomal organization
While most bacterial genomes are circular, most Streptomyces genomes are linear and carry a linear inverted repeat on the ends of the chromosome arms ( 45 ), overlapping with, but not identical to the chromosome arms.Of the 1034 genomes in this publication, 886 belong to the genus Streptomyces , of which 742 are chromosome level assemblies.We used the assembly repeat graph from Flye ( 21 ) to manually determine the topology of each individual genome.In Figure 2 D, six chromosome layouts are shown, exemplified by flye assembly repeat graphs.Both circular and linear topologies are found among the genomes presented here.The linear genomes often but not always were found to have large unresolved Terminal Inverted Repeats (TIRs), of up to 2.8 Mb in length.The assembly repeat graphs has been suggested to be a useful guide in correctly assembling a chromosome where the repeats vastly outsize the read length, as is the case for most of the linear inverted repeats of Streptomyces ( https:// github.com/fenderglass/ Flye/ issues/610#issuecomment-1629027346 ).

Streptomyces show strong variability in the length of the terminal inverted repeats
Out of a total of 742 chromosome level Streptomyces genomes, TIRs could be identified in 79% (587 / 742), while in 21% (155 / 742), no TIR was identified using our BLAST based methodology.The strains without identified TIRs could either not contain them, or perhaps the assembly or identification process somehow fell short for these strains, for example the nanopore or illumina polishing could have truncated the extreme chromosome ends.Plotting the TIR lengths both as absolute values and as a fraction of the total chromosome length (including both TIRs in a genome) reveal that the size of the TIRs range from less than 100 bp to almost 3 Mb, and from < 0.01% to > 50% of the total chromosome size (Figure 2 E and F, see Supplemental Table 1 for the data used for the figures).Most of the TIRs are in the 10 000-150 000 bp range or 0.3-10% of the total genome size (median 50 618 bp, q1: 14 006 bp, q3: 147 244 bp, SD: 274 156 bp).The huge size range of > 1000-fold showcases the highly dynamic nature of the TIR.We then analyzed the distribution within the named species with more than 10 members (Figure 2 E and F, colored dots), and found that even within species, the length of the TIR is highly dynamic ranging for example from few 100 bp to more than 100 000 bp in the most abundant species in the dataset Streptomyces anulatus (Figure 2 E and F, red dots) .This dynamic TIR correlates to observations of Streptomyces strains undergoing large changes in the TIR during growth in a laboratory, and to frequent reports of the loss of one or both chromosome arms due to double strand breaks ( 11 ,45 ).Understanding these processes also has very important implications for engineering Streptomyces , i.e. in knockout experiments, where a duplicate gene can be harder to modify or in the worst case even can cause megabase-sized deletions.Since BGCs are commonly examined and often found on TIRs, there is a high risk of misinterpreting experimental results if the TIR is not correctly accounted for.

Streptomyces show conserved core gene placement
In order to explore the genome organization of the 742 linear Streptomyces genomes, we first oriented all genomes according to the direction of the dnaA gene.For circular genomes, this gene is commonly used to rotate and orient the genome sequence ( 46 ).For four strains, there were two annotated dnaA genes, however in all cases the dnaA genes were located on the same strand.Around 50% of genome sequences were reverse complemented, as would be expected from a random initial genomic direction.We used the 124 HMM gene models in the bacteria_odb10 (v.2020-03-06) BUSCO dataset to identify single copy essential genes conserved in all Bacteria, as these can be considered the most conserved core genes.The position of each gene was then extracted from the genbank files and used for the probability density estimates in Figure 4 A and B, with the middle of gene coordinates being used for calculations.This revealed that the placement of the core genes in linear Streptomyces genomes is extremely structured: the dnaA gene is the most centrally placed conserved gene in the complete Streptomyces genomes, encoding the DNA replication initiation protein.It is located exclusively near the mathematical center of the linear chromosome with a median distance of −21 173 bp, just upstream of the chromosome center (q1: −130 158 bp, q3: 77 679 bp, SD: 214 661 bp).An additional 11 genes are consistently located around the chromosome center, among them genes encoding ribosomal proteins and tRNA ligases (Figure 4 A, red and orange hues).
Generally, all 124 bacterial core gene models seem to have a preferred distance to the chromosome center (see Supplementary Figure S3 for complete legend for Figure 4 A. See Supplementary Table S2 for the tables used for Figure 4 ).From the small ribosomal subunit, 10 out of 17 genes are in the chromosome arms 12-25% of the arm length distance from the chromosome center (Figure 4 A, green and yellow hues) in a clear focal point likely reflecting a cluster of ribosomal operons.In the same focal point, 14 out of 25 genes from the large ribosomal subunit are found.Other ribosomal proteins such as L20, L35 and S4 are positioned halfway towards the chromosome ends (Figure 4 A, purples).Generally, only a few genes shared by all of Bacteria are placed outside of the inner 75% of the chromosome, which can be considered the central genomic region.For genes unaffected by the direction of replication, the density on either side of the chromosome center should be of equal size.However, there seem to exist a preferred position of a group of genes relating to the direction of the dnaA gene, where several of the ribosomal genes are more likely to be found downstream than upstream of dnaA .This effect can be observed for most core genes, but it seems that the effect decreases with the distance to the chromosome center.This observation of core gene placement is in agreement with previous observations on a smaller Streptomyces dataset ( 15 ,47 ) and several genome organization descriptions ( 48 ,49 ), where core genes were found close to the central chromosome in a 'central compartment'.
In most genomes, a few BUSCO genes are found in duplicate.A total of 2179 genes were found in duplicate from 742 complete Streptomyces genomes, giving an average duplicate count from the bacterial BUSCO dataset of approximately three genes per genome (Figure 4 B).Most of the duplicated genes are positioned away from the chromosome center with a single prominent exception of the ribosomal protein S18 gene (1940575at2), which have densities both in the central chromosome and on the chromosome arms.For the remaining secondary duplicate genes, the density estimates are shifted towards the ends of the chromosome arms, away from the chromosome center.This observation is in line with the observation that BGCs often carry a resistant version of the target of the specialized metabolite (41).The observation that duplicate core genes are placed towards the chromosome ends is affected by the methodology: the primary and duplicate gene copies are distinguished by the distance to the chromosome center, where the most central copy is assumed to be the primary copy.This forces the 'secondary' copy to be more distal than the 'primary' copy.The number of duplicated genes is only a small subset of the total number of genes visualized (Figure 4; 92 006 versus 2179 genes).

BGC are not placed randomly on linear Streptomyces genomes
Having documented the conserved placement of core genes along the linear Streptomyces chromosome, we next investigated the placement of BGCs.The observation that biosynthetic gene clusters often are located towards the linear chromosome ends has been suggested since the first complete Streptomyces genome sequences became available (14).anti-SMASH (v.7.0.0) operates with 81 rules, each describing the core biosynthetic requirements of a specific type of BGC, e.g.Type 1 Polyketide Synthase (T1PKS) or lassopeptide.Each cluster type has a predefined neighborhood on either side of the core cluster (e.g.20 kb on either side of T1PKS core genes), which together constitutes a protocluster.Overlapping protoclusters are annotated as a 'region', the primary output from the antiSMASH software.Each protocluster is annotated as one of six categories, responsible for production of the most common groups of compounds: non-ribosomal peptide synthase (NRPS), polyketide synthase (PKS), terpenes, ribosomally synthesized and post-translationally modified peptides (RiPP), Oligosaccharide, and 'other'.Depending on the purpose of the analysis, we either analyze protocluster or region in this paper.For a comprehensive overview of the annotation layers generated by antiSMASH, please refer to ( 6 ,44 ).For an overview of the protocluster types, categories, and counts, please refer to Supplementary Table S3 .
The global distribution of BGCs on linear Streptomyces chromosomes was explored using the six antiSMASH categories (Figure 4 C).In all cases except for 'other', the BGC category is preferentially located towards the end of the chromosome, confirming previous observations that most specialized metabolite clusters are placed distally on the linear Streptomyces chromosome.The 'other' category includes a diverse range of BGCs, which could explain the lack of a clear pattern in the distribution.For the category oligosaccharide, the density is skewed to the right arm, however this could be an artifact from a low number of protocluster observations and an uneven taxonomic distribution ( n = 103).In total, this analysis is based on 29 933 Streptomyces protoclusters.We chose to represent the annotation level 'protocluster' to get a comprehensive overview of BGCs not impacted by adjacent BGCs which antiSMASH did not resolve.However, since some clusters are true hybrids consisting of more than one type of protocluster, this approach risks overestimating the total number of clusters in a dataset.Conversely, had we used the an-tiSMASH annotation level 'region', the number of common cluster types would certainly be underestimated, as clusters physically near-but unrelated to-other clusters would be in the same region and thus only be represented once, severely reducing the dataset near the linear chromosome ends.
We then proceeded to analyze the placement of a few common antiSMASH cluster types, (T1PKS, T2PKS, T3PKS, trans-AT PKS), or cluster types which had a clear distribution along the linear Streptomyces chromosomes (NI-siderophore, NRP-metallophore, Ectoine, Melanin, Arylpolyene).Plots for the remaining cluster types (with enough observations for a density plot) can be found in Supplementary Figure S4 .Some BGCs encoding the biosynthetic machinery for small compounds (e.g.furan and phosphonate) are located centrally on the linear chromosomes indicating that they may be conserved as part of the core genome.This raises the question of whether they are classical secondary metabolites, or have fundamental roles.Possibly some of these secondary metabolites are as important for survival as primary metabolites, or act multifunctionally.Phosphonates, for example, have innate bioactivities as antibiotics and have been suggested to be part of the signalling machinery ( 50 ,51 ).This suggests that there is no clear cut between primary and secondary metabolism, but more of a spectrum of importance.

Polyketides
For each of the four PKS types in Figure 5 , there is a mostly symmetrical distal distribution, with very few protoclusters found in the central region of the genome.Trans-AT PKSs seem to be heavily skewed to the 'right' chromosome arm, however the number of observations is low (139) which means that the distribution could be skewed by a few common cluster families.

Metallophores
Metal ions, especially Fe(II) / Fe(III), are essential cofactors for many cell functions, and organisms across the whole tree of life possess highly selective uptake systems to acquire these ions from the environment, frequently utilizing chelator-based sequestration.In bacteria, biosynthesis of these metallophores can be distinguished to be either dependent or independent of an NRPS (for an overview, see ( 52 )).In Streptomyces genomes both the NRPS dependent type (called NRP-metallophore, Figure 5 , middle right) and the NRPS independent type (called NI-siderophore, Figure 5 , center) can be found.Notably, NIsiderophores tend to be placed more centrally on the chromosomes whereas NRP-metallophore clusters occur more distally.

Ectoines
Ectoine (Figure 5 , lower left) is a compound which plays roles in protecting cells against osmotic stress.The BGCs responsible for production of Ectoine have a mostly symmetric distribution approximately halfway between the linear chromosome center and the chromosome ends, in an area where core genes are often found.This very localized placement suggests a conservation in the distribution along the linear Streptomyces chromosome.

Melanines
Melanin-type protoclusters (Figure 5 , lower middle) are placed all the way along the linear chromosome, with peaks both in the distal parts and the more central parts.Melanintype clusters are responsible for production of dark or colorful pigment compounds, with several subtypes performing different roles.The best known role for a melanin compound is UV protection in both eu-and prokaryotes, but some melanintype clusters encode compounds with antibacterial and antiviral activity ( 53 ), hinting that the distribution of melanins along the linear chromosome could be more located if the BGC type was broken further down into cluster families.

Arylpolyenes
Arylpolyenes (Figure 5 , lower right), which are pigments with functions similar to carotenoides, is encoded by a special subtype of T2PKS type BGCs ( 54 ).Arylpolyene BGCs have a prominently central location on the linear chromosome, potentially reflecting a role in primary metabolism.
For the BGC types CDPS, NAPAA, NRPS, NRPS-like, hlgE-KS, terpene, lanthipeptide-class-iv, linaridine, redox-cofactor, RiPP-like, amglyccycl (Aminoglycoside / aminocyclitol), aminopolycarboxylic acid, betalactone, blactam, indole, nucleoside, and 'other', a distal distribution is also seen, whereas for the remaining types, the distribution is either central or spread out over the whole genome (See Supplemental Figure 4 for distribution plots of all cluster types with > 50 observations).A treasure trove for specialized metabolite biosynthetic gene clusters using trans-AT PKS BGCs as an example While statistics on the BGC-content give an excellent overview of the biosynthetic potential of the individual strains, they don't indicate the diversity of BGCs across the dataset.To analyze the biosynthetic potential of a dataset, BGCs can be assigned to gene cluster families (GCFs).GFCs contain similar BGCs that most likely are involved in the biosynthesis of identical or chemically closely related natural products.A common approach for GCF identification includes reconstruction of a similarity metric between BGCs and then grouping BGCs into GCFs based on clustering of the similarity networks.For example, BiG-SCAPE generates BGC-class specific combined distance metric based on similarity metrics (36).Inclusion of BGCs for known natural products from the MIBiG reference database allows the assessment if the identified GCFs contain BGCs involved in the biosynthesis of known molecules or if they code for potentially novel pathways ( 55 ).A similarity network analysis with BiG-SCAPE indicated that the 29 024 BGC regions of G1034 can be assigned to as many as 7193 GCFs.190 of these could be associated with a BGC in the MIBiG reference dataset and thus likely code for known compounds or derivatives of these.We note that 4400 of the BIG-SCAPE GCFs are unique to individual isolates in the dataset.
As the diversity of BGCs and GCFs in the G1034 dataset is enormous, we chose to focus on the BGC type trans-AT PKS as an example of how to query the dataset.Trans-acyltransferase polyketides constitute a highly diverse group of bacterial multimodular specialized metabo-lites.These compounds are biosynthesized through diverse biosynthetic pathways, posing significant challenges for their characterization using computational prediction tools.Unlike cis-acyltransferase polyketide synthases ( cis-AT PKSs), where acyltransferase domains are integrated within the modules, trans-AT PKSs are characterized by the fact that the acyltransferase is encoded as a separate gene.The polyketide product often diverges from this expected module-based structure, however, common traits of trans-AT PKSs have aided the prediction of these intricate pathways ( 56 ).One example is the catalytic ketosynthase (KS) domains in the trans-AT PKS pathway, which can be grouped into clades correlating with distinct substrate specificity ( 56 ,57 ).For a detailed description of the trans-AT PKS biosynthesis see ( 58 ).These conserved features make detection of trans-AT PKSs possible, however, subsequent manual curation of the BGCs is beneficial when comparing novel BGCs as they may be assigned to a MIBiG reference because they share genes adjacent to the biosynthetic core but not necessarily synthesize similar molecules.
In our analysis of the G1034 dataset, we identified 247 trans-AT hits matching the antiSMASH trans-AT classification rule.Of these, 122 BGCs showed high similarity to 14 BGCs of known compounds.This similarity was confirmed through manual inspection of each BGC by identifying the described core genes in the biosynthesis pathway, as detailed in the methods section.Remarkably, out of the 17 described trans-AT PKS reference BGCs found in the MIBiG database from the phylum Actinomycetia, 14 are found in the G1034 dataset.The 247 transAT-PKS BGCs stem from 87 different species which are represented with fewer than ten trans-AT PKS BGCs, and two species, Streptomyces sp003846175 with 26 trans-AT PKS BGCs and Streptomyces anulatus with 56 transAT-PKS BGCs.These two species are also among the most well represented in the G1034 dataset with 26 and 25 strains, respectively.Furthermore, our analysis revealed 41 trans-AT PKS BGCs that did not resemble a MIBiG reference with similar biosynthetic genes.These BGCs were categorized into 7 gene cluster families and 14 singletons with unique domain organization.Given their distinct profiles, these GCFs are presumed to be putative novel BGCs, potentially representing new natural products (see Figure 6 B).The remaining 84 BGCs were not included in this analysis as they were found to be part of larger hybrid cluster regions, which if included would interfere with GCF assignment.
In the network representation of likely trans-AT PKS producing BGCs (Figure 6 A) the similarity cutoff 0.40 was chosen to best describe the relationship between BGCs.This cutoff highlighted substantial similarities among the BGCs, despite divergent features primarily due to neighboring genes to the biosynthetic core causing the BGCs to deviate from the main network cluster with BGCs that share biosynthetic core properties.A notable example of this scenario is the 5 kirromycin-like BGCs, which, despite a gene rearrangement and distinct flanking regions between its BGCs, share similar biosynthetic core regions, but are split into two network clusters (Figure 6 B, yellow).Accession numbers can be found in Supplementary Figure S5 and the manually curated overview of trans-AT PKS BGCs can be found in Supplementary Table S4 .

Conclusions
We have collected, sequenced and analyzed 1034 genomes from filamentous actinomycetes, in a dataset which more than doubles the available HQ complete genomes from the important genus Streptomyces .In this dataset, we identified more than 400 genomes from potential new species, as well as genomes from 145 existing species.We here provide the first large scale analysis of the terminal inverted repeats of Streptomyces genomes.We analyzed the placement of core genes and BGCs along the linear Streptomyces genome and found a conserved pattern, where features are not only located within the genomics compartments, but each feature is also organized with a certain distance to the chromosome center.Finally, we investigated trans-AT PKSs as an example of the BGC-focus analysis possible, and found that 14 out of 17 known actinomyces derived trans-A T PKS' s could be identified in the dataset, along with several new trans-AT PKS cluster families.With this study, we show the possible analysis which can only be performed on chromosome resolved complete genomes, and provide a valuable source of genomic information to be analyzed in future studies.

Figure 1 .
Figure 1. ( A ) Tax onom y and quality assessment of 2938 Streptom y cetaceae genomes and the 1034 actinom y cetes genomes from this study.T he GTDB assignment sho w ed 3696 of these genomes belonging to the genus Streptom y ces in total (middle panel of Sank e y diagram).( B ) Cumulativ e number of HQ Streptom y ces genomes in the NCBI R efSeq database (as of 30 J une 2023).T he y ello w bar represents the complete or chromosome le v el assemblies of Streptom y ces genomes collected in the present study.The slight difference between the Streptomyces numbers in panel A (501) and B (488) are due to differences in the taxonomic assignment between NCBI and GTDB.( C ) Phylogenetic tree of all high quality complete Streptomyces genomes reconstructed using GTDB-Tk de no v o method.Colors represent the genomes from either NCBI or this study (with or without GTDB species assignment).The bar chart on the outer ring denote the number of BGCs.( D ) Phylogenetic tree of the 1034 genomes presented in this study, reconstructed using GTDB-Tk de no v o method.Colors represent select genera, whereas the bar chart on the outer ring denotes genome length.

Figure 2 .
Figure 2. ( A ) Map of Denmark showing the sampling locations of the strains presented in this study.The green dots on the inner border show the 155 Danish strains without GPS coordinates, and the 304 dots in the outer border show the strains collected outside of Denmark.For an overview of the genomes stemming from outside of Denmark, please refer to Supplementary Figure S2 .( B ) Nanopore data used for assemblies, with N50 and total length of raw nanopore reads shown for chromosome level assemblies (blue) as well as near-complete assemblies (orange).( C ) Core gene presence measured using the BUSCO actinobacteria_class set of gene models and visualized as a violin (density) plot with a box plot o v erla y sho wing the first and third quartiles as well as the median.( D ) six examples of topologies from circular to linear with increasingly large terminal inverted repeat.The colored bars represent genomic sequences, and the black lines represent connections between them.( E , F ) Overview of the size and chromosome fraction taken up by the terminal inverted repeat.The colored dots correspond to the named species with > 10 members and show that within species, the TIR size and fraction varies 100-fold.

Figure 4 .
Figure 4. Kernel density plots of core genes and BGC protoclusters along the 742 complete linear Streptomyces genomes in the present study.( A ) Placement of 124 bacterial core genes from the BUSCO Bacteria dataset, showing central and highly organized placement in relation to the chromosome center.( B ) Secondary copy core genes showing a distal placement for all but one core gene.( C ) The six antiSMASH categories of protoclusters, showing a distal placement for all except for the 'other' category.

Figure 5 .
Figure 5. Kernel density plots showing the placement of specific antiSMASH7 protocluster types along the linear Streptomyces chromosomes.While the PKS type clusters are distally placed, other types such as NI-siderophore and arylpolyene are centrally placed.The heavily right skewed transAT-PKS distribution is dominated by cycloheximide which in 88% of cases is located on the right chromosome arm, and constitutes 48 out of 139 of the total number of transAT-PKS clusters, primarily from members of two species.

Figure 6 .
Figure 6.GCF network and BGC region comparison of selected identified GCFs.( A ) network representation of 163 trans-AT PKS BGCs generated with BIG-SCAPE and positioned by hierarchical clustering.Nodes are colored by putative production of a known trans-AT PKS or described as an unknown trans-AT PKS. ( B ) Clinker gene comparison plots displaying trans-AT PKS BGC regions with similar domain str uct ure as experimentally validated BGCs from the MIBiG secondary metabolite database.The MIBiG entry is on top in each plot.Furthermore, gene comparison plots displaying GCF of trans-AT PKS BGCs with no similar reference are shown on the right.Genome accession numbers and the full network are available in the Supplementary Figure S5 .