BGCFlow: systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets

Abstract Genome mining is revolutionizing natural products discovery efforts. The rapid increase in available genomes demands comprehensive computational platforms to effectively extract biosynthetic knowledge encoded across bacterial pangenomes. Here, we present BGCFlow, a novel systematic workflow integrating analytics for large-scale genome mining of bacterial pangenomes. BGCFlow incorporates several genome analytics and mining tools grouped into five common stages of analysis such as: (i) data selection, (ii) functional annotation, (iii) phylogenetic analysis, (iv) genome mining, and (v) comparative analysis. Furthermore, BGCFlow provides easy configuration of different projects, parallel distribution, scheduled job monitoring, an interactive database to visualize tables, exploratory Jupyter Notebooks, and customized reports. Here, we demonstrate the application of BGCFlow by investigating the phylogenetic distribution of various biosynthetic gene clusters detected across 42 genomes of the Saccharopolyspora genus, known to produce industrially important secondary/specialized metabolites. The BGCFlow-guided analysis predicted more accurate dereplication of BGCs and guided the targeted comparative analysis of selected RiPPs. The scalable, interoperable, adaptable, re-entrant, and reproducible nature of the BGCFlow will provide an effective novel way to extract the biosynthetic knowledge from the ever-growing genomic datasets of biotechnologically relevant bacterial species.


Table of Contents
Table S1.List of 15 main computational tools to select the large-scale genome mining analysis using BGCFlow ..

Figure S2. Job scheduling and monitoring using Snakemake and panoptes
Screenshot of the dry run for BGCFlow on the mq_saccharopolyspora with all jobs to run for the selected rules.The number of minimum threads for each jobs can be updated using configure file for the thread profiling.Also shown the monitoring of the Snakemake runs and individual jobs using panoptes.

Figure S11. Co-Phylogenetic Analysis of representative BGC members of BiG-FAM model GCF_201888
Phylogenetic comparison of genome tree (right) built using AutoMLST (6) and BGC tree (left) built using getphylo (26).The tanglegram was built using the R package Paco (33), with colors highlighting the family of where the genome/BGCs belongs to.Genomes of Saccharopolyspora which do not have the Staphylobactin-like BGCs are also included.The three orthologs chosen by getphylo encodes for the gene annotated for Staphyloferrin B transporter, N-((2S)-2-amino-2-carboxyethyl)-L-glutamate dehydrogenase, and N-(2-amino-2-carboxyethyl)-L-glutamate synthase.

Figure S1 .
Figure S1.Project configuration and metadata to setup BGCFlow a) Global configuration file (yaml) with path to two of the PEPs used in the demonstration.b) Example of the mq_saccharopolyspora PEP configuration file with selected rules to run.c) List of 28 NCBI accession IDs in the samples table used in the project (samples.csv).d) List of selected high-quality genome annotations of actinomycetes as priority of prokka annotation.(prokka-db.csv)

Figure S4 .
Figure S4.Example of Jupyter-based markdown reportsa) Screenshot of the home page of the BGCFlow reports of the mq_saccharopolyspora project with selected rules.b) Example report of the antismash rule's report page with a summary and the table with a list of genomes.The links for the genomes IDs in the table correspond to antiSMASH-generated HTML reports with details on all detected BGCs.

Figure S5 .
Figure S5.Overview of timeline, quality, and taxonomic placement of 42 Saccharopolyspora genomes a) Cumulative bar chart of the number of genomes over the last 15 years with different assembly qualities.b) Distribution of contamination vs completeness metrics calculated using CheckM, where colors represent the assembly qualities.c) Sankey diagram representing the species assignment differences between NCBI and GTDB.d) Scatterplot representing the distribution of N50 values vs the number of contigs.The cutoff of 50 contigs is used to filter the low-quality genomes, whereas 5 Mbp of N50 value cutoff was used to define high-quality genomes.The remaining genomes were defined as medium-quality.

Figure S6 .
Figure S6.Report of genome clustering using Mash distances.

( a )
Suggested optimal clustering based on K-means clustering with minimum adjusted inertia.(b) MASH-similarity network with nodes represented by genomes from different species and edge width represented by the similarity of 85% to 100%.The 8 optimal clusters are represented by different colors of nodes.

Figure S7 .
Figure S7.Gene comparison of staphylobactin-like BGCs in SaccharopolysporaComparison of staphylobactin-like BGCs which are connected through a BiG-FAM node GCF_201888, which has a Shannon index of ~0.3 and contained 12,444 BGCs which are distributed across 43 genera with the majority belonging to Staphylococcus (~94.2%).All 8 genes responsible for staphylobactin biosynthesis can be mapped through CBlaster except for NZ_FOZX01000003, with only 7 matches.

Figure S8 .
Figure S8.Comparison of BiG-SCAPE network with different cutoffs assignment with the enriched network

Figure S9 .
Figure S9.Gene comparison of spinosyn-like BGCs in SaccharopolysporaComparison of BGCs with BiG-SCAPE and KnownClusterBlast similarity to spinosyn.All regions matched with spinosyn belong to phylogroup 2. Of the 5 BGCs, only 1 has an exact structure with spinosyn MIBIG BGC (NZ_PJNB01000001).Other BGCs showed variation in the PKS modules and might even lose the whole biosynthesis pathway (NZ_CP061007).Annotations in all BGCs showed matches to specific tailoring enzymes found in spinosyn biosynthesis for sugar attachment (cog_03probable rhamonsyltransferase).

Figure S10 .
Figure S10.Median Beta-RD distribution of Saccharopolyspora GCFsSummary of median Beta-RD distribution of BiG-SCAPE GCFs in Saccharopolyspora calculated using lsabgc(30).Only GCFs with non-null values are shown.Highlighted in red are GCFs with similarity to Staphylobactin (Staphyloferrin B) based on query result to BiG-FAM GCF model GCF_201888.Detailed beta-rd distribution and population genetic analysis is available from Data S8.

Table S1 . List of 15 main computational tools to select the large-scale genome mining analysis using BGCFlow Rule Name Tool Name Version Description Link Refs
(20)s://github.com/zellerlab/GECCO(20)

Table S3 . Additional sub-workflows available in BGCFlow
This figure compares the number of connected component assignments between an enriched and non-enriched BiG-SCAPE network with different cutoffs.BiG-SCAPE networks were enriched with ARTS2 hits, BiG-FAM hits, and KnownClusterBlast hits.BiG-FAM models202087, 200946, 210179,  213140, 201682, 201608, 205957, 215277, 201830, and202082 were removed because the assigned top genus in the model is below 30%.A) BiG-SCAPE sequence similarity network with cutoff 0.3 resulting in 328 connected components (GCFs) with 206 singletons.A total of 4 GCFs with 14 BGCs can be assigned to 6 MIBIG entries.B) Enriched BiG-SCAPE network with cutoff 0.3 resulting in 202 connected components with 44 singletons.This increases the number of BGCs in the 4 BiG-SCAPE GCF with MIBIG hits into 29 BGCs.Additionally, a total of 105 BGCs can be assigned to 12 GCFs with MIBIG KnownClusterBlast similarity >= 80% and left only 122 BGCs in 70 GCFs without connection to MIBIG and BiG-FAM nodes.C) BiG-SCAPE sequence similarity network with cutoff 0.4 resulting in 280 connected components (GCFs) with 162 singletons.A total of 10 GCFs with 80 BGCs can be assigned to 19 MIBIG entries.D) Enriched BiG-SCAPE network with cutoff 0.4 resulting in 190 connected components with 41 singletons.This increases the number of BGCs in the 10 BiG-SCAPE GCF with MIBIG hits into 109 BGCs.Additionally, a total of 53 BGCs can be assigned to 8 GCFs with MIBIG KnownClusterBlast similarity >= 80% and left only 115 BGCs in 64 GCFs without connection to MIBIG and BiG-FAM nodes.E) BiG-SCAPE sequence similarity network with cutoff 0.5 resulting in 259 connected components (GCFs) with 144 singletons.A total of 15 GCFs with 115 BGCs can be assigned to 110 MIBIG entries.F) Enriched BiG-SCAPE network with cutoff 0.5 resulting in 173 connected components with 36 singletons.This increases the number of BGCs in the 15 BiG-SCAPE GCF with MIBIG hits into 184 BGCs.Additionally, a total of 44 BGCs can be assigned to 7 GCFs with MIBIG KnownClusterBlast similarity >= 80% and left only 114 BGCs in 59 GCFs without connection to MIBIG and BiG-FAM nodes.Red box highlights the location of spinosyn-like BGCs in the network.