BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters

Abstract Background Genome mining for biosynthetic gene clusters (BGCs) has become an integral part of natural product discovery. The >200,000 microbial genomes now publicly available hold information on abundant novel chemistry. One way to navigate this vast genomic diversity is through comparative analysis of homologous BGCs, which allows identification of cross-species patterns that can be matched to the presence of metabolites or biological activities. However, current tools are hindered by a bottleneck caused by the expensive network-based approach used to group these BGCs into gene cluster families (GCFs). Results Here, we introduce BiG-SLiCE, a tool designed to cluster massive numbers of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-pairwise, near-linear fashion. We used BiG-SLiCE to analyze 1,225,071 BGCs collected from 209,206 publicly available microbial genomes and metagenome-assembled genomes within 10 days on a typical 36-core CPU server. We demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential. BiG-SLiCE also provides a “query mode” that can efficiently place newly sequenced BGCs into previously computed GCFs, plus a powerful output visualization engine that facilitates user-friendly data exploration. Conclusions BiG-SLiCE opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry. BiG-SLiCE is available via https://github.com/medema-group/bigslice.

We have been aware of these approaches during the design phase of BiG-SLiCE, but indeed overlooked them in our writing. We have now added a paragraph discussing ProtVec and Pfam2Vec approaches in our "BGC feature extraction" subsection and an additional forward-looking perspective in our "Conclusions and Future Perspectives" subsection.
(2) Input folder: Even though the wiki page about the input folder is very useful, a downloadable example of an input folder will make it very easy to test and users can familiarize with the output of the tool. Response: We already had a downloadable input folder example in our GitHub repository (https://github.com/medema-group/bigslice/tree/master/misc/input_folder_template). We nowspecifically mention this folder for a "minimal test run" on our README page.
(3) README file (Quick start / 3. Fetch the latest HMM models): The name of the script in the instruction is "download_bigslice_hmmdb.py" while the installed script does not have the ".py" extension. Response: We have fixed the specified typo on our README page.
(4) The minimum information that a cluster's genbank file must contain to be processed must be provided. Meaning, which features/annotations should the file have?. This will be helpful when an alternative tool to antiSMASH is used to detect BGCs, or in cases where the cluster borders are manually refined. Response: At this moment, we rely on curation-based tools and databases such as antiSMASH and MIBiG in order to tune our feature extraction and algorithm on well-studied BGC classes. Technically speaking, we rely on the output format from antiSMASH and MIBiG to assign BGC (sub-)classes and to separate (putatively) complete and fragmented BGCs. In the future, we plan to accept file formats produced by other, more "exploratory" tools and databases. We have added some extra sentences in our "Conclusions and Future Perspectives" to communicate this to readers. AntiSMASH cluster GenBank files with manually-refined cluster borders will be accepted by BiG-SLiCE as inputs without any extra requirements. Furthermore, to support people who want to use their own custom GenBank file inputs, we have now added an adaptor script (https://github.com/medemagroup/bigslice/tree/master/misc/generate_antismash_gbk) along with a short instruction in our README page (section: "Custom GenBank input") on how to transform a regular GenBank file into a valid antiSMASH5-like cluster GenBank file suitable for use in BiG-SLiCE.
Reviewer #2 points (1) BiG-SLICE is quite rigidly designed around input data coming (exclusively) from AntiSMASH (only seems to accept antismash-modified GenBank files); we could not get it to accept more standard GenBank files or ones generated by other tools such as DeepBGC. Response: Indeed, for its initial release, we limited BiG-SLiCE usage around curation-based tools and databases like antiSMASH and MIBiG, as we wanted to be able to carefully optimize and benchmark its feature extraction strategy (i.e., the biosynthetic-& sub-Pfams) and clustering algorithm (i.e., using antiSMASH"s & MIBiG"s metadata to separate complete and fragmented BGCs) for well-studied, experimentally-supported BGC classes. Looking into the future, it is perfectly viable (and under our consideration) to expand its coverage for other, more "exploratory" tools and databases like DeepBGC. We have added some extra sentences in our "Conclusions and Future Perspectives" to communicate this to readers. To support for people who want to use their own custom GenBank files, we have now added an adaptor script (https://github.com/medemagroup/bigslice/tree/master/misc/generate_antismash_gbk) along with a short instruction in our README page (section: "Custom GenBank input") on how to transform a regular GenBank file into a valid antiSMASH5-like cluster GenBank file suitable for use in BiG-SLiCE.
(2) BiG-SLICE results are all stored in the SQL database and export is not straightforward. Programmatic access to the database is not (well) documented either.A more flexible design for BiG-SLICE, in particular export of results as flat-text files for easy import to enable quick follow-up analysis would be highly desirable. I feel that while GUI access may be most important for tools such as AntiSMASH, most users interested in really large-scale BGC analyses will (have to) possess more advanced computational skills and would therefore likely value programmatic access over graphical outputs. As a side note, I'd encourage the authors to consider more light-weight designs for future versions of BiG-SLICE as these may also scale better. Fig 5A to me suggests that further gains in efficiency (necessary for ever larger analyses of MAGs etc) cannot be expected from algorithmic improvements (clustering currently only takes ~10% of the whole runtime), but from improved software engineering (e.g. for leaner IO). Response: One key reason to center our design around SQL-based data storage is so that BiG-SLiCE can target both advanced and basic users at the same time. We envision that BiG-SLiCE will be used to process at least double the amount of ~1.2M BGCs in the manuscript, especially if we want to consider taking inputs from more "exploratory" BGC detection tools like DeepBGC that predict larger numbers of clusters. For that, SQL(ite)-based data storage provides a simple yet powerful way to process that data for downstream analyses (including generating any custom tabular and text-based files on demand) by advanced users. In the paper"s supplementary data, we give some example SQL queries and python scripts that can be reused to familiarize users with programmatic access to BiG-SLiCE data. We have now made the instruction clearer in our README page (section: "Programmatic Access and Postprocessing"). In the future, we plan to make a more comprehensive wiki and guideline on how to perform such operations, as we are aware that although starting to gain increased adoption in the field, SQL is not something everybody is already familiar with from the start. Furthermore, the same SQL database will support flexible development of even more extensive and/or customized features on top of its Flask-based user interface engine. Although the one-time clustering process can indeed be performed by (slightly) more advanced users, its resulting output can be hosted locally (i.e., on an intranet), so that regular users can access it via their web browsers. We find that this automated visualization strategy is important to enable a detailed exploration / expert-based interpretation of the large-scale clustering results and would be appreciated by both target audiences. We give an example of this "scenario" in our newly online BiG-FAM database (https://bigfam.bioinformatics.nl/, 10.1093/nar/gkaa812), which was built upon BiG-SLiCE"s SQLite and flask-based output. As for improving BiG-SLiCE"s runtime efficiency, although there is a lot to gain over its current SQLbased implementation (i.e., at this moment, all SQL operations are naively done in series, which can be significantly improved upon taking a more sophisticated parallelization strategy), we do not see throwing the architecture away in favor of, e.g., tabular-based file approach as a good trade-off due to reasons described above.
(3) https://github.com/medema-group/bigslice: hyperlink to pre-processed result on ~1.2M microbial BGCs from the NCBI database produces 404 Response: We have fixed the broken link on our README page. (4) https://github.com/medema-group/bigslice/wiki/Input-folder: GTDB-toolkit and NCBI taxonomy script hyperlinks produce 404 Response: We have fixed the broken link on our README page. (5) https://github.com/medema-group/bigslice/wiki/Program-parameters: program parameters are not documented Response: We have now put up the explanation of each program parameters on our wiki page. (6) The `bigslice --version` command does not print the version. Response: Due to the tricky behavior of python argparse, the previous version of BiG-SLICE required an input folder to be specified (in this case "bigslice --version ."; mind the extra dot). We have now implemented a workaround to enable the mentioned command to correctly print the software version (will only be available through PyPI on the next stable release). (7) Page 1. The sentence "Due to the sheer size of microbial and enzymological biodiversity, there exists a vast repertoire of potentially useful compounds remains to be unearthed" needs rewording. Response: We understand that this sentence might be too wordy and hard to follow. We have rephrased it into a simpler and clearer one. (8) Page 5. Fig 2C: Separation between distributions could be quantified by receiver-operatingcharacteristic (ROC) curves (also applies to Fig 3). Using a varying distance cutoff, pairs from the same class above the cutoff are counted as true-positives, below the cutoff as false negatives; pairs from different classes above the cutoff as false positives and as true negatives below the cutoff. Response: We do not intend to make the underlying questions a classification problem. In Figure 2C, BGCs can have multiple classes, and even within the same antiSMASH class, the domain content (and thus the observed distance) can vary. In Figure 3, although we can say that the MIBiG groups can serve as a "ground truth", the manually described families should only be taken as a reference (to objectively judge BiG-SLiCE results) rather than as a classifier benchmark. (9) Fig 2D: consider changing scale to 0-100 for "percent" identity, as I am assuming this is referring to 0-100% and not 0-1.0%.

Response:
The left y-axis shows Pearson correlation values (R), as opposed to %-identity. We thought the title and