Linking biosynthetic and chemical space to accelerate microbial secondary metabolite discovery

ABSTRACT Secondary metabolites can be viewed as a chemical language, facilitating communication between microorganisms. From an ecological point of view, this metabolite exchange is in constant flux due to evolutionary and environmental pressures. From a biomedical perspective, the chemistry is unsurpassed for its antibiotic properties. Genome sequencing of microorganisms has revealed a large reservoir of Biosynthetic Gene Clusters (BGCs); however, linking these to the secondary metabolites they encode is currently a major bottleneck to chemical discovery. This linking of genes to metabolites with experimental validation will aid the elicitation of silent or cryptic (not expressed under normal laboratory conditions) BGCs. As a result, this will accelerate chemical dereplication, our understanding of gene transcription and provide a comprehensive resource for synthetic biology. This will ultimately provide an improved understanding of both the biosynthetic and chemical space. In recent years, integrating these complex metabolomic and genomic data sets has been achieved using a spectrum of manual and automated approaches. In this review, we cover examples of these approaches, while addressing current challenges and future directions in linking these data sets.


INTRODUCTION
Recent improvements in the identification of BGCs has revolutionised our capacity to understand secondary metabolite production. Over the last few years, there has been a significant effort to link genomic data to secondary metabolite data for microorganisms, in particular bacteria. The first section of this review focuses on the creation of both biosynthetic and chemical data sets used for this purpose. The term linking in this review will cover all aspects of associating BGCs with the metabolite they encode. This linking process is divided into two main sections. The first is targeted approaches to linking, comprising the subsections, targeted genome mining linking, whereby strains are selected based on BGC information for further chemical analysis and multi-targeted linking, encompassing genome mining with bioactivity, metabolomics and proteomics approaches. The second is automated approaches, which is further divided into correlation-based and featurebased. Correlation-based approaches identify putative links via correlation of strain inclusion in clusters of spectra and BGCs. Feature-based approaches score individual spectra against individual BGCs based on shared properties. At the same time, we appreciate that studies rarely fall into these binary categories and that in reality, linking is often a spectrum using both approaches. In this review, we have selected recent studies to exemplify the discovery angle of each approach. We conclude with examples of experimental validation of these links through synthetic biology methods and a section on current challenges.

BIOSYNTHETIC AND CHEMICAL DATA SETS FOR LINKING
This section covers how data sets are made from the prediction of BGCs and detection of metabolites. This is a crucial step as the quality of the data sets will directly impact the success of linking across data sets. Predicting BGCs from bacterial genomes is a fairly mature discipline. Tools such as anti-SMASH (Blin et al. 2017) and SMURF (Khaldi et al. 2010) provide BGC predictions by matching curated statistical models based on sequences of protein family domains (PFAM domains) to genomic sequences (Coggill, Finn and Bateman 2008). Such techniques typically exhibit high specificity (low numbers of false positives) at the expense of low sensitivity (high numbers of false negatives). For users wishing higher sensitivity, more speculative algorithms such as ClusterFinder (Cimermancic et al. 2014), MIDDAS-M (Umemura et al. 2013) or MIPS-GC (Umemura, Koike and Machida 2015) exist. An in-depth review of BGC detection is beyond the scope of this article -please refer to Chavali and Rhee 2018 (Chavali and Rhee 2018) for a detailed reviewas are the community-driven comparative metabolomics platforms such as molecular networking, based on tandem-mass spectrometry data (Wang et al. 2016) and data bases such as NP Atlas (www.npatlas.org). While cross-referencing data bases has been the status quo for secondary metabolite identification, facilitating this through automation would greatly accelerate discovery.

Targeted genome mining approaches to linking
In this section, strains prioritised for chemical investigation based on genome mining information, for example the presence of specific BGCs, will be discussed. This linking is often a manual process and requires specialist biosynthetic and chemical knowledge. The first step involves the dereplication, or strain prioritisation (or strain elimination due to the presence of previously discovered metabolites), within complex biosynthetic and chemical data sets. Dereplication has been greatly aided by the data analysis platforms mentioned previously, resulting in new metabolite discovery ( Fig. 1) (Duncan et al. 2015;Kaweewan et al. 2017;Schneider et al. 2018;Son et al. 2018;Ueoka et al. 2018;Xu et al. 2018). For example, two new peptides, the halogenated curacomycin (1) and its dechlorinated derivative dechlorocuracomycin (2), produced by Streptomyces curacoi and Streptomyces noursei respectively, were discovered through a genome mining approach using antiSMASH to identify the presence of tryptophan halogenase genes in proximity to a nonribosomal peptide synthetase (NRPS) BGC . Further genomic investigation of S. curacoi led to the discovery of an additional new cytotoxic peptide, curacozole, produced by a gene analogous to that of curacomycin (Kaweewan et al. 2019). Using a similar approach, the novel lanthipeptide tikitericin (3) was isolated from a thermophilic bacterium, Thermogemmatispora sp. by detecting a lanthionine synthetase, homologous to a class II lanthipeptide BGC . The putative BGC consisting of 10 genes encoding a new lasso peptide was observed through genome mining of the rare actinomycete Actinokineospora spheciospongiae and led to the isolation of actinokineosin, a new peptide with promising antibacterial activity (Takasaka et al. 2017). In another study, the product of the lasso peptide BGC uld from the genome of Streptomyces sp. KCB13F003 was targeted using a One Strain Many Compounds (OSMAC) approach, in which each strain was grown in multiple media in an attempt to elicit a wide range of BGC expression, resulting in the isolation of the new metabolite ulleungdin (4) (Son et al. 2018). Recently, a plantassociated Gynuella sunshinyii strain was prioritised based on the presence of several unassigned BGCs and six trans-AT PK clusters. Further investigation led to the isolation of four metabolites of which three, the polyketides lacunalide A and B (5, 6) and the cyclodepsitripeptide sunshinamide (7)-represented novel scaffolds. Further manual genome mining resulted in two ergoyne analogues, ergoyne A and B, revealing the importance of complementing automated genome mining with manual curation ).

Multi-targeted linking
In the last five years, comparative metabolomics has been linked with comparative genome mining, proteomics and bioactivity to accelerate discovery. For example, genome mining of a lichen-associated Streptomyces sp. with metabolites bioactive against Bacillus subtilis revealed a BGC encoding a lantibiotic. By connecting the inactivation of the BGC to the loss of the observed bioactivity, a new 35-membered macrocyclic thiopeptide antibiotic geninthiocin B (8) was isolated (Schneider et al. 2018). Natural product proteomining -a quantitative proteomics platform -was introduced by Gubbens et al., for the identification of BGCs for targeted secondary metabolites by applying the OSMAC approach, metabolomics and quantitative proteomics. This approach allowed correlations between quantitative metabolomics or bioactivity data and protein expression profiles. A new juglomycin derivative was isolated from a soilisolated Streptomyces sp. by applying this quantitative multiomics approach (Gubbens et al. 2014).
Using pattern-based BGC genome mining combined with comparative metabolomics through molecular networking, the relationship between BGCs and the corresponding metabolites across 35 Salinispora strains was assessed. This resulted in an uncharacterised PKS BGC being linked to the previously reported metabolite arenicolide A, in addition to linking an uncharacterised NRPS BGC (NRPS40) to retimycin A (9), a new quinomycin-like depsipeptide (Duncan et al. 2015). The increasing complexity and scale of these combined data sets, often consisting of tens to thousands of both genomes and spectra, has resulted in the need to automate this process.

AUTOMATED LINKING
Two major tracks can be observed in the automated linking of BGCs to secondary metabolites. The first approach, featurebased linking, involves linking chemical features predicted from genomic information. For example, neutral losses indicative of amino acid residues, with the observed metabolomic data. The  (1), dechlorocuracomycin (2), tikitericin (3), ulleungdin (4), lacunalide A and B (5, 6), sunshinamide (7), geninthiocin B (8) and retimycin A (9). second approach, correlation-based linking, makes use of data sets where genomic and metabolomic data are available for a large number of strains. Related BGCs assembled into gene cluster families (GCFs) can be correlated with spectra belonging to molecular families (MFs) based on the occurrence of their source strains across the data set, with the assumption that true links would have high source strain correlations.

Feature-based linking
A fruitful approach for linking BGCs to secondary metabolites has been to predict the structural properties of the molecules based on genomic information to directly detect corresponding features in mass spectra (Fig. 2). For example, tools such as SANDPUMA  can predict substrate specificity for adenylation domains in NRPS BGCs. These predictions can be matched to amino acid residue-derived MS/MS fragmentations. This approach, using molecular properties predicted from genomic information to guide the search in chemical space, termed peptidogenomics by Kersten et al. (Kersten et al. 2011), has been used to link peptidic natural product BGCs to metabolites. These include stendomycin I (10) and the ribosomal lantipeptide. A similar technique has been described by Panter et al. (Panter, Krug and Müller 2019) for polyketides, which facilitated the discovery and structure elucidation of fulvuthiacene A and B (11, 12). Other examples of tools intended to predict detectable features from genomic information include RODEO (Tietz et al. 2017), which is focused on RiPPs, although this has not yet been integrated with mass spectrometry data; PRISM (Skinnider et al., , 2017, which focuses on NRPs and type I and II PKSs but is limited to LC-MS (but not LC-MS/MS data); GNP (Johnston et al. 2015), which links NRPS and PKS BGCs to LC-MS/MS spectra, Pep2path , which focuses on peptidic natural products, RippQUEST (Mohimani et al. 2014a) and NRPquest (Mohimani et al. 2014b), which detect RiPP and NRPS BGCs, respectively, and predict possible fragmentation patterns for their products.
Also relevant are tools that have been developed for dereplication. For example, DEREPLICATOR (Mohimani et al., 2017(Mohimani et al., , 2018 predicts possible fragmentation patterns for peptidic natural products. By linking spectra to peptides, and identifying the BGCs responsible for the production of those peptides in databases such as MIBiG (Medema et al. 2015), similar BGCs from the organism can be tentatively linked to the spectra. This approach was used, for example, in Mohimani et al. 2018(Mohimani et al. 2018, to link the polyketide antibiotic C 35 H 56 O 13 , which is structurally similar to chalcomycin, to its producing BGC.

Correlation-based linking
Another major approach is based on matching patterns of source strain occurrence between GCFs and MFs. The assumption that similar BGCs in different strains will produce similar molecules can be used to compute a score for the link between a GCF and a MF. This builds upon early work by Lin and coworkers (Lin, Zhu and Zhang 2006) on clustering homologous BGCs (structurally similar clusters that have a shared ancestry), which was then verified by comparison with known clusters of homologous genes. Later work on clustering BGCs (Cimermancic et al. 2014;Doroghazi et al. 2014;Navarro-Muñoz et al. 2018) has usually involved explicit verification of the GCF, i.e. that the BGCs being grouped together are, in fact, producing related metabolites, by heterologous expression or gene knockout, but follow a similar pattern of defining novel distance functions between BGCs and constructing a clustering based upon these distances. The distances are usually defined at least partly in terms of the composition of BGCs by protein family domains, (Lin, Zhu and The clustering of spectra into Molecular Families by molecular networking is incorporated into tools such as GNPS (Wang et al., 2016). The similarity between any two spectra in a data set is computed using a modified cosine similarity score, to account for certain structural modifications; as a result, two spectra are taken to belong to the same MF if their score exceeds a userdefined threshold.
Once the BGCs have been clustered based on the distance measurements, the shared source strains of GCFs and MFs can be used as a starting point to correlate between BGCs and products (Doroghazi et al. 2014;Goering et al. 2016) in an approach known as metabologenomics. A linking score between a GCF and a MF is computed, which is dependent on the degree of strain overlap, penalising strains being present in one side (GCF or MF) and not the other. Since the presence of a BGC in a strain does not guarantee that it will be active in all circumstances (cryptic BGCs), this penalty is often asymmetric, with a low penalty applied for strains that contribute to the GCF but not to the spectra, while strains that contribute to the spectra and not to the GCF are highly penalised. A new chlorinated antiproliferative compound named tambromycin (13) was discovered by applying this approach to a set of 178 actinomycete strains ). Moreover, a new class of natural products and their BGC were discovered when metabologenomics was combined with molecular networking. As a result, six tyrobetaines (14) bearing an unusual N-terminal trimethylammonium were identified and their BGC was confirmed through heterologous expression (Parkinson et al. 2018

Hybrid approaches
Even though we have described two distinct approaches, they are not mutually exclusive. Clustering BGCs can be used in conjunction with previously established links to infer the product from the known link to the other BGCs in the cluster (Nguyen et al. 2013). This can also be done with databases of known BGCs, such as MIBiG, to determine which BGCs are likely to produce already known compounds (Helfrich et al. 2018). Clustering can therefore complement the matching of BGCs and metabolites based on predicted spectral features.
Similarly, the mutual strain information between a cluster of BGCs and a metabolite is not enough on its own to establish a correspondence between the two, especially for novel secondary metabolites. Instead, the strain content has been used to prioritise the potential links for further verification. This verification has, for instance, taken the form of predictions of common structural elements of the products. These have then been searched for in the metabolomic data (Parkinson et al. 2018). For example, the creation of knockout strains was used to verify the linking of macrobrevin (15) to its BGC, or the correspondence of parts of the BGCs with known parts of the pathway for the product from other organisms (Helfrich et al. 2018). An example of the last approach is the discovery of indolmycin (16) (Maansson et al. 2016), where publicly available databases and machine learning were used to create an integrated mining approach for linking gene clusters, biosynthetic pathways and secondary metabolites for 13 closely related strains of Pseudoaltreomonas luteoviolacea. In these strains, close to 10% of the total genes encode for secondary metabolites. This percentage is considerably higher compared to studies conducted on other Pseudoaltreomonas strains (Médigue et al. 2005;Thomas et al. 2008) and is corroborated by the high degree of chemical complexity reported in Maansson et al. 2016 as only 2% of the molecular features were shared between the investigated strains. Indeed, novel analogues of thiomarinols were detected in the molecular network of strains that were characterised as biosynthetically diverse.

VALIDATION OF LINKS
In this section, we focus on approaches to experimentally validate links between a BGC and a secondary metabolite using synthetic biology techniques, such as genetic manipulation of BGCs. One of the most common approaches to validate the link between BGCs and metabolite is heterologous expression, the experimental details of which are outside the scope of this review. The reader is referred to Huo et al. for a detailed description of heterologous expression of bacterial secondary metabolite pathways (Huo et al. 2019).
The advent of new methods such as CRISPR/Cas9-based editing (Tao et al. 2018), λ-red mediated recombination (Gust et al. 2003), overexpression of positive regulators (Bergmann et al. 2010) and promoter engineering (Myronovskyi and Luzhetskyy 2016) have recently been applied to GC-rich actinomycetes. For instance, Gomez-Escribano et al. engineered S. coelicolor M145 strains specifically for the heterologous expression of BGCs to simplify the metabolite profiles and eliminate antimicrobial activity. This was achieved by deleting the actinorhodin, prodiginine, CPK and CDA BGCs and adding point mutations into the rpoB and rpsL genes to increase the production of secondary metabolites. The point mutations in rpoB and rpsL increased the production of chloramphenicol and congocidine by 40-and 30fold, respectively, therefore validating the BGC-metabolite link (Gomez-Escribano and Bibb 2011). In another example, genome mining was recently used to confirm the presence of a lasso peptide BGC in the genome of a marine Streptomyces sp. SCSIO ZS0098 strain that was known to produce the antimicrobial type I lasso peptide aborycin. In this study, the utility of strain engineering was used to validate this link through the heterologous expression of the candidate aborycin BGC in S. coelicolor M1152 (Shao et al. 2019).
Genetic manipulations can also be applied in combination with genome mining to induce metabolite production. For example, genome mining of a marine Streptomyces strain previously known to produce anthracenes and xiamycin A also revealed an ansamycin BGC. By removing the anthracenes and xiamycin A BGCs, the mutant strain was found to produce two new napthoquinone macrolides, olimyicn A and B (Maansson et al. 2016;Sun et al. 2018). A conserved set of five regulatory genes previously characterised by Sidda et al. were used as a query to both search and identify atypical BGCs in Streptomyces sclerotialus NRRL ISP-5269 (Sidda et al. 2014;Alberti et al. 2019). This approach was then used to identify an atypical scl BGC which was transferred and heterologously expressed in Streptomyces albus, resulting in the production of scleric acid, a secondary metabolite with moderate activity against Mycobacterium tuberculosis and inhibitory activity on the cancer-associated enzyme NNMT (Alberti et al. 2019). An additional example of genome mining combined with heterologous expression and chemical analysis, was a study of the fungal strains Arthrinium sp. NF2194 and Nectria sp. Z14-w, which resulted in the isolation of eight new meroterpenoids, two of which exhibited immunosuppressive bioactivity ).

CURRENT CHALLENGES
While genome mining and the application of genomic techniques have hugely benefited the genome-led secondary metabolite discovery pipeline, there are still important challenges that need to be addressed in order to maximise the potential that these approaches offer. Arguably, the bottleneck in discovery is our narrow understanding of the total biosynthetic and chemical space of microbial secondary metabolites. While prediction pipelines like SMURF (Khaldi et al. 2010) and antiSMASH (Medema et al. 2011;Blin et al. 2017) greatly facilitate the characterisation of BGCs, our knowledge of secondary metabolites is impeded as a result of up to 90% of BGCs being cryptic or silent (Abdelmohsen et al. 2015;Rutledge and Challis 2015;Baltz 2017;Machado, Tuttle and Jensen 2017). The lack of global transcriptome and translation data therefore makes it difficult to distinguish between BGCs that are transcriptionally silent and those that are actively transcribed but lack a link with their products (Jeong et al. 2016). A recent comparative transcriptomics study focused on four Salinispora strains to assess the effect of gene expression in the production of secondary metabolites. Only 13 out of the 49 BGCs were previously linked to their products, whereas the remaining were considered cryptic gene clusters. However, global transcriptome analyses at exponential and stationary phase revealed that more than half of the BGCs were in fact expressed (Amos et al. 2017). Further knowledge of protein translation will enable greater understanding of transcription levels and metabolite detection.
Another issue commonly encountered in secondary metabolites research is the lack of metabolite detection due to extraction constraints or the analytical technique limitations. For instance, studies have demonstrated the impact of extraction solvent on the detection of metabolites (Floros et al. 2016;Crüsemann et al. 2017). Although advances in instrument sensitivity (mass spectrometry and NMR spectroscopy) could arguably remedy problems of detection (Bouslimani et al. 2014), the use of limited experimental conditions can greatly impact the number and diversity of metabolites detected. These include, for example, culture conditions, media composition, growth stage and extraction solvent (Romano et al. 2018). If not taken into consideration, these variables could undermine the biosynthetic potential of the studied organism, complicating the linking process further.
The high rediscovery rate of molecules is another setback commonly encountered in secondary metabolite research. The efficient prioritisation of strains and extracts using combined comparative genomic and metabolomic approaches has proven to be a useful strategy to avoid this. For example, using an integrated approach of combining metabolomic and genomic techniques, Ong and co-workers identified novel metabolites with anti-quorum sensing activity from five bacterial strains isolated from subtidal marine samples (Ong et al. 2019). Their work is a good example of the application of molecular networking-based dereplication in the discovery of secondary metabolites. Effective dereplication greatly relies on the availability of comprehensive, curated, chemical databases and several commercially available databases to this effect are already in place, including AntiBase (Laatsch 2017), Dictionary of Natural Products (Buckingham 1994) and MarinLit ('MarinLit'). However, data analysis using this approach is often complex and manual. The recent development of the Global Natural Products Social Molecular Networking (GNPS) platform represents a step-change that facilitates community-driven data curation, enabling open access analysis and sharing of MS/MS spectra (Wang et al. 2016). The expansion of such data sets will greatly facilitate our understanding of chemical space.

FUTURE DIRECTIONS AND CONCLUDING REMARKS
Currently, researchers are biased towards the study of putative BGCs that encode variants of already known compounds or biosynthetic pathways, consequently biasing discovery towards analogues of known natural products. Efforts to overcome this include a trend towards the creation of datasets built upon increasing numbers of strains. Recently, a dataset consisting of genomic and metabolomic data for 363 bacterial strains was published and it is likely that more datasets of increasing size will become available (Navarro-Muñoz et al. 2018). As they do, the performance of automated linking approaches will improve.
Data sets are also likely to increase in terms of the data modalities they cover. Already, large datasets connecting bioactivity with genomics exist. For example, a recently published study linked genomic data for 224 bacterial strains (found on the leaves of Arabidopsis plants) to bioactivity data for the same strains (Helfrich et al. 2018). As high-throughput bioactivity screening becomes standard (Pye et al. 2017), it is likely that data sets combining metabolomics, genomics and bioactivity will become available, in turn, motivating the development of new computational techniques capable of analysing them. Crypticity of BGCs will always be a challenge in this domain. Transcriptomic analysis can indicate activity of BGCs (Amos et al. 2017), and the coupling of transcriptomic data with metabolomics, genomics and bioactivity is now possible. This would be particularly powerful when coupled with data generated using the OSMAC approach.
In conclusion, as data sets increase in strain coverage and modalities, increasingly advanced bioinformatics tools are required for their analysis. We believe that modern computing techniques, such as machine learning and artificial intelligence, have a key role to play in elucidating the links between genomes, transcriptomes, metabolomes and phenotypes. Computational tools to date have largely focused on modular secondary metabolites (e.g. NRPS and RIPPs), reflecting the relatively repetitive nature of their biosynthesis. The creation and continued growth of ground-truth data sets such as MiBIG (Medema et al. 2015) provides the necessary infrastructure for the development of tools based upon recent advances in machine learning, that are able to learn mappings between genomic information and molecular structure (as observed in mass spectrometry data). Current research is biased towards areas of the BGC space for which much is known about biosynthesis. Machine learning tools may be able to uncover patterns that help us illuminate larger, unknown areas of both the biosynthetic and chemical space.