Genome mining for drug discovery: cyclic lipopeptides related to daptomycin

Abstract The cyclic lipopeptide antibiotics structurally related to daptomycin were first reported in the 1950s. Several have common lipopeptide initiation, elongation, and termination mechanisms. Initiation requires the use of a fatty acyl-AMP ligase (FAAL), a free-standing acyl carrier protein (ACP), and a specialized condensation (CIII) domain on the first NRPS elongation module to couple the long chain fatty acid to the first amino acid. Termination is carried out by a dimodular NRPS that contains a terminal thioesterase (Te) domain (CAT-CATTe). Lipopeptide BGCs also encode ABC transporters, apparently for export and resistance. The use of this mechanism of initiation, elongation, and termination, coupled with molecular target-agnostic resistance, has provided a unique basis for robust natural and experimental combinatorial biosynthesis to generate a large variety of structurally related compounds, some with altered or different antibacterial mechanisms of action. The FAAL, ACP, and dimodular NRPS genes were used as molecular beacons to identify phylogenetically related BGCs by BLASTp analysis of finished and draft genome sequences. These and other molecular beacons have identified: (i) known, but previously unsequenced lipopeptide BGCs in draft genomes; (ii) a new daptomycin family BGC in a draft genome of Streptomyces sedi; and (iii) novel lipopeptide BGCs in the finished genome of Streptomyces ambofaciens and the draft genome of Streptomyces zhaozhouensis.

Cyclic lipopeptide antibiotics produced by actinomycetes were first discovered in the 1950s , and daptomycin was the first to be approved for treatment of Gram-positive infections, including methicillin-resistant Staphylococcus aureus (MRSA) (Baltz, 2009;Eisenstein et al., 2010). By the mid-2000s, NRPS BGCs encoding daptomycin (Dpt), A54145 (Lpt), and calcium-dependent antibiotic (CDA) had been cloned and sequenced (Hojati et al., 2002;Miao et al., 2005;Miao, Brost, et al., 2006). These lipopeptides have 10-membered ring structures with identical chirality (Hojati et al., 2002;Miao et al., 2005;Miao, Brost, et al., 2006;Gu et al., 2011), and all three NRPS multienzymes utilize phylogenetically related dimodular NRPS termination proteins (CAT-CATTe) that insert the final two amino acids, 3mGlu-Kyn (DptD), 3mGlu-Ile (LptD), or 3mGlu-Trp (CDA-PSIII), then cyclize and release the final products by thioesterase (Te) domains. Early combinatorial biosynthesis studies at Cubist Pharmaceuticals demonstrated that a dptD deletion mutant of Streptomyces roseosporus could be complemented by the lptD and CDA-PSIII genes from Streptomyces fradiae and Streptomyces coelicolor, respectively, to produce daptomycin analogs containing Ile or Trp in the terminal amino acid position . These findings were followed by a series of studies on combinatorial biosynthesis in S. roseosporus Nguyen, Ritz, et al., 2006;Coëffet-Le Gal et al., 2006;Doekel et al., 2008) and S. fradiae (Nguyen et al., 2010;Alexander et al., 2010Alexander et al., , 2011) that generated many active lipopeptide antibiotics related to daptomycin and A54145, including several with highly improved efficacy in a Streptococcus pneumoniae murine lung infection model, while maintaining the high antibacterial activity against multiple Gram-positive pathogens and low toxicity of daptomycin (Baltz, 2014b(Baltz, , 2014c. The NRPS genes (dptA, dptBC, and dptD) in the daptomycin BGC are preceded by dptE and dptF, which encode a fatty acyl-AMP ligase (FAAL) and a free-standing acyl carrier protein (ACP) involved in initiation of lipopeptide assembly by coupling long chain fatty acids to the N-terminal Trp Wittmann et al., 2008;Baltz, 2014b). Initiation of A54145 biosynthesis is carried out in a similar manner by an apparently fused FAAL-ACP encoded by lptEF (Miao, Brost, et al., 2006). However, recent genome mining studies indicate that the fragmented A54145 BGCs in draft genome assemblies of four other Streptomyces species encode free-standing FAALs and ACPs (Baltz, 2018;This report). CDA biosynthesis does not use a FAAL-ACP mechanism for initiation of lipopeptide assembly (Hojati et al., 2002).
Additional cyclic lipopeptide BGCs have been sequenced and annotated more recently, and several employ initiation and termination mechanisms similar to those utilized for daptomycin and A54145 assembly (Müller et al., 2007;Wang et al., 2011;Yamanaka et al., 2014;Fu et al., 2015;Johnston et al., 2016;Liu et al., 2016;Hover et al., 2018;Reynolds et al., 2018). This report explores the evolutionary relationships between these structurally diverse but evolutionarily related cyclic lipopeptides (Baltz, 2008b), particularly as it relates to biosynthetic features that can be exploited by recent advancements in synthetic biology, genome mining, and combinatorial biosynthesis for drug discovery (Baltz, 2018;Katz et al., 2018).

Strains and Lipopeptide BGC Sequencing Status
The DNA sequencing status of select actinomycete strains and uncultured bacteria, and their lipopeptide BGCs are summarized in Table 1.

Searches for Cryptic Lipopeptide BGCs
Initial BLASTp searches of genome sequences in NCBI were carried out using various molecular beacons (Supplementary Table  S1), including genes from the daptomycin BGC, and homologs from other lipopeptide producers. Putative lipopeptide producers were also surveyed for the presence of MbtH homologs related to those of known lipopeptide producers by BLASTp analysis with a 24-mer MbtH multiprobe (Baltz, 2014a(Baltz, , 2017a. Other BLASTp searches were carried out with pathway-specific genes from other lipopeptide BGCs to help distinguish between known and novel lipopeptide BGCs. Putative lipopeptide BGCs were analysed from finished genomes by antiSMASH 4.0 or 5.0 (Blin et al., 2017(Blin et al., , 2019.

Cyclic Lipopeptide BGCs for Comparative Analysis
The structures of daptomycin and A54145 are shown as examples of cyclic lipopeptide antibiotics in Fig. 1. For comparative analysis, key elements of lipopeptide assembly machines for daptomycin, taromycin, A54145, friulimicin, laspartomycin/glycinocin, malacidin, and telomycin are shown in Fig. 2a and b. The key elements include molecular parts and devices for initiation, elongation, and termination/release of the finished lipopeptides. Other conserved accessory devices include ABC transporters for export and resistance, and MbtH-like chaperones. The status of assembly of the genomes and BGCs encoding these lipopeptides is presented in Table 1, and background information on each molecule is provided below.
2018; Reynolds et al., 2018). Taromycin is closely related to daptomycin, and differs in the tridecapeptide by a single amino acid substitution (d-ala 11 for d-ser 11 ), by chlorination of l-Trp 1 and l-Kyn 13 , and it has a C8 fatty acid side chain unsaturated in two positions (Reynolds et al., 2018;Yamanaka et al., 2014). Its Ca 2+ binding tetrapeptide is identical to that of daptomycin (Fig. 2b). A cryptic taromycin-like BGC, but lacking the tryptophan chlorinase gene involved in chlorination of Trp and Kyn in taromycin, is encoded by Saccharomonospora viridis DSM 43017, a causative agent of Farmer's Lung Disease (Pati et al., 2009;Baltz, 2010bBaltz, , 2018.

Laspartomycin
Laspartomycin is a 10-membered Ca 2+ -dependent cyclic lipopeptide antibiotic produced by Streptomyces viridochromogenes ATCC 29814. It was first described in the 1950s , and its BGC has been sequenced (Wang et al., 2011). It is a member of the glycinocin family . It has a single exocyclic amino acid coupled to a mono-unsaturated long chain fatty acid. The laspartomycin cyclic peptide backbone has the same chirality as daptomycin, taromycin, A54145, and CDA, and has a canonical Ca 2+ -binding tetrapeptide, Asp-Gly-Asp-Gly (Fig. 2b).

Telomycin
Telomycin was first described in the 1950s by scientists at Bristol-Myers (Misiek et al., 1957(Misiek et al., -1958. It is a 9-membered cyclic depsipeptide with a two amino acid exocyclic tail lacking a lipid side chain. Recent studies indicate that telomycin biosynthesis initiates with the coupling of a long chain fatty acid to the first amino acid and that the lipid is removed after the cyclic lipopeptide is released from the NRPS multienzyme (Fu et al., 2015). Two strains of Streptomyces canus were deposited by Bristol-Myers to support patent applications, and the telomycin BGCs have been sequenced from both strains (Fu et al., 2015;Johnston et al., 2016;Liu et al., 2016). The telomycin BGCs were chosen for inclusion in this analysis because they encode homologs to DptE, DptF, and DptD for initiation and termination of assembly, but the final cyclic peptide has no apparent Ca 2+ -binding tetrapeptide (Fig. 2b).

Malacidin
Malacidin is a cyclic lipopeptide antibiotic recently discovered from an uncultured bacterium (Hover et al., 2018). Malacidin has an 8-membered amino acid heterocycle and a two amino acid exocyclic tail coupled to a di-unsaturated fatty acid. Its lipopeptide assembly apparatus, including the use of DptE, DptF, and DptD homologs, is similar to those of friulimicin and laspartomycin ( Fig. 2a and b), but it lacks a canonical Ca 2+ -binding tetrapeptide. Nonetheless, it requires high levels of Ca 2+ for antibacterial activity (Hover et al., 2018).

Components for Cyclic Lipopeptide Assembly
From a synthetic biology perspective, cyclic lipopeptide antibiotic assembly requires a number of parts and devices to build the assembly machines in microbial host chassis. In addition, lipopeptide biosynthesis often requires the coordinated acquisition of accessory devices for lipid or amino acid modifications, activation of ACPs and peptidyl carrier proteins (PCPs or T domains) by phosphopantetheinyl transferases (PPTases), MbtH chaperone function, host resistance, and transport. Typical Ca 2+ -dependent cyclic lipopeptides require an additional tetrapeptide device within the peptide ring to bind Ca 2+ ions. Therefore, key components for lipopeptide assembly include: (i) fatty acid to amino acid coupling devices; (ii) multiple types of amino acid to amino acid coupling devices; (iii) parts to set chirality; (iv) devices to impart Ca 2+ -binding; (v) devices to cyclize and release lipopeptides from the giant multi-modular, multi-subunit NRPS assembly machines; and (vi) multiple accessory devices to facilitate the process. As the individual lipopeptide assembly functions are modular, they lend themselves to combinatorial evolutionary processes that can be accelerated by many orders of magnitude in the laboratory by combinatorial biosynthesis (Baltz, 2014b). In the following sections, I discuss evolutionary relationships that can be deduced from the analysis of the BGCs from structurally diverse, but evolutionarily related lipopeptide antibiotics produced by actinomycetes or uncultured bacteria.

Activation and Coupling of Fatty Acids to Amino Acids (Initiation)
At the front end of lipopeptide assembly is the attachment of a long chain-length fatty acid to the first amino acid to initiate assembly. The evolution of this process was undoubtedly a key element in the evolution of lipopeptide assembly machines. Bioinformatic analysis of the daptomycin BGC identified three NRPS genes, dptA, dptBC, and dptD . Just upstream of dptA are dptE and dptF, which were initially annotated as acyl-CoA ligase and free standing ACP, respectively . Subsequent biochemical studies (Wittmann et al., 2008) showed that DptE has two activities that do not involve acyl-CoA intermediates. DptE activates certain long chain fatty acids with ATP to form fatty acyl-AMP intermediates; the fatty acids are then transferred to a holo-ACP (DptF) for subsequent coupling to l-Trp 1 by a specialized C III condensation domain of the first module of DptA . So DptE has two activities, FAAL and acyl-ACP synthetase (AAS) (Wittmann et al., 2008). For simplicity I refer to this type of enzyme as FAAL as it is typically annotated in NCBI. Also, mechanistic studies showed that DptE requires DptF for FAAL activity (Wittmann et al., 2008). The dptE and dptF genes are transcribed along with dptABCD genes as a single long transcript from a promoter upstream of dptE (Coëffet-Le . This FAAL mechanism to activate long chain fatty acids by DptE is similar to the FAAL mechanism (FadD32) involved in mycolic acid biosynthesis in Mycobacterium tuberculosis (Kuhn et al., 2016). It differs in that the activated long chain fatty acid is transferred to an ACP in a small PKS (ACP-KS-AT-Te) in M. tuberculosis. DptE and FadD32 show 34% sequence identity in BLASTp analysis, indicating that they are distantly related evolutionarily.  (2) - The FAAL:ACP:C III mechanism to initiate lipopeptide assembly is observed in other cyclic lipopeptide BGCs (Fig. 2a), including A54145, taromycin, friulimicin, laspartomycin/glycinocin, malacidin, telomycin, and enduracidin (not shown). The FAAL:ACP:C III mechanism utilizes free-standing holo-ACPs activated by PPTases (Wittmann et al., 2008). It differs mechanistically from holo-ACPs and holo-PCPs involved in PKS and NRPS chain elongation in that it interacts with upstream FAAL enzymes and downstream specialized C III domains. As such, they might be more closely related phylogenetically to each other than to the more common ACPs and PCPs imbedded in PKS or NRPS enzymes. However, they process fatty acids of different chain length and degree of saturation, and couple the fatty acids to different amino acids, so we might expect divergent ACP amino acid sequence relationships based on substrate and coupling partner preferences. The same might hold true for FAAL enzymes that must select fatty acids of correct chain length and degree of unsaturation from primary metabolic FA pools to bind and activate. Table 2 shows the amino acid sequence similarities between FAAL enzymes involved in lipopeptide assembly. Table 2 includes compounds validated chemically and others from BGCs identified bioinformatically. The four apparent LptE orthologs from Streptomyces exfoliatus, Streptomyces griseoluteus, Streptomyces pini, and Streptomyces barkulensis showed 79-90% amino acid sequence identities to the LptE domain of LptEF from S. fradiae, which otherwise showed only 46-52% sequence identities to FAAL sequences from other lipopeptide producers. The two LipA FAALs from friulimicin producers share 92% sequence identities, and LipA from A. friuliensis showed the highest sequence identities to paralogous FAALs encoded by strains harboring laspartomycin/glycinocin BGCs (59-62%). The four FAALs from strains identified bioinformatically to encode laspartomycin/glycinocin BGCs showed 73-78% sequence identities to LipA from the laspartomycin BGC in S. viridochromogenes, suggesting that they may be orthologs. The strains harboring telomycin BGCs encode FAALs sharing 75-99% sequence identities with Tem18 from S. canus, and are not closely related to any others. MlcG from the uncultured malacidin producer shows 95% sequence identity with a FAAL from a predicted malacidin BGC from another uncultured bacterium, but only 45-55% sequence identities with other FAALs. A particularly interesting pair of FAALs is those from the daptomycin and taromycin BGCs. Even though these BGCs encode very similar tridecapeptides, DptE shows only 46 sequence identity to Tar4. Curiously, DptE shows higher sequence identities to LptE orthologs encoded by five different A54145 BGCs (51-52%). These data likely reflect the large divergence in fatty acid preferences displayed by daptomycin and taromycin (Fig, 2B).

Free-standing ACPs
ACPs and PCPs (T domains) are very important in the assembly of polyketide (PK) and nonribosomal peptide (NRP) secondary metabolites by PKS-I and NRPS multienzymes. In these cases, they are embedded in multimodular, multisubunit megaenzymes (Weissman, 2015;Marahiel, 2016;Süssmuth & Mainz, 2017;McErlean et al., 2019). The stand-alone ACPs involved in coupling fatty acids to amino acids in lipopeptide assembly present a striking contrast. Typical ACPs and PCPs have multiple proteinprotein interactions in PK and NRP assembly, but differ in specificity from the stand alone ACPs involved in lipopeptide assembly. The latter interact with FAAL enzymes and specialized C III domains Miao, Brost, et al., 2006;Baltz, 2014b) involved in initiation of lipopeptide assembly. As such, they show little amino acid conservation with typical ACP and PCP domains in PKS-I and NRPS BGCs. This aspect of stand-alone ACPs, coupled with their small sizes (∼90 amino acids), makes them attractive molecular beacons to help identify known, related, and novel lipopeptide BGCs in finished and draft genomes. Table 3 shows the results of BLASTp analyses of different actinomycetes with DptF (ACP) homologs from seven lipopeptide BGCs that use both FAAL:ACP:C III initiation and CAT-CATTe di-modular termination mechanisms. It is apparent that the ACP proteins are much more divergent than the FAAL enzymes (Table 2), and other proteins involved in lipopeptide assembly discussed below. This high level of amino acid sequence divergence has been exploited by generating an ACP multiprobe that can be used to help identify known, related, and novel lipopeptide BGCs in finished and draft genomes. Table 4 shows results from ACP multiprobe analyses of the free-standing ACPs from the 19 lipopeptide BGCs that make up the multiprobe, and six others. The multiprobe is a contatenane including (sequentially) ACPs from the following BGCs: one from daptomycin, two from taromycins, five from A54145s, two from friulimicins, three from laspartomycin/glycinocins, two from malacidins, and four from telomycins. The degree of sequence similarity is reflected in the numerical code, from the highest (4) to the lowest (1) (see Materials and Methods). The first three ACP codes considered are those of highly related daptomycin, taromycin, and a taromycin-like cryptic lipopeptide encoded by S. viridis. Daptomycin has the simplest code: 4-33-33 333-33-333-33-3 333. In contrast, the code for the taromycin ACP differs from that of daptomycin at 15 positions. This may be due to the relatively short, di-unsaturated lipid starter processed by the taromycin FAAL:ACP (C8 2,4) versus the branched C12-13 lipids preferred by the daptomycin FAAL:ACP. Both couple fatty acids to l-Trp 1 of these highly related tridecapeptides (Fig. 2b). The ACP code from the taromycin-like cryptic BGC from S. viridis differs from that of authentic taromycin at 12 positions, but only differs from the daptomycin code in 7 positions. This divergence pattern suggests that the cryptic BGC from S. viridis may encode initiation with a longer chain length, di-unsaturated fatty acid (see below). This could be tested by expressing the cryptic BGC in a Streptomyces expression host (Baltz,   The ACP multiprobe codes for the A54145 BGC from S. fradiae and cryptic A54145 BGCs from S. exfoliatis, S. griseoluteus, S. pini, and S. barkulensis are closely related, but show some variation at 5 positions ( Table 4). All of the variation resides in positions 2-3 (taromycins) and 16-19 (telomycins). The two friulimicin ACP codes differ from each other in positions 16 and 17 (telomycin). Other code differences within otherwise highly related BGCs may reflect differences in fatty acid chain length specificities (e.g., S. sp. 1331.2 and S. formicae KY5). These ACP codes are useful in identifying known, related, and novel lipopeptide BGCs (see below).

Fatty acid dehydrogenations
The cyclic lipopeptides have fatty acids ranging from C8 to C15 chain lengths. Some are unsaturated (e.g., daptomycin and A54145), and others have one or two double bonds. Laspartomycin has C15: 2; friulimicin C13-15: cis3; taromycin C8: 2,4; and malacidin C10-11: 2,4. The fatty acid chain length and degree of unsaturation can influence the biological activities of cyclic lipopeptide antibiotics, and are thus important targets for combinatorial biosynthesis as well as chemical semi-synthesis Baltz, 2014bBaltz, , 2014d. FAAL and ACP genes are contiguous and just upstream of the first NRPS genes in the daptomycin, A54145, telomycin BGCs (Fig. 2). They are displaced by one or two genes (depicted as a space between FAAL and ACP genes in Fig. 2) in BGCs encoding lipopeptides with lipid side chains containing one or two double bonds. These genes encode enzymes that are annotated as acyl-CoA dehydrogenase family (ACAD). They encode enzymes that insert double bonds into the lipid starter units (Heinzelmann et al., 2005). Since there are no acyl-CoA intermediates involved in the FAAL:ACP:C III lipopeptide initiation mechanism (Wittmann et al., 2008), it seems likely that these enzymes act on fatty acids bound to FAALs or to holo-ACPs, but no mechanistic studies have been reported for these enzymes. Table 5 shows BLASTp analyses of the enzymes responsible for catalyzing the fatty acid dehydrogenations. LipB was demonstrated to carry out the cis3 dehydrogenation in friulimicin biosynthesis by gene disruption analysis (Heinzelmann et al., 2005). LipB has 60% sequence identity to the laspartomycin homolog Orf22 (Wang et al., 2011) that inserts the 2 double bonds. Both LipB and Orf22 share higher sequence identities with Tar5 and MlcH enzymes from the taromycin and malacidin pathways than to Tar6 and MlcI, the second fatty acid dehydrogenases encoded in the taromycin and malacidin BGCs (Yamanaka et al., 2014;Hover et al., 2018;Reynolds et al., 2018). Therefore, Tar5 and MlcH likely insert the 2 double bonds, and Tar6 and MlcI likely insert the 4 double bonds. Tar5 and Tar6 homologs are also encoded by the cryptic taromycin-like BGC in S. viridis (Table 5) (Baltz, 2010b).
Tar5 and Tar6 paralogs share 32% sequence identities, and MlcH and MlcI share 30%. Even though Tar5 and MlcH, and Tar6 and MlcI appear to have similar functions, they have diverged substantially, presumably to accommodate different fatty acid chain length preferences, and the associated divergences in FAAL and ACP amino acid sequences (Tables 2-4). These proteins can be used in conjunction with other molecular beacons to analyse lipopeptide BGCs for similarities and novelty (see below).

Dimodular termination devices
A second important biosynthetic device for lipopeptide assembly is the dimodular NRPS with a CAT-CATTe organization for Baltz | 9 termination and release of completed lipopeptides (Fig. 2a). When combinatorial biosynthetic studies were initiated at Cubist Pharmaceuticals in the early 2000s, only three cyclic lipopeptide BGC sequences were available, those for daptomycin , A54145 (Miao, Brost, et al., 2006), and CDA (Hojati et al., 2002). These BGCs were chosen because they appeared to by evolutionarily related, as witnessed by conserved amino acid chirality in the ten-membered rings and in the conservation of dimodular NRPS genes that inserted 3mGlu-Kyn, 3mGlu-Ile, and 3mGlu-Trp, respectively . All three NRPS genes also encoded terminal Te domains. This type of NRPS dimodule is also used for biosynthesis of friulimicin/laspartomycin type lipopeptides (Fig. 2) (Müller et al., 2007;Wang et al., 2011). With recent publications of telomycin (Fu et al., 2015;Johnston et al., 2016;Liu et al., 2016) and malacidin BGC sequences (Owen et al., 2013;Hover et al., 2018), it is now apparent that the dimodular CAT-CATTe mechanism for termination of lipopeptide assembly is used among these structurally diverse lipopeptide BGCs, and likely derives from very ancient origins (Baltz, 2010b). Table 6 shows the sequence identities among the eight DptD homolog types. The taromycin, A54145, CDA, friulimicin, telomycin, and malacidin clades show at least 80% sequence identities to likely orthologs within the clades, and less than 60% identities to paralogs in other clades. The apparent laspartomycin/glycinocin clade shows over 70% sequence identities within the clade. Interestingly, DptD shows less than 60% sequence identity with the two members of the taromycin clade, in spite of the fact that they all have CAT-CATTe dimodules that insert 3mGlu 12 -Kyn 13 or 3mGlu 12 -4clKyn 13 into nearly identical lipopeptide tridecapeptides (Fig. 2b). This divergence may be due to the substantial differences in fatty acid structures that need to be accommodated during thioesterase cyclization and release (termination).

Amino acid binding pocket analysis of dimodular termination devices
The amino acid binding pockets in A domains determine which amino acids are bound, activated, and incorporated during peptide assembly (Challis et al., 2000;Stachelhaus et al., 1999). Table 7 shows that phylogenetic relationships between amino acid binding pockets in CAT-CATTe didomains can be used to help distinguish between known, related, and novel lipopeptides (see below). Daptomycin, A54145, and CDA have 3mGlu incorporated at position one of NRPS termination didomains, and Kyn, Ile/Val, and Trp at position two. There are three related binding codes for 3mGlu: DLGKTGVINK for daptomycin; DLGKTGVVNK for two taromycins and five A54145s; and DQGGKTGVGHK for four CDAs. The daptomycin DptD module-1 differs from those of two taromycins and five A54145s by single conserved change at position 8 of the pocket (I for V). The CDA 3mGlu pocket differs from those of daptomycin, taromycin, and A54145 at positions 2, 8, and 9. For module-2, Kyn has two pocket codes: DAWTTTGVGK for daptomycin and cryptic taromycin from S. viridis; and DAWTTTGVAK for taromycin from Saccharomonospora sp. CNQ490. These differ by a conserved substitution at position 9 (G or A). All A54145 module-2 pockets for Ile/Val are identical (DGLFVGIAVK), as are all CDA module-2 pockets for Trp (DGWAVASVCK). The friulimicin, laspartomycin/glycinocin, and malacidin lipopeptide families have termination dimodules that insert Val-Pro (Fig. 2). Among them, they use four different, but somewhat related amino acid binding codes for insertion of Val, and four different, but related codes for Pro. The telomycin termination dimodule inserts Ile-Pro. The Ile binding code is identical for four Tem22 orthologs, but is substantially different from the Val modules of friulimicins, laspartomycin/glycinocins, and malacidin, and the Ile/Val modules of A54145s. The amino acid binding codes for these lipopeptide dimodules establishes a baseline to help triage and characterize

Activation and Sequential Coupling of Amino Acids to Amino Acids (Elongation)
Sandwiched between the initiation and termination devices for lipopeptide assembly are the elongation devices of variable composition. These are NRPS proteins that generally contain multiple modules to catalyze sequential amino acid couplings. The elongation process provides a fertile evolutionary "workshop" to test different combinations of amino acids and peptide lengths with varying chirality for activities that impart survival advantages for the producing microorganisms. The evolutionary changes in primary amino acid sequence can be coupled with modifications of fatty acid chain length and degree of unsaturation, and amino acid modifications as discussed below. Some of these NRPS multimodular proteins can be used to confirm known BGC types, and to identify new and novel BGCs in finished BGCs (see below). Because of their generally large sizes with repetitious functional domains, they are often misassembled in draft genomes (Baltz, 2017b;Baltz, 2019;Goldstein et al., 2019;Klassen & Currie, 2012), and generally not suitable to use as primary molecular beacons.

Amino Acid Modifications
During the course of lipopeptide pathway evolution, changes in amino acid composition have occurred. In some cases, these include amino acid modifications. Aside from the many examples of the use of d-amino acids, additional examples are the inclusion of 3mGlu in daptomycin, taromycin, A54145, and CDA; hAsn in A54145 and CDA; moAsp in A54145; hAsp in malacidin; mGly (Sar) in A54145; mTrp in telomycin; and mAsp in malacidin, friulimicin, amphomycin, and parvuline Fu et al., 2015;Johnston et al., 2016;Hover et al., 2018) (Fig. 2). These are important for subtle alterations in biological activity and can be manipulated for combinatorial biosynthesis by simple gene deletions, as demonstrated in combinatorial manipulations of the A54145 pathway to generate highly active antibiotics with improved properties (Nguyen et al., 2010;Alexander et al., 2011;Baltz, 2014b). Several of these amino acid modifying enzymes have been used as molecular beacons for use in genome mining (Supplementary  Table S1), and can be used to help triage known and novel BGCs (Baltz, 2010b(Baltz, , 2018) (see below).

MbtH Chaperones
Many NRPS-based BGCs include mbtH homologs that encode small nonenzymatic chaperones that enhance certain adenylation reactions (Baltz, 2011;Baltz, 2014a). MbtH homologs have diverged substantially in different NRPS BGCs, and can be considered as orthologs, paralogs, or "ortho-paralogs," proteins with similar functions but different protein-protein interactions (Baltz, 2018). Among the MbtH homologs, a high degree of sequence similarity is observed within BGCs encoding similar products, but The degree of MbtH divergence can also be assessed by BLASTp analysis with a 24-mer multiprobe consisting of the most conserved 60 amino acid segments from 24 diverse MbtH homologs (Baltz, 2014a). Supplementary Table S4 shows the MbtH multiprobe codes for lipopeptide BGCs. The 24 digit numerical codes are highly related, with a consensus of 332-333-322-333-322-222-223-312. Of the 16 MbtH homologs analysed, five have consensus sequences (one taromycin, two CDAs, one laspartomycin, and one telomycin), 7 deviate at 1 position, 3 at 2 positions, and 1 at 4 positions, giving an average deviation from consensus of 4.4%. Supplementary Table S5 shows the MbtH consensus codes for six diverse NRPS BGC families. The bleomycin family includes tallysomycin and zorbamycin BGCs, and the five producers span four actinomycete genera; the five griseobactin MbtH homologs are encoded by Streptomyces sp.; the two nikkomycin BGCs are encoded by Streptomyces sp.; the two nocardicin A BGCs are encoded by Nocardia uniformis and Actinosynnema mirum; and the pacidamycin family includes napsamycin and sansanmycin BGCs from Streptomyces sp. (Baltz, 2014a). The deviations from consensus within families range from 1.7% for griseobactin to 6.9% for the bleomycin family (Supplementary Table S5). The consensus MbtH codes for the five diverse BBG families deviate from that of the lipopeptide family by ∼70-95%. The exception is the glycopeptide consensus codes that differ by only 8%. These data suggest that the MbtH-like chaperone function has been conserved in the lipopeptide family of BGCs, and indicate that BLASTp analysis with individual MbtH homologs and the MbtH multiprobe can be used in conjunction with other molecular beacons to identify genomes encoding known, related, and novel lipopeptides. The MbtH multiprobe can also triage other more distantly related NRPS BGCs, and help identify novel BGCs encoded in finished or draft actinomycete genomes (Baltz, 2014a).

Resistance and Transport
Incremental resistance to daptomycin in low G + C pathogenic Gram + bacteria is mediated by mutations in a number of genes that result in alterations in cell membrane charge or cell wall thickness (Baltz, 2009(Baltz, , 2014d. None of the genes involved in these mechanisms are observed in lipopeptide BGCs. The high G + C Gram + actinomycetes tend to be intrinsically resistant to daptomycin by expressing hydrolases that cleave the fatty acid tail, or the depsipeptide bond, and many show secondary peptide bond cleavages (D'Costa et al., 2006(D'Costa et al., , 2012Baltz, 2014d). These mechanisms often result in MICs > 256 μg/ml. Genes encoding these hydrolase mechanisms are not observed in lipopeptide BGCs. Instead, it is likely that resistance to lipopeptide antibiotics in the producing microorganisms is mediated by transport mechanisms, as accumulation of high intracellular lipopeptide concentrations could be highly toxic.
The DptM, DptN, and DptP proteins showed >70% sequence identities within individual clades. DptM and DptN homologs from the taromycin, A54145, friulimicin, laspartomycin and malacidin pathways showed >50% sequence identities in pairwise BLASTp analyses, but DptMN homologs showed only 29-34% sequence identities (DptM) or no sequence identities (DptN) with those of CDA and telomycin pathways. The DptM and DptN ABC transporter counterparts from the CDA and telomycin pathways showed 65-66% (DptM-like) and 41-42% (DptN-like) identities with each other. This suggests that two distinct lines of evolution have contributed to the ABC transporters for lipopeptide antibiotics. DptP is interesting in that it is found only in daptomycin/taromycin and A54145 BGCs, and that DptP from the daptomycin BGC shows 91% sequence identity with LptP from the A54145 pathway, suggesting possible horizontal gene transfer. When DptP was integrated into the chromosome of S. ambofaciens, an unusual streptomyces susceptible to daptomycin, the recombinant strain became resistant to daptomycin (Baltz, 2008b), suggesting that DptP may normally interact with DptMN to export daptomycin, and may interact with a close homologs of DptMN in S. ambofaciens to express the DapR phenotype (see below).
From an evolutionary perspective, the general use of a molecular target agnostic ABC transporter mechanism for resistance and export of lipopeptide antibiotics facilitates natural combinatorial biosynthesis of molecules with different target specificities and mechanisms of action (MOA), as is the case for cyclic lipopeptide antibiotics (Baltz, 2009;Johnston et al., 2016;Hover et al., 2018). This aspect of lipopeptide BGCs should facilitate successful combinatorial biosynthesis of compounds with improved antibacterial properties and toxicity profiles, and possible beneficial changes in MOA, without jeopardizing the viability of the recombinants producing the new molecules. This may help explain the high success rates of producing novel lipopeptides by combinatorial biosynthesis at Cubist Pharmaceuticals (Baltz, 2014b(Baltz, , 2014c(Baltz, , 2014d. So far, little is known about the substrate specificities and possible cross-resistance patterns expressed by lipopeptide ABC transporters.

Use of Molecular Beacons for Genome Mining
Over 2 300 genome sequences from filamentous actinomycetes were publically available on the NCBI website (https://www.ncbi. nlm.nih.gov/genome ) in July of 2020. A large majority (∼90%) are in draft form, which is problematic for the identification of complete lipopeptide BGCs because of frequent misassembly of multimodular NRPS genes (Klassen & Currie, 2012;Baltz, 2017bBaltz, , 2019Goldstein et al., 2019). To accommodate productive genome mining of both finished and draft genomes, two strategies can be implemented. Gifted microorganisms (Baltz, 2014a(Baltz, , 2017a(Baltz, , 2017b can be surveyed for the presence of genes (molecular beacons) directed at conserved functions by BLASTp to identify strains encoding targetted SM-BGC classes. Supplementary Table S1 summarizes examples of molecular beacons for lipopeptide antibiotics, some of which have been used previously, and others exemplified here. For finished genomes, molecular beacon analysis can be followed directly by antiSMASH (5.0 at the time of this writing) to identify draft lipopeptide BGCs, and to predict the NRPS subunits, amino acid binding specificities, gene composition and BGC organization. For unfinished genomes, the most promising microorganisms can be identified by molecular beacon analysis, sequenced to completion, the analysed by antiSMASH analysis of the targeted BGCs.

Identification of common lipopeptide BGCs
Many cyclic acidic lipopeptide antibiotics were reported in the 1950s and 1960s, so it is likely that their BGCs will dominate the molecular beacon analyses because of their relatively high natural abundance. Examples include amphomycin, aspartamycin, laspartamycin/glycinocin, friulimicin, parvuline, tsushimycin, zaomycin, glutamycin, and related lipopeptides . Of these, the BGCs of only two (friulimicin and laspartomycin) have been reported (Müller et al., 2007;Wang et al., 2011). Friulimicin and laspartomycin have two fundamental biosynthetic features in common with daptomycin, taromycin, A54145, telomycin, and malacidin: FAAL:ACP for initiation; and CAT-CATTe di-modules for termination (Fig. 2). They also have unique features that distinguish them from other lipopeptide BGCs: trans NRPS tri-domain (ATTe) proteins, PstA and LpmA, that interact with the CTs imbedded in tri-modular NRPS (CAT-CT-CATE) proteins, PstB and LptB. This mechanism is also used by the more recently discovered malacidin (Hover et al., 2018) (Fig. 2). The PstA and LpmA proteins are small and lack repetitive domains, so it is likely that pstA and lpmA homologs will be assembled correctly in draft genomes. Friulimicin, and closely related molecules, amphomycin, parvuline, and tsushimycin, insert 3mAsp just upstream of the Asp-Gly-Asp-Gly Ca 2+ -binding sequence   (Fig. 2). In the friulimycin producer, A. friuliensis, 3mAsp is biosynthesized by the two subunit glutamate mutase, GlmA/GlmB (Heinzelmann et al., 2003). Thus PstA/LpmA and GlmA/GlmB are useful as molecular beacons to triage known lipopeptide BGCs. One example is S. canus ATCC 12237, a known producer of amphomycin . The amphomycin peptide differs from friulimicin at position 1 (l-Asp vs. l-Asn in friulimicin) . BLASTp analysis of the draft genome of S. canus ATCC 12237 indicated that it encodes FAAL, ACP, and CAT-CATTe NRPS enzymes (Tables 2, 3, and 6), and has an ACP multiprobe code similar to those of friulimicin/laspartomycin/glycinocin (Table 4). It encodes GlmA and GlmB homologs needed for biosynthesis of 3mAsp (not shown), and a LipB homolog needed to insert the 3 double bond in the fatty acid (Table 5). It has a typical lipopeptide MbtH homolog with high sequence similarity to those of the friulimicin and laspartomycin/glycinocin BGCs (Supplementary Tables S3 and S4), and an ABC transporter most closely related to that of laspartomycin (Supplementary Tables S6 and S7). It has a PstA (ATTe) homolog (Supplementary Table S9), but PstB is missing, likely due to inadequate assembly of this larger NRPS in the draft genome. Importantly, its DptD homolog termination dimodule has amino acid specificity for Val-Pro (Table 7). The Val binding code is identical those of friulimicins, and the Pro binding code is identical to those of several in the laspartomycin/glycinocin group. This strain of S. canus is a good candidate for finished genome sequencing to provide a complete amphomycin BGC for the MiBIG database (Kautsar et al., 2020). Streptomyces parvulus NRRL 5740 was reported to produce parvuline, a member of the amphomycin family . Its peptide is identical to that of amphomycin, and it has a 3iso-deceneoyl fatty acid side chain rather than the 3-anteisodecenoyl side chain observed in amphomycin . Thus the BGCs of parvuline and amphomycin should be highly homologous. The draft genome of the recently isolated S. parvulus 2297 (Hu et al., 2018) contains many genes required for parvuline biosynthesis (Tables 2, 3, 5, 6, Supplementary Tables S3, S4, and S6-S9). Its ACP multiprobe code places its BGC within the friulimicin/laspartomycin/glycinocin group of related lipopeptides (Table 4). As anticipated, the apparent parvuline biosynthetic proteins show the highest sequence similarities (78-90%, average 85%) to the apparent amphomycin biosynthetic proteins encoded in S. canus ATCC 12237 (Tables 2, 3, 5, 6, Supplementary Tables S3,  S6, S7, and S9). Notably, the DptD homolog termination dimodule inserts Val-Pro, and the binding codes are identical to those of the predicted amphomycin BGC from S. canus ATCC 12237. The large NRPS genes involved in lipopeptide elongation are not assembled correctly, so it would be useful to obtain a finished genome to add a parvuline BGC to the MiBIG database (Kautsar et al., 2020). The molecular beacons used in these two examples can be used to triage common lipopeptide BGCs that encode molecules of little current interest for drug development, and to help focus on those with higher potential for clinical development.

Identification of Lipopeptide BGCs Related to Important Clinical Antibiotics
Daptomycin is an important antibiotic approved to treat difficult to treat Gram-positive infections, including MRSA (Baltz, 2009;Eisenstein et al., 2010). Daptomycin has 3mGlu at position 12 in the peptide. 3mGlu is biosynthesized by a mechanism that employs an α-ketoglutarate methyltransferase (Milne et al., 2006) encoded by dptI, a gene that has distantly related homologs in the A54145 (lptI) and CDA (glmT) BGCs. DptI, LptI, and GlmT are useful molecular beacons to identify lipopeptide BGCs containing the rare 3mGlu, and for sorting them into daptomycin, A54145, and CDA related clades (Baltz, 2018). DptI was used in such a search over a decade ago, and it led to the discovery of a DptI homolog (DptI-sv) in S. viridis imbedded in a cryptic lipopeptide BGC closely related to that of daptomycin (Baltz, 2010b). More recently the taromycin BGC, closely related to the cryptic BGC in S. viridis, was cloned from the marine Saccharomonospora sp. CNQ490 and expressed in S. coelicolor (Yamanaka et al., 2014;Reynolds et al., 2018). The taromycin BGC encodes a DptI homolog, Tar13. DptI was used in a recent BLASTp search and identified another DptI homolog (DptI-ss), encoded by Streptomyces sedi JCM 16909 (Table 8). DptIss is more closely related to DptI, DptI-sv, and Tar13 than to any of the LptI or GlmT apparent orthologs, suggesting that it may be involved in the biosynthesis of a daptomycin-like lipopeptide. The draft genome of S. sedi (Li et al., 2009) also encodes a dimodular termination NRPS that shows 65% sequence identity to DptD, and has amino acid binding codes for 3mGlu-Kyn identical to those of DptD (Table 7). It has lipopeptide initiation FAAL and ACP homologs that are 56% and 50% identical to DptE and DptF (Tables 2 and 3), and an ACP multiprobe code similar to, but differing from that of daptomycin at five positions (Table 4). It encodes two ACAD-family fatty acid dehydrogenases distantly related to those of the taromycin BGC, suggesting that it initiates lipopeptide biosynthesis with a di-unsaturated fatty acid, perhaps of longer chain length than that of taromycin. S. sedi encodes a DptG homolog in the lipopeptide family, and the MbtH multiprobe analysis indicates that it differs from that of daptomycin at two positions, and taromycin at one position (Supplementary Table S3 and S4). S. sedi encodes an ABC transporter pair that shows 65 and 70% sequence identities to DptM and DptN, respectively, but no DptP homolog (Supplementary Tables S6-S8). The S. sedi genome sequence includes several NRPS fragments that show >60% sequence identities to portions of the two large NRPS proteins involved in daptomycin biosynthesis, DptA and DptBC (not shown). S. sedi is a prime candidate to obtain a finished genome sequence to determine if it encodes a new lipopeptide antibiotic related to daptomycin and taromycin.  Fig. 3 Organization of lipopeptide BGC genes involved in initiation, elongation, and termination of a cryptic lipopeptide encoded by S. ambofaciens (Sambo-LP). The gene organization and NRPS subunit structures are distantly related to those of daptomycin, taromycin, and A54145, all of which encode tridecapeptides. The amino acid binding specificities differ substantially, however. The Ca 2+ -binding tetrapeptides of daptomycin, taromycin, A54145, laspartomycin are shown in bold italics. The amino acid regions that could be converted to canonical Ca 2+ -binding sites by insertion of a single Asp or Gly module for Sambo-LP or malacidin, respectively, are shown in bold italics in parentheses.

Identification of Novel BGCs from Finished and Draft Genomes
homolog with <50% sequence identity to any of the known lipopeptide dimodular termination NRPSs (Table 6). Its DptD homolog has amino acid binding specifities for Thr-Hpg (Table 7). BLASTp analysis indicated that it encodes FAAL and free-standing ACP enzymes for initiation of lipopeptide assembly not closely related to any of the known cyclic lipopeptides (Tables 2 and 3). The ACP multiprobe code is consistent with a lipopeptide assignment, but differs from those of known lipopeptide ACPs (Table 4). an-tiSMASH 5.0 analysis indicates that a lipopeptide BGC abuts the PKS-I BGC encoding the 16-membered macrolide antibiotic spiramycin. These two BGCs cluster with two other SM-BGCs, and are located in the core region containing mostly primary metabolic genes (Aigle et al., 2014). The lipopeptide BCG contains three NRPS genes composed of 6, 5, and 2 modules (Fig. 3), and is predicted to encode a tridecapeptide. The d-thr in position three could be involved in the formation of a depsipeptide bond to form a cyclic peptide. The novel lipopeptide BGC also encodes two ACAD-family fatty acid dehydrogenases, suggesting that lipopeptide assembly is initiated with a di-unsaturated fatty acid. It also encodes DptM and DptN ABC transporter homologs most closely related to those of daptomycin, taromycin, and A54145, and a DptG homolog in the lipopeptide MbtH family (Supplementary Tables S3-S5). It does not encode a DptP homolog. It is conceivable that the DptMN homolog ABC transporter interacts with heterologously expressed DptP in S. ambofaciens to express resistance to daptomycin (see above). This novel lipopeptide lacks a canonical Ca 2+binding tetrapeptide, but is one module insertion away in the Gly-Asp-Gly-Gly region of modules 8-11, as is malacidin between modules 5 and 6 (Fig 3). This cryptic lipopeptide BGC is a prime candidate for homologous or heterologous expression studies to assess biological activity, as it may provide a new scaffold for modification by combinatorial biosynthesis and medicinal chemistry.

Draft genomes
The draft genome sequence of Streptomyces zhaozhouensis (He et al., 2014) has a dptD homolog encoding a lipopeptide termination protein that shows 46-53 sequence identities to DptD homologs. Amino acid binding pocket analysis indicates that it incorporates Asp-Kyn. The Kyn binding pocket utilizes the same code as taromycin (Table 7). To my knowledge, this is the first instance of the use of Kyn in a termination dimodule differ-ing from 3mGlu-Kyn of the daptomycin family of lipopeptides (daptomycin, taromycins, and the lipopeptide encoded by S. sedi reported here). S. zhaozhouensis also encodes DptE and DptF homologs not closely related to those of known lipopeptides (Tables 2  and 3), and the ACP multiprobe code is unique (Table 4). It does not encode an ACAD-family fatty acid dehydrogenase, suggesting that it initiates lipopeptide biosynthesis with an unsaturated fatty acid. It encodes no DptI homolog, consistent with its insertion of Asp-Kyn. It has an ABC transporter pair most closely related to those of CDA and telomycin (Supplementary Tables S6 and S7), giving additional credence to two lines of evolution of ABC transporters for lipopeptide BGCs. The predicted BGC also lacks a dptP gene. The combined information indicates that S. zhaozhouensis encodes a novel lipopeptide. As such, S. zhaozhouensis is a good candidate for finished genome sequencing and further analysis to assess biological activity and suitability for further SAR studies. Alternatively, the lipopeptide BGC could be cloned and sequenced by traditional methods.

Discussion
Biosynthesis of lipopeptides related to daptomycin is carried out by a mechanism reminiscent of protein biosynthesis; it can be described in terms of initiation, elongation, and termination. Initiation is carried out by a fatty acid to amino acid coupling device that is comprised of a FAAL, a free-standing ACP, and a specialized C III domain (FAAL:ACP: III C). Elongation is carried out by NRPS multi-modular proteins that utilize mostly CAT and CATE modules to insert specific l-and d-amino acids specified by A domain amino acid binding pocket codes. Termination is carried out by dimodular CAT-CATTe NRPS elongation proteins that have terminal Te domains for cyclization and release of finished lipopeptides. This lipopeptide assembly device is well suited for natural combinatorial evolution and combinatorial biosynthesis in the laboratory (Baltz, 2014b). The initiation and termination mechanisms have provided molecular beacons (FAAL, ACP, and CAT-CATTe proteins) for BLASTp searches of finished and draft genomes to identify cryptic known, related, and novel lipopeptide BGCs. This has been exemplified by identifying cryptic genes for several known BGCs (e.g., A54145, taromycin, laspartomycin/glycinocins, and telomycin), predicted BGCs (parvuline and amphomycin), a related BGC (daptomycin family), and two novel BGCs. Several of these were from draft genomes, and the large elongation NRPSs were fragmented, as anticipated (Klassen & Currie, 2012;Baltz, 2017bBaltz, , 2018Baltz, , 2019Goldstein et al., 2019). As such, this work has identified previously unsequenced BGCs for known and novel lipopeptides that are candidates for finished genome sequencing and deposition in MIBiG to facilitate future comparative analyses (Kautsar et al., 2020). The new daptomycin-related BGC encoded by S. sedi, and novel BGCs encoded by S. ambofaciens and S. zhaozhoensis are candidates for fermentation/expression studies to identify new lipopeptides for biological testing.
It is noteworthy that all of the lipopeptide BGCs in this study appear to use either of two phylogenetically related ABC transporters for export and molecular target-agnostic resistance. No other potential resistance mechanisms are encoded in any of the BGCs. This is advantageous for ongoing natural evolution and laboratory-based combinatorial biosynthesis of new lipopeptides with new or improved biological activities, including possible new target interactions and MOAs, as anticipated from the proven diversity in MOAs within the group (Baltz, 2009;Johnston et al., 2016;Hover et al., 2018).