-
PDF
- Split View
-
Views
-
Cite
Cite
Euijin Seo, Yun-Nam Choi, Ye Rim Shin, Donghyuk Kim, Jeong Wook Lee, Design of synthetic promoters for cyanobacteria with generative deep-learning model, Nucleic Acids Research, Volume 51, Issue 13, 21 July 2023, Pages 7071–7082, https://doi.org/10.1093/nar/gkad451
- Share Icon Share
Abstract
Deep generative models, which can approximate complex data distribution from large datasets, are widely used in biological dataset analysis. In particular, they can identify and unravel hidden traits encoded within a complicated nucleotide sequence, allowing us to design genetic parts with accuracy. Here, we provide a deep-learning based generic framework to design and evaluate synthetic promoters for cyanobacteria using generative models, which was in turn validated with cell-free transcription assay. We developed a deep generative model and a predictive model using a variational autoencoder and convolutional neural network, respectively. Using native promoter sequences of the model unicellular cyanobacterium Synechocystis sp. PCC 6803 as a training dataset, we generated 10 000 synthetic promoter sequences and predicted their strengths. By position weight matrix and k-mer analyses, we confirmed that our model captured a valid feature of cyanobacteria promoters from the dataset. Furthermore, critical subregion identification analysis consistently revealed the importance of the -10 box sequence motif in cyanobacteria promoters. Moreover, we validated that the generated promoter sequence can efficiently drive transcription via cell-free transcription assay. This approach, combining in silico and in vitro studies, will provide a foundation for the rapid design and validation of synthetic promoters, especially for non-model organisms.

INTRODUCTION
Fine-tuning of gene expression is crucial for the engineering of living organisms. The acquisition of well-characterized biological parts can promote diverse and targeted control of gene expression. Promoters, in particular, are critical regulatory elements in gene expression at the transcriptional level. Numerous synthetic promoters have been generated and applied to achieve the desired transcriptional activity. Typically synthetic promoters are generated by replacing or combining regulatory elements originating from different genetic modules or by random mutation of the existing wild-type promoter sequences (1–8). These rational approaches have been widely used to generate synthetic promoters and are particularly effective for organisms with various genetic engineering tools.
Meanwhile, recent advances in deep learning technologies in the biology-related fields have enabled the rapid characterization of promoter elements and the computational design of synthetic promoters (9–15). Convolutional neural network (CNN) models were developed and employed for recognition of promoter sequences (11–13) and prediction of promoter strength (14,15) based on the nucleotide sequence, by learning features from known promoter sequences. Identifying key promoter features for regulatory activity can provide a design rule for synthetic promoters (14). Alternatively, a deep generative model can rapidly and directly generate synthetic promoter sequences, thus expanding the library size of promoters with varied strengths and characteristics. In a recent study, Wang et al. (16) proposed de novo promoter generation method based on a deep generative network. Using experimentally identified promoter sequences from the transcriptome dataset of Escherichia coli, synthetic promoters were generated by a generative adversarial network (GAN), and then evaluated using a predictive CNN model, followed by an in vivo validation. This AI-based design approach generated novel promoters for strong gene expression in E. coli. The GAN model has the advantage of generating high-quality synthetic data when training can be performed with a sufficient number of dataset (17,18).
Cyanobacteria are photoautotrophs that gain energy through photosynthesis. Recent developments in genetic engineering tools have raised the potential of cyanobacteria as a promising microbial host for biochemical production (19–21). However, in comparison to E. coli or yeast, it takes a relatively long time to construct an engineered strain of cyanobacteria (22,23), and their biological parts are not well-characterized, creating obstacles to cyanobacteria performance enhancement. Therefore, the rapid generation of numerous synthetic promoters using a deep learning method will improve their potential as next-generation microbial factories.
To this end, we here suggest a de novo promoter generation model for Synechocystis sp. PCC 6803 (Synechocystis), which is a model unicellular cyanobacterium. We use a variational autoencoder (VAE) as a deep generative model to create synthetic promoters. We then predict the expression level of the generated cyanobacteria synthetic promoters through CNN model. This CNN model is tuned to show more precise performance by applying cross-validation to the CNN hyperparameter tuning process. Next, we confirm that our deep generative and predictive models can successfully extract a valid feature of cyanobacteria promoters. The importance of the -10 box sequence motif is consistently revealed by position weight matrix, 6-mer frequency, and mutation analyses. We also demonstrate that the synthetic promoters generated by our approach are to be functional using a cell-free transcription and CRISPR/Cas12a-based fluorescence assay (24). 95% of the synthetic promoters tested showed transcriptional activity, whereas randomly generated sequences did not. With the deep learning method and the cell-free transcription assay, we offer a generic framework to design, assess, and validate new synthetic promoters for cyanobacteria, i.e. organisms that are not arrayed with various genetic engineering tools.
MATERIALS AND METHODS
Acquisition and processing of cyanobacteria promoter dataset
We applied differential RNA sequencing (dRNA-seq) results of Synechocystis (25–28) as a training dataset of the variational autoencoder (VAE) model and convolutional neural network (CNN) model. We used the dRNA-seq results of Synechocystis cultivated at 37°C in the dark to the exponential growth phase. Based on transcription start sites (TSSs) identified in the dRNA sequencing dataset, we assume the 100 bp upstream of the TSS as a native promoter sequence. We also took the number of normalized reads in dRNA-seq as the promoter strength. Promoter sequence data was transformed to a 1 × 4 × 120-sized tensor through one-hot encoding (OHE) by converting each nucleotide sequence (A, T, C, G) into the corresponding vector: Adenine (A) as [1,0,0,0], Thymine (T) as [0,1,0,0], Guanine (G) as [0,0,1,0], and Cytosine (C) as [0,0,0,1]. Promoter strength data was converted to a logarithmic scale (|$lo{g}_2( {x + 1} )$|).
Brief introduction to VAE
VAE is one of the popular deep generative models, which learn distribution of training data and generate new data that have an intrinsic feature of the training data (29). VAE is composed of encoder (|${q}_\varphi (z|x)$|), latent space (|${\rm{z}}$|) and decoder (|${p}_\theta (x|z)$|). The encoder converts the distribution of data (|$x$|) into latent space using a weight tensor (|$\varphi ( w )$|) in the encoder to represent the input data in a low dimensional space. The converted information in the latent space follows Gaussian distribution (|${q}_\varphi ( x ) = \ N( {{\mu }_{z|x\ },\ {\Sigma }_{z|x}} )$|). After the data encoding, the decoder obtains new data distribution (|$x$|’) by computing the converted information with a weight tensor (|$\theta ( w )$|) in the decoder.
The goal of training the VAE is to find optimal weight tensors (|$\varphi ( w )$| and |$\theta ( w )$|) that lead to generate output data (|$x{\rm{\text{'}}}$|) resembling the input data (|$x$|). To calculate the maximal likelihood, evidence lower bound (ELBO;|$\ L(x,\ \varphi ,\theta)$|) loss function as denoted below is used. The training process can be described mathematically as:
|${E}_z[ {log\ {p}_{\theta} ( {{x}_i|z} )\ } ]$| represents the expectation of the log-likelihood of the function |${p}_{\theta} ({x}_i|z)$| and |$KL$|, Kullback–Liebler, divergence represents the difference between real probability distribution functions of samples (|${p}_{\theta} ({x}_i|z)$|) and approximated probability distribution function of the training data (|${q}_\varphi ( {z|{x}_i} )$|). The training process to acquire optimal weight tensors (|$\varphi ( w )$| and |$\theta ( w )$|) is to iterate in the way of maximizing |${E}_z[ {log\ {p}_{\theta} ( {{x}_i|z} )} ]$| and minimizing |$KL( {{q}_\varphi ( {z|{x}_i} )\parallel {p}_\theta ( {{x}_i|z} )} )$|. The training process of VAE is continued until ELBO converges to the maximum. Based on the attained VAE model after the training process, the training data features can now be mapped on the latent space through the encoder. A random value tensor is used as input data to generate new data that resembles the original training data's features. The decoder can create new data by applying the weight tensor (|$\theta ( w )$|) to the random value tensor in the decoder.
Model training of VAE
We constructed our VAE model based on the Tensorflow and Python modules. The training data for our VAE model contains 3712 Synechocystis promoter sequence data (25). To obtain the best hyperparameter set of the VAE model, we conducted a random search to select the best combinations from a grid of hyperparameter values: number of layers, the length of the kernel, number of epochs, and batch sizes. We used Adam optimizer as an optimization method and chose ReLU as an activation function. A combination of 1000 hyperparameters were tested, and we selected a model with the maximum ELBO at 1000 epochs. After choosing the VAE model, we generated 10000 synthetic promoters through the decoder of the VAE model (Supplementary Table S1). We also generated five random sequences as dummy sequences as negative controls (Supplementary Table S2).
Evaluation of VAE model by position weight matrix (PWM) and 6-mer frequency analyses
We converted a csv format of native and generated promoter sequence data to a fasta format. The converted data was used to calculate the PWM. Using a matrix model, a numerical score is calculated for each nucleotide at each position (30). PWM results were graphically represented as sequence logos through Weblogo3 (31) (Supplementary Figure S1).
For 6-mer frequency analysis, we identified every 6-mers that exist between -20 and -1 from the native promoter sequences and calculated the frequency of each 6-mer. We selected three 6-mers (TAAAAT, TAGAAT, AAAATA) most frequently observed in the region. In addition to the top three 6-mers, we selected TATAAT for further analysis. Distributions of the four 6-mers across each promoter sequence were investigated, and the frequency of each 6-mer was calculated.
Brief introduction to CNN
As a popular artificial neural network model, CNN is widely used for the analysis of images (32,33) or sequence data in a two-dimensional form converted through one-hot encoding (16). In the CNN, the convolutional layer extracts a certain feature from input data using a set of weight tensors called kernels and forms the output called a ‘feature map’. An element (|$Y[ {i,j} ]$|) of ith row and jth column in the feature map obtained by the dot product of input data (|$X$|) and kernel (|$W$|) can be described as:
Through convolution, information regarding an input data point and its neighbors is contained in the feature map. These values in the first feature map are then computed with the next kernel to obtain the following feature map. As calculation proceeds with multiple kernels, the convolutional layers can contain a larger area of data features. After the entire calculation process with kernels, the attained values on the feature map and the corresponding ground truth values are used to calculate a loss function of the CNN model. The resulting values from the loss function are then applied to adjust weight tensors in the kernels. This training process is iterated until values from the loss function of the CNN model converge to minimums, and optimal weight tensors are obtained. After the training process, the CNN model can be used to predict and analyze new image data (or sequence data in two-dimensional form) based on the features obtained from the training data.
Training of CNN model
The CNN model was constructed based on the Tensorflow and Python modules. We used a training dataset for the CNN model containing all of 3712 Synechocystis promoter sequence and coverages in dRNA-seq dataset from Kopf et al. (25). As we did in the previous VAE model training, we conducted a random search to find the best set of hyperparameters. We used Adam optimizer as an optimization method and chose ReLU as an activation function. We set mean-squared error (MSE) as a loss function of the model. To avoid overfitting, we evaluated 1000 CNN models from a random search through 5-fold cross validation. We selected the CNN model with the minimum MSE value and used the CNN model for prediction, and we represented the result in Supplementary Figure S2.
Critical subregion identification analysis
We divided each native promoter sequence by 3-mer units and randomized the sequence of two units simultaneously. The promoter strength of each group was predicted through the CNN model that was previously trained. The Pearson correlation coefficients between the predicted strengths of each mutated promoter and the corresponding native promoter were calculated to evaluate the impacts of mutations. The calculation of Pearson correlation coefficient (|$r$|) between the predicted strengths of the mutated promoter set (|$x^{\prime}$|) and the corresponding native promoter set (|$x$|) can be described mathematically as:
The upper bar represents the mean of the promoter set's predicted strength. The results were displayed as a heatmap graph (Supplementary Figure S3).
Data refinement and new CNN model training
We classified the promoter dataset based on the existence of abundant 6-mers between –13 and –6: NANNNT, NANANT, TANNNT, or TANANT. With the classified dataset having the specific 6-mer sequences, the CNN model was trained in the same way as described above. Then, we newly predicted the promoter strengths of 10000 synthetic promoters and represented the result in Supplementary Figure S4 and Table S1) and five dummy sequences (Supplementary Table S2).
BLAST search on the generated sequences
A nucleotide BLAST search was conducted on the 10 000 synthetic promoter sequences and five dummy sequences against the whole Synechocystis sp. PCC 6803 genome with the default setting (https://www.ncbi.nlm.nih.gov 〉 geo 〉 query 〉 blast).
DNA manipulation and strain construction
Oligonucleotides and plasmids used in this study are listed in Supplementary Tables S3 and S4. All the DNA manipulating enzymes and reagents were purchased from New England Biolabs (NEB). Q5 DNA polymerase was used for polymerase chain reaction (PCR) according to the manufacturer's protocol. All oligonucleotides were synthesized by Cosmogenetech (Korea). For plasmid extraction and purification of PCR products, kits from Geneall were used.
To construct plasmids for the crRNA expression, pTRC.crRNA (24) was used as a backbone plasmid and was linearized by PCR with the primers set of Fvector and Rvector. For each insert DNA cloning, two oligonucleotides (0.5 uM each) were hybridized, and 20 cycles of polymerase chain extension were conducted with annealing conditions of 50°C for 15s, in 20-ul reaction volume. 1 ul of the product was used as a template for the follow-up PCR with the primers set, Finsert and Rinsert. Next, the linearized plasmid and the insert DNA were assembled by Gibson assembly (34,35). After transformation, sequences were confirmed by DNA sequencing (Cosmogenetech). For plasmid construction and propagation, DH5α chemically competent Escherichia coli cells (Enzynomics) were used. Cells were grown in LB medium at 37°C, and carbenicillin was supplemented for selecting and maintaining transformants.
To construct plasmids for yellow fluorescent protein (YFP) expression in Synechocystis, pTRC.YFP (24) was used as a backbone plasmid and was linearized by PCR with the Fyfpvector and Ryfpvector primers set. The selected promoters for in vivo testing were amplified by PCR with the Finsertyfp and Rinsertyfp primers set. Next, the linearized plasmid and the insert DNA were assembled by Gibson assembly. DH10β E. coli cells harboring helper plasmid (RP4) and target plasmid, respectively, were incubated together with Synechocystis cells for conjugation. After a 5 h incubation, cells were spread onto a nitrocellulose membrane filter (MF-Millipore) placed on a BG-11 agar plate. The filter was periodically moved to a new BG-11 plate supplemented with an appropriate concentration of kanamycin so that cells could be exposed to a gradually increasing concentration of kanamycin until the recombinant colonies appeared. Colony PCR and sequence analysis (Cosmogenetech) were followed to confirm the sequence.
Cell-free transcription and CRISPR/cas12a-based assay for promoter activity measurement
Cell-free transcription was conducted as described previously (24) with modification. Briefly, Synechocystis cells were cultivated under mixotrophic conditions with continuous lighting (50 μmol photons/s/m2) and glucose (10 mM); their growth was monitored by measuring optical density at 730 nm (Biodrop). Synechocystis cells were harvested from mid-exponential growth phase (OD730 ≒3) and lysed by bead-beating. Synechocystis cells were harvested from mid-exponential growth phase (OD730 ≒3), and lysed by bead-beating. After removal of cell debris by centrifugation, cell-extracts were diluted and followed by a run-off reaction for 5 h at 37°C. For cell-free transcription (CF-TX), the following components were prepared: 10 mM magnesium acetate, 30 mM potassium acetate, 20 mM ammonium chloride, 2% PEG-8000, 50 mM HEPES (pH8), 1.5 mM ATP and GTP, 0.9 mM CTP and UTP, 0.2 mg/ml tRNA, 0.26 mM CoA, 0.33 mM NAD, 0.75 mM cAMP, 0.068 mM folinic acid, 1 mM spermidine, 30 mM 3-phosphoglyceric acid, 8 mM creatine phosphate, 0.4 mg/ml creatine kinase, 1 mM DTT, 0.8 units RNase inhibitor (NEB), 30% cell-extract by volume, and nuclease-free water. Plasmids containing the crRNA expression cassette under each generated synthetic promoter were added separately. After 8 h of incubation at 37°C, the CF-TX products were stored at –20°C until the CRISPR/Cas12a-based fluorescence assay (FQ-assay). A negative control without a template plasmid was also run in the same procedure. Three separate CF-TX reactions were conducted for each plasmid.
The FQ-assay was performed as described previously (24). Briefly, the assay mixture was prepared as follows: 200 nM LbCas12a (NEB), 50 nM pUC19 (as activator DNA), 1000 nM fluorophore quencher DNA (FQ-DNA, Cosmogenetech), and 1x NEBuffer 2.1. The CF-TX products (30% by volume) were added to the assay mixture, and the reaction was started. After 3 hours of reaction at 37°C, the end-point fluorescence was measured using a microplate reader (Hidex Sense 425–301). For analysis, the background fluorescence value obtained from the negative control was subtracted from each fluorescence value. Two FQ-assays were conducted for each CF-TX reaction.
In vivo YFP fluorescence analysis
Synechocystis harboring the YFP expression plasmids were inoculated in 10 ml of BG-11 supplemented with 5 mM glucose and 50 μg/ml kanamycin, adjusting to an initial OD730 of 0.1. Under continuous light conditions (50 μmol photons/s/m2), cells were grown to their mid-exponential phase, and the YFP fluorescence was measured with a microplate reader (Hidex Sense 425–301).
RESULTS
A scheme of deep learning-aided synthetic promoter generation
This study aimed to generate synthetic promoter sequences for cyanobacteria using a deep learning method. The overall process consists of three steps (Figure 1): (i) promoter generation using a variational autoencoder (VAE) model, (ii) prediction of promoter strength using a convolutional neural network (CNN) model and (iii) validation of transcriptional activities of synthetic promoters. In the promoter generation step, native promoter sequences obtained from the dRNA-seq of Synechocystis (25) were used as training data for the VAE model, which generates new synthetic promoter sequences. For the promoter strength prediction, each promoter sequence and the number of reads from the dRNA-seq were used to train the CNN model (25), which was then used to predict the strengths of synthetic promoters generated from the VAE. Finally, we used a previously developed cyanobacterial cell-free transcription assay (24) to validate our prediction.

A framework for cyanobacteria synthetic promoter design using deep learning methods. The overall synthetic promoter generation process consists of three steps: generation, prediction, and validation. In the generation step, a variational autoencoder (VAE) model is trained with native promoter sequences of cyanobacteria and generates numerous synthetic promoters. Next, promoter strength is predicted by a convolutional neural network (CNN) model. For validation, generated promoters are tested to check promoter activity via cell-free transcription (CF-TX) assay.
Generation of synthetic promoter sequences using variational autoencoder (VAE)
A deep generative model, which combines a generative model and a deep neural network, can generate new data by learning the distribution of a particular dataset and capturing its features (36,37). Because the deep generative model does not require a data-labeling process that specifies features manually, it is advantageous for training unclassified and intractable data. Compared to model organisms such as E. coli and yeast, sequence data of Synechocystis revealed relatively less conserved motifs in promoter regions (37,38). In this study, we used VAE, one of the deep generative models, to generate synthetic promoter sequences that resemble the native promoter sequences of Synechocystis (see the Materials and Methods section for the full description).
As training data for our VAE model, we applied native promoter sequences of Synechocystis derived from the dRNA-seq (25). The dRNA seq identifies the primary transcriptome, providing the genome-wide maps of TSSs. Thus, we can obtain promoter sequences and their reads from the dRNA-seq (25,39). Using the VAE model, we generated 10 000 synthetic promoter sequences.
Next, we evaluated whether the generated sequences are functional as promoters for gene expression using two approaches. We first investigated the distribution of sequence motifs in both native and generated promoter regions by calculating the position weight matrix (PWM). We found that the –10 box sequence motif was highly conserved in the native promoter region between –12 and –7 (Figure 2A), which is consistent with a previous report for cyanobacteria (37,38) and for other Gram-negative bacteria (40). More importantly, a similar –10 box sequence motif was observed in the generated promoters, confirming that our VAE model accurately captured traits of the cyanobacteria promoters and created valid new promoter sequences.

Evaluation of the variational autoencoder (VAE) model using position weight matrix (PWM) and 6-mer frequency analyses. (A) The sequence logos of native (top) and generated (bottom) promoters on promoter region between –20 and –1. (B) Distributions of selected 6-mers (TAAAAT, TAGAAT, AAAATA, and TATAAT) on the native and generated promoters. The x-axis represents relative positions to the transcription start site (TSS). The y-axes on the left and right sides represent each 6-mer frequency in native and generated promoters, respectively.
On top of the PWM result, we performed a 6-mer frequency analysis on the promoter region between –20 and –1. We calculated the frequency of every 6-mer in that region of native promoters and chose the top three 6-mer sequences. In addition to these three, we also selected TATAAT, the universal –10 box sequence motif, for 6-mer frequency analysis. We investigated how the four 6-mer sequences were distributed in the generated sequences and found that the three 6-mers (TATAAT, TAAAAT, and AAAATA) showed nearly identical distributions in the native and generated promoter sequences (Figure 2B). The other 6-mer (TAGAAT) showed a slightly different distribution, but did appear between –20 and –1. This result implies that the promoter sequences generated from our VAE model contain a valid feature as a promoter in cyanobacteria.
Prediction of promoter strength using a predictive CNN model
After confirming the validity of our VAE model for generating synthetic promoter sequences for cyanobacteria, we aimed to develop a predictive model to estimate the promoter strength of the generated promoters. To use nucleotide sequence data in deep learning models, conversion of the data format through OHE is essential. In the OHE, a one-dimensional nucleotide sequence is converted into a 2D data format to be used to extract further hierarchical features. CNN has proven its performance in extracting hierarchical features from sequential data such as image or sequence strings and has been widely used for genomic or other biological sequence analysis. DeePromoter (15) can distinguish promoter sequences from non-promoter sequences, DeepBind (41) can accurately predict sequence specificities of DNA- and RNA-binding proteins, and AlphaFold (42) can deduce protein structure from its amino acid sequence.
Similarly, we used CNN to develop a predictive model for promoter sequences. Gene expression data from the dRNA-seq study (25) were used to train the CNN model. As a training method, we used k-fold cross validation to avoid overfitting, which possibly occurs because of the limited input data size of the dRNA-seq. We assessed our predictive model based on the Pearson correlation between the predicted promoter strength and the gene expression level determined experimentally (25) and confirmed the prediction accuracy (P = 0.41) (Supplementary Figure S2).
Critical subregion identification
To further improve the prediction capability of our model, we sought to identify critical subregions of the promoters. Previously, we demonstrated that base pair alterations of promoter region largely impacted promoter strength. Moreover, we found that more differences in promoter strength due to base pair alterations indicated that the altered base pair regions in promoter sequence were more critical in promoter strength (24). Similarly, for this critical subregion identification, we assumed that if promoter strength is dramatically changed with a specific alteration of subregions in the promoter sequence, these regions would be critical areas affecting promoter strength. Since 6-mer-based frequency analysis is a typical way of searching promoter motifs, we tried to set 6-mer or smaller as a unit subregion for this critical subregion identification. Considering computing capacity and analysis time, we decided 3-mer as a unit subregion. To inspect synthetic promoters more comprehensively, we targeted two different 3-mers in the promoter region in this study compared to the previously demonstrated one-dimensional single-base alteration (15). Therefore, we were able to inspect two different subregions’ combinatorial impact on promoter strength. In this way, it was possible to detect which loci of two 3-mers alterations, including the tandem 3-mer cases (6-mer), influenced promoter strength severely (Figure 3A and B).

Critical subregion identification. (A) Schematic representation for combining two subregions to be altered and analyzed. A native promoter nucleotide sequence is grouped into three consecutive nucleotides as a unit subregion. All possible combinations for two subregions are denoted from Set 1 to Set 528. The figure represents examples of subregion alteration from Set 1 to Set 528. (B) Schematic representation for critical subregion identification analysis. The CNN model predicts the strength of each promoter after the mutations. The colored rectangle represents the Pearson correlation coefficient between the predicted strengths of the native promoter set (|${x}_1$| to |${x}_{3712}$|) and promoter Set 1 (|$x_1^{Set\ 1}$| to |$x_{3712}^{Set\ 1}$|). (C) A heatmap graph represents Pearson correlation coefficient between the predicted promoter strengths before and after the mutation. The graph is for Case 1, and heatmap graphs for Cases 2 and 3 are provided in Supplementary Figure S3. The smaller value means a more dramatic change in the promoter strength caused by mutations on the subregion, revealing the critical subregions for promoter strength. The red boxes represent the critical subregions identified. The x- and y-axes denote relative positions to the transcription start site (TSS).
To do so, we used the CNN model to evaluate promoter strengths before and after the subregion alteration with the following procedure. First, we designated 3-mer as a unit subregion, starting from –99 to –1, obtaining 33 subregions (Figure 3A). Then, we created all possible combinations of two subregions, yielding 528 different sets (33C2). For each set designating two unique alteration loci, we evaluated all 3712 promoters’ strengths before and after the alteration of two subregions designated by red rectangles (Figure 3A). Then, we calculated the Pearson correlation coefficient between two groups, one is each 3712 promoter strengths with the original sequence, and the other is those with the altered nucleotide sequence (Figure 3A). After calculating the Pearson correlation coefficient (–1 ≤ ρ ≤ 1; 1 for the highest correlation) between the two groups, we displayed the value as indicated in Figure 3B. If the subregion is not critical for the promoter strength, the distribution of promoter strengths before and after the alteration would be similar, therefore the Pearson correlation coefficient would be close to 1 (green-colored blocks, Figure 3C). However, If the subregion is critical for the promoter strength, the distribution of promoter strengths before and after the alteration would be considerably different, thus the Pearson correlation coefficient would be close to 0 (purple-colored blocks, Figure 3C). In other words, we speculated that the smaller Pearson correlation coefficient meant a larger variation in promoter strength caused by mutations in that subregion. We iterated this procedure from Set 1 to Set 528 and denoted the Pearson correlation coefficients in Figure 3C to display which subregion would be critical for the promoter strength.
In addition to the promoter region from –99 to –1 (Case 1), we repeated the same analysis for the promoter regions from –100 to –2 (Case 2) and –98 to –3 (Case 3), respectively, to inspect critical regions further thoroughly (Supplementary Figure S3). Taken together, from the critical subregion identification, we found that the –15 to –7 region for Case 1, the –13 to –8 region for Case 2, and the –14 to –6 region for Case 3, that is, collectively the –15 to –6 region, were important for the promoter strength compared to other regions (Figure 3, Supplementary Figure S3). This result was consistent with our previous PWM results (Figure 2A).
Data refinement for improved prediction accuracy
Next, we sought to refine the training dataset by excluding relatively less-featured data. Based on the consensus promoter sequence identified in the PWM analysis (Figure 2A), the most plausible 6-mer consensus was TANANT between –12 and –7. Thus, we sought to collect native promoter sequences with TANANT, but widened the range to between –13 and –6 to allow positional variations of the consensus sequence on promoters. We found 826 promoters in cyanobacteria containing TANANT between –13 and –6 (Figure 4A). However, when we trained the CNN model again with the new collection, the prediction accuracy of this model did not improve from the original CNN model (Figure 4B). We then reduced the constraint (4 bps) of the consensus sequence to 2 bp (Figure 4A) and made NANNNT, NANANT, and TANNNT collections, which contained 3339, 1254 and 1933 promoters, respectively. With the three refined datasets, we re-trained the CNN model and found that the prediction capability for the promoter strength was significantly improved with the CNN model using the TANNNT collection (Figure 4B). This result implies that refining the training data can be a simple strategy to increase the prediction accuracy of the predictive CNN model.

Refinement of training data to improve prediction accuracy. (A) Criteria to exclude relatively less-featured data. Native promoters that contain NANNNT, NANANT, TANNNT or TANANT, respectively, in the region between –13 and –6, were selected for CNN-training data. (B) Comparison of the Pearson correlation coefficient from previous and new CNN models. When we used a refined dataset with TANNNT-containing promoters for the CNN model, we found that the prediction accuracy was improved compared to the previous model. NNNNNN represents the original training dataset prior to the data refining.
Validation of transcriptional activities of synthetic promoters using a cell-free transcription assay
We generated synthetic promoter sequences based on cyanobacteria native promoter sequences and assessed their functionality as promoters using deep learning methods. We then validated their transcriptional activities using a cell-free transcription (CF-TX)-based, efficient promoter strength assessment assay developed previously (24). Briefly, in the CF-TX assay, crRNA is transcribed under a promoter of interest in a cell-free system that contains the native molecular machinery of cyanobacteria. The crRNA produced by CF-TX forms a complex with the CRISPR/Cas12a enzyme on the target DNA that we added. The formation of the CRISPR complex on the target facilitated the activation of the trans-cleavage activity of Cas12a against non-specific ssDNA in the reaction. This collateral endonuclease activity, which is proportional to promoter transcription activity (24), can be monitored using a fluorophore- and quencher-labeled ssDNA probe (Supplementary Figure S5) (43,44). Therefore, we are able to efficiently quantify promoter strength by measuring the fluorescence of the CF-TX reaction (Figure 5A). To confirm whether the generated synthetic promoters drive cyanobacterial transcription, we selected the top 20 sequences and constructed crRNA expression plasmids with these promoter sequences; however, two plasmids carrying the S9 and S10 promoters proved difficult to generate. Alternatively, we added S21 and S22 promoters, making a total of 20 synthetic promoters for experimental validation. We also prepared five additional constructs with random sequences, dummy 1–5 (Supplementary Table S2), as negative controls.

Validation of promoter activity using CRISPR/Cas12a-based cell-free transcription assay. (A) Schematic representation of CRISPR/Cas12a-based cell-free transcription assay. The illustration was adapted from our previous publication (24). crRNA is set up to be transcribed under a promoter of interest in a cell-free system that contains cyanobacteria's native molecular machinery. The resulting crRNA can be detected by fluorescence measurement because a complex formation of Cas12a with the crRNA on a target DNA leads to collateral cleavage of fluorophore- and quencher-labeled ssDNA probes (FQ-DNAs), emitting fluorescence. Thus, synthetic promoter strength can be measured via quantifying fluorescence in a Cas12a-based FQ-assay. (B) Promoter activity measured using the CRISPR/Cas12a-based cell-free transcription assay. The crRNA expression cassette is constructed under each synthetic promoter sequence (S1–S8, S11–S22), dummy sequence (D1–D5), or native promoters (N1–N12, psbA2S, psaA, rbcL, rnpB). Three independent cell-free transcription reactions are conducted and followed by two Cas12a-based FQ-assay reactions for each cell-free transcription. Most synthetic promoters showed transcriptional activities compared to dummy sequences, and four synthetic promoters exhibited higher activity than one of the strong native promoters, psbA2S, confirming that the deep learning-based approach can successfully generate valid promoter sequences. (C) In vivo YFP expression under selected synthetic promoters and dummy sequences. Unlike dummy sequences, synthetic promoters showed a higher activity to drive gene expression in vivo. The means and standard deviations from two independent cultures are presented.
As a result, all five dummy sequences showed negligible transcription activities compared to the background fluorescence, which was measured from the CF-TX performed without any construct DNA, indicating that the dummy sequences were not functional as promoters. Conversely, all the synthetic promoters except S18 showed higher fluorescence than the dummy sequences, indicating the efficient transcription of the reporter, crRNA, mediated by each synthetic promoter. When compared to the one dummy sequence that showed the highest fluorescence (D1), 95% of the synthetic promoters tested showed outstanding transcriptional activities, validating our deep learning approach to generate synthetic promoters (Figure 5B, Supplementary Table S5).
For comparison, native Synechocystis promoter sequences were similarly assessed. More specifically, we selected the top 12 native promoter sequences from the training data set and 4 additional Synechocystis promoters, namely psbA2S, psaA, rbcL and rnpB promoters, which are widely used for metabolic engineering of cyanobacteria (45–47). We found that synthetic promoters showed a varied range of strengths and, notably, 4 new promoters exhibited higher activity than the psbA2S promoter, a strong cyanobacteria promoter (Figure 5B). This result demonstrates that our deep learning approach is useful for broadening a pool of genetic parts.
Additional analysis on synthetic promoters and in vivo demonstration
We sought to validate whether the high promoter activity of the synthetic promoters was due to the core promoter sequence, not the abundant AT-rich regions within the other regions of the synthetic promoters (Supplementary Figure S1 and Table S1). To this end, the S12 promoter sequence, which showed the highest promoter strength (Figure 5B), was divided into three 50-bp fragments (Supplementary Figure S6A). Subsequently, we determined whether crRNAs were produced under a partial sequence containing the AT-rich regions. The result showed that neither the first 50-bp sequence, nor the middle 50-bp sequence, promoted any transcriptional activity. Second, we replaced the -10 element of the top three promoters (S12, S6 and S5) with the corresponding sequence of three dummy sequences (D1, D2 and D3). This restricted mutation resulted in a complete loss of their original activity as a promoter (Supplementary Figure S6B). These results indicate that no other valid promoter sequence exists within the synthetic promoter sequence.
Next, we examined how the synthetic promoter sequences differ from the native promoter. We conducted a nucleotide BLAST search on the 10 000 synthetic promoters (data not shown). Notably, for both synthetic promoter sequences and dummy sequences, no significant similarity was found in the nucleotide BLAST search against the whole Synechocystis sp. PCC 6803 genome. Hence, our VAE model can readily generate new and valid promoters that differ from the native ones (16).
To further demonstrate their functionality as promoters, we applied the generated promoters to express YFP and monitored the expression level in vivo (24). All synthetic promoters that were tested in vivo showed higher YFP expression than the dummy sequences, demonstrating their ability to facilitate gene expression in cyanobacteria (Figure 5C). Additionally, we investigated how the predicted promoter strength correlates with that measured experimentally. We calculated the Pearson correlation coefficient (r) between the rank orders of CNN-predicted values and the in vitro experiment results (Figure 5B and Supplementary Figure S7A). The result revealed a good correlation (r = 0.62) comparable to that obtained with the refined dataset (Figure 4B, r = 0.57). We also calculated the Pearson correlation coefficient between the CNN-predicted and in vivo ranks (Figure 5C) in the same manner, which yielded a strong correlation, 0.76 (Supplementary Figure S7B). Previous studies (29,30) suggest that a wide range of transcription factors are involved in cyanobacteria promoter activity, enabling cells to respond efficiently to various environmental conditions, such as light and nutrition. Hence, the 100-bp synthetic promoters we generated could serve as a chassis for such operator sites for transcription factors due to their high and well-fitted correlation in vitro and in vivo.
DISCUSSION
Here, we designed synthetic promoter sequences for cyanobacteria using deep learning methods. Based on the native promoter sequences from cyanobacteria, our VAE model successfully captured a valid feature of cyanobacteria promoters, resulting in the generation of numerous synthetic promoters. Using our predictive CNN model, all the synthetic promoters’ strengths were predicted, and we were able to improve the prediction accuracy by identifying critical subregions in cyanobacteria promoters. The critical subregion identification analysis showed a significant variation in promoter strength when sequence alterations occurred in a specific region of the promoters (between –15 and –6), which revealed the significance of the –10 box sequence motif. The importance of this motif was also consistently observed in the PWM and 6-mer frequency analyses. Based on these results, we refined our training dataset, which was used to enhance the prediction accuracy of the CNN model. For validation, 20 synthetic promoters were selected and tested to drive transcription in the cyanobacterial cell-free system. We employed the CRISPR/Cas12a-based assay and efficiently detected the transcriptional activities driven by synthetic promoters. Nearly all (19/20) synthetic promoters exhibited higher fluorescence than the background signal, whereas five dummy sequences, which were randomly generated without any data-training process, showed a negligible difference from the background; that is, no transcriptional activity occurred (Figure 5B). Moreover, a strong correlation between the predicted strength and experimental results was observed, demonstrating the ease and utility of this approach for creating synthetic promoters. In addition, the nucleotide BLAST search revealed no similarity between synthetic sequences and native sequences, suggesting orthogonality to the existing genetic parts. Taken together, these results confirmed that the deep learning-based approach could be used to design and generate new and functional synthetic promoters for cyanobacteria more efficiently.
In this study, we used the VAE as a deep generative model for the synthetic promoter design of Synechocystis. The GAN model used in the previous study (16) has the advantage of generating high-quality synthetic data when trained with large data. However, the training process can be complex owing to the accompanying competition between the generator and the discriminator (48,49). Compared to E. coli and other model bacteria, cyanobacteria have relatively less conserved motifs in their promoter sequences (29). Therefore, it might be more difficult to distinguish promoter from non-promoter sequences, and it is challenging to evaluate the success of the training process. However, the VAE model involved a likelihood-related objective function for optimizing the distribution of generated data; thus, we could immediately assess whether the training was successful (18,29,48–50). As a deep generative model, VAE allowed successful training with small and less-featured data, such as promoter sequences in cyanobacteria, and resulted in valid synthetic promoter generation. Of note, this is the first study to use the VAE model to create synthetic promoter sequences.
Deep learning methods are particularly useful for designing biological elements (51–53). Numerous biological activities, including transcription, translation, and various enzymatic reactions, rely on specific binding between a biomolecule and its recognizing biomolecular machinery (e.g. ligand–receptor, antigen–antibody, DNA/RNA–protein). Their specificities are derived from sequences, structures, or combinations of both. Meanwhile, they necessarily share machinery (e.g. DNA/RNA polymerase, ribosome, metabolic enzymes) and thus require ‘a shared feature’ to be recognized by the machinery. Recent advancements in deep learning technology, which can extract hidden features from a large raw data, have provided some insight to bio-researchers seeking a new, innovative way to handle complex data. With the aid of deep learning technology, new biological elements (11,15,16), enzymes (54,55) and drugs (56,57) have been discovered, and the synthetic metabolic pathway has been optimized in a more efficient way (58,59).
A wide variety of biological parts, products and systems that are newly designed using a deep learning approach can take advantage of cell-free systems as a validation tool. Cell-free systems can provide rapid high-throughput test platforms, eliminating time-consuming and resource-demanding processes, such as transformation and cell growth. Strong correlations in various biological activities between cell-free and cellular systems confirmed the suitability of the cell-free system as a prototyping tool (60–63). In particular, for non-model organisms or organisms that are difficult to engineer, such as cyanobacteria, utilizing the cell-free system for validation of deep learning-based designs is a promising alternative. In this study, by applying a cell-free transcription assay, we rapidly validated our predictions. Considering that it might take days to weeks for transformation and complete segregation in cyanobacteria, great synergy by combining in silico and in vitro studies was demonstrated in this research.
In conclusion, based on the deep learning methods and cell-free transcription assay, we designed, assessed, and validated the synthetic promoter sequences for cyanobacteria. We believe our approach can contribute greatly to advancing cyanobacteria research and applications.
Data Availability
The source codes are available at https://figshare.com/articles/software/CyanoDeeplearning/22331044.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
ACKNOWLEDGEMENTS
Author contributions: J.W.L. conceived the study. ES designed, built, and implemented all computational models in this study. Y.N.C. and Y.R.S. performed all wet experiments, including molecular cloning and cell-free transcription assay. E.S., Y.N.C. and J.W.L. analyzed the data. E.S. and Y.N.C. wrote the manuscript. D.K. and J.W.L. edited the manuscript.
FUNDING
Bio & Medical Technology Development Program of the National Research Foundation (NRF) funded by the Ministry of Science & ICT (MSIT) [2021M3A9I4030408, 2022M3A9I5020804, 2021M3A9I4024840]; C1 Gas Refinery Program through the NRF, funded by MSIT [2015M3D3A1A01064926].
Conflict of interest statement. None declared.
REFERENCES
Author notes
The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.
Comments