Prediction and large-scale analysis of primary operons in plastids reveals unique genetic features in the evolution of chloroplasts

Abstract While bacterial operons have been thoroughly studied, few analyses of chloroplast operons exist, limiting the ability to study fundamental elements of these structures and utilize them for synthetic biology. Here, we describe the creation of a plastome-specific operon database (link provided below) achieved by combining experimental tools and predictive modeling. Using a Reverse-Transcription-PCR based method and published data, we determined the transcription-state of 213 gene pairs from four plastomes of evolutionary distinct organisms. By analyzing sequence-based features computed for our dataset, we were able to highlight fundamental characteristics differentiating between operon pairs and non-operon pairs. These include an interesting tendency toward maintaining similar messenger RNA-folding profiles in operon gene pairs, a feature that failed to yield any informative separation in cyanobacteria, suggesting that it catches unique traits of operon gene expression, which have evolved post-endosymbiosis. Subsequently, we used this feature set to train a random-forest classifier for operon prediction. As our results demonstrate the ability of our predictor to obtain accurate (84%) and robust predictions on unlabeled datasets, we proceeded to building operon maps for 2018 sequenced plastids. Our database may now present new opportunities for promoting metabolic engineering and synthetic biology in chloroplasts.


INTRODUCTION
Plastids are cellular organelles mainly found in a diverse group of photosynthetic organisms (1). They originate from an endosymbiotic event (∼1.5 × 10 9 years ago) in which an ancient cyanobacterium was engulfed and retained by a eukaryotic cell, giving the latter the benefit of producing its own energy from sunlight (2). Since the debut of this co-evolutionary interaction, the majority of cyanobacterial genes were either lost or horizontally transferred to the host' nuclear genome, while the plastid genome (plastome) mainly retained house-keeping and photosynthesis-related genes (3,4). As a result, the plastid has become highly dependent on imported nucleus-encoded proteins to conduct basic operations, making it a non-autonomous organelle (4). Nevertheless, the plastid conserved many of its ancestral characteristics and genomic features, such as the circular genome structure, 70S bacteria-like ribosomes, plastidencoded bacterial-like RNA polymerase (PEP) and the organization of genes in bacterial-like operon transcription units (5)(6)(7).
Unlike bacteria, plastid operons have no available databases and were only studied by few. These studies mainly focused on higher-plant model organisms; in Hordeum vulgare (barley) the entire operon map was revealed by differential RNA sequencing (29), in Nicotiana tabacum (tobacco) part of the polycistronic transcripts were revealed using northern-blot (30), in Spinacia oleracea (spinach) the rpoBC and the psbB operons were discovered using northern-blot (31,32), whereas the adenosine triphosphate (ATP) synthase operon was suggested by comparing its gene content and order to its homologous ATP synthase gene-cluster in Escherichia coli (33). In algae, part of the Chlamydomonas reinhardtii operons were studied; several operons were revealed using northern blot (34), whereas two recent reports identified 16 and 22 polycistronic units by searching for consistency in overlapping RNA-sequencing reads in the intergenic regions between adjacent genes (35,36). However, since no large-scale analysis of chloroplast operons exists, the ability to identify them, distinguish their characteristics, and use these data for synthetic biology purposes remains limited.
Plastid gene expression differs from that of model bacteria (e.g. E. coli, Bacillus subtilis etc.) in several features; chloroplast transcripts are often subject to RNA editing and splicing (37), the role of transcription termination is significantly reduced (38,39), many non-coding RNAs are frequently transcribed (29,36) and the plastome is suggested to be fully transcribed (38,39). Moreover, the expression of genes often relies on specific RNA-binding proteins (e.g. the pentatricopeptide repeat family) that bind cis elements upstream from the START codon, thus obstructing the activity of exoribonucleases and stimulating translation by suppressing stem-loops that hamper ribosome binding (40)(41)(42). Additionally, polycistrons are often regulated by multiple promoters and are massively processed (43,44), resulting in the formation of different transcript isoforms that derive from a single primary transcription unit (41,45,46). Thus, the plastid operon structure has evolved considerably compared to classical bacterial operons. These differences have most likely affected the composition and characteristics of chloroplast operons and gave rise to unique features.
Subsequently, the ability to transform synthetic genes into plastids has had a major impact on the field of plant biotechnology as it offers significant benefits compared to nuclear transformation. Among these advantages are homologous recombination based site specific integration (which is not available in many plant and algae nuclei) (47,48), the absence of gene-silencing (49), relatively high expression of heterologous genes (50)(51)(52) and prolonged transformation stability in most crops due to maternal inheritance (47). A consequent advantage that is particularly relevant for this study is the option to utilize the plastid' natural ability to express polycistrons and design vectors with multiple genes under the control of a single promoter; thus minimizing plasmid sizes and allowing the introduction of several metabolic related transgenes in a single transformation (53)(54)(55)(56).
Since both basic scientific questions and synthetic biology aspirations are hindered by the lack of large-scale information on plastid operons, in this work we describe the creation of a plastome-specific operon map database, using a combination of experimental tools and predictive modeling. The full database can be found at: https://www. energylabtau.com/cppod.

Construction of labeled dataset
Empirical operon detection by RT-PCR. To retrieve data on plastid operons for Cyanidioschyzon merolae, Phaeodactylum tricornutum and C. reinhardtii, each plastome (NC 004799, NC 008588 and NC 005353, respectively) was organized as a list of adjacent gene pairs. From each organism 20-40 gene pairs were selected for reversetranscription PCR (RT-PCR) analysis. Specific primers were designed for each chosen gene pair; the forward annealed to the 5 gene, whereas the reverse primer annealed to the 3 gene ( Figure 1B-1). DNA was extracted from cultures in standard growth conditions (see Supplementary Methods) using the chelex protocol (57). Total RNA was extracted using RNeasy plant Mini Kit (QIAGEN 74903). The purified RNA from each sample was used for complementary DNA (cDNA) synthesis using Applied Bio-System High-Capacity cDNA Reverse-Transcription Kit. Subsequently, for each of the aforementioned templates (i.e. DNA, RNA and cDNA) each gene pair was amplified by polymerase chain reaction (PCR) using its specific primers. The DNA template served as a positive control for the primers' efficiency, the RNA template served as a negative control for denying the presence of plastid DNA and the cDNA template served as an indicator whether or not the gene pair is co-transcribed ( Figure 1B-2). Only if the DNA control was positive and the RNA control was negative, the cDNA control was observed and labeled accordingly; one for the existence of a band, zero otherwise.  Table S1). All other operon data of cyanobacteria (i.e. Nostoc azollae, Acaryochloris marina, Cyanothece sp. ATCC, Trichodesmium erythraeum, Gloeobacter violaceus PCC 7421 and Synechococcus elongatus PCC 6301) were retrieved from DOOR2 (15).

RNA folding energy profiles
To create a RNA folding energy profile for each gene, we used a 40 nucleotide long sliding window and computed its free energy using the Vienna package (RNAfold) (58).

Feature selection
A wrapped backward elimination feature selection was performed on the entire dataset. In this method our model' mean accuracy over ten bootstrap samples was computed at the first part of each iteration of a random-forest classifier. In the second part, the feature/s with the lowest importance score was/were removed. Ultimately, a feature set that yielded high accuracy with a small set of features was manually selected (Supplementary Figure S5).

Robustness test
The error test was carried out by randomly introducing 19 type I (false-positive mistake) or type II errors (falsenegative mistake) into the labels and re-running the prediction pipeline with the selected features. For each error rate the mean accuracy score over ten bootstrap trained samples was calculated. For comparison, the same analysis was carried out on a permutated dataset as well ( Figure 3C and D).

Enrichment index
Where X is the number of operon genes in the gene type i, N is the total number of operon genes, K is the total number (operon or not) of genes in the gene type i and M is the total number of genes in the dataset.

Random-forest classifier
Initially, the dataset was divided into two groups: gene pairs that were comprised only of coding-sequences (CDS group) and a mixture of transfer RNA (tRNA), ribosomal RNA (rRNA) and CDS gene pairs (mixed group). Then each group was randomly split into a train set (70% of the data) and a cross-validation set (30% of the data) of gene pairs. A scikit-learn random-forest classifier (59,60) ('n estimators' value was set to 1000 trees) was fitted based upon the train set and validated by predicting the cross-validation set. The process of training and cross-validating was repeated ten times, where in each round different cross-validation and train groups were selected. Each round of training and cross-validating yielded an accuracy score, representing the percentage of correct predictions on the cross-validation group (see 'Materials and Methods' section: Accuracy). The final accuracy score was the mean accuracy over ten bootstrap rounds of training and cross-validating.

Large-scale plastid operon predictions
A total of 2018 plastomes were downloaded from the NCBI Organelle Genome Resources (see list of names and NCBI IDs in https://www.energylabtau.com/cppod) and features were calculated for each. Next, ten bootstrap classifiers (each classifier was trained and tested on different train and cross-validation groups) were used in order to predict labels for all organisms, where each gene pair was determined as '1' or '0' according to the majority voting of the classifiers. Finally, adjacent overlapping operon-pairs were concatenated to form full operon maps.

Empiric P-value
The empiric P-value was computed as follows; fifty random samples were pooled from each group and their relevant traits were calculated. In each round the data were divided into two vectors (or groups) summarizing the outputs of all calculations performed on the extracted data, according to the hypothesis examined. If the average of the first vector was larger/smaller (depends on the objective function) than the average of the second vector, the current round received a '1', otherwise '0'. Finally, the empiric P-value was calculated as the number of '1' events divided by the number of total events.

Permutation test
The P-values given in Figure 2 were computed using a permutation test, with no prior assumptions regarding the distribution of the data. In this test, all values were pooled and distributed N times into two random vectors, maintaining the original sizes of the original vectors. The margin between the means of the random vectors were compared to the original margin, where the fraction of margins more extreme than the original margin is the P-value. By default, N = 10 4 , but if this sample size yielded a fraction of 0, N grew to 10 5 . Due to computation time, if this sample size yielded a fraction of 0 as well, the P-value given was: P < 10 −5 .

Constructing an empirical dataset of primary transcription units
To create a generalist dataset of plastid operons, we began from obtaining empirical operon data from four chloroplast genomes: H. vulgare (higher plant), C. reinhardtii (green alga), C. merolae (red alga) and P. tricornutum (Heterokont) ( Figure 1A) (61). For barley (H. vulgare), we used the published operon map that was revealed by RNA sequencing (29). For the other three species, we extracted total RNA from cultures in standard growth conditions and performed RT-PCR on a variety of gene pairs (see Supplementary Methods), where the forward primer annealed to the 5 gene and the reverse primer annealed to the 3 gene (Figure 1B-1). A DNA template was used to verify that the primers function properly, and the RNA template was used to rule out DNA contamination in the cDNA samples. Under these conditions, successful amplification of the cDNA reports that the two genes analyzed are present on a single RNA strand (i.e. Operon Pair, OP), whereas no amplification implies that the two are transcribed separately (i.e. Non-Operon Pair, NOP) ( Figure 1B-2). Overall, we obtained operon data for 213 gene pairs (137 OPs and 76 NOPs) (Supplementary Table S1). To verify the reliability of the RT-PCR method, we performed the same test on the known psaA operon derived from the H. vulgare plastome (psaA-psaB-rps14-trnF-trnR, (29)) (Supplementary Figure  S1A). The results clearly demonstrate that the RT-PCR is able to determine the actual transcription-state of a primary polycistron that is comprised of different types of genes (e.g. CDSs and tRNAs). Moreover, to prove that the RT-PCR method can robustly amplify cDNA transcripts with tight secondary structures (e.g. tRNAs), we successfully amplified the full cDNA transcript of six tRNA genes derived from the C. reinhardtii plastome (Supplementary Figure  S1B).
For each gene pair we computed roughly 1100 features ( Figure 1B-3 Figure S2) based on sequence analysis alone--thus, they could be computed for any sequenced plastome without requiring additional data (e.g. RNA-Seq). These features were designed to capture essential gene characteristics (e.g. coded protein hydrophobicity, RNA structure, codon usage bias, nucleotide composition) and to quantify their level of similarity within each couple of adjacent genes (Supplementary Figure S2).

and Supplementary
Examination of these features yielded an array of indices, which created a significant separation between OPs and NOPs in our dataset (Supplementary Figure S3). As ex-pected, intergenic distance and the gene pair conservation, which have been previously found to hold valuable information for operon prediction (15)(16)(17)(18), could be used for meaningful classification in our dataset as well (Figure 2A and B). However, we were able to discover several novel informative features as well which are relevant to chloroplast operons but haven't been reported before in the context of bacterial operons. One of these features shows that the GC content is significantly lower in intergenic spacers (IGSs) separating adjacent operon CDSs (i.e. transcribed spacers), compared to their non-operon counterparts ( Figure 2C). Interestingly, we also found that neighbor operon CDSs are inclined to have similar RNA folding energy profiles in their 5 UTRs ( Figure 2D and E). To validate the significance of these findings, we computed the P-value distribution of our entire feature set while shuffling its labels. We compared this distribution to that of the original data, and observed that the significance of the aforementioned features exceeds the limits of the randomized data (Supplementary Figure S4). More- over, all significant P-values were confirmed using the Benjamini & Hochberg False Discovery Rate procedure (62).
To test whether these operon characteristics are shared with cyanobacterial operons, we computed the same features for all coding gene pairs derived from seven distinct cyanobacteria species. The transcription state of Synechocystis sp. PCC6803 was based on a comparative analysis of RNA-seq (12), whereas the operon data for the other six cyanobacteria were predicted by DOOR2 (15). Intergenic distance and gene pair conservation were found significant here as well, emphasizing the generalist nature of these operon traits (Supplementary Figure S3). Additionally, we observed that OPs tend to have low CDS or IGS GC content, both in chloroplasts and in cyanobacteria (Supplementary Figure S3). However, the messenger RNA (mRNA) folding profile similarity index failed to create any informative separation between OPs and NOPs (Supplementary Figure S3), indicating that this feature captures some of the unique traits for operon gene expression that have evolved in chloroplasts.

Operon map inference
To create a model for chloroplast operon prediction, we used our labeled features to train a supervised randomforest classifier (59,60), where the objective was to predict whether or not a pair of adjacent genes are transcribed together. To increase the prediction accuracy and reduce the number of features used in the final model, we applied an iterative feature selection algorithm in which the least in-formative feature was removed at the end of each round (Supplementary Figure S5). Since some of the features were relevant to CDSs alone (e.g. codon usage bias), we ran this feature selection pipeline separately for CDS gene pairs and for all the rest (i.e. a mixture of CDS, tRNA and rRNA gene pairs). The final feature composition of our classifier can be seen in Supplementary Table S2.
To evaluate the performance of our model, we compared it to an array of random models in which the same data were used but the labels were permutated. Our overall accuracy, simply computed as the percentage of correct predictions for the cross-validation groups, was 84%--significantly higher (P < 10 −4 ) than the random prediction (58%) (Figure 3A), with true positive and true negative rates of 87 and 79%, respectively (for the classifiers' ROC curves and full metrics, see Supplementary Figure S6 and Supplementary  Table S3). Pursuing a different approach, we discarded the red alga C. merolae from the initial dataset, re-performed the iterative feature selection, and applied this prediction to C. merolae. Our model showed roughly the same metrics in this case ( Figure 3B and Supplementary Table S3), strengthening the relevance of its predictions on an organism that had not been a part of the learning process. To evaluate the uniqueness of our model to chloroplast operons, we used it to predict the operon map of Synechocystis sp. PCC6803 and compared it to its published operon map (12). The low accuracy score obtained (59%) highlights the orientation of our model toward predicting plastomic operons ( Figure 3B).
To test the robustness of our prediction against false labels (which may occur due to rare PCR artifacts, sequencing errors etc.), we performed an error test by randomly introducing type I or type II errors into our labels and re-running the prediction pipeline. We observed that rates of up to 19 errors only reduced the accuracy score to around 80% (Figure 3C and D), thus ensuring that our model is sufficiently robust given a reasonable error rate in the original dataset. Finally, we applied our classifier to a large number of available sequenced chloroplast genomes (2018 plastomes).

Plastid operon characteristics
By analyzing this newly formed database, we observed that indeed according to our predictor most chloroplast CDSs are transcribed as polycistrons (94.5 ± 0.05%), as suggested previously (29,47,(63)(64)(65). We noticed that this ratio is roughly similar between green plastids (93.7 ± 0.06%), red plastids (98.86 ± 0.06%) and glacuphytes (Cyanophora paradoxa, 99.32%) ( Figure 4A). To examine whether specific gene classes have different tendencies to be found in operons, we computed an enrichment index based on the hypergeometric distribution (methods: enrichment index). We observed that genes related to basic cell maintenance tend to be found in operons, while photosynthesis-related genes are found as monocistrons more frequently ( Figure  4B). Interestingly, the RNA genes received the lowest enrichment scores, with tRNA having the strongest inclination toward the monocistronic form, as was hypothesized previously (5,29). To examine whether operons tend to be comprised of functionally related genes, we computed the same enrichment index on randomly-chosen samples of our Prediction accuracies on the red alga Cyanidioschyzon merolae when discarded from the initial dataset, and on the cyanobacteria Synechocystis sp. PCC6803 labels derived from (12). The bars represent coding-sequence gene pairs ('CDS group'), a mixture of tRNA, rRNA or coding sequence gene pairs ('mixed group') and the weighted mean of the two groups ('overall'). Synechocystis sp. PCC6803 encompasses only CDS gene pairs. The bars show mean ± STD. (C) Type I (false-positive mistake) error test. (D) type II (false-negative mistake) error test. An overall of 19 errors were introduced into the labels and for each error rate the prediction pipeline was repeated. The same analysis was carried out on random labels. All accuracy scores were calculated based on the average accuracies of ten bootstrap trained samples. database, and compared them to similar samples taken from a database with permutated labels. The results clearly show that primary operons in plastids tend to harbor functionally related genes ( Figure 4C).

Model construction
Our choices in model selection were driven by our overall aspiration to construct a high-quality wide database of plastid primary operons. Thus, though other vascular plants--such as N. tabacum--have published operon data, we only used the barley operon map for two reasons: (i) it is highly reliable and encompasses all genes and (ii) we wished to avoid bias toward higher-plant plastid operons in our labeled data, as these are known to be highly conserved and redundant. In addition, we used the red alga C. merolae as an independent group to validate our model performances on a test set that had not been part of the initial dataset ( Figure 3B and Supplementary Table S3). Furthermore, we chose to merge our empirical datasets and create a generalist plastid classifier (as opposed to using each representa-tive for creating a group-specific classifier); using this wide perspective, our algorithm is able to grasp the main fundamental traits unique to operon genes, and avoid potential species-specific biases.
It is important to note that our empirical data were collected from cultures in normal growing conditions (see Supplementary Methods). Since it is known that operon maps might vary to some extent between different conditions (12,13), our database is likely to have the best predictive power for organisms in standard growing conditions. Subsequently, due to the nature of primary polycistronic processing in chloroplasts and the presence of multiple promoters, our database reflects the structure of primary operons (whether or not they undergo subsequent processing).
As described above, we decided to separate our predictor into a group where both adjacent genes are CDSs and another group that contains all the other combinations (i.e. CDS, tRNA and rRNA). We found the transcription state (NOPs or OPs) in this mixed group to be less predictable than that of CDSs ( Figure 3A and B; Supplementary Table S3); one explanation for this is simply the loss of information induced by comparing two different types of genes (e.g. tRNA-CDS). Another phenomenon playing a role in explaining this observation is most likely the fact that tRNA and rRNA genes have less computable features (at least based on the approach employed here), compared to CDSs; while mRNAs tend to have structured and distinguishable sequence information dictated by the START to STOP flow, the codon triplet organization, and the constraints induced by the final protein sequence, extracting beneficial information from tRNA and rRNA genes is less trivial. For these reasons, existing works and tools neglect the tRNA and rRNA genes and target the CDSs alone when studying operons (12,13,15,25,27). In this work we included all genes, and although the CDS group received higher accuracy scores, the prediction accuracy was high in the mixed group as well--significantly outcompeting the random models ( Figure 3A).
Since tRNAs are highly-structured and undergo 5 cleavage for their proper maturation (66), we tested the credibility of our RT-PCR method in these cases. We were able to show that it properly detects the transcription state of mixed gene pairs (e.g. tRNA-CDS, Supplementary Figure S1A) and is able to amplify complete tRNA transcripts (Supplementary Figure S1B).

Operon characteristics
Analysis of the features in our dataset highlighted interesting differences between chloroplast OPs and NOPs. Shorter intergenic spacer and lower transcript GC content in operon gene pairs (whether originating from the IGSs or from the CDSs themselves, Supplementary Figure S3) share the same logic; these are most likely means of sparing unnecessary resources. Since the IGS is fully transcribed, having a constraint on its length seems sensible in order to spare nucleotides, energy and transcription time. It also seems reasonable to have a constraint on the overall operon GC content in order to speed up transcription (67), as operon transcripts naturally tend to be long. Subsequently, minimizing the time-lag between the expression of adjacent operon Additionally, a hypergeometric enrichment/depletion P-value was computed for each group; P < 10 −300 for all groups. (C) The distribution of functional enrichment in operons for 10 000 random samples shown in contrast to the same analysis carried on randomized data (see legend). Z-score and P-value (a two-tailed Wilcoxon test) between predicted labels and shuffled labels are 2.2 and < 10 −300 , respectively. genes is also beneficial since such pairs tend to serve similar functions ( Figure 4C) and thus may often be required at roughly the same time. The presence of these features in both plastids and cyanobacteria (Supplementary Figure S3) highlights the generality of these concepts.
We were also able to identify a couple of operon features in plastids, which are not shared with cyanobacterial operons (Supplementary Figure S3). Interestingly, we found a high overall similarity in the mRNA folding profile at the 5 UTR of two adjacent operon genes ( Figure 2D and E). The complete absence of this signal in cyanobacteria (Supplementary Figure S3) suggests that it captures unique traits that have evolved post-endosymbiosis. While mRNA folding is known to regulate gene expression on several levels (6,(68)(69)(70)(71)(72)(73), it is most likely that this characteristic is mainly related to translation, which unlike in bacteria--has become a major rate-limiting factor in chloroplast gene expression (42). Secondary mRNA structures affect both translation initiation, translation elongation and mRNA stability; thus, the similarity in folding energy profiles could be a powerful means for post-transcriptionally regulating operon genes toward simultaneous expression.

Operons as a form of chloroplast translational regulation
After applying our classifier to a large number of sequenced plastomes, we organized the plastid genes into functional groups and examined the transcription state of the different groups. Our results show that the operon formation is widely preferred across all protein-coding groups, while RNA genes tend to be found in the monocistronic form ( Figure 4B). This negligible role that operons play in regulation of plastid RNA gene expression could be explained by the overall high levels of transcription in chloroplasts (35,38,52) and the rate limiting nature of chloroplast translation. According to this hypothesis, co-transcription of functionally related CDSs is expected to be more beneficial as they are subject to similar translation regulation ( Figure  2D and E). On the other hand, the expression of functionally related RNA genes as operons would not affect their expression synchronization, as they are not translated and are already highly expressed in the first place. Thus, assuming that operons were retained in plastids in order to coregulate functionally related genes ( Figure 4C), RNA genes (for which translation regulation is not relevant) could be expected to adopt the monocistronic form.

CpPOD--an online database for predicted chloroplast operons
In this work, we have created a model that enables the prediction of primary plastid operons based on an annotated plastome alone. To make this ability useful, we ran our predictor on 2018 sequenced plastomes and uploaded the results to an open access site (see 'Data Availability' section). Besides binary final decision values (0 or 1), one can also find the predicted operon score, which is a continuous value between 0 and 1, reflecting the likelihood that the pair is cotranscribed. This allows the user to select stricter or more permissive thresholds than the default 0.5 threshold used in this work, if needed.

CONCLUSION
In this work, we collected empirical operon data from plastomes of four model organisms, each representing a major plastid-containing group. Subsequently, we applied a supervised machine learning classification algorithm to build Nucleic Acids Research, 2019, Vol. 47, No. 7 3351 a generalist model for primary operon prediction that requires a sequenced plastome alone as input. By analyzing the features in our dataset, we were able to discover principle characteristics of operons, including their low overall GC content and similar mRNA folding patterns. Using a set of selected features, we were able to create a full database for predicted chloroplast operons that encompasses 2018 operon maps.