DNA free energy based promoter prediction and comparative analysis of Arabidopsis and rice g enomes

Cis -regulatory regions on DNA serve as binding sites for proteins such as transcription factors and RNA polymerase. The combinatorial interaction of these proteins plays a crucial role in transcription initiation, which is an important point of control in the regulation of gene expression. We present here an analysis of the performance of an in silico method for predicting cis -regulatory regions in the plant genomes Arabidopsis thaliana and Oryza sativa on the basis of free energy of melting of DNA. For protein coding genes, we achieve recall and precision of 96% and 42% for Arabidopsis and 97% and 31% for rice respectively. For ncRNA genes, the program gives recall and precision of 94% and 75% for Arabidopis and 95% and 90% for rice respectively. Moreover, 96% of the false positive predictions were located in non-coding regions of primary transcripts out of which 20% were found in the first intron alone, indicating possible regulatory roles. The predictions for orthologous genes from the two genomes showed a good correlation with respect to prediction scores and promoter organization. Comparison of our results with an existing program for promoter prediction in plant genomes indicates that our method shows improved prediction capability. the midpoint of a 100nt window is considered as a prediction if it satisfies the cutoffs. In EP3, the entire 400nt window is considered as a prediction if it satisfies the cutoffs.


Sequencing and annotation of a large number of eukaryotic genomes has made
available an enormous amount of information regarding genetic coding sequences. This data can be effectively utilized for studying and modifying expression of genes if the location and contribution of cis-regulatory regions, which control spatial and temporal regulation of gene expression, is available. However, the precise annotation of regulatory regions is difficult as compared to the identification of genes primarily because regulatory regions do not code for an identifiable product. In fact, regulatory regions are bound by proteins such as transcription factors which bring about transcription and its regulation.
Determining transcription factor binding sites (TFBSs) from chromatin immunoprecipitation methods has limitations and requires a lot of downstream data processing (Farnham, 2009). Moreover, the mere binding of a transcription factor at a particular site doesn't warrant its involvement in regulation of a gene. Development of computational approaches that enable accurate prediction of cis-regulatory sites could thus greatly aid in deciphering the regulatory mechanisms at the genome level.
The preponderance of non-coding DNA in the eukaryotic genome makes it difficult to identify promoter regions. Most efforts towards prediction of regulatory regions have traditionally focused on detection of consensus sequences for TATA box, INR elements, TFBSs, etc. Such sequence based prediction of short motifs might be inadequate because a large number of false hits are possible by chance. Moreover, there is increasing evidence to suggest that consensus sequences vary greatly and are even absent in many cases.
TATA box, which is considered as the signature sequence for promoters, is not found in a majority of core promoters in eukaryotes (Cooper et al., 2006;The ENCODE Project Consortium, 2007) and TATA binding protein can recognize the core promoter irrespective of the underlying sequence, with the help of additional factors (Pugh, 2000). A recent study on nucleosomal positioning in Schizosaccharomyces pombe shows that nucleosome depleted regions at promoters do not show the sequence characteristics (poly[A+T] (Florquin et al., 2005;Fujimori et al., 2005;Kanhere and Bansal, 2005a,b;Alexandrov et al., 2006;Lee et al.,2007;Abeel et al., 2008a;Cao et al., 2009;Parker et al., 2009;Rangannan and Bansal,2009;Tanaka et al., 2009). Although these properties are inherently sequence dependent, they give additional insight into long range interactions which might not be evident from sequence alone. Moreover, the structural features found at promoter regions are sometimes conserved across species (Fujimori et al., 2005;Kanhere and Bansal, 2005a;Abeel et al., 2008a). Thus, a prediction program that effectively captures the structural patterns at promoters could help in predicting regulatory regions across genomes irrespective of the availability of training data.
Majority of the currently available promoter prediction programs (PPPs) such as ARTS, Eponine and ProSOM focus on promoter prediction in the human genome, or related genomes (Down and Hubbard, 2002;Sonnenburg et al., 2006;Abeel et al., 2008bAbeel et al., , 2009 for which processed experimental data and detailed annotation such as from deepCAGE sequencing is already available. Since these programs require pre-training on the genome, they cannot be readily applied to other genomes, such as plants. CpG island predictors cannot be used for plants since a suitable prediction criteria is unavailable (Rombauts et al., 2003) and they are purported to be absent in plant genomes (Yamamoto et al., 2007b).
Sequence based PPPs for plants are either repositories of TFBSs and cis-regulatory elements reported in individual studies like PLACE (Higo et al., 1999), Osiris (Morris et al., 2008) and AGRIS (Davuluri et al., 2003) or in silico analysis of over-represented kmers at promoters (Molina and Grotewold, 2005;Yamamoto et al., 2007a;Lichtenberg et al., 2009). EP3 (Abeel et al., 2008a) is the only PPP available currently which predicts extended promoter regions in plant genomes. However, the promoter prediction property (base stacking) used in EP3 is selected based on analysis in the human genome only.
Also, some minimal training is apparently involved in the EP3 program as well since different thresholds are used for different organisms (Arabidopsis: 0.0583, rice: 0.1394). Many plant genomes have been recently sequenced (Ming et al., 2008;Rensing et al., 2008;Schmutz et al.,2010;The International Brachypodium Initiative, 2010) and a large number of genomes are in the sequencing pipeline such as The Multinational Brassica rapa sequencing project (For a full list, refer NCBI Genome Projects). A promoter prediction tool suited for plant genomes could help in annotation of putative cis-regulatory regions as well as in finding new genes for these newly sequenced genomes.
We present here a detailed analysis of the performance of the program PromPredict -a simple program that captures the free energy pattern at promoter regions from DNA sequence information without requiring any pre-training-for the model monocot and eudicot plant genomes Oryza sativa L. ssp. japonica cultivar Nipponbare (rice) (The 7 Rice Annotation Project, 2008) and Arabidopsis thaliana (Arabidopsis) (The Arabidopsis Genome Initiative, 2000). PromPredict was originally developed to predict putative promoters using the whole genome %GC of select bacterial genomes to define the baseline cutoffs for relative free energy of promoter regions (Rangannan and Bansal, 2009). It has now been generalized for genome prediction using 1000nt fragments with 20-80% GC (Rangannan and Bansal, 2010). It should be noted that the promoters are not predicted on the basis of motif composition or organization of cis-regulatory modules, but solely on the basis of relative free energy of adjoining sequences. We compare and contrast the genomic features and prediction characteristics in the two plant genomes to highlight the similarities and differences in their genome architecture. Such a comparison can shed light on the evolution of monocot and dicot lineages of flowering plants. The predictions are assigned to five different score classes to indicate their relative strength (as discussed under Prediction score). We also compare the performance of PromPredict with the EP3 program.

Results
A comparison of the annotated genomes (excluding mitochondrial and chloroplast chromosomes) of rice and Arabidopsis gave some interesting insights into the genome composition of the two plants. Arabidopsis has a small and compact genome with gene density an order of magnitude higher than that for the rice genome (Table IA). The length of the Arabidopsis genome is less than half the length of the rice genome but it has 40% of its genome being transcribed as compared to ≈22% in rice at the current state of annotation. However, the rice genome has longer primary transcripts and introns contribute to a majority of the primary transcript length (Table IB). Moreover, the rice genome has a higher average GC content and a greater GC variation, which is also reflected in the various regions of the gene. (Table IC, Fig. S1).

Average Free Energy (AFE) profile
Distortion of the DNA double helix such as separation of strands and bending of DNA is necessary for binding of RNA polymerase and transcription factors at the promoter site. The free energy of DNA melting is a dinucleotide sequence dependent secondary structure property that comprises of not only hydrogen bonding energy but also base stacking energy and hence is slightly different from mere AT or GC contents. Fig. 1A and B show the AFE profiles for Arabidopsis and rice genomes in the vicinity of the transcription start site (TSS). The details of calculating the AFE profile are mentioned in Methods. Both plants show similar free energy profiles with a significant difference 8 between upstream and downstream regions and a peak just upstream of TSS. Overall, the profiles show a less stable upstream region followed by a relatively stable downstream region. However, the difference in stability is much greater for rice (≈3.5 kcal/mol) as compared to Arabidopsis (≈1.5 kcal/mol) as seen in Fig. 1C. It should be noted here that AT-rich sequences tend to be less stable, even though the correlation is not exact and depends on their dinucleotide frequencies as seen in the vicinity of the TSS.
The Arabidopsis profile has a higher free energy (less stability) than rice owing to its AT rich genome. Similarly, the AFE profiles shift with variation in GC content of the sequences (Fig. S2). In conclusion, the promoter region is characterized by relative instability when compared to the downstream stable region for both the plant genomes.
This characteristic can be used to identify promoter regions as shown by Bansal (2007, 2009) for prokaryotes and by Abeel et al. (2008a) for eukaryotes.
The high stability trough in the profiles is found around 100 to 200nt downstream of TSS. This region is beyond the 5'UTR region of most genes in both Arabidopsis and rice (Table IB) and hence overlaps with the first coding sequence (CDS). Moreover, the 5'UTR and CDS regions in rice also have a higher average GC content than that of the primary transcript (Table IC). The GC richness of the region immediately downstream of the TSS is more pronounced in rice than in Arabidopsis (Fig. S1). These observations could account for the presence of GC-content gradients previously reported in monocots (Wong et al.,2002).

Performance of PromPredict on plant genomes
We tested the latest version of the program PromPredict (Rangannan and Bansal, 2010) on rice and Arabidopsis genomes in order to find cis-regulatory sites (See Web Resources). The program detects relative differences in free energy and applies cutoffs based on GC content of sequence. It compares the free energy of two adjacent sequences and predicts a cis-regulatory region at the upstream sequence if the two criteria -(a) free energy of the upstream sequence (E1 value); and (b) the difference in free energy between the two sequences (D value) -are greater than pre-determined cutoff values (See Methods, Supplemental Protocol S1, Fig. S3 and Table S1). The predictions are directional, depending on the orientation of the input sequence -forward or reverse strand.
The whole genome prediction performance of the program is presented in Table II, in terms of the recall and precision values for the various gene datasets (See Methods for precision and recall calculations). The region −500 to +100 bp with respect to the TSS (TP region) was considered for determining true positives in protein-coding genes as this covers the upstream region as well as most 5'UTR regions. The region −1000 to 0 bp was considered for non-coding RNA (ncRNA) genes and for protein-coding genes with only translation start site (TLS) information. The predictions obtained within the gene beyond the true positive region are considered as false positives (FPpred.). Overall, more than 92% genes (henceforth referred to as TPgenes) have a true positive prediction (TPpred.) within −500 to +100bp. Both the genomes have ≈1.5 TPpred. per TPgene on an average. As expected, the longer protein coding genes have more FPpred., thus leading to lower values of precision, as compared to that for ncRNA genes.
Although a gene might have more than one prediction in the TP region, the prediction nearest to the TSS will correspond to the core promoter. A frequency plot for distance of the nearest prediction from the TSS ( Fig. 2A) shows that majority of the TPpred. obtained are proximal to the TSS -70% of predictions for Arabidopsis and 63% predictions for rice are within −200 to +100 bp of the TSS. For ncRNA genes, the region −500 to 0 bp of the TSS contains 92% to 93% of the predictions for both Arabidopsis and rice genomes If the closest prediction for a gene is found near the TSS (−100 to +50 bp), it might correspond to the core promoter and hence have a stronger signal. The free energy difference with respect to the downstream sequence would be greater for these predictions as compared to those present distally. However, we found that this is not true. We calculated the AFE profiles for 1001nt sequences clustered according to the proximity of the closest prediction to the TSS (50nt bins from Fig. 2). The AFE plots in Fig. 3 have a broad low stability region corresponding to the 50nt bin where the closest prediction lies and another peak at −35 region. The first peak is expected because the algorithm recognizes this feature for prediction. The difference in AFE for upstream and downstream regions is almost constant for all plots irrespective of the distance of the instability peak from the TSS. The AFE for the peaks is less (by 1.5 to 2 kcal/mol) than the AFE observed at the same position in Fig. 1, but the peaks follow the general trend of the overall profile. We propose that such a profile might be a characteristic of cis-regulatory sites spread over a longer region upstream of TSS.
The second sharper peak found ubiquitously at −35 position might indicate the presence of a TATA box at this region. However, web-logos for this region did not show any strong consensus TATA sequence (data not shown). A comparison of tetramer frequencies in the −50 to −20 bp region and the −500 to +500bp region shows a relatively high occurrence of 'TATA' and 'AAAA' tetramers in the core promoter region for both www.plantphysiol.org on August 28, 2017 -Published by Downloaded from Copyright © 2011 American Society of Plant Biologists. All rights reserved.
Arabidopsis and rice (Fig. S5). Interestingly, while several AT-containing tetramers are preferentially located at upstream −35 region in Arabidopsis, some C-rich sequences also show over-representation in this region for rice promoters.
An analysis for the overlap of predictions with 92 TFBSs in rice as obtained from Osiris (Morris et al., 2008) was carried out. 56% TPpred. contained within them entire TFBS motifs while 98% TPpred. overlapped with atleast half of the TFBS sequence. 91% of the reported TFBSs overlapped atleast partially (half or more) with TPpred., out of which 58% were found to overlap completely. Although substantial number of TPpred. were found to contain AT rich TFBSs, a significant number of GC rich TFBSs were also found to occur within the predictions.

Prediction score
We categorized the predictions on the basis of the difference in free energy between a prediction and its downstream region, denoted as the D-value. The score classes are formed on the basis of the maximum D value (Dmax) of predictions and the GC content range of the surrounding 1001nt sequence (See Methods and Fig. S4 for details of categorization). Table III shows that most of the predictions in the higher score categories are TPpred., whereas the FPpred. show a preponderance towards the lower score categories. The D value cutoffs used for relative score categorization of predictions are similar to the cutoffs applied for promoter prediction. Both are dependent on GC content of the flanking 1001nt sequence and the frequency of obtaining a prediction in a particular GC range. If we raise the cutoff for promoter prediction to the category classification cutoffs, the precision can be improved. However, the number of TPgenes reduce as a consequence and recall decreases. A segregation of the predicted signals according to their score, thus allows user-defined stringency settings. We suggest that the three highest classes should be considered where multiple predictions are obtained.

Distribution of false positive predictions in primary transcript
The PromPredict program is able to predict cis-regulatory elements for more than 90% of the annotated genes (considering all score classes). However, on applying the lowest cutoffs, the precision of prediction is lower i.e. a substantial number of predictions are found in the primary transcript region (FP region). An analysis of the locations and relative scores of FPpred. (Table IV) showed that majority of FPpred. were obtained in the noncoding regions of the primary transcript i.e. introns and UTRs. We found that ≈20% of FPpred. were found in the first intron alone, which could be putative cis-regulatory regions. If we consider the predictions in first intron as TPpred., the precision increases to 0.47 for Arabidopsis and 0.40 for rice. Interestingly, the median length for the first intron (177nt for Arabidopsis and 1176nt for rice) is greater than that for all introns (100nt for Arabidopsis and 96nt for rice). Only few FPpred. were found in the CDS region (≈6%) although it constitutes a substantial length of the primary transcript (50.8% in Arabidopsis and 29.7% in rice). If only these predictions are considered as FPpred., the precision increases to 0.96 for both Arabidopsis and rice.
We also categorized FPpred. in score classes and according to their location in the primary transcript. The distribution is presented in Fig. 4 as a percentage of FPpred. for each score class. Although introns dominate in all score classes, there is an increasing trend of predictions towards higher score categories (68% level of significance for highest frequency). The same trend is observed in 5'UTRs although the number of FPpred. is low. On the other hand, in exons and 3'UTRs there is an increasing trend towards lower score categories (68% level of significance for highest frequency).

Analysis of false negatives
There are very few genes (8-9%) which do not have any predictions between −500 to +100 bp relative to the TSS, termed as false negative genes (FNgenes). A comparison of the AFE profiles for these genes with the profiles for all TSSs from a representative chromosome (Fig. 5) showed that the distinct difference between upstream and downstream regions is absent in FNgenes. The immediate upstream region has higher stability while the immediate downstream region has lesser stability in the FNgenes profile than that observed for the corresponding regions in the profile of all TSSs. The gene ontology (GO) categorization of Arabidopsis and rice genes (Protocol S2, Table S3 and Fig. S6) does not show preponderance of FNgenes in any particular GO category.
However slight differences are seen in the various categories, especially the presence of greater percentage of FNgenes with unknown functions, processes and cellular locations as well as the absence of FNgenes corresponding to vital processes such as DNA integration and chromatin assembly.

Correlation of prediction score with gene expression
Inter-species homology is routinely used for characterization of gene functions. Thus, it would be interesting to see if orthologous genes from rice and Arabidopsis show similarity in their promoter organization as well. An analysis of the prediction score correlation in all orthologous gene pairs from rice and Arabidopsis was carried out. 12780 Arabidopsis orthologs and 12615 rice orthologs were obtained (only protein coding genes with TSS information considered) using the g:Orth program of the g:profiler software (Reimand et al.,2007) that uses the Plant Ensembl database (Kersey et al., 2010). Of these, 11941(93.4%) and 11554(91.6%) genes were TPgenes in Arabidopsis and rice respectively, and the remaining were FNgenes. Since there were multiple Arabidopsis orthologs for certain rice genes and vice versa, 12359 pairs of orthologous genes were formed. The Dmax prediction scores for the orthologous gene pairs (See 'Prediction score categorization' in Methods) have been plotted in Fig. 7 and give a correlation coefficient of 0.23. However, if only the ortholog pairs for which the prediction scores from the two genomes are from the same score class or differ by one level are considered (81% of total pairs -shown as blue '+'), the correlation coefficient is 0.51.
A comparison of the relative positions and scores of predictions in the promoter regions of certain orthologous gene pairs (geneids are given in Table S4) showed that predictions of comparable strength and relative position are found in most of the cases (Fig. 8).
Arabidopsis genes, FAD2 (Kim et al., 2006) and PRF1 (Jeong et al., 2006), which have regulatory first introns were also studied along with their rice orthologs. For FAD2 (Fig.   8E), intronic predictions were observed in the first intron of both Arabidopsis and rice.
In addition, an ncRNA gene overlapping the first intron was found in the rice genome.
The first intron of PRF1 (Fig. 8F) in Arabidopsis is long and covers the same length as two short introns in the rice homolog. The prediction for rice was found in the second short intron, but at the same position from the TSS as the intronic prediction in Arabidopsis.
As mentioned earlier, the Dmax score of a prediction gives the relative difference in free energy between adjacent regions. The question then arises, can the score give an idea of the 'strength' of the predicted promoter? For example, it has been shown that CpG islands are generally found upstream of housekeeping genes whereas tissue specific genes have strong promoters usually containing a TATA box. It would be interesting to see if the relative differences in DNA free energy can capture the promoter strength. We categorized the TPgenes from certain families, metabolic pathways (Mueller et al., 2003;Jaiswal et al., 2006) and GO terms (The Gene Ontology Consortium, 2000) according to their score categories (Fig. 6). Most of the gene sets studied showed similar score distributions in the two genomes. For example, about 50% of genes involved in inflorescence had 'very high' and 'highest' scores in both genomes that might indicate their tissue-specific roles. Also, 60 to 80% of predictions for heat shock proteins fall in the top three score categories. However, constitutively expressed genes such as ubiquitin and tubulin did not show such similarities possibly owing to different expression rates for the protein isoforms.

Comparison of PromPredict performance with EP3
www.plantphysiol.org on August 28, 2017 -Published by Downloaded from Copyright © 2011 American Society of Plant Biologists. All rights reserved.
We compared the programs PromPredict and EP3 for whole genome prediction in Arabidopsis and rice (Table V). For both programs we considered the predictions within −500 to +500bp for determining true positives since that is the criteria used in EP3. The F-value (harmonic mean of the recall and precision) calculated shows that PromPredict gives better prediction performance than EP3. However, the EP3 program gives slightly higher F-value in the rice genome for protein-coding genes. Interestingly, PromPredict is able to predict cis-regulatory sites for ncRNA genes with much higher sensitivity and precision than EP3. It should be noted that all PromPredict predictions within −500 to +500 bp are considered irrespective of their score. If only the three higher score classes are considered, the F-value would improve as seen in Table III.
We believe that the recall and precision values do not give a complete picture of the prediction quality. In order to be useful in guiding experimental analysis and annotation, it is important to have predictions of appropriate length. One of the major drawbacks of EP3 is that it uses non-overlapping windows of fixed 400nt length for prediction. As a result, large chunks of the genome are predicted which might not be amenable to experimental validation. PromPredict, on the other hand, calculates free energy over overlapping windows of 100nt length and assigns it to the midpoint of the window.
Thus the prediction length varies (maximum 300 nt) depending on the local free energy in adjacent windows. The prediction coverage of the TP and FP regions indicates the percentage length of the region that is predicted to be "true". The overall percentage of genome length covered by TP and FP predictions is significantly lower for PromPredict as compared to EP3. Overall, PromPredict gives a better performance for promoter prediction in plants than EP3.

Discussion
The program PromPredict gives relatively good performance for the plant genomes Arabidopsis and rice, although its cutoffs have been derived from prokaryotic analysis.
The AFE profiles for plant genomes (Fig. 1) and prokaryotes (Rangannan and Bansal, 2009) are not identical but the difference in free energy between upstream and downstream sequences is seen in both profiles, which is the prediction criteria used by PromPredict.
Interestingly, the cutoffs derived by training on prokaryotes show good performance for eukaryotes (as shown here for plant genomes). However, slightly tweaking the cutoffs might give better predictions for each genome and the prediction score classes outlined in Table III can serve as alternative cutoffs for achieving the required performance.
We compared the performance parameters of PromPredict with EP3 which is the only other program which predicts extended promoters in plant genomes (Table V).
PromPredict gave better F-values than EP3 except for protein-coding genes in rice. However, precision and recall parameters take only the number of predictions into consideration and ignore their length. Longer predictions could (wrongly) give better values for these parameters, but would, in turn, increase the amount of experimental testing required. The EP3 algorithm gives longer and fixed length predictions that contribute significantly to the TP and FP regions as compared to PromPredict.
PromPredict gives a much better performance for ncRNA genes than EP3 for both plant genomes, even though EP3 is based on similar parameters. Most motif searching and trained algorithms such as ARTS, Eponine and ProSOM that look for consensus sequences or patterns are also expected to give a poor performance for ncRNA genes because the organization of PolII promoters differs from PolI (Russell and Zomerdijk, 2005) and PolIII (Geiduschek and Kassavetis, 2001) promoters.
The two plant genomes were compared with respect to genome characteristics as well as prediction characteristics. The precision value obtained for rice was lower than for Arabidopsis due to presence of higher FPpred. However, this might be a result of longer primary transcripts in the rice genome, which have a preponderance of intronic regions. Predictions obtained in the primary transcript especially in non-coding regions cannot be ignored as these might be alternative promoters or promoters for downstream genes. Yang (2009) has shown that broadly expressed genes in Arabidopsis and rice have longer noncoding regions, which might play a regulatory role. Carninci et al. (2006) have shown that alternative promoters present within primary transcripts are responsible for tissue-specific expression in humans. 48% of oligo-dT-primed CAGE libraries and 34% of random-primed CAGE libraries have at least one alternative promoter overlapping the sequence of known or predicted transcripts. Also, transcription start sites (TSSs) have been found in the 3'UTR of certain protein-coding genes which may code for transcripts that regulate downstream genes on the same or opposite strands. Moreover, regions located downstream of the TSS such as introns (Rose, 2008;Rose et al., 2008) and 5'UTRs (Lu et al., 2008) have been shown to be involved in regulation of transcription by acting as enhancers or through mechanisms such as Intron-mediated enhancement (IME). Noncoding regions might also be involved in replication, transcription of regulatory ncRNAs and transposition. Zhu et al. (2010) have shown that short conserved introns (50 to 150nt) in humans and mouse show preferential location (3'UTRs of universally expressed housekeeping genes) and non-random chromosomal distribution. They speculate that these introns might play regulatory roles in gene expression and nucleo-cytoplasmic transcript export.
Our analysis showed that about 95 to 96% predictions were found in the non-coding region of the primary transcript, a majority of which were found in the introns and may be valid TSSs or cis-regulatory sites (Table IV). Also, we analyzed 24 introns that have been experimentally shown to regulate expression in Arabidopsis and rice (Table VI), out of which 21 were detected by PromPredict to contain a promoter signal. The introns closest to the TSS are suggested to be most important for regulation of transcription and interestingly 20% of our FPpred. are found in the first intron alone. The remaining intronic predictions (50% of total FPpred) might be signals for other processes like splicing in RNAs.
In order to determine the core promoters, the predictions closest to the TSS were considered for both protein-coding and ncRNA genes ( Fig. 2A and B). The predictions in Arabidopsis are concentrated closer to the TSS than in rice. However, the free energy of distal predictions is comparable to that of proximal predictions (Fig. 3) indicating that these might be putative core promoters and not prediction artifacts. Thus it seems that free energy peaks might be present at different positions relative to the TSS for eukaryotic genes, in contrast to prokaryotic genes where the peak is only localized close to the

TSS (Rangannan and Bansal, 2009).
A genome-wide analysis of prediction scores in orthologous gene pairs gave a good correlation between the highest score TPpred. for each ortholog member (Fig. 7). A finer analysis of promoter organization in certain ortholog pairs further showed that the predictions in the vicinity of TSS not only have comparable scores but also similar locations with respect to TSS (Fig. 8). Hence, there seems to be a good relationship between the promoter predictions in orthologous genes and we propose that our predictions can also be used for studying promoter regions in these genes.
The AFE for FNgenes showed a different profile than that observed for all genes especially in rice (Fig. 5). The presence of alternative structural profiles for bendability in human promoters has been reported (Florquin et al., 2005;Zeng et al., 2009).
Presence of such alternative structural profiles might point towards different regulatory architectures that enable spatio-temporal expression specificity in eukaryotes. Hence, alternative free energy profiles (and other structural profiles) could be explored in plants to gain a better understanding of the promoter region and regulation.

Conclusion
We show here that the program PromPredict performs quite well in predicting cis-regulatory regions in plant genomes. This is indeed surprising since the program has been trained on prokaryotes. It seems that the relative free energy difference criterion used in this program seems to be a general property found in the vicinity of transcription start sites as shown in the human genome by Abeel et al. (2008a). Hence, PromPredict might also be expected to perform well for other plants and eukaryotes.
The program is based on simple prediction criteria that are easy to program and further www.plantphysiol.org on August 28, 2017 -Published by Downloaded from Copyright © 2011 American Society of Plant Biologists. All rights reserved. enhancement of the program with other features might give better results. Since, PromPredict predictions are biased towards unstable and hence AT-rich regions, complementing the program with other motifs like Y-patch or GA motif (Yamamoto et al., 2009) could be beneficial.
As our understanding of transcription develops, the actual complexity of the processes involved in gene regulation is revealed. Determination of putative regulatory sites where transcription factors could bind is but a small step in trying to understand the huge orchestra of regulatory mechanisms involved. It is difficult to make a one to one correlation between the cis-regulatory region and the corresponding regulated gene. In eukaryotic genomes, the sites involved in regulation of a gene may vary in different tissues adding to the complexity of the problem. This is further complicated by factors such as combinatorial regulation, nucleosome binding and epigenetic modifications. Yet, common themes and patterns of regulation can be observed as seen in this study. A combinatorial approach involving sequence and structural studies, both theoretical and experimental, would be most useful to further explore the mechanisms of transcription regulation.

Datasets
The Arabidopsis thaliana (Arabidopsis) genome and annotation data was extracted from

Average Free Energy (AFE) profile
For calculating the average free energy, dinucleotide parameters based on the model proposed by Allawi and Santalucia (1997) and Santalucia (1998) were used. The sequences of same length were aligned with the TSS at 0th position. An average profile is obtained by calculating the mean value of free energy at each position over all the sequences. The dinucleotide parameters averaged over a moving window of 15nt (frameshift of 1nt) were assigned to the midpoint of each window in order to reduce noise.

PromPredict program
The PromPredict program was first written to predict promoter regions in prokaryotes (Kanhere and Bansal, 2005b;Bansal, 2007, 2009). The program is built to predict cis-regulatory regions in a given input sequence on the basis of relative free energy of neighbouring regions in a 1001 nt long fragment and doesn't require any genome specific training. Hence, the program can be readily used for newly sequenced genomes for which gene and promoter information is scarce. Fig. S3A shows the AFE values in the -2000 to +2000 regions relative to TSS in Arabidopsis and rice. Fig. S3B shows the AFE values used as cutoffs to define the promoter regions in Prompredict and the AFE values in upstream (−500 to 0) and downstream (0 to +500) regions of Arabidopsis and rice. The values for Arabidopsis match well with the Prompredict cutoffs. The proximal downstream regions in rice are unusually GC rich, leading to lower AFE values, but in general the AFE values for non-promoter regions (+500 to +1000) farther away from TSS are similar to the cut-offs used in Prompredict.
PromPredict considers the (a) absolute free energy (E1) averaged over each overlapping window of 100nt sequence (frameshift of 1nt) and (b) the relative free energy difference (D) of E1 with the free energy (E2) averaged over a downstream 100nt sequence separated by 50nt in the 5'→3' direction. The free energy is calculated using the dinucleotide parameters based on the model proposed by Allawi and Santalucia (1997) and Santalucia (1998) as a sum over 15 nt. The parameters E1 and D are then compared with pre-defined cut-off values (Table S1). A sliding super-window of 1001nt (with a frameshift of 750nt) is used to determine the average GC content range used for cutoff values for each 100nt window within this super-window. Further details of the algorithm are given in Supplemental Protocol S1. We have used the latest version of the program described in Rangannan and Bansal (2010).

Performance evaluation and comparison
The performance of PromPredict on rice and Arabidopsis genomes was evaluated using the distance based cutoff as described by Bajic et al. (2004) and Abeel et al. (2009). It is obvious that the performance of a program would improve if the region considered for determining true positives is increased. However, the cis-regulatory regions are generally found within a particular distance upstream of the TSS/ TLS and in some regions within the primary transcript, such as the 5' UTR. The optimal length of this region largely depends on the organism under study. Previous analysis have considered a "true positive region" (TP region) of −150 to +50 bp for prokaryotes (Rangannan and Bansal, 2009), −500 to +500 bp for eukaryotes (Abeel et al., 2008a(Abeel et al., , 2009) with respect to TSS and −500 to 0 bp for prokaryotes with respect to TLS (Rangannan and Bansal, 2009) where 0th position corresponds to the TSS/TLS. For rice and Arabidopsis we found that ≈50% of the genes have a 5'UTR length of 100 nt or less (Table IB). Hence in our analysis, we considered the region −500 to +100 bp with respect to the TSS for determining true positives in proteincoding genes while the region −1000 to 0 bp of TSS/TLS was considered for ncRNA genes and for genes with only TLS information.
For comparison of our results with EP3, we considered the region −500 to +500 bp with respect to the TSS for protein-coding genes and −1000 to 0 bp of TSS for ncRNA genes.
The least stable position (LSP) of a prediction was considered as a single nucleotide metric for defining true and false predictions to avoid ambiguity due to overlapping with TP region. Therefore, any prediction with its LSP lying within the TP region was considered as a true positive prediction (TPpred.), while any prediction (LSP) lying within the transcribing region of a gene but not within the TP region (FP region) was considered a false positive prediction (FPpred.). The genes that have at least one TPpred. within the TP region of its TSS were considered as true positive genes (TPgenes). The genes which didn't have any predictions within the TP region were considered as false negative genes(FNgenes). The performance parameters were defined as follows:

* Precision * Recall F − value = (3) Precision + Recall
We also mention the average prediction length and the percentage prediction coverage of TP region and FP region when comparing programs. The latter statistic indicates the percentage length of the TP or FP region that a program predicts to be "true".

Prediction score categorization
The D value of a prediction reflects the difference in free energy of a prediction from its downstream region. Since an unstable region in a comparatively stable environment in DNA is a characteristic of promoter regions, we used the D value to determine the relative score of predictions. The maximum D value (Dmax) for all predictions in Arabidopsis and rice were pooled and the mean and standard deviation for this data was calculated.
Dmax cutoffs for categorizing predictions (Table S2) were calculated on the basis of the GC content of 1001nt fragment (super-window) containing the prediction. One of the following five score classes were assigned to each prediction depending on the score of the Dmax value of a prediction: (a) Highest (Dmax > Mean + 2SD), (b) Very high (Mean+SD <Dmax < Mean + 2SD), (c) High (Mean < Dmax < Mean + SD), (d) Medium (Mean−SD <Dmax < Mean), (e) Low (Cutoff < Dmax < Mean−SD). Also, the genes were categorized according to the score class of the prediction with the highest score present within −500 to +100bp of its TSS. All the predictions thus categorized according to their scores can be browsed online in the genome browser PlantcisProm constructed using Bioperl (Stein et al.,2002) alongwith annotated genes (See Web Resources).

Supplemental Material
Protocol S1: Details of the PromPredict algorithm.

Web resources
Plantcisprom genome browser (http://nucleix.mbu.iisc.ernet.in/plantcisprom) can be used to browse the whole genome promoter predictions alongwith gene annotation for Arabidopsis and rice genomes.
Zhu, J., He, F., Wang, D., Liu, K., Huang, D., Xiao, J., Wu, J., Hu, S., and Yu, J. (2010). A novel role for minimal introns: routing mRNAs to the cytosol. PloS One, 5:e10144.   Arabidopsis and (B) rice. It is seen that the predictions occurring in each frequency class correspond to peaks in AFE profiles at a particular distance from the TSS. The plots depict the AFE for sequences with the closest prediction present at a given distance (−500 to +100bp relative to TSS). The colour code used to depict the AFE profile, for sequences with predictions in each 50 nucleotide bin, is indicated in the box on the right.    scores corresponding to 11941 TP genes in Arabidopsis have been plotted against the scores for their 10275 TP gene orthologs in rice. Since there is more than one Arabidopsis gene ortholog for some rice genes, 12359 pairs of orthologous genes were formed. The 9976 orthologous gene pairs with scores in same class or differing by one level in the two genomes (+) give a Pearson correlation coefficient of 0.51 (best fit line -.-) while a value of 0.23 is obtained for all gene pairs (+ and •, best fit line -).    the region considered for determining true positives was −500 to +100 bp in the vicinity of TSS.

Figure legends
For ncRNA genes and protein coding genes with only TLS information, the region considered for determining true positives was −1000 to 0 bp of start site.     (Abeel et al., 2008a): In Arabidopsis, 20094 (protein-coding) and 1263 (RNA-coding) TSS were considered for analysis. In rice, 23057(protein-coding) and 1527 (RNA-coding) TSS were considered for analysis. For Arabidopsis, PromPredict gave 386264 predictions while EP3 gave 594559 predictions.
PromPredict predicted 1284547 signals and EP3 predicted 1611598 signals in the rice genome. The region considered for determining true positives is −500 to +500 bp of the TSS for protein coding genes and −1000 to 0 bp of the TSS for ncRNA genes.  intron in all the genes was involved in regulation, except for TWN2 where the first two introns were involved.

Gene
Prediction strength Reference