## Abstract

Motivation: Predicting cis-regulatory modules (CRMs) in higher eukaryotes is a challenging computational task. Commonly used methods to predict CRMs based on the signal of transcription factor binding sites (TFBS) are limited by prior information about transcription factor specificity. More general methods that bypass the reliance on TFBS models are needed for comprehensive CRM prediction.

Results: We have developed a method to predict CRMs called CisPlusFinder that identifies high density regions of perfect local ungapped sequences (PLUSs) based on multiple species conservation. By assuming that PLUSs contain core TFBS motifs that are locally overrepresented, the method attempts to capture the expected features of CRM structure and evolution. Applied to a benchmark dataset of CRMs involved in early Drosophila development, CisPlusFinder predicts more annotated CRMs than all other methods tested. Using the REDfly database, we find that some ‘false positive’ predictions in the benchmark dataset correspond to recently annotated CRMs. Our work demonstrates that CRM prediction methods that combine comparative genomic data with statistical properties of DNA may achieve reasonable performance when applied genome-wide in the absence of an a priori set of known TFBS motifs.

Contact:nora.pierstorff@uni-koeln.de.

Supplementary information: Supplementary data are available at Bioinformatics online.

## 1 INTRODUCTION

Identifying cis-regulatory modules (CRMs) is a prominent issue in computational biology. In order to have a better understanding of the mechanistic aspects of gene regulation it is necessary to identify systematically the target sites for transcriptional regulatory proteins. CRMs are also important targets for evolutionary change since phenotypic differences between species may be attributable to differences in gene regulation rather than to differences in the gene itself (Wilson, 1975; Tautz, 2000; Khaitovich et al., 2006). In contrast to the gene-finding problem, the problem of finding CRMs is much more difficult—partly because of the lack of a commonly accepted definition of a CRM, owing to their inherent variability in structure. As a working definition we call a CRM of gene x a stretch of DNA of several hundred bases located within, proximal or distal to gene x, which contains potential target sites of transcription factors and whose presence or absence has a measurable effect on the expression of gene x.

Most programs to predict CRMs, such as Ahab (Rajewsky et al., 2002) or ClusterBuster (Frith et al., 2003) use information about known TFBSs to identify potential CRMs. The results of these methods can be improved by using information about the conservation of predicted TFBSs, such as done by the programs Stubb (Sinha et al., 2004) and ecis-analyst (Berman et al., 2004). These methods yield reasonably good results but cannot generally be applied to genes for which the upstream transcription factors have not yet been studied experimentally. Thus, they cannot easily be used for a genome-wide CRM annotation since only a minor fraction of all active transcription factors and their binding sites is known.

An alternative approach is to use information contained in the pattern of conservation between related sequences. For example the method of Grad et al. (2004) uses patterns of sequence conservation between two species to predict CRMs with similar properties to a training set. Another approach, generally known as phylogenetic shadowing (Bofelli et al., 2003), uses more than two species to predict functional regions. Methods using conservation signals alone can be limited in their performance because TFBS conservation necessary to maintain regulatory function may not be significantly higher than in non-binding sequence (Emberly et al., 2003) and TFBSs can be gained or lost even when CRM function is conserved (Ludwig et al., 2000). In addition, conservation signals depend on the local mutation rate and selective constraints which vary over the genome.

Here, we show that CRMs can be identified in Drosophila melanogaster by combining conservation signals with a second signal based on identifying locally overrepresented sequence motifs, similar to the method described by Papatsenko et al. (2002). Often, functional TFBSs are associated with a cluster of degenerate repeats of their core motifs. Their biological function may be the guidance of a transcription factor to its binding site (Kim et al., 1987; Khory et al., 1990; Coleman and Pugh, 1995). Our method, called CisPlusFinder, locates sequences that are both perfectly conserved in multiple genomes and contain an overrepresented core motif as signal of a TFBS, and subsequently clusters sequences with these properties to predict CRMs. We developed and applied our method in the genus Drosophila, a species group that is well suited for a multi-way comparative analysis since draft genome sequences are available for D.melanogaster and 11 other species (http://rana.lbl.gov/drosophila/). Our results indicate that combining conservation and compositional signals from a set of multiple closely related species, which sum to a sufficiently high substitution rate can lead to better CRM predictions than has been achieved thus far.

## 2 METHODS

### 2.1 Overview over CisPlusFinder

The motivating idea behind CisPlusFinder is to find regions in a target genome, which contain a high density of perfect local ungapped sequences (PLUSs) that are shared among a set of closely related species. Individual PLUSs are selected on the basis that: (1) the length of a PLUS is unlikely relative to random occurrence; (2) at least one core motif contained in a PLUS is locally overrepresented and (3) additional PLUSs fulfilling conditions (1) and (2) occur within the immediate neighborhood. Underlying these requirements is the idea that PLUSs contain core TFBS motifs, and that local clustering of PLUSs represents the comparative genomic signal of clustered TFBS that typify CRMs.

Figure 1 shows a flowchart describing CisPlusFinder. A detailed description of the method can be found in the Supplementary materials. The first section of Figure 1 shows the calculation of the nucleotide homology map, which is based on chained blastn-hits (Altschul et al., 1990). The second part of Figure 1 depicts the stepwise calculation of putative PLUSs, which have a significant length and contain within them a core motif that is overrepresented in a local window centered on the PLUS. In order to find the set of putative PLUSs suffix trees are built using Ukkonen's algorithm (Gusfield, 1997). The equations to calculate the significance of the length of a PLUS and the overrepresentation of possible core motifs are explained in the Supplementary materials. The last section of the flowchart depicts the selection of PLUSs, which possess at least the minimal regulatory potential, and the clustering of PLUSs into CRMs. We note that our definition of ‘regulatory potential’ is not related to that used by King et al. (2005).

Fig. 1

Flowchart describing the CisPlusFinder method.

Fig. 1

Flowchart describing the CisPlusFinder method.

### 2.2 Measuring prediction accuracy

To evaluate our CRM predictions, we use the same measurements as Chan and Kibler (2005): sensitivity (SN) and positive predicted value (PPV) on the CRM and nucleotide levels and specificity (SP) on the nucleotide level. They are

(1)
$SN=TP/(TP+FN)$

(2)
$PPV=TP/(TP+FP)$
and
(3)
$SP=TN/(TN+FP),$
where TP means true positive, TN means true negative, FP means false positive and FN means false negative nucleotides or CRMs. We believe that the most meaningful measurement is SN on the CRM level, which quantifies the proportion of annotated CRMs discovered by the computational method under evaluation.

These performance measurements are often used to evaluate and compare gene prediction programs and classifiers that partition an input set into two groups. Applying the same measurements to CRM prediction the following problems have to be considered. One problem is caused by the fact that the exact boundaries are not known for most CRMs. It is possible that annotated CRMs exceed the boundaries of the functional CRM causing many false negative nucleotides. It is also possible that additional functional regions of a CRM exist outside an annotated CRM giving rise to false positive nucleotides. More false positive nucleotides and predictions are caused by the fact that many CRMs are still unknown and not annotated yet. Thus false positive predictions are expected and high specificity may not necessarily reflect good accuracy of a program. Another problem is that the equation one prediction = one CRM is not true for most of the prediction programs. PPV on CRM level is calculated assuming this fact. PPV also depends strongly on the length of the analysed sequence. If the sequence is short enough the probability of a false positive prediction is limited by space.

Because of these considerations we developed a standardized version of PPV to train our method, denoted PPV′, which calculates the length-adjusted probability that a prediction hits an annotated CRM. This measurement on the nucleotide level is defined as

$PPV′=TPPTPP+FPN,$
where P means the number of nucleotides covered by annotated CRMs and N means the number of nucleotides not covered by annotated CRMs. In addition, in training the parameters of our method (see 3.1.2) we avoided penalizing predictions that overlap annotated CRMs but do not hit the exact boundaries of the annotations. Thus, for a CRM prediction that hit an annotated CRM, the number of false positive nucleotides outside of the boundary of the hit annotation were not used to calculate PPV′ or SP. Likewise, nucleotides in annotated CRMs outside the boundary of an overlapping prediction were not counted as false negatives in calculating SN.

To evaluate the performance of CisPlusFinder in comparison with other methods or on the test set we used the script by Chan and Kibler (2005) without modification. This script considers a CRM to be discovered if the overlap between it and a predicted CRM exceeds 50 bp and therefore excludes CRMs shorter than 50 bp from the analysis. Our analysis follows the same rules. We note that exons were not masked in this study, so as not to exclude prediction of CRMs that are known to overlap exons, and for direct comparison with results from Chan and Kibler (2005).

### 2.3 Datasets

To develop and test our algorithm we used three datasets containing CRMs from genes involved in early development in D.melanogaster (Table 1).

Table 1

Summary of CRM datasets used in this study

Papatsenko HexDiff REDfly (all) REDfly (non-redundant)
Number of target genes 39 16 203 165
Number of CRMs 95 52 610 386
Papatsenko HexDiff REDfly (all) REDfly (non-redundant)
Number of target genes 39 16 203 165
Number of CRMs 95 52 610 386

To develop the CisPlusFinder method and train its parameters we used the combined datasets from Papatsenko et al. (2002) and Papatsenko and Levine (2005).

To compare results from CisPlusFinder with other CRM detection methods, we used the dataset and published results of Chan and Kibler (2005). We note that HexDiff dataset shares 12 genes in common with the Papatsenko dataset, and therefore this evaluation dataset is not independent from our training dataset, as is the case for many of the PWM-based methods tested in Chan and Kibler (2005). We also note that the CRM annotation for the HexDiff and Papatsenkodatasets differ for these 12 genes, with 34 of 44 annotated HexDiff CRMs overlapping 35 of 41 annotated Papatsenko CRMs.

To test our method on a dataset of CRMs that is fully independent from the training data and that covers a wider range of biological processes, we used the

REDfly
database curated by Gallo et al. (2005). To construct the REDfly non-redundant dataset, we chose every CRM that is verified by an in vivo reporter construct and regulates a target gene not found in one of the other CRM datasets used to train and evaluate CisPlusFinder. For every target gene, the region containing the gene and all annotated CRMs are downloaded and extended 10 kb upstream of the first CRM and 10 kb downstream of the last CRM.

As an additional experiment, we tested whether the PLUSs that underlie CRM predictions by CisPlusFinder correspond to annotated TFBSs using the flyreg database curated by Bergman et al. (2005), many of which are positioned in CRMs from the HexDiff dataset.

## 3 RESULTS

### 3.1 Training

#### 3.1.1 Choice of informant species

To train the parameters of CisPlusFinder we attempted not to miss CRMs because they are absent from one or more informant sequences. Therefore, we selected manually the informant species from the set of Drosophila yakuba, Drosophila ananassae, Drosophila pseudoobscura and Drosophila virilis. Table 1 in the Supplementary materials indicates which informant species were used for which target gene.

#### 3.1.2 Parameters for Drosophila

For D.melanogaster and the informant species, we systematically varied the following five parameters, and measured the accuracy (PPV′) on the training dataset. For other species the parameters can easily be retrained. We found the following combination to be optimal:

1. length of overrepresented core of a PLUS: c = 5

2. threshold for the minimal O/E-ratio: t = 2

3. window size for measuring overrepresentation: w = 1000

4. window size for measuring regulatory potential: r = 500

5. minimal regulatory potential: m = 3

#### 3.1.3 Accuracy on the training set

With optimized parameters we found 94 of the known 95 CRMs in the Papatsenko dataset. Only the CRM regulating the gene zen is missing in our prediction, which lacks a third PLUS to reach the required regulatory potential. Sensitivity on CRM level of our method is 0.99 and PPV′ is 0.67. Specificity on nucleotide level is 0.60 and sensitivity on nucleotide level is 0.56.

### 3.2 Comparison with other methods on HexDiff dataset

To compare the performance of CisPlusFinder with other methods, we applied it to a dataset used by Chan and Kibler (2005). In this study the methods Ahab (Rajewsky et al., 2002), ClusterBuster (Frith et al., 2003), MSCAN (Johansson et al., 2003), MCAST (Bailey and Noble, 2003), LWF (Papatsenko et al., 2002) and HexDiff (Chan and Kibler, 2005) were applied to a set of CRMs active in early Drosophila development and performance measures of these methods were computed. The results are summarized in Table 2; for details on the parameters used for these method we refer the reader to the paper by Chan and Kibler (2005). We note that the thresholds chosen by Chan and Kibler (2005) were optimized to increase specificity according to the HexDiff annotation. Lowering the thresholds would increase sensitivity and decrease specificity for these methods. We added to these results the published predictions from an additional method developed by Grad et al. (2004), which uses only comparative data and a training set of known CRMs. Furthermore, we generated predictions for the HexDiff dataset using Stubb (Sinha et al., 2004) and ecis-analyst (Berman et al., 2004), both of which use information about conservation of known TFBSs. To evaluate these methods we used default parameter settings and input matrices from Chan and Kibler (2005).

Table 2

Comparison of different CRM prediction methods on the HexDiff dataset

SN(CRM) PPV(CRM) SP(nuc.) SN(nuc.) PPV(nuc.) Hit CRMs Missed CRMs
HexDiff 69 34 94 39 36 36 16
Ahab 44 58 98 22 55 23 29
ClusterBuster 60 26 95 34 37 31 21
MSCAN 65 19 91 27 21 34 18
MCAST 83 11 70 48 13 43
LWF 52 11 90 13 11 27 25

Stubb (pse-default) 67 24 92 29 24 35 17
Stubb (vir-PhastCons) 87 11 71 43 12 45
ecis-analyst 61 65 92 45 33 32 20

CisPlusFinder 96 11 54 60 11 50
PFR 23 36 83 10 12 40
SN(CRM) PPV(CRM) SP(nuc.) SN(nuc.) PPV(nuc.) Hit CRMs Missed CRMs
HexDiff 69 34 94 39 36 36 16
Ahab 44 58 98 22 55 23 29
ClusterBuster 60 26 95 34 37 31 21
MSCAN 65 19 91 27 21 34 18
MCAST 83 11 70 48 13 43
LWF 52 11 90 13 11 27 25

Stubb (pse-default) 67 24 92 29 24 35 17
Stubb (vir-PhastCons) 87 11 71 43 12 45
ecis-analyst 61 65 92 45 33 32 20

CisPlusFinder 96 11 54 60 11 50
PFR 23 36 83 10 12 40

Accuracy of different CRM prediction methods on a set of 52 CRMs active in early Drosophila development. (Results in the top section of the table are taken from hexdiff).

Table 2 shows that CisPlusFinder has a higher sensitivity on the CRM and nucleotide levels than any other method. The second highest sensitivity is achieved by Stubb, which misses five CRMs found by CisPlusFinder. Both CRMs missed by CisPlusFinder have high local substitution rates, and one CRM missed by CisPlusFinder (regulating the expression of oc) was also not found by the method of Grad et al. (2004).

Figure 2 shows the predictions of all programs listed in Table 2, the HexDiff annotation and the current REDfly-based annotation for two HexDiff regions. Some CRMs are found by all methods, such as the eve-stripe3 enhancer at position 5 487 313 in Figure 2a, suggesting that a combined approach to CRM prediction, as is increasingly used for gene prediction (Allen and Salzberg, 2005), may yield accurate CRM annotations. Some CRMs are exclusively found by CisPlusFinder, such as the prd PMFE CRM at position 12 078 033 in Figure 2b.

Fig. 2

Visualization of the predictions of the methods listed in Table 2. The top two tracks show the REDfly (2006) and HexDiff (2005) annotations for the genomic regions containing eve and prd using the UCSC Browser (Kent et al., 2002) (). Note that apparent many ‘false positives’ may in fact represent unannotated CRMs in the HexDiff regions, as shown by comparison with the REDfly CRM annotation. The bed files for all 16 HexDiff regions are available at the project page for visualisation using the UCSC browser.

Fig. 2

Visualization of the predictions of the methods listed in Table 2. The top two tracks show the REDfly (2006) and HexDiff (2005) annotations for the genomic regions containing eve and prd using the UCSC Browser (Kent et al., 2002) (). Note that apparent many ‘false positives’ may in fact represent unannotated CRMs in the HexDiff regions, as shown by comparison with the REDfly CRM annotation. The bed files for all 16 HexDiff regions are available at the project page for visualisation using the UCSC browser.

### 3.3 Analysis of false positive predictions

Table 2 also shows a relatively low specificity achieved by CisPlusFinder. However it is obvious from Figure 2 that there are predictions made by CisPlusFinder that do not overlap with the HexDiff-based annotation (black boxes) but do correspond to annotated CRMs in the REDfly database (grey boxes). To see if the low specificity of CisPlusFinder results from an incompletely annotated evaluation dataset, we extracted all CRMs from the REDfly database that regulate genes in the HexDiff dataset. The REDfly database contains 125 CRMs for the HexDiff regions in contrast to 52 CRMs currently annotated, and also has improved annotations for hairy and oc. A total of 120 of the 125 REDfly CRMs overlap with our predictions, and 5 of the annotated CRMs cannot be found. The number of false positive CisPlusFinder CRM predictions is reduced from 408 to 338. The number of nucleotides in false positive CisPlusFinder CRM predictions is reduced by 15.41% from 296 005 to 250 388.

The current REDfly-based annotation and bed files for upload to the UCSC genome browser for all 16 HexDiff regions containing all the information can be downloaded at the project page.

### 3.4 Correlation between PLUSs and known TFBS

To test whether PLUSs are caused by TFBSs, we downloaded all annotated TFBSs that are found in the HexDiff regions from the flyreg database (Bergman et al., 2005). This resulted in a dataset of 376 TFBSs regulating 10 target genes bound by 27 different transcription factors. For 67 of the TFBS, the binding factor is not known. The average length of the TFBS in our subset is 18 bp.

A total of 135 TFBS (36%) overlap with 115 PLUSs found by CisPlusFinder. The average length of a hit TFBS is 19 bp and the average length of a hitting PLUS is 14 bp. In the 602 508 bp analysed sequence 4243 PLUSs are found, covering 8.65% of the sequence. The 376 TFBSs cover 1.07% of the sequence. The hit TFBSs cover 0.39% of the sequence. The probability to have this overlap by chance is 3.12*10−4. These results indicate that while many PLUSs correspond to TFBSs, many TFBSs do not appear in PLUSs because of the degree of substitution and turnover in this species set (Emberly et al., 2003; Ludwig et al., 2000).

### 3.5 Application to simulated data

To investigate our predictions in unconstrained sequences we applied it to simulated data. We simulated the evolution of the 16 HexDiff sequences as ancestral sequences along the tree shown in the supplementary material using the method CisEvolver by Pollard et al. (2006) (), which can evolve sequences according to the HKY85-model (Hasegawa et al., 1985) without any selective constraint. We extracted the simulated sequences of D.melanogaster and the informant sequences D.yakuba, D.ananassae, D.pseudoobscura and D.virilis from the generated sequences and applied CisPlusFinder to it. CisPlusFinder did not predict any CRM in the simulated D.melanogaster sequences. Out of 642 508 bp of the simulated D.melanogaster sequence roughly 60% are covered by chained alignments. No significant PLUS can be found in these aligned sequences. We conclude that PLUSs occur preferentially in sequences, which are subject to functional constraint.

### 3.6 Application to the REDfly test set

#### 3.6.1 Accuracy on the REDfly test set

The results of CisPlusFinder applied to the REDfly CRMs are shown in Table 3. Our method predicts CRMs that overlap 289 of 386 annotated CRMs in the REDfly dataset. We compared our results with the output of Stubb using the same set of PWMs as above, which may limit Stubb to predict CRMs regulated by transcription factors regulating the CRMs in the Chan and Kibler (2005) regions. When D.virilis is used as a reference species Stubb predicts CRMs that overlap 236 annotated REDfly CRMs.

Table 3

Accuracy of CisPlusFinder and Stubb on the non-redundant REDfly dataset

SN(CRM) PPV(CRM) SP(nuc.) SN(nuc.) PPV(nuc.) Hit CRMs Missed CRMs
CisPlusFinder 75 13 64 47 19 289 97
Stubb (pse-default) 24 19 94 18 92 294
Stubb (vir-PhastCons) 62 12 75 27 16 238 148
SN(CRM) PPV(CRM) SP(nuc.) SN(nuc.) PPV(nuc.) Hit CRMs Missed CRMs
CisPlusFinder 75 13 64 47 19 289 97
Stubb (pse-default) 24 19 94 18 92 294
Stubb (vir-PhastCons) 62 12 75 27 16 238 148

#### 3.6.2 Further analysis of false negatives

A total of 97 CRMs are missed by CisPlusFinder in the REDfly regions. Assuming that the function of a CRM is conserved in all informant species, it can be missed by CisPlusFinder for two reasons. One reason is that the number of PLUSs in this region is lower than the minimal required regulatory potential. The other reason is that the local substitution rate of a region is very high, which means that the probability that a TFBS will cause a PLUS is reduced.

To understand the basis of false negative predictions, we define a missed CRM as those which do not overlap at all with any CisPlusFinder prediction. Thus nine CRMs that overlap with CisPlusFinder predictions for <50 bp and were treated as false negatives above, are treated as true positives in this analysis. One missed CRM is not present in the chained alignment and excluded from further consideration. For 13 missed CRMs, the informant sequences contain ambiguous or unassembled sequences, i.e. stretches of

N
s. The very strong requirement of perfect ungapped alignment makes CisPlusFinder very sensitive to unassembled sequences and
N
s in the sequence. For this reason six missed CRMs with a
N
-content above 60% are excluded from further consideration. After excluding these 16 exceptions, 81 false negative CRMs are left for further analysis.

To find which CRMs are missed because they do not contain enough PLUSs to give rise to the required regulatory potential, we chose the subset of missed CRMs that contain putative PLUSs but are not predicted by CisPlusFinder (named lowRP in the following). lowRP contains 31 missed CRMs from 26 different regions. The remaining 50 missed CRMs contain no PLUS and map to 30 different regions. We speculated that these missed CRMs in no PLUS regions might be caused by a high local substitution rate.

To test this hypothesis, we downloaded the PhastCons scores for multiple alignments of the D.melanogaster genome with eight other species Drosophila simulans, D.yakuba, D.ananassae, D.pseudoobscura, D.virilis, Drosophila mojavensis, Anopheles gambiae and Apis mellifera (Siepel et al., 2005). These alignment scores are calculated for every position of the genome of D.melanogaster. The average conservation scores for the different subsets of CRMs are given in Table 4, which shows that many missed CRMs are positioned in regions of the genome that are poorly conserved. CRMs with lowRP show an intermediate level of conservation, and positively predicted CRMs show the highest average level of conservation. We suggest that TFBS in the no PLUS regions have undergone a higher rate of evolutionary turnover because of a high local substitution rate, and thus CRMs are not identified since the number of PLUSs is too low. We note that in contrast to positively predicted CRMs both classes of missed CRMs have a lower conservation score than their surrounding regions.

Table 4

Relationship between local conservation score and CisPlusFinder CRM prediction accuracy

Found LowRP No PLUS
Number of CRMs 297 31 50
Number of regions 106 26 30
Average length of predicted CRM 2444 1194 533
Average conservation score of the CRM 0.53 (0.14) 0.40 (0.16) 0.25 (0.19)
Average conservation score of the region 0.48 (0.063) 0.46 (0.064) 0.38 (0.10)
Found LowRP No PLUS
Number of CRMs 297 31 50
Number of regions 106 26 30
Average length of predicted CRM 2444 1194 533
Average conservation score of the CRM 0.53 (0.14) 0.40 (0.16) 0.25 (0.19)
Average conservation score of the region 0.48 (0.063) 0.46 (0.064) 0.38 (0.10)

Average PhastCons score according to the multiple alignment of nine insects for predicted CRMs, CRMs containing putative PLUSs with low regulatory potential (lowRP) and CRMs where no PLUSs are found. Standard errors are given in brackets.

## 4 DISCUSSION

Transcriptional regulation is a highly complex process that differs for each individual gene and CRM. Therefore, it is difficult to find a necessary and sufficient set of rules that can be used for computational prediction of all CRMs in a genome.

The goal of this project was to develop a method that finds CRMs independent of the information about the transcription factors that regulate a target gene. To do this we have developed a CRM prediction method based solely on sequence conservation and composition. Conservation-based methods limit prediction to conserved CRMs, but this limitation is becoming less restrictive with the increasing number of completely sequenced genomes. The set of informant species can be chosen according to the conservation of regulation for each investigated gene.

Table 2 shows that CisPlusFinder is able to localize Drosophila CRMs better than the other methods applied thus far, which were also trained and developed with CRMs involved in early Drosophila development. The results of CisPlusFinder applied to the REDfly dataset (Table 3) show that CisPlusFinder is not limited to a special group of CRMs, and that our method can predict CRMs that are involved in biological processes different from those in the training data. This generality of CisPlusFinder results from the fact that it does not require any a priori knowledge about the binding transcription factors regulating a target gene.

However, we found that the specificity of our method is the lowest in the comparison to those that condition predictions on TFBS models. To test if CisPlusFinder predicts a high number of false positives or, more interestingly, a high number of undiscovered CRMs, we compared our prediction with an updated annotation of the HexDiff regions (section 3.3). These results clearly indicate that the low specificity of CisPlusFinder is in part because we are predicting unannotated CRMs in the HexDiff regions, underscoring both the importance of high quality CRM annotations as well as the difficulty in evaluating the specificity of CRM prediction methods.

The results reported in Table 4 also show that the performance of CisPlusFinder depends on the local substitution rate of the region investigated. In regions with high substitution rates, CRMs cannot be identified with the same accuracy as those in regions with low substitution rates. There may also be cases where the substitution rates are extremely low such that the probability to find a PLUS by chance is exceeded by the probability to find it because of shared ancestry. In these cases more species should be included in the analysis.

A final result of our study is that PLUSs are valuable signals for predicting CRMs. This finding is not incompatible with the occurrence of substitutions in TFBS and binding site turnover (Emberly et al., 2003; Ludwig et al., 2000). PLUSs likely correspond to and are caused by TFBSs, but not every TFBS causes a PLUS. Hence CisPlusFinder can find CRMs that may have undergone functional substitutions, when at least some of the TFBSs are conserved strongly enough to give rise to PLUSs.

## 5 CONCLUSION

In conclusion, we have developed a method to predict CRMs, CisPlusFinder, that can produce reasonable predictions across different genomic regions, even when the parameters and set of informant species are static. CisPlusFinder may provide an accurate tool for biologists to uncover the regulation of individual genes by allowing the user to tailor the set of informant species to the local substitution rate and set parameters appropriate to the complexity of regulation of the target gene. Automated methods for species selection and parameter optimization will be an important development in the future.

The authors would like to thank R. Nunes da Fonseca for valuable discussion, F. Catania for the name ‘PLUS’ and A. Siepel for sharing the phylogeny. The authors also thank three anonymous reviewers for many helpful comments on the manuscript. N.P. is supported by a Marie Curie visiting fellowship from the European Commission (HPMT-GH-01-00285-13).

Conflict of Interest: none declared.

## REFERENCES

Allen
J.
Salzberg
S.
JIGSAW: integration of multiple sources of evidence for gene prediction
Bioinformatics
,
2005
, vol.
21
(pg.
3596
-
3603
)
Altschul
S.
, et al.  .
Basic local alignment search tool
J. Mol. Biol.
,
1990
, vol.
215
(pg.
403
-
410
)
Bailey
T.
Noble
W.
Searching for statistically significant regulatory modules
Bioinformatics
,
2003
, vol.
19
(pg.
II16
-
II25
)
Bergman
C.
, et al.  .
Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, D.melanogaster
Bioinformatics
,
2005
, vol.
21
(pg.
1747
-
1749
)
Berman
B.
, et al.  .
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome
,
2004
, vol.
2
(pg.
757
-
762
)
Bofelli
D.
, et al.  .
Phylogenetic shadowing of primate sequences to find functional regions of the human genome
Science
,
2003
, vol.
299
(pg.
1391
-
1394
)
Chan
B.
Kibler
D.
Using hexamers to predict cis-regulatory motifs in Drosophila
BMC Bioinformatics
,
2005
, vol.
6
pg.
262

Coleman
R.
Pugh
B.
Evidence for functional binding and stable sliding of the TATA binding protein on nonspecific DNA
J. Biol. Chem.
,
1995
, vol.
270
(pg.
13850
-
13859
)
Emberly
E.
, et al.  .
Conservation of regulatory elements between two species of Drosophila
BMC Bioinformatics
,
2003
, vol.
4
pg.
57

Frith
M.
, et al.  .
Cluster-Buster: finding dense clusters of motifs in DNA sequences
Nucleic Acids Res.
,
2003
, vol.
31
(pg.
3666
-
3668
)
Gallo
S.
, et al.  .
REDfly: a regulatory element database for Drosophila
Bioinformatics
,
2005
, vol.
22
(pg.
381
-
383
)
Y.
, et al.  .
Prediction of similarly acting cis-regulatory modules by subsequence profiling and comparative genomics in Drosophila melanogaster and D. pseudoobscura
Bioinformatics
,
2004
, vol.
20
(pg.
2738
-
2750
)
Gusfield
D.
Algorithms on Strings, Trees and Sequences
,
1997
Cambridge, UK
Cambridge University Press
Hasegawa
M.
, et al.  .
Dating of the human-ape splitting by a molecular clock of mitochondrial DNA
J. Mol. Evol.
,
1985
, vol.
22
(pg.
160
-
174
)
Johansson
O.
, et al.  .
Identification of functional clusters of transcription factor binding motifs in genomic sequences: the MSCAN algorithm
Nucleic Acids Res.
,
2003
, vol.
19
(pg.
169
-
176
)
Kent
W.
, et al.  .
The human genome browser at UCSC
Genome Res.
,
2002
, vol.
12
(pg.
996
-
1006
)
Khaitovich
P.
, et al.  .
Positive selction on gene expression in the human brain
Curr. Biol.
,
2006
, vol.
16
(pg.
R356
-
R358
)
Kim
J.
, et al.  .
Kinetic studies on Cro repressor-operator DNA interaction
J. Mol. Biol.
,
1987
, vol.
196
(pg.
149
-
158
)
King
D.
, et al.  .
Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences
Genome Res.
,
2005
, vol.
15
(pg.
1051
-
1060
)
Khory
A.
, et al.  .
Lac repressor-operator interaction: DNA length dependence
Biochim. Biophys. Acta
,
1990
, vol.
1087
(pg.
55
-
60
)
Ludwig
M.
, et al.  .
Evidence for stabilizing selection in a eukaryotic enhancer element
Nature
,
2000
, vol.
403
(pg.
564
-
567
)
Papatsenko
D.
Levine
M.
Quantitative analysis of binding motifs meditating diverse spatial readouts of the Dorsal gradient in the Drosophila embryo
,
2005
, vol.
102
(pg.
4966
-
4971
)
Papatsenko
D.
, et al.  .
Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers
Genome Res.
,
2002
, vol.
12
(pg.
470
-
481
)
Pollard
D.
Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments
BMC Bioinformatics
,
2006
, vol.
7
pg.
376

Rajewsky
N.
, et al.  .
Computational detection of genomic cis-regulatory modules, applied to body patterning in the early Drosophila embryo
BMC Bioinformatics
,
2002
, vol.
3
pg.
30

Siepel
A.
, et al.  .
Evolutionary conserved elements in vertebrate, insect, worm and yeast genomes
Genome Res.
,
2005
, vol.
15
(pg.
1034
-
1050
)
Sinha
S.
, et al.  .
Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila
BMC Bioinformatics
,
2004
, vol.
5
pg.
129

Tautz
D.
Evolution of transcriptional regulation
Curr. Opin. Genet. Dev.
,
2000
, vol.
10
(pg.
575
-
579
)
Wilson
A.
Evolutionary importance of gene regulation
,
1975
, vol.
7
(pg.
117
-
134
)

## Author notes

Associate Editor: Martin Bishop