ConBind: motif-aware cross-species alignment for the identification of functional transcription factor binding sites

Eukaryotic gene expression is regulated by transcription factors (TFs) binding to promoter as well as distal enhancers. TFs recognize short, but specific binding sites (TFBSs) that are located within the promoter and enhancer regions. Functionally relevant TFBSs are often highly conserved during evolution leaving a strong phylogenetic signal. While multiple sequence alignment (MSA) is a potent tool to detect the phylogenetic signal, the current MSA implementations are optimized to align the maximum number of identical nucleotides. This approach might result in the omission of conserved motifs that contain interchangeable nucleotides such as the ETS motif (IUPAC code: GGAW). Here, we introduce ConBind, a novel method to enhance alignment of short motifs, even if their mutual sequence similarity is only partial. ConBind improves the identification of conserved TFBSs by improving the alignment accuracy of TFBS families within orthologous DNA sequences. Functional validation of the Gfi1b + 13 enhancer reveals that ConBind identifies additional functionally important ETS binding sites that were missed by all other tested alignment tools. In addition to the analysis of known regulatory regions, our web tool is useful for the analysis of TFBSs on so far unknown DNA regions identified through ChIP-sequencing.

regulatory regions, can be forcefully pulled together in the alignment, artificially aligning unrelated TFBSs. Contrarily, the weaker the MMW and the MSW are, the more likely it will be for the algorithm to discard the information about motifs and optimize the alignments in the traditional way (i.e. maximising the alignment score). An optimal motif-aware alignment method should produce alignments with a minimal change in alignment score and, at the same time, be able to align all (and only) functional TFBSs. These two objectives, however, are clearly discordant. We can imagine the difference in alignment score as the Cost that we need to pay for a certain gain in Effectiveness, i.e. the number of functional TFBSs correctly aligned.
In order to find the best parameter values and assessing the quality of the produced alignment compared with the efficiency in identifying conserved TFBSs we trained our method using a set of regulatory regions for which the functional TFBSs where previously experimentally validated by the Göttgens Lab in murine cell lines (personal communication). This training set includes 14 regulatory regions (Erg+75,Erg+65,Scl+40,Cx3cr1 promoter,Gfi1b+16 Meis1+48,Gfi1b+17,Pim1+10,Scl+19,Lyl1+2, for a total of 114 experimentally validated TFBSs belonging to six motif families (i.e. ETS, GATA, EBOX, GFI1, MEIS, and RUNT). These 14 mouse regions were aligned to seven different organisms (i.e. Homo sapiens, Bos taurus, Canis lupus familiaris, Loxodonta africana, Monodelphis domestica, Sarcophilus harrisii, and Ornithorhynchus anatinus) using ConBind. We performed an exhaustive parameter sweep running ConBind with different MMW and MSW pairs, such that 1 ≤ ≤ 50 and 0 ≤ ≤ (notice that assigning a heavier weight to a mismatch would not be a sensible option), for a total of 1325 runs. The sum-of-pairs score as defined by Thompson et al. (1999) was used to assess the overall quality of the produced alignments. For each ConBind run (with a specific MMW and MSW pair) two values were computed for each region R. The cost where , is the alignment of the region R produced by ConBind using the weight pair MMW,MSW. The columns corresponding to TFBSs were excluded in the Cost calculation. Notice that a perfect alignment yields a sum-of-pairs score of 1. For a perfect alignment A we expect � , � to tend to 0. The second value we computed is the effectiveness where is the total number of experimentally validated TFBS on the mouse region R. is the i th experimentally validated TFBS on the mouse sequence, H the number of sequences used in the alignment and ( ) is the number of organism (including mouse) in which is aligned in the same position. Notice that for every , ( )/ should approach 1, since all have been experimentally validated and should be conserved in H sequences. Therefore, we expect Effectiveness � , � to tend to 1 for an alignment A that shows highly conserved . Supplementary Figure 1 shows the mean Cost and Effectiveness (over all 14 regions) computed for every MMW, MSW combination. Notably, every weight combination shows an improvement in terms of efficiency over ClustalW2 alignments of the same regions. To reduce overtraining we computed cost and effectiveness using different subsets of the 14 regions. Specifically, we used every possible subset using 100%, 90% and 80% of the 14 regions, for a total of 379 subsets. For each subset we chose the MMW-MSW pair with the best cost-effectiveness ratio. For roughly 64% of the subsets the best weight pair had a MMW=3 and MSW=1. Supplementary Figure 2 shows the best weight pairs for the entire cross-validation study.

Supplementary Figure 2
Only three weight pairs (over the total of 1325 tested) attained the best cost-efficiency ratio in at least one of the cross-validation subsets. Particularly, the pair with MMW=3 and MSW=1 is the one with the best cost-efficiency ratio for roughly 64% of the cross-validation subsets.
Using the weight pair selected by cross-validation (i.e. MMW=3 and MSW=1) we compared the alignment of experimentally validated TFBSs between ConBind and ClustalW2, as shown in Supplementary Figure 3.

PRALINE AND PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT
The progressive multiple alignment step was performed with a flexible sequence alignment program, a reimplementation of the available PRALINE MSA toolbox (Heringa, 1999). This tool was developed in-house and thus has support for required features such as the use of custom symbol alphabets and weight matrices during the alignment process. In order to reduce the number of parameters, we used default settings and implemented a minimal tree-guided progressive alignment strategy. A reasonable default was chosen for the linkage method during the hierarchical clustering (UPGMA).

Supplementary Figure 3
For each experimentally validated TFBS we counted the number of species in which the TFBS was aligned in the same position. ConBind (CB) can detect a higher conservation signal compared to ClustalW2 (CW), indicating that ConBind aligns more TFBSs.