Identification of cis-regulatory modules in promoters of human genes exploiting mutual positioning of transcription factors

In higher organisms, gene regulation is controlled by the interplay of non-random combinations of multiple transcription factors (TFs). Although numerous attempts have been made to identify these combinations, important details, such as mutual positioning of the factors that have an important role in the TF interplay, are still missing. The goal of the present work is in silico mapping of some of such associating factors based on their mutual positioning, using computational screening. We have selected the process of myogenesis as a study case, and we focused on TF combinations involving master myogenic TF Myogenic differentiation (MyoD) with other factors situated at specific distances from it. The results of our work show that some muscle-specific factors occur together with MyoD within the range of ±100 bp in a large number of promoters. We confirm co-occurrence of the MyoD with muscle-specific factors as described in earlier studies. However, we have also found novel relationships of MyoD with other factors not specific for muscle. Additionally, we have observed that MyoD tends to associate with different factors in proximal and distal promoter areas. The major outcome of our study is establishing the genome-wide connection between biological interactions of TFs and close co-occurrence of their binding sites.


INTRODUCTION
Gene regulation in higher organisms is affected by multiple specific proteins called transcription factors (TFs). The human genome exhibits a spectacular example of sophisticated transcriptional regulation. TFs bind specifically to short DNA sequence motifs [TF binding sites or (TFBSs)] often clustered together.
The spatial combination of multiple such binding sites or elements is non-random in nature and forms Cis-regulatory modules (CRMs) (1)(2)(3). The interplay between the TFs that compose the CRMs plays an important role in gene regulation in eukaryotes (4). This is underscored by the fact that $25 000 human genes are controlled by <2000 sequence specific DNA-binding TFs (5,6). Eukaryotic gene expression is controlled by a number of different TFs bound to DNA as CRM combinations. The study by (7) shows that regulatory regions contain multiple functional binding sites. The CRMs retain their ability to regulate genes in vitro and lose the ability if the binding is disrupted by either eliminating a certain TF or its binding site (7). Similarly (8)(9)(10) showed that the association between TFs is a key to generating muscle-specific expression.
The binding sites (or motifs) for particular TF are the building blocks/components of the CRM. The binding sites for a given TF are similar, although most often not identical in a DNA sequence. As a result, the binding site motifs are often highly degenerate, which brings in some challenges to build a model for these signals (20). Thus, the computational detection of these cis-regulatory DNA segments within a genome of interest is a major challenge. Furthermore, the relatively short length of binding motifs represented by the PWMs multiplies the challenge because the small amount of information they contain may result in a large number of false-positive predictions in genome-wide searches. This scenario can be compensated by combining the PWMs with some other features such as proximity to TSS (21), chromatin structure (21,22) and proximity to other PWM hits (1,2).
Numerous attempts were made to identify CRMs. However, many of the popular methods need prior knowledge of the TFs involved in the clusters. For example, Wasserman and Fickett developed a model to predict/ identify the muscle-specific regulatory modules. They considered the known factors associated with skeletal musclespecific expression, such as Mef-2, Myf, Sp-1, SRF and Tef (23). Some other methods like DiRE and CREME (24,25) identify the CRMs from a list of co-regulated genes. These methods require prior knowledge of coregulated genes (relatively small number) from expression data for a given set of genes. The method starts with the preparation of a database of conserved TFBSs for all the TFs from TRANSFAC across the promoter region of human genes and identifying their combinations in a given set of promoters. The method also requires an alternative set of control sequences to evaluate the background distribution of TFBSs and identify the CRMs by statistically evaluating the significant modules.
Nonetheless, these methods do not address one important aspect of CRMs, which is the mutual positioning of the factors composing them, such as a preference for certain distance from each other. As discussed by (26), the relative positioning of the factors is important for understanding the nature of their interactions. In the present article, we propose a new approach where we do not consider a priori the set/cluster of factors known to be involved in myogenesis. Instead, we consider all the available factors with respect to statistically significant positional preferences in their mutual positioning with Myogenic differentiation (MyoD). The available methods are helpful in finding regulatory modules from the specific set of genes for a specific biological process. In contrast, our approach is not confined to any individual biological process.
In our approach we take the TF-binding motifs derived experimentally from TRANSFAC database and computationally determined the binding sites on the sequences from the Chromatine immunoprecipitation (ChIP) experiment for specific TF. We also determine the binding site for other TFs in these sequences, and we derive the mutual positioning among the associated factors. Our approach finds the significant association between factors that may reflect their interaction in biological processes. We have also investigated the relationships among the associated factors depending of their distance from the transcription start site and also examined the differences between the functional and similar non-functional binding sites.
Thus, in this work, in addition to identifying the clusters, we analyze the mutual positioning of the factors, i.e. the preferential spacing between them. In this study, we considered all human TFs for which PWMs are available in TRANSFAC database. This enabled us to find the association of muscle specific factors with other non-muscle specific factors. This association may signify the involvement of MyoD with biological processes other than myogenesis and involvement of additional factors in myogenesis.
In this study, we have compared the association of MyoD with other factors in functional binding sites (27) and non-functional MyoD-binding motifs (hits/matches derived from the MyoD unbound sequences) derived as non-overlapping sequences from MyoD bound ChIP-Seq sequences (27). We assume that if certain TFs are significantly over-represented in a close range around binding sites of another factor, such mutual positional preference of the given TFs is related to their common biological function. Combining the computationally searched binding sites with the information concerning association between the factors can also help in determining true binding site of a factor. This way, biologically functional TFBSs can be discriminated from a vast amount of similar yet non-functional motifs: the functional TFBS are more likely to be organized in the CRMs than similar but nonfunctional motifs.
The results of our work show preferential coupling of the muscle-specific TFBS together with MyoD in the promoter sequences, as well as some novel relationships of MyoD with other non-muscle specific TFs.

MATERIALS AND METHODS
TRANSFAC provides information in the form of base frequency tables for 1226 different TFs. Of these 1226, 721 TFs are found in human. We have adopted base frequency tables for the human TFs from the TRANSFAC database. These tables are often used to find out the binding sites in genomes (28). In our work, we used PWM instead of the frequency tables to map the TFBS in the promoter sequences from human. PWM represents the log-odd probabilities of finding each base at each position in a signal. The whole protocol is outlined in the Figure 1. We implemented the method proposed by Staden (29) to build the PWMs. The background frequencies (30) were calculated from the Database of Transcription Start Sites (DBTSS) (http://dbtss.hgc.jp/) (31) as described in (32,33). The weight for each position of the matrix is derived using the formula described in (32,33), which is a modification of Bucher's formula (34). Individual weights of the nucleotide corresponding to the matching sequence were summed to calculate the matching score for a sequence (33).
TRANSFAC has three matrices for MyoD (M00001, M00184 and M00929). We have selected the matrix M00184 for our study because this matrix is built from only MyoD sites, and the information content is more than that of M00001. The matrix M00929 is built from E12, E2A, E47, ITF-1; MRF4, Myf-6, MyoD, Myogenin and Tcfe2a-binding sites. The matrix M00929 represents E-box protein rather than MyoD binding site specifically. Furthermore the number of promoters having combination of MyoD BS with E-box BSs for example E12, E2A and myogenin is relatively lower than for the M00184 matrix. For example, E2A+M00184 = 5221 and E2A+M00929 = 2951; E12+M00929 = 2449 and E12+ M00184 = 3468; myogenin+M00184 = 3599 and myog-enin+M00929 = 2951. Considering these facts, we carried out our further analysis with M00184.

Assigning the threshold
The PWMs calculated with the aforementioned method do not provide us with the threshold score to select the hits from the mapped data. Moreover, we cannot assign a single standard cutoff to all the TFs. Hence, it is essential to determine and assign different specific threshold score to each of the factors. To resolve this problem, we have determined how many sites are likely to arise by chance for any given score for any given TF. To do so, we have created a random promoter data set by shuffling the human promoter sequences using uShuffle while preserving the relative proportion of each nucleotide (35). The DBTSS database was used for the shuffling; therefore, the shuffled sequence database contained sequences of the same number and length. All the 721 factor's PWMs were mapped with a varying range of threshold to the shuffled sequence data set; thus, these hits tell us how many hits may be obtained by chance for each threshold.
We have determined a specific threshold for each PWM by estimating the number of false positives predicted by the PWM in randomized sequences. To determine a threshold that would result in an acceptable number of the false-positive predictions, we calculated the number of hits for each threshold for each TF based on the shuffled sequences, and we term this as 'Randomized Occurrence Frequency' (OF r ).
We assume that the sites recognized as positive from the randomized sequences are the false positives. We calculate OF as the average number of positive predictions per base pair in the random shuffled data set: where fP is the number of sites predicted in the shuffled sequences by the given PWM, N is the total number of sequences in the shuffled sequence database, and L is the length of the sequence subtracting the length of the PWM. We will use the notation OF r to designate occurrence frequency calculated from the shuffled sequence data set. Therefore, the higher the occurrence frequencies from the shuffled sequences are, the lower is the specificity. Now, we calculate OF r with the aforementioned formula for each threshold, and to avoid the selection of the false-positive occurrences, we take the OF r of 0.0001. With this threshold, we would detect minimum level of false positives from the promoter sequences.
We iteratively calculate the OF r for each cutoff. We start to calculate OF r for each TF with a high cutoff and check the OF r after each change in cutoff during the iteration. If the OF r reaches 0.0001, we stop further decrement of the cutoff for the TF and if OF r <0.0001, we decrease the cutoff by 0.1 and again calculate OF r . Therefore, we assigned the threshold for each factor for selected OF r of 0.0001, which means average of 1 hit in every 10 000 shuffled sequences at each position. The value of OF r is empirically selected to restrict the level of false positive predictions by the search procedure.

Finding the distribution of TFs around the TF of interest
As aforementioned, we have selected the process of 'myogenesis' as a study case and have selected to analyze MyoD, as it plays a vital role in the process. We calculated the distribution of factors around the MyoD within the range of ±100 bp. This is because we want to screen the factors that co-occur close to the MyoD inside the given interval. We term these factors as co-occurring with MyoD, i.e. the factors found to occur in combination with MyoD within ±100 bp interval in the proximal promoter region. To determine the distribution of MyoD with itself, we used matrix M00184 and M00929 both for MyoD and calculated the distribution with respect to each other. However, the sites selected in this Figure 1. The algorithm to detect the positional association of motifs in close vicinity. The frequency tables are converted into PWM, and the cutoff is determined as described in the text. Each PWM with the corresponding threshold for OF r = 0.0001 is mapped in both the promoter sequences and shuffled sequences. After comparing the number of occurrences in both the data sets, TFs having significantly higher occurrence in the promoter sequences are selected with the criteria z-score >3. Further, to find out the positional associations of the TFs with respect to MyoD, each observed occurrence distribution is compared with the background distribution, and positions having z-score >10 are selected as preferred positions.
step do not ensure that they are truly interacting and have biological significance. The distribution of the TFs found around myogenic TF MyoD may be arbitrary.
In addition, we have mapped these PWMs in the MyoD bound ChIP experiment sequences (27). Here, we have incorporated one more constraint, i.e. we have selected the matches only around the center of the MyoD bound sequences (27). The reason of adding the constraint is that in the ChIP-seq bound sequences, MyoD is likely to be bound close to the center of the sequences. Even though while computationally determining the binding sites, we may encounter many hits in the whole sequences, and many of them would be non-functional binding sites. Thus, to avoid the false-binding sites, we have considered only the matches that lie in the central region (up to À20 bp upstream and +20 bp downstream from the center position) on the MyoD bound sequences (27).

Statistical significance of the co-occurring TFs
To determine that the occurrence of the factors in combination with MyoD is not random, we calculate the statistical significance for each combination of the cooccurring factors. With the threshold obtained for OF r of 0.0001, we computationally mapped all the 721 PWMs into the shuffled database and find the distribution of other factors around the factor of our interest.
We compared occurrence for each factor around MyoD from the observed distribution with its occurrence in the shuffled sequence database and calculated the z-score with the formula as described in (33). We selected the distribution of the factor for further analyses if that has z-score >3.
Same statistical criteria have been implemented to determine the positional preference of the studied factors around MyoD-binding sites in DBTSS. The z-score !10 is selected as a cutoff to designate any position as a preferred location of the binding site of factors with respect to MyoD. As in our approach, we do not consider positional bins, the observed and the expected counts are small. Therefore, in addition to the z-score, we have performed the 'exact binomial test' to determine the preferred position. The R package is used to calculate the exact binomial P-value (36).

RESULTS
In our study, we used the information from the TRANSFAC database focusing on de novo discovery of the CRMs, based on the mutual positioning of TF-binding motifs. We confined our study to the pairwise combinations of the 721 human TFs with a key TF controlling the process of myogenesis, MyoD (37). We mapped the binding sites of MyoD and other factors from TRANSFAC in the human promoter sequences from the DBTSSs (http://dbtss.hgc.jp/) (31) as well as in experimentally identified MyoD-binding sites (27). From these computationally mapped binding sites, we evaluated the distribution of all the factors with respect to their mutual occurrence and distance between them.

Factors co-occurring with MyoD in human promoters
We have considered all 721 TRANSFAC frequency tables related to human TFs. We calculated PWMs and their respective false-discovery thresholds. These new PWMs were used to search for the similar motifs from 32 042 human promoter sequences (from À1000 to 201 bp around the TSS) and also in the MyoD bound ChIP experiment sequences (27).
We identified the factors having close positional association with MyoD in the range of ±100 bp following earlier studies (see later in the text).
To find the differences from the background distribution, we mapped the same PWMs in the shuffled sequences. Then, we determined how many times each of the factors is found together with MyoD in both the data sets. Wasserman and Fickett (23) found that in case of cis-regulatory elements, most of respective binding sites are positioned within 100 bp from each other. As in our approach, we are looking on the binding sites for only a pair of factors at a time, we considered the distance between them to be at most 100 bp, and therefore, we analyzed only factors found within this interval from MyoD. A straightforward mechanistic model proposed by Teif et al. (38) and based on the experimental study of Drosophila embryonic development by Fakhouri et al. (39) explained a possible reason of the preferred distance between BS for a repressor/activator transcription regulation in synthetic enhancers. They proposed a quantitative description of the nucleosome-dependent regulation of the gene expression at short genomic distances. They have showed the preferred distance between the adjacent functional TFBS to be 50-60 bp, which is mediated by nucleosome and TF interactions. Our interval 100 bp covers TF-TF interactions from (37) and also searches for TF-TF interactions in the adjacent area. We also calculated number of TF-TF interactions for various interval lengths to check dependence of the discovered here effects of the interval length.
We calculated the z-score (difference in TF occurrence frequencies in promoter sequences and randomized sequences, in units of standard deviations for the former) for each TF in combination with MyoD and selected those factors that have z-score values above 3. We obtained a large number of TFs with higher z-scores, and the full distribution of the z-scores follows a normal distribution (Supplementary Figure S1).
TF-binding information in TRANSFAC is redundant: some factors are represented by more than one matrix, and certain matrices for different factors are extremely similar. Some of this reflects a biological reality that different proteins can bind related sequences, and some of it is technical and stems from the fact that separate studies have reported independently the binding site preferences of a given factor. Numerous studies have been conducted to address this by comparing and clustering PWMs according to their similarity (40)(41)(42). A similar study specifically performed for muscle-specific factors showed that MyoD can be grouped with E47, E12, E2A, myogenin, ABA-responsive element binding factor 6 (AREB6) and Lmo2 based on their matrix similarity (42). According to the previous studies, E-proteins have been shown to dimerize with MyoD to bind to DNA together (27,37). Certainly, in our study, we too found these factors positioned overlapping with MyoD in a significant number of promoters. We therefore should check whether the 'cooccurrence' of these TFs with MyoD is real or simply a consequence of redundancy of their binding sites. TFs for which the PWM is similar to that of MyoD (e.g. those for E-proteins and other bHLH factors, see Supplementary   Table S1) fall at the same place as MyoD itself. These TFs were removed from the subsequent analyses.
The factors found to have significant co-occurrence with MyoD but are not identical to the MyoD motif pattern are listed in the Tables 1-4. We sorted these factors according to the number of promoters in which they occur together with MyoD (Tables 1-4). Tables 1-3 and  Supplementary Table S1 include the factors co-occurring with MyoD in >500 promoters. Factors found cooccurring with MyoD in <500 promoters are reported in The table summarizes the top significantly co-occurring factors with MyoD in >500 promoters and are not identical to the MyoD motif pattern. The positions in the last right column are identified to be significant after comparing with the background; only those positions that have z-score above 10 and P-value below 0.005 were selected. The table summarizes the top significantly co-occurring factors with MyoD in >500 promoters and are not identical to the MyoD motif pattern. The positions in the last right column are identified to be significant after comparing with the background; only those positions that have z-score above 10 and P-value below 0.005 were selected.   Table 4. The number of factors co-occurring with MyoD in >500 promoters and having z-score >3 was found to be 48 of 721, and association of the majority of these factors with MyoD was confirmed to be essential for the process of myogenesis (see later in the text).
We have investigated the dependence of the number of MyoD-TF pairs on the length of the region. The result is demonstrated in Figure 2. The figure shows the difference in the number of hits when the distance range is changed to 200 bp (black bar), 500 bp (dark gray bar) and 1000 bp The table summarizes the top significantly co-occurring factors with MyoD in >500 promoters and are not identical to the MyoD motif pattern. The positions in the fifth column are identified to be significant after comparing with the background; only those positions that have z-score above 10 and P-value below 0.005 were selected.  (light gray bar) keeping MyoD-binding site at the center for the factors selected from the Tables 1-3 and  Supplementary Table S1. From the figure, we can see that the differences in the number of occurrences vary for the particular factors in different distance ranges. After investigating the cause of such variations, we found that the length of the motifs seems to be a primary factor. For example, when the range is increased from 200 to 1000 bp, the number of hits/occurrences for Kid3 and ZNF333 also increased. This is expected, as the co-occurrence of Kid3 and ZNF333 with MyoD is relatively high because of the short motif length. However, for factor E2A, the difference of number of occurrences is small, which means that it is less likely that we would get high number of occurrences if the range is increased. The motif of the factor E2A is similar to that of the MyoD. This kind of distribution is similar with other E-box proteins as well. The lesser number of hits of these factors after increasing the distance/range implies that MyoD like E-box motifs are locally concentrated around the MyoD. A lesser increase in the occurrence of these E-box factors binding sites (similar to that of MyoD BS) can also be a consequence of the fact that the MyoD binds at the E-box binding sites as described by Tapscott (27). For the factor Meis, the difference in the occurrence number is less with the increased range, which is expected, as the factor is functionally associated with MyoD and hence may be also closely situated on the DNA sequence. However, the distribution/occurrence of the other factors known to be involved in the process of myogenesis, like AML1 and MAZ rises significantly with the increase in range of distance from MyoD-binding sites, which is unexpected. For factors such as PKNOX and PREP1, the number of occurrences does not increase with the increase of the range. This indicates that PKNOX and PREP1 BSs prefers to co-localize with MyoD BS these factors were not reported to be associated with the myogenesis or to function with MyoD. However, this observation indicates that the localized distribution of the BSs similar to that of MyoD BS may explain the fact that MyoD or MyoD like BS are not distributed in wider genomic regions, rather they are concentrated at certain regions of the genome.
Among the top 48 factors listed in the Tables 1-3 and  Supplementary Table S1, 23 are reported in previous studies to have some activity in muscle development or some interactions with MyoD or are associated together in the promoter area of genes. For example, MyoD activates the mouse MafB promoters (51). E-protein HEB is one of the primary E-proteins to regulate skeletal muscle differentiation as per the findings of (52). Recently, the sequential association between MyoD, myogenin, Myf5 and HEB has been established by (53). TEF-1 from the family of TEAD TFs (54) was found to regulate tissuespecific gene expression in muscle and placenta (55,56). Pitx2 is an upstream activator of extraocular myogenesis and survival (57).
Apart from the factors previously determined to function with MyoD, we observed the association of MyoD with other factors not specific to muscle or involved with myogenesis co-occurring with MyoD in a significant number of promoters, e.g. PREP1, NFAT1, Ikaros, Lyf-1, Sterol regulatory element binding proteins (SREBP), AREB6 and Pax4. Though not well established, some indication of association of some of these factors with MyoD in some biological process can be found in previous literature. For instance, (58) indicated the co-occurrence of MyoD and Ikaros in the proximal 1.5 kb region of genes encoding melanin-concentrating hormone receptor. A previous study by (59) has showed that activation and over-expression of PPARg promote adipogenic conversion of myoblasts. The functional interaction between MyoD and T3R in regulation of avian myoblast differentiation is shown in (60). Krox-like binding sites along with MyoD-like binding sites are present in myoblast-specific domain of muscle-specific enhancer (61). Deletion and site-directed mutation experiments demonstrated that at least 2 Krox-like sequences are required for enhancer activity in myoblasts (62). Other works (27,(63)(64)(65) also showed that MyoD binding overlaps with various other TFs, although the overlap is not systematic: the binding sites of some TFs such as E2F, SRF or NRSF tend not to co-occur with those of MyoD (Supplementary Table S2).
Though NFAT belongs to the family of nuclear factors of activated T-cells, we found this factor to have the significant preference to locate at 39 bp upstream from MyoD. NFAT signaling is required for primary myogenesis by transcriptional cooperation with MyoD (66). Involvement of MyoD in glucose metabolism has been reviewed by (67). They indicated other factors involved in this mechanism along with MyoD such as MEF2A, SREBP, C/EBP and NF-1 in insulin-mediated GLUT4 gene expression, which belongs to the glucose transport family that is expressed in the muscle adipose tissue and heart. In our study too, we found that these factors have a significant specific spacing with regards to the MyoD-binding site in a large number of promoters. The substantial occurrence of the binding sites of these factors within a close proximity around MyoD may have biological significance.
Provided that the MyoD BSs are GC-rich, it is more likely that we would obtain more MyoD BSs in GC-rich regions than elsewhere. It is also possible that we would discover there the enrichment of other GC-rich TFBS cooccurring with MyoD BSs. This is also reflected in the factors in Tables 1-4. To determine whether the association of the factors is contributed by the GC-content biases, we have partitioned the DBTSS promoters into CpG island containing (CpG+) and non-CpG island containing (CpGÀ) promoters using program Promoter Classifier (68). In these partitioned promoter sequences, we have analyzed the co-occurrence of the TFs with MyoD. The result is presented in the Supplementary  Table S3. From the Table it is clear that the proportion of the co-occurring factors in the whole DBTSS database and the partitioned database are similar. For example, proportion of co-occurrence of myogenin with MyoD is found to be 0.12565 (promoters found to have MyoDbinding sites in the DBTSS database divided by total number of promoters in the database) in CpG+promoters and 0.11407 in CpGÀ and 0.11231 in complete DBTSS. Though we can see the slightly higher proportion in CpG+ as compared with the CpGÀ promoters, which may be contributed by the GC-rich promoter effect, overall similar proportions show that promoter's GC content is not significantly affecting the co-occurrence of these factors. However, the GC content might be responsible for the distribution pattern of the factors having relatively short motif length ( Figure 3D), where the factors are almost evenly distributed around MyoD and do not have any preferred location with respect to the position of MyoD.

Mutual positioning of factors with respect to MyoD
In addition to the screening of the factors co-occurring with MyoD at some average distance, we investigated the individual distribution of each factor around MyoD. In this analysis, we again aligned promoters with respect to MyoD and calculated the actual occurrence distribution of each factor separately for all factors listed in the Tables 1-3 and Supplementary Table S1. In all promoters, we calculated the number of occurrences of these TFs within ±100 bp of MyoD-binding site and used our false-discovery estimation method to calculate a z-score for each TF to determine whether the distributions are significantly different.
The factors found to be overlapping with MyoD in the previous section have similar observed and background distributions but are remarkably over-represented in the DBTSS promoter sequences ( Figure 3A). The figure also shows the complete distribution of MyoD with itself. However, AREB6 and E47 despite having similar peak in the overlapping area also have another peak upstream and downstream ( Figure 3C). These factors (AREB6 and E47) have a number of occurrences similar to the background in the overlapping area; yet, they rather have outnumbered the background occurrence in further upstream and downstream areas. The peak in the overlapping area may be caused by the similar motif pattern between MyoD and AREB6 and E47; however, the peaks in the upstream and downstream positions might indicate the preferred positioning of these factors with respect to the MyoD.

Positional preferences of the factors with respect to MyoD
From this analysis, we have determined the significant positional preference of occurrence of each factor with respect to MyoD. Depending on the calculated z-score, we selected those positions that have a z-score above 10 and P-value <0.005. These positions are highly significant and are represented in the column 5 of Tables 1-4.
We observed that some factors show remarkable differences in the distribution and show distinct positional preference. The positional preference can be seen where a factor at a specific distance from MyoD is found in a large number of promoters. Among the 48 factors selected in the previous section with P < 0.005, 43 show preferences (listed in Tables 1-3 and Supplementary Table  S1). In all, 17 of 43 show the overlapping positional preferences. For the majority of these factors overlapping co-occurrence is confirmed in previous studies except PREP1, PKNOX2, AP-4, LBP-1 and Lmo2. However, some of the factors did not exhibit such significant positional preference, and these factors are found to be evenly distributed around MyoD-binding motifs. The preferred locations for factors like Meis, NFAT1, E-box proteins are found to be precise, whereas no preferred locations for factors like AML1a, MEF-2 are found in close proximity of MyoD. These factors are scattered around MyoD-binding sites.
Other factors found to co-occur significantly in promoters with MyoD but lacking any preferential positions with z-score 10 are listed in Supplementary Table S4. Among them, some are found to have biological association with MyoD. CCCTC-binding factor (CTCF) is found cooccurring with MyoD, and the association is recently described by (69). They found that CTCF enhanced myogenic differentiation by directly interacting with myogenic regulatory factors like MyoD and myogenin.

Patterns of distribution of particular factors around MyoD
From the individual distribution of each factor at each position in the range of ±100 bp around MyoD, we can observe some differences in the distribution patterns. Occurrence of some of the factors is found to be higher in surrounding regions besides the overlapping region. This is expected, as they have no similarity in their motif pattern but are present in abundance upstream and downstream of the MyoD. For example, AML1a, which shows some preferential distance from MyoD at ±90 bp, is highly represented in upstream and downstream areas ( Figure 3D). This result might indicate that MyoD has a higher affinity for sequences where AML1a sites are abundant and possibly where AML1 (Runx1) is bound. This would not be fortuitous, as this protein has been shown to bind directly to MyoD in myoblasts (43). Other factors exhibiting this kind of distribution are ELF1, TTF-1, MAZ, MEF-2, Lyf-1, p300, MZF1, ZID, Pax-6, KROX and Nanog. Thus, in addition to the preference for flanking E-box sequences as suggested by (27), MyoD may also prefer binding to the sequence/location enriched with these binding sites.
From these observations, four distinct groups of TFs depending on their positional distribution with respect to MyoD can be seen. Group 1: factors found to highly overlap with the MyoD. Binding motifs of factors in this group closely resemble MyoD; therefore, they have a single peak overlapping with MyoD, and their discovery may be trivial. Representatives of this group are often E-proteins or other classes of bHLH factors: E2A, Myogenin, E12, TAL1, Ebox, Lmo2, NeuroD, LBP-1, Tal-1alpha, E47, AP-4, HEB, PKNOX, PREP1 and Meis2 ( Figure 3A). Group 2: factors with single occurrence peak apart from MyoD like E2F ( Figure 3B). Group 3: factors with several distinct peaks upstream and downstream of MyoD ( Figure 3C), for example: Ikaros, Lyf-1, PPARG, T3R, Pitx2, ARP-1, AREB6, SREBP and Pax-4. Some of the factors in this group are zinc-finger proteins and largely take part in organ development, morphogenesis and also in metabolism. Group 4: factors in this group are broadly distributed upstream and downstream of MyoD with significant representation of zinc-finger proteins in this group. Factors in this group are ELF1, ZNF333, NFAT1, Churchill, AML1, MAFB, MAZ, ETS2 and CKROX. These factors are largely involved in transcription regulation and immune system. Many of these factors, though not all of them, are also found to have GC-rich motifs. There is no particular or distinct preferred position for these factors to be found upstream or downstream ( Figure 3D). The abundance of occurrence of GC-rich motifs in the DBTSS human promoters is expected as 72% of the human genome promoters are GC-rich (70). Even as observed in Figure 2, the occurrence of these factors in varied window length (200, 500 and 1000 bp) increases, which implies their general abundance in the promoters. However, the occurrence of the factors of Groups 1 and 3 does not increase for the larger interval length (Figure 2).

Expression of associated factors in muscle tissue
From our analysis, we have found that some of the nonmuscle specific factors co-occur with MyoD in significant number of promoters (Table 3). Now the obvious question would be if these factors are at all expressed in the muscle cell environment. To determine this, we have checked the expression profile of these factors in the previously published expression microarray data (71) from a time course of C2C12 mouse myoblast differentiation. The result is summarized in the Table 3. The last column in the table is marked as 'Yes' if we detect any expression in the C2C12 cells and 'No' otherwise. From these data, we could see that many ($60%) of the novel factors found in our study is detectably expressed in the muscle cell environment. This may imply that these binding sites that are in close proximity to MyoD have some important biological meaning yet to be identified. They also may function with MyoD in the process of myogenesis. The other factors with no detectable expression in C2C12 cells have no significant biological importance in the muscle cell environment.

Association of factors with MyoD in ChIP-seq experiments
Recently, Cao et al. (27) used ChIP-sequencing to identify genome wide binding sites of MyoD in mouse muscle cells. MyoD targets in undifferentiated myoblast and in differentiated myotubes were reported. To take advantage of these binding sites and to validate the findings of our study, we have mapped all the PWMs specified for human from TRANSFAC used in our analysis, in these MyoDbound sequences, with the same constraints like the cutoff and the survey region ±100 bp around the site of MyoD binding.

Similarity in preferences for factors association in myoblast and myotubes
The distribution of MyoD with factors other than E-proteins is similar in both the surveyed data sets (promoters from DBTSS and the ChIP-Seq bound sequences). Factors primarily mentioned by (27) like AP1, Meis and Sp1 are also found with MyoD in a large number of promoters in both these data sets. Table 5 lists the factors associated with MyoD in large number of promoters in both myotubes and myoblasts. However, the number of hits in the myotubes is significantly higher than that of myoblast sequences in most of the cases except few factors like AP-1 and FRA1 (Table  5). This suggests that more genes might be activated by MyoD in association with these factors (Table 5) during or after differentiation. This can be seen in the fifth column of the table.
As observed in the previous analysis, here too, we found the preferred association of MyoD with E-boxes. However, we also detected preferences for some of the factors other than E-box proteins that were not reported to function together with MyoD in the process of myogenesis in the previous studies. These factors are shown in the Table 6. We have also detected the preferred mutual position of these factors with respect to MyoD in both myoblasts and myotubes. The columns six and seven represent the preferred significant positions in myoblasts and myotubes, respectively. These factors are also found in large number of promoters associated with MyoD in these sequences. Some of the factors found in our previous analysis are also detected here, e.g. PKNOX, TGIF, MAFB, TBX5.

Differences in preferences for factors association in proximal promoters and enhancers
We used the ChIP-Seq data (27), along with the mouse genome annotation, to identify MyoD-binding sites (from the myoblasts and myotubes data sets combined) that lie within proximal promoters (from 1 kb upstream to 0.2 kb downstream of the TSS), and within distal promoters (from 10 kb upstream to 1 kb upstream). To complement these sets of sequences, we also retrieved proximal promoters and enhancers elsewhere in the genome that are not bound by MyoD. Thus, we have collected four sets of DNA sequences for this analysis. Analysing them, we observed some remarkable differences between the proximal and distal binding regions in terms of preferred factors.
Binding sites of many factors not involved in myogenesis or previously established to be associated with MyoD (like CDP, Evi, Oct, RORalpha, SRY, E2F) are not detected in a considerable numbers in the close proximity of MyoD in the ChIP-Seq bound sequences (27). Instead, these factors' binding sites are enriched in sequences not bound by MyoD. This may signify that the binding of these factors in close proximity to MyoD is restricted by the innate properties of the MyoD-bound sequences. Another, non-mutually exclusive possible interpretation is that these factors in MyoD-unbound sequences either block or restrict the binding of MyoD around their binding sites.
In our analysis, we found factors like Meis1, E47, AP-4 to be associated to the predicted MyoD-binding sites in both bound and unbound sequences. In the context of myogenesis, this combination is functional; therefore, we can expect the association in the MyoD bound sequences. However, the occurrence of the association of these factors with MyoD in the unbound sequences is unexpected. This might occur because of the false discovery of the binding sites with computational methods.
From this investigation, it is clear that the binding of MyoD to its binding sites is associated with the presence or absence of binding sites for other factors. The relationship of these factors' binding sites can be taken into account to distinguish or discriminate the functional MyoD-binding sites from the non-functional ones. For example, considering the limited binding sites of the aforementioned factors like Oct, CDP and so forth around MyoD-binding sites can enhance the discrimination of the functional MyoD-binding sites from that of nonfunctional ones. Further, we have analyzed the association The factors having high number of occurrence in myoblasts and myotubes are selected in this table. The z-score is calculated for the occurrence of these selected factors with MyoD in myoblasts versus myotubes. of MyoD with other factors with respect to the TSS, i.e. in proximal and distal bound sequences. As we have both the bound and not bound sequences, we could calculate the enrichment score for each of the factors if they are differentially bound in either of these regions. The enrichment score for each factor is calculated by dividing the number of promoters having the factor cooccurring with MyoD in the bound sequences by the number of promoters having the same factor co-occurring with MyoD in unbound sequences. Likewise, from these bound and not bound sequences, we measured the significance in terms of z-score. We regard the number of promoters having a factor co-occurring with MyoD in bound sequences as the observed occurrence and the number of promoters having the same factor co-occurring with other E-box binding factors in unbound sequences as background occurrence.
Thus, we obtained enrichment score and z-score for each factor. Based on these scores, we rank each of the factors in ascending order. Thus, we have two lists of factors, sorted in ascending order with respect to the scores. This analysis is done for both the proximal sequences and for distal sequences separately.
We selected the top ranked factors from both the proximal and distal sequences; those that have congruent ranks with both scoring methods (rank differences of 10 or less) are reported in Tables 7 and 8. From this analysis, we have seen a difference in the preference of factors around MyoD-binding sites in proximal and distal bound regions. In proximal bound sequences, the factors enriched are AP-2, CREB, USF, ETF and E2F (Table 7). However, in the enhancer, the top ranking factors are found to be E-box proteins, such as AP-4, LBP-1 and HEN1 (Table 8). The other factors enriched  The table shows the top ranked factors from the proximal promoters and those that have congruent ranks in both scoring schemes (enrichment score and the z-score of co-occurrence of TFBS with MyoD). The factors in the table are sorted in ascending order of the scores. The factors are selected based on top rank (above 100) and the differences between the ranks of the scoring schemes by 10 or less.
in the bound enhancers are NRSF, SZF1-1, SP1 and so forth ( Table 8). The observations from this analysis have the potential to help make better predictions of functional MyoD binding sites.

DISCUSSION
As mentioned formerly, the mapping of the location of TFBSs is error prone, but the inclusion or modeling of additional type of information has the potential to mitigate this weakness. Additional information can be the accompanying factors with the MyoD-binding sites. For instance, the affinity or aversion of MyoD to bind to certain sequences depends on the surrounding factors. As indicated previously by (27), the co-occurrence of E-box proteins can be used to improve the identification of functional MyoD-binding sites from the genomic sequences. Likewise, in this study, we have identified factors that can also be included into the model that will further strengthen the efficiency to discriminate the functional binding sites from that of the nonfunctional ones. We found out the distribution of the factors with respect to their mutual occurrence and the distance between them. From this in silico information, we calculated the typical pair-wise distance between each two TFs and extracted the combinations of TFs within certain interval with respect to one particular TF along the promoters of the human genome. This analysis is further extended to determine whether there is any preferential positioning between the selected combinations.
In this present work, we obtained the relationship of binding sites of different factors with respect to MyoD in terms of the co-occurrence and mutual positioning. The distribution of the aforementioned factor's binding sites around MyoD is non-trivial and as indicated in previous studies, some of them are well established for their cooperation with MyoD during myogenesis. Most factors having important role in myogenesis and recruited together with MyoD are found to be present within the range of ±100 bp.
Recently Guo et al. (72) have developed a method named GEM to determine the pair-wise spatial binding constrain of TFs in vivo. In their study, they have also focused on the aspect of the appropriate spacing between the TFBS. However, their method is different from our approach. In Guo et al.'s approach, they first computationally discover the motifs for TFs whose binding motifs are not available in public databases. In our approach, we have taken the TF-binding motifs derived experimentally and determined the functional binding sites on the sequences from the ChIP experiment for specific TF to determine the mutual positioning among the associated factors.
In addition to these, we found some other TFBSs cooccurring distinctly in a vicinity of MyoD. The presence of the other non-myogenic or not muscle-specific factors around MyoD may explain the involvement of regulation of a large number of genes by MyoD. Positional preference of MyoD with certain factors with different tissue distributions may be related to the regulation of many genes by MyoD in a different tissue-specific manner. For example, in our analysis, we detected Egr having a preferential position with the MyoD in both the data sets (human promoters from DBTSS and ChIP-Seq MyoD bound sequences). The expression of certain members of the Egr gene family is also detected in the C2C12 cells (71), like Egr1-3. This suggests a role of Egr proteins in muscle development. Furthermore, from the expression profiling The table shows the top ranked factors from the enhancers and those that have congruent ranks in both scoring schemes (enrichment score and the zscore of co-occurrence of TFBS with MyoD). The factors in the table are sorted in ascending order of the scores. The factors are selected based on top rank (above 100) and the differences between the ranks of the scoring schemes by 10 or less.
data, we could see that Egr1, Egr2 and Egr3 gene expression is upregulated during myoblast differentiation, suggesting a possible role of these proteins in cooperating with MyoD. In the partitioned ChIP-Seq data (27), the binding motif for this factor is found to be enriched in the proximal promoters. Similarly, the factors playing important roles in nonmuscle tissues like orphan steroid receptor ARP-1 and Ikaros TF (found to play critical functions in the control of lymphohematopoesis and immune regulation) are found to be mutually positioned with MyoD and also found to be expressed in C2C12 cells (71). These interactions may suggest that some TFs not related to muscle development also cooperate with MyoD or, more likely, with factors with MyoD-like binding patterns.
In our comparative co-occurring TF in myoblast and myotubes, we observed that certain TF prefers to bind at close proximity to MyoD in myoblast but also at distance in myotubes. For example, SMAD3 in Table 6, in myoblast, the mutual position is at ±5 bp, but in myotubes, the preferred mutual position is stretched up to 60 bp upstream and 86 bp downstream along with the close mutual preferred positions. We have no explanation to this phenomenon. This indicates that there are many targets of MyoD that are myoblast or myotubes specific. It could be that the circumstances are such in the condition-specific targets that the optimal binding distance for the partners of MyoD is different.
Apart from this, there is another interesting observation that can be found from Table 6. For instance, the BSs for NERF1a in myoblast are mostly in downstream of the MyoD BS, whereas the BSs of NERF1a in myotubes are mostly in upstream. However, the distance from MyoD is almost conserved in myoblast and myotubes. Only the direction of the preferred position is reversed. This effect can also be seen with NF-E2. The preferred BSs NF-E2 in myoblast is mostly in upstream, whereas in myotubes, the preferred BSs are mostly distributed in downstream.
As observed from our analysis, the binding sites of the various factors are distributed symmetrically around MyoD. As we included both the strands in our study, the question may be raised whether the DNA strand has any effect on the distribution of the TFBSs with respect to the MyoD-binding sites. To examine this, we mapped the TFs from the Table 1-5 in the single strand of the promoters. Even in this analysis, we could see the positional preference of BSs with respect to the MyoD BS in addition to the overlapping preferences. However, the peak of the distribution is reduced in the single strand analysis, which is expected. But some of the factors, which have positional preferences in positions both upstream and downstream of MyoD in both strand analysis, are found to have preferences only in upstream or downstream direction when they are mapped in a single strand. This means that one of the preferred positions is contributed by the other strand. This may also imply that these binding sites have strand specificity in addition to the positional preferences.
In our analysis, we found the occurrence of NeuroD (primarily regulates neuronal differentiation) in the overlapping position with MyoD like that of the E-protein. MyoD dimerizes with E proteins; however, MyoD does not dimerize with NeuroD, and thus the occurrence of NeuroD in the overlapping position with MyoD is merely because of the motif pattern CANNTG. The PWM for NeuroD in our analysis from TRANSFAC has the consensus motif CAGCTG, which is similar to the MyoD motif from TRANSFAC, CACCTG.
As mentioned earlier, the detection of functional binding sites is error prone, which can lead to MyoD being mistakenly identified as E-box proteins or vice versa. However, incorporating or considering the additional information like the presence and absence of certain factors in close proximity of the factor of interest of study can increase the efficiency of the binding site identification algorithm. For example, if the MyoDbinding site is detected close to factors like E-box proteins or AML1 and not in close proximity to the factors like CDP or GR, the detected binding site can be regarded as functional if such criteria are implemented.
In this study, we could find many of the factors associated with MyoD; however, this kind of study is still limited by the paucity of binding site information for many TFs. The site-specific information for many factors is not yet available in the library of TFs like TRANSFAC.
For instance, from previous studies, it is known that Pax7-mediated activation of MyoD specifies the population of muscle stem cells that enter the differentiation program (73,74). In our analysis, the occurrence of Pax6 and Pax4 instead of Pax7/Pax3 motif could be falsepositive discovery because of their similar motif pattern as defined in the TRANSFAC database. Therefore, we cannot deny the discovery of the false-binding sites when we consider the TFBSs database, which is not mature enough. Nevertheless, based on the existing site-specific information and the statistical analysis of over-representation of some factors with MyoD, we could establish the association of some factors, which were not previously studied, and these factors can be used to discriminate between the functional and the non-functional MyoDbinding sites. This study also emphasizes the preference of a factor binding sites with respect to the TSS, which can be used to build a model to help in identifying real binding sites in the genome.