Romulus: robust multi-state identification of transcription factor binding sites from DNase-seq data

Motivation: Computational prediction of transcription factor (TF) binding sites in the genome remains a challenging task. Here, we present Romulus, a novel computational method for identifying individual TF binding sites from genome sequence information and cell-type–specific experimental data, such as DNase-seq. It combines the strengths of previous approaches, and improves robustness by reducing the number of free parameters in the model by an order of magnitude. Results: We show that Romulus significantly outperforms existing methods across three sources of DNase-seq data, by assessing the performance of these tools against ChIP-seq profiles. The difference was particularly significant when applied to binding site prediction for low-information-content motifs. Our method is capable of inferring multiple binding modes for a single TF, which differ in their DNase I cut profile. Finally, using the model learned by Romulus and ChIP-seq data, we introduce Binding in Closed Chromatin (BCC) as a quantitative measure of TF pioneer factor activity. Uniquely, our measure quantifies a defining feature of pioneer factors, namely their ability to bind closed chromatin. Availability and Implementation: Romulus is freely available as an R package at http://github.com/ajank/Romulus. Contact: ajank@mimuw.edu.pl Supplementary information: Supplementary data are available at Bioinformatics online.


Supplementary Figures
Area under ROC curve  Figure 4. Romulus outperforms the other tools in terms of Spearman correlation coefficient between binding predictions and ChIP-seq peak height. Prediction performance of CEN-TIPEDE, Wellington and Romulus in A549, HepG2 and K562 cells using three sources of DNase-seq data was assessed by applying each tool to calculate the binding probability (CENTIPEDE and Romulus: posterior probability, Wellington: 1−(p-value)). The probabilites less than 0.5 were clamped down to 0, and the Spearman correlation coefficients between these probabilites and ChIP-seq peak height were calculated. Apart from the correlation values for individual TFs, the averages are indicated for each cell line.    Supplementary Figure 9. (A) Logarithm of complete likelihood of Romulus parameters obtained using the default initialization procedure (x axis) compared to the highest complete likelihood obtained in 10 random initializations (y axis). (B) As above, but compared to the difference in complete likelihood between the default initialization and best random initialization. In total, 117 data points are shown in each panel, each point representing a combination of TF, cell type and DNase I data source.  Supplementary Table 2. ChIP-seq datasets used as the reference classification of the candidate binding sites. These datasets were generated by the ENCODE Analysis Working Group (AWG) using a uniform processing pipeline. The narrowPeak filenames follow the pattern "wgEn-codeAwgTfbs. . . UniPk.narrowPeak.gz", where only the changing ". . . " part is given above.

Supplementary Methods 1 Prior probabilities of TF binding
To model the prior probabilities, we apply a logistic model against the unbound "pivot" case of Z i = 0: where k = 0 indicates no binding, k = 1 refers to binding as monomer, and k = 2, . . . , K + 1 refer to the respective cooperative binding modes. This way, we have K + 1 outcomes separately regressed against the pivot outcome Z i = 0. For clarity of presentation, we impose an additional constraint such that γ In other words, β (k) j = 0 for the partner motifs not involved in k-th binding mode. We can now explicitly formulate P (Z i = 0) by summing up Equation 1 for k = 1, . . . , K + 1: Applying the above to Equation 1, we obtain an explicit formulation for all the probabilities P (Z i = k) where k > 0:

Chromatin state component
Our primary interest is i.e. the probability of the motif instance i to be bound in any binding mode.
Taking the complement and following the Bayes' theorem, we get We make a simplifying assumption that all the chromatin state data included in the model are independent, given its binding state Z i . Hence, the conditional probability P (X i | Z i = k) is a product of the corresponding conditional probabilities: For brevity, we discuss the formulas for the forward strand DNase I component only; they are analogous for the reverse strand and for other types of data. The negative binomial component in binding mode k quantifies the total number of DNase I cuts on the forward strand and is naturally parametrized by the success probability p +(k) ∈ (0, 1) and the real-valued number of failures r +(k) > 0. The multinomial component quantifies the probability of a particular spatial distribution of the total number of DNase I cuts on a given strand. For each binding mode k and positional data type (e.g. DNase I cuts on forward strand), we divide the positions j into one or more bins. Let us denote by DNaseBin +(k) j the bin number for position j in binding mode k. For clarity, let us assume that the bins are numbered by positive integers. In this study, we take 20 bp long bins outside the motif site, and single-basepair bins within the motif site. Moreover, for the unbound mode (k = 0) we put all the positions in a single bin: for k = 0 and any j j/20 for k > 0 and j = 1, . . . , 200 190 − j for k > 0 and j = 201, . . . , 200 + L.
Note that binding modes may differ in the way the positions are split into bins.
For a given binding mode k, we associate a free parameter λ with each bin b = 1, . . . , B +(k) . However, for the multinomial distribution we must provide a vector of probabilities covering every single position in the vicinity of the motif instance. Hence, we calculate the actual multinomial coefficientsλ By definition, the multinomial coefficientsλ +(0) j for the unbound state are equal, i.e. there is no positional preference for DNase I cuts in the null model.
The joint probability of the DNase I positional data is obtained by the superposition of the negative binomial and multinomial components: Now we can explicitly formulate the probabilities: where Γ is the standard gamma function, i.e. a continuous extension of the factorial function.

Expectation-Maximization approach
To estimate the model parameters we apply the Expectation-Maximization approach. We use a common technique: instead of maximizing the likelihood function with unknown latent state, we maximize the complete likelihood function which is more tractable.
The complete likelihood function, as stated above, is defined only for Z i = 0, . . . , K + 1. However, we may rewrite it using indicator functions Z Let us denote by Z Taking the expected value of L C (Θ) with respect to all Z (k) i , we obtain a real-domain function of Θ: The formulas will easier to manipulate after taking the logarithm: . (21) Our goal is to maximize the (log-transformed) expected value of the complete likelihood function L C . Note that the value of the first component, L A , depends on (p +(k) ) k , (p −(k) ) k , (r +(k) ) k , (r −(k) ) k , (λ ) b,k , while the value of the second component, L B , depends only on the parameters in Θ not listed previously, namely on (β (k) j ) j,k . Therefore, we can maximize L A and L B separately.
We found no closed-form solution for β (k) j that maximizes L B (Θ), hence we apply the Broyden-Fletcher-Goldfarb-Shanno (BFGS) numerical optimization procedure here. This method uses the function values and gradients to build up a representation of the surface to be maximized. Substituting Equation 5 to the definition of L B , we get: Note that the last factor, , is equal to 1 and may thus be omitted. Differentiating L B with respect to β (k) j , we get: Now let us focus on the other component of log L C (Θ) , i.e. L A . For clarity, let us assume that DNase-seq data is the only kind of positional data provided. The derivations follow analogously for any other independent positional datasets. Substituting Equations 10 to the definition of L B in Equation 21, we get: . (24) The two components, L + A and L − A , depend on distinct sets of parameters in the same manner. Hence, we can maximize them separately. Without loss of generality, we will discuss the optimization procedure for L + A . From Equations 13 to 15, we have: Eliminating the additive inverse terms and noting that K+1 k=0 Z (k) i = 1, we get: Note that only the first three summands depend on (r +(k) ) k , only the third and fourth depends on (p +(k) ) k , and only the fifth depends on the parameters (λ Hence, we may find the values of (λ ) b,k that maximize L A independently of the other parameters. We need to maximize subject to the constraint jλ +(k) j = 1 for each k. Since k-th element in the sum above depends only on λ +(k) j j and consequently only on (λ +(k) j ) j , we can maximize each element of the sum independently. We use a common technique, and for a given k maximize the expression where L is called the Lagrange multiplier. Let us recall that the multinomial coefficientsλ Differentiating Formula 28 with respect to λ and setting the derivative equal to zero, we get: Note that the above is a decreasing function of λ , hence we capture a local maximum here. Hence, and summing this equation over all b = 1, . . . , B +(k) , we get We should now note that Now Equation 32 becomes and substituting the above into Equation 31, we obtain the desired solution: To increase the robustness of the model, we employ a shrinkage estimator of the parameters (λ the initial values of the prior likelihoods. Our choice of the procedure was motivated by the fact that the total number of DNase-seq cuts in the vicinity is a very simple and reasonably accurate predictor for the motif instance to be bound. We expect that other procedures may perform comparably well, and to test this hypothesis we tried initializing the algorithm randomly, by putting P (Z i = 1)/P (Z i = 0) = 100 for a random 10% of motif instances. The random initialization performed surprisingly well in terms of AUC-PR and AUC-ROC when compared to the default one (Supplementary Figure 9). After applying the random initialization 5 times for each combination of TF, cell type and DNase I data source, we concluded that the random procedure missed the maximum found by the default procedure in 2.7% of the cases. Moreover, in no case the random procedure outperformed the default one, confirming the robustness of the latter. We iterate the Expectation-Maximization procedure, in each iteration getting a revised vector of parameters Θ t , until the posterior probabilities do not change by more than 0.001, i.e. max i,k In most of the cases described here, the algorithm converged in less that 30 iterations.

Correlation between DNase I accessibility and motif score
The correlation was calculated for each TF motif, DNase-seq data source and cell type. We considered 500 bp long genomic windows, starting each 50 bp. For each window, we calculated the total number of DNase I reads mapped within the window, as a measure of DNase I accessibility. We also took the highest motif score for a genomic sequence within the window as the motif score for the whole window.
To allow for a balanced comparison between DNase I accessibility and motif score, we took all the windows overlapping the ChIP-seq peaks for the given TF, and additionally an equal number of randomly chosen windows not overlapping such a peak. Within these windows, we calculated the Spearman correlation coefficient between DNase I accessibility and motif score (Supplementary Figure 10). We found no clear trend between these correlation coefficients and Binding in Closed Chromatin (BCC) values (Supplementary Figure 10). This was also the case when we considered the correlation calculated within all the genomic windows, or only within the windows overlapping the ChIP-seq peaks.