-
PDF
- Split View
-
Views
-
Cite
Cite
Yanchun Bao, Veronica Vinciotti, Ernst Wit, Peter A. C. 't Hoen, Joint modeling of ChIP-seq data via a Markov random field model, Biostatistics, Volume 15, Issue 2, April 2014, Pages 296–310, https://doi.org/10.1093/biostatistics/kxt047
Close -
Share
Abstract
Chromatin ImmunoPrecipitation-sequencing (ChIP-seq) experiments have now become routine in biology for the detection of protein-binding sites. In this paper, we present a Markov random field model for the joint analysis of multiple ChIP-seq experiments. The proposed model naturally accounts for spatial dependencies in the data, by assuming first-order Markov dependence and, for the large proportion of zero counts, by using zero-inflated mixture distributions. In contrast to all other available implementations, the model allows for the joint modeling of multiple experiments, by incorporating key aspects of the experimental design. In particular, the model uses the information about replicates and about the different antibodies used in the experiments. An extensive simulation study shows a lower false non-discovery rate for the proposed method, compared with existing methods, at the same false discovery rate. Finally, we present an analysis on real data for the detection of histone modifications of two chromatin modifiers from eight ChIP-seq experiments, including technical replicates with different IP efficiencies.
1. Introduction
Chromatin ImmunoPrecipitation-sequencing (ChIP-seq) is a well-known biological technique to detect protein–DNA interactions, DNA methylation, and histone modifications in vivo. ChIP-seq combines ChIP with massively parallel DNA sequencing to identify all DNA binding sites of a transcription factor or genomic regions with certain histone modification marks. The final data produced by the experiment provide the number of DNA fragments in the sample aligned to each location of the genome. From this, the aim of the statistical analysis is to distinguish the truly enriched regions along the genome from the background noise. Whereas conventional transcription factors, which bind directly to the DNA, show sharp peaks at the regions of enrichment, chromatin modifiers tend to have much broader regions of enrichment, and do not follow a peak-like pattern. The latter cannot be captured by standard peak-calling algorithms and require more sophisticated statistical models. This is the focus of the present paper.
As regions of the genome are either bound by the protein in question or not, it is quite natural to analyze such data using a mixture model. Here, the observed counts are assumed to come either from a signal or from a background distribution. A number of methods have adopted this approach, with some differences in the distribution chosen for the mixture. Spyrou and others (2009), in their BayesPeak (BP) R package, adopt a negative binomial (NB) mixture model, with an NB distribution used both for signal and background. Kuan and others (2011), in their MOSAiCS package, adopt a more flexible NB mixture model, where an offset is included in the signal distribution and this distribution itself is taken as a mixture of NBs. Spyrou and others (2009) show evidence that an NB mixture model outperforms a Poisson mixture model, such as the one used by iSeq (Mo, 2012). Qin and others (2010), in their HPeak implementation, suggest to use a zero-inflated Poisson (ZIP) model for the background and a generalized Poisson distribution for the signal. Zero-inflated distributions have been used successfully also for modeling other types of sequencing data (e.g. Dhavala and others, 2010; Van De Wiel and others, 2013). In this paper, we consider the more flexible framework by allowing a ZIP or NB distribution for the background and a Poisson or NB distribution for the signal component.
Another feature of ChIP-seq data on histone modifications is the spatial dependency of counts for neighboring windows along the genome. This is mainly the result of a common pre-processing step, whereby the genome is divided into bins of some ad hoc length. It is quite common to consider fixed-width windows, although dynamic approaches have also been considered (Mo, 2012). The sum of counts within each window is subsequently considered for the analysis. As a result of this, true regions of the genome that are bound by the protein in question could be easily found to cross two or more pre-processed bins. This issue has been addressed in the literature with the use of hidden Markov models (HMMs) (Mo, 2012; Qin and others, 2010; Spyrou and others, 2009).
With few exceptions, the methods developed so far are limited to the analysis of single experiments, with the optional addition of a control experiment. When technical replicates or biological replicates are available, the standard procedure is to perform the peak calling on each individual dataset and then combine the results by retaining the common regions. This process has inherent statistical problems, as pointed out by Bardet and others (2012) and Bao and others (2013). Despite the recognition of the need for biological replicates for ChIP-seq analyses (Tuteja and others, 2009) and despite the fact that several normalization methods have been proposed for multiple ChIP-seq experiments (Bardet and others, 2012; Shao and others, 2012), very few methods have been developed that combine technical and biological replicates at the modeling stage, so that the variability in the data can be properly accounted for. Zeng and others (2013) extend MOSAiCS (Kuan and others, 2011) by developing a mixture model for multiple ChIP-seq datasets: individual models are used to analyze counts for each experiment and a final model is considered to govern the relationship of enrichment among different samples. Bao and others (2013) build mixture models for multiple experiments, where replicates are modeled jointly by an assumption of a shared latent binding profile. They show how such a joint modeling approach leads to a more accurate detection of enriched and differentially bound regions and how it allows one to account for the different IP efficiencies of individual experiments. The latter has probably been the main reason why joint modeling approaches of ChIP-seq data have rarely been considered so far.
In this paper, we combine all the aspects described above into a single model, by proposing a 1D Markov random field (MRF) model for the analysis of multiple ChIP-seq data. Our model can be viewed as an HMM where the initial distribution is a stationary distribution. As such, we follow the existing literature on the use of HMMs for ChIP-seq data in order to account for the spatial dependencies in the data (Mo, 2012; Spyrou and others, 2009). In contrast to the existing HMM-based methods, we propose a joint statistical model for ChIP-seq data, under general experimental designs. In particular, we discuss the case of technical and biological replicates as well as the case of different antibodies and/or IP efficiencies associated to each experiment. The remainder of this paper is organized as follows. In Section 2, we describe the MRF model and its Bayesian Markov Chain Monte Carlo (MCMC) implementation. In Section 3, we perform a simulation study to compare our method with two existing HMM-based methods, BP and iSeq, as well as with the joint mixture model of Bao and others (2013) and that implemented in jMOSAiCS (Zeng and others, 2013). A real data analysis on eight experiments for the detection of histone modifications of two proteins, CREB-binding protein (CBP) and p300, is given in Section 4, where we also compare our results with two widely used methods for ChIP-seq analyses, MACS (Zhang and others, 2008) and CisGenome (Ji and others, 2008). In Section 5, we conclude with a brief discussion.
2. Methods
2.1 A joint latent mixture model and its limitations
The data generated by ChIP-sequencing experiments report the number of aligned DNA fragments in the sample for each position along the genome. Due to noise and the size of the genome, it is common to summarize the raw counts by dividing the genome into consecutive windows, or bins. Since the majority of the genome is expected not to be enriched, we would generally expect that some bins are enriched regions, with a lot of tags, and most other bins are not enriched, containing only a few tags. This scenario is well suited to a mixture model framework.
Since we divide the genome arbitrarily in fixed-size windows, it is possible that a region in a certain chromatin state crosses two or more bins. As a consequence of this, it is reasonable to expect spatial dependencies in the data. Figure 1 (left) plots the bin counts |$Y_m$| for 200 bp fixed windows in a region of the genome, for one ChIP-seq experiment. On the right, the plot shows the posterior probability of enrichment, using the latent (non-Markovian) mixture model described above. This plot clearly shows regions of consecutive enriched bins.
Plots of bin counts |$Y_m$| versus bin |$m$| (left) for bins of size 200 bp on a region of chromosome 1 for the p300T302 experiment, and |$P(X_m=1|Y_m, \hat {\theta })$| versus bin |$m$| (right) under mixture of NB model.
Plots of bin counts |$Y_m$| versus bin |$m$| (left) for bins of size 200 bp on a region of chromosome 1 for the p300T302 experiment, and |$P(X_m=1|Y_m, \hat {\theta })$| versus bin |$m$| (right) under mixture of NB model.
In this paper, we propose a natural extension of the mixture model in (2.1) by allowing first-order Markov dependencies. This is described in the next section.
2.2 A 1D MRF model
2.3 Parameter estimation
2.4 Assuming the same number of binding sites across conditions
The method above can be used in the presence of technical and biological replicates. Whereas technical and biological replicates share the same binding profile X, different proteins will generally have a different binding profile. Under certain situations, e.g. when comparing the binding profiles across two conditions or between highly similar transcription factors, we can assume that the total number of binding sites is the same. Bao and others (2013) show how this assumption can be included in a mixture modeling framework. In this paper, we show how the same assumption can be included also in the proposed MRF mixture model.
In particular, if |$\textbf {X}_1$| and |$\textbf {X}_2$| are the binding profiles of conditions 1 and 2, respectively (e.g. protein 1 and protein 2), we can include the a priori biological knowledge in the joint model that the two conditions have the same number of binding sites, i.e. |$P(X_{m1} = 1)=P(X_{m2} = 1)$| for any region |$m$|. This constraint is satisfied under an assumption of equal transition probabilities for the two conditions. However, this is quite a strong assumption and it is difficult to know this beforehand, unless we are in the presence of technical replicates. If we note that the stationary distribution |$P(X=1)={\tilde {q}_0}/({\tilde {q}_0+1-\tilde {q}_{1}})={1}/({1+(1-\tilde {q}_1)/\tilde {q}_0})$|, we can see that if |$({1-\tilde {q}_{11}})/{\tilde {q}_{01}}=({1-\tilde {q}_{12}})/{\tilde {q}_{02}}$|, then |$P(X_1=1)=P(X_2=1)$|. Here, |$\tilde {q}_{01}$|, |$\tilde {q}_{11}$| and |$\tilde {q}_{02}$|, |$\tilde {q}_{12}$| denote the transition probabilities corresponding to the binding profiles |$\textbf {X}_1$| and |$\textbf {X}_2$| of the two conditions, respectively. This shows that a weaker condition on the transition probabilities is necessary to achieve equal probabilities of enrichment.
2.5 Identification of enriched regions and differentially bound regions
In this section, we show how the statistical model described above is used to detect the regions in the genome that are bound by a protein of interest. Let |$ \textbf {X}^{\textbf {(1)}}_{\textbf {c}}, \ldots , \boldsymbol{X}^{\textbf {(N)}}_{\textbf {c}}$| be |$N$| Gibbs draws of the joint states |$\boldsymbol{X}_{\textbf {c}}$| under condition |$c$|, where |$\boldsymbol{X}^{\textbf {(k)}}_{\textbf {c}}=(X^{(k)}_{1c}, \ldots , X^{(k)}_{Mc})$|. Under the proposed random field model, a natural estimate of the posterior probability that the |$m$|th bin is enriched is given by |$\hat {P}(X_{mc}=1|\textbf {Y})=\sum _{k=1}^{N} I(X^{(k)}_{mc}=1)$| (Scott, 2002). To decide whether a bin is enriched or not, we set a threshold on these probabilities based on the FDR. If |$D$| is a set of declared enriched regions corresponding to a particular cut-off on the posterior probabilities, then the estimated FDR for this cut-off is given by |$\widehat {{\mathrm {FDR}}} = \sum _{m \in D} \hat {P}(X_{mc}=0|\textbf {Y})/|D|$|.
3. Simulation study
In this section, we perform an extensive simulation study where we compare our MRF model with four competitive methods: iSeq (Mo, 2012), BP (Spyrou and others, 2009), and the mixture model approaches of Bao and others (2013) (here denoted as Mixture) and Zeng and others (2013) (jMOSAiCS). For a number of different scenarios, we generate the data for |$M=10000$| regions and we repeat the simulation for 100 times. For all methods, we identify the enriched regions by controlling the FDR at 0.05. We then compute the false non-discovery rate (FNDR), that is, the fraction of all the non-discovered regions that were actually enriched. Finally, we report the |$p$|-values of a |$t$|-test for the null hypothesis that the FNDR of our method is greater than or equal to the FNDR of each of the other methods.
In the first four scenarios, we compare our method with the other HMM-based methods, namely iSeq (Mo, 2012) and BP (Spyrou and others, 2009). In the first scenario, we simulate data from a mixture model with a ZINB background distribution and an NB signal distribution. We set the parameters of these distributions using the values estimated by an MRF model on two of our real ChIP-seq datasets. We choose the experiments on the basis of their ChIP efficiency. In particular, we consider the case of a not very efficient experiment (CBPT0) and the one of a more efficient experiment (p300T302). In terms of the mixture distribution, the more efficient experiment corresponds to background and signal distributions that are better separated. Since neither iSeq nor BP can deal with multiple experiments, we perform these comparisons on single experiments. The results are given in Table 1 in scenario 1. BP is in general inferior to both iSeq and MRF. Between iSeq and MRF, there is no significant difference for the less efficient experiment, whereas MRF is superior to iSeq for the more efficient experiment. In general, we find that the use of zero-inflated distributions is particularly suited to the case of efficient experiments, where there is a combination of a large number of zeroes and a relatively large number of high counts. A mixture of Poisson distributions, which is implemented in iSeq, cannot capture this situation very well. We further extend this simulation to scenarios where some assumptions are shared between MRF and iSeq. First, we generate data from a mixture of Poisson distributions and |$\tilde {q}_1+\tilde {q}_0=1$|. These are the two main assumptions imposed by the Ising model implemented in iSeq. The results are given in Table 1 in scenario 2. In this case, as expected, there is no difference between iSeq and MRF, whereas BP is still inferior to both. Secondly, we consider the case of a Poisson mixture, as in iSeq, but we relax the assumption of |$\tilde {q}_1+\tilde {q}_0=1$| (Table 1, scenario 3). In both cases, the MRF method is superior to iSeq, although the difference is not so large. Finally, in the fourth scenario (Table 1, scenario 4), we generated data which satisfy the constraint |$\tilde {q}_1+\tilde {q}_0=1$|, but which does not follow a Poisson mixture distribution. In particular, we use a ZINB-NB mixture distribution. This is the case where the MRF method performs much better than either of the two other methods. In general, the first four scenarios in Table 1 show how iSeq and MRF perform equally well when the data are generated from a Poisson mixture distribution and |$\tilde {q}_1+\tilde {q}_0=1$|, but MRF is superior to both iSeq and BP when either of these two conditions is not satisfied.
Simulated count data are generated for |$M=10\,000$| regions under different scenarios|$,$| with parameter values given in brackets for signal |$(S)$| and background |$(B)$| distributions
| . | Less efficient experiment . | More efficient experiment . | ||
|---|---|---|---|---|
| . | FNDR . | |$p$|-value . | FNDR . | |$p$|-value . |
| . | Scenario 1: ZINB-NB mixture with |$\tilde {q}_1+\tilde {q}_0\neq 1$| (as MRF) . | |||
| . | S: NB(1.38, 2.07); B: ZINB(0.66, 0.33, 2.01) |$(\tilde {q}_0,\tilde {q}_1)$|=(0.002, 0.940) . | S: NB(6.95, 0.89); B: ZINB(0.53, 0.36, 0.88) |$(\tilde {q}_0,\tilde {q}_1)$|=(0.003, 0.866) . | ||
| MRF | 0.0090 | — | 0.0020 | —- |
| iSeq | 0.0086 | 0.7778 | 0.0052 | |$<$|2.2e|$-$|16 |
| BP | 0.0292 | |$<$|2.2e|$-$|16 | 0.0088 | |$<$|2.2e|$-$|16 |
| Scenario 2: Poisson–Poisson mixture with |$\tilde {q}_1+\tilde {q}_0 = 1$| (as iSeq) | ||||
| . | Less efficient experiment . | More efficient experiment . | ||
|---|---|---|---|---|
| . | FNDR . | |$p$|-value . | FNDR . | |$p$|-value . |
| . | Scenario 1: ZINB-NB mixture with |$\tilde {q}_1+\tilde {q}_0\neq 1$| (as MRF) . | |||
| . | S: NB(1.38, 2.07); B: ZINB(0.66, 0.33, 2.01) |$(\tilde {q}_0,\tilde {q}_1)$|=(0.002, 0.940) . | S: NB(6.95, 0.89); B: ZINB(0.53, 0.36, 0.88) |$(\tilde {q}_0,\tilde {q}_1)$|=(0.003, 0.866) . | ||
| MRF | 0.0090 | — | 0.0020 | —- |
| iSeq | 0.0086 | 0.7778 | 0.0052 | |$<$|2.2e|$-$|16 |
| BP | 0.0292 | |$<$|2.2e|$-$|16 | 0.0088 | |$<$|2.2e|$-$|16 |
| Scenario 2: Poisson–Poisson mixture with |$\tilde {q}_1+\tilde {q}_0 = 1$| (as iSeq) | ||||
| . | S: Poisson(1.5); B: Poisson(0.5) |$\tilde {q}_1=1-\tilde {q}_0=0.98$| . | S: Poisson(9.0); B: Poisson(0.5) |$\tilde {q}_1=1-\tilde {q}_0=0.98$| . | ||
|---|---|---|---|---|
| MRF | 0.0606 | — | 3.79e–06 | — |
| iSeq | 0.0586 | 0.7661 | 1.04e–05 | 0.1073 |
| BP | 0.4547 | |$<$|2.2e–16 | 0.2707 | |$<$|2.2e–16 |
| Scenario 3: Poisson–Poisson mixture with |$\tilde {q}_1+\tilde {q}_0\neq 1$| | ||||
| . | S: Poisson(1.5); B: Poisson(0.5) |$\tilde {q}_1=1-\tilde {q}_0=0.98$| . | S: Poisson(9.0); B: Poisson(0.5) |$\tilde {q}_1=1-\tilde {q}_0=0.98$| . | ||
|---|---|---|---|---|
| MRF | 0.0606 | — | 3.79e–06 | — |
| iSeq | 0.0586 | 0.7661 | 1.04e–05 | 0.1073 |
| BP | 0.4547 | |$<$|2.2e–16 | 0.2707 | |$<$|2.2e–16 |
| Scenario 3: Poisson–Poisson mixture with |$\tilde {q}_1+\tilde {q}_0\neq 1$| | ||||
| . | S: Poisson(3.0); B: Poisson(0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02,0.5)$| . | S: Poisson(6.0); B: Poisson(0.2) |$(\tilde {q}_0,\tilde {q}_1)=(0.02,0.5)$| . | ||
|---|---|---|---|---|
| MRF | 0.0225 | – | 0.0011 | – |
| iSeq | 0.0287 | <2.2e–16 | 0.0016 | 1.737e–12 |
| BP | 0.0299 | <2.2e–16 | 0.0200 | <2.2e–16 |
| Scenario 4: ZINB-NB mixture with |$\tilde {q}_1+\tilde {q}_0=1$| | ||||
| . | S: Poisson(3.0); B: Poisson(0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02,0.5)$| . | S: Poisson(6.0); B: Poisson(0.2) |$(\tilde {q}_0,\tilde {q}_1)=(0.02,0.5)$| . | ||
|---|---|---|---|---|
| MRF | 0.0225 | – | 0.0011 | – |
| iSeq | 0.0287 | <2.2e–16 | 0.0016 | 1.737e–12 |
| BP | 0.0299 | <2.2e–16 | 0.0200 | <2.2e–16 |
| Scenario 4: ZINB-NB mixture with |$\tilde {q}_1+\tilde {q}_0=1$| | ||||
| . | S: NB(3.0, 1.0); B: ZINB(0.5, 0.5, 0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02, 0.98)$| . | S: NB(6.0, 1.0); B: ZINB(0.5, 0.5, 0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02, 0.98)$| . | ||
|---|---|---|---|---|
| MRF | 0.0168 | – | 0.0039 | – |
| iSeq | 0.2903 | <2.2e–16 | 0.1874 | <2.2e–16 |
| BP | 0.4100 | <2.2e–16 | 0.4310 | <2.2e–16 |
| Scenario 5: multiple experiments and Markov property | ||||
| . | S: NB(3.0, 1.0); B: ZINB(0.5, 0.5, 0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02, 0.98)$| . | S: NB(6.0, 1.0); B: ZINB(0.5, 0.5, 0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02, 0.98)$| . | ||
|---|---|---|---|---|
| MRF | 0.0168 | – | 0.0039 | – |
| iSeq | 0.2903 | <2.2e–16 | 0.1874 | <2.2e–16 |
| BP | 0.4100 | <2.2e–16 | 0.4310 | <2.2e–16 |
| Scenario 5: multiple experiments and Markov property | ||||
| . | Rep 1 S: NB(|$2.74, 1.55$|); B: ZINB(|$0.63, 0.43, 2.32$|) Rep 2 S: NB(|$5.99, 0.96$|); B: ZINB(|$0.48, 0.48, 1.25$|) |$(\tilde {q}_0,\tilde {q}_1)=(0.003,0.839)$| . | Rep 1 S: NB(|$3.80, 1.14$|); B: ZINB(|$0.66, 0.40, 3.01$|) Rep 2 S: NB(|$7.40, 0.96$|); B: ZINB(|$0.49, 0.40, 1.06$|) |$(\tilde {q}_0,\tilde {q}_1)=(0.003,0.830)$| . | ||
|---|---|---|---|---|
| MRF | 0.0011 | — | 0.0008 | — |
| Mixture | 0.0072 | <2.2e–16 | 0.0057 | <2.2e–16 |
| jMOSAiCS | 0.0104 | <2.2e–16 | 0.0081 | <2.2e–16 |
| Scenario 6: multiple experiments and no Markov property | ||||
| . | Rep 1 S: NB(|$2.74, 1.55$|); B: ZINB(|$0.63, 0.43, 2.32$|) Rep 2 S: NB(|$5.99, 0.96$|); B: ZINB(|$0.48, 0.48, 1.25$|) |$(\tilde {q}_0,\tilde {q}_1)=(0.003,0.839)$| . | Rep 1 S: NB(|$3.80, 1.14$|); B: ZINB(|$0.66, 0.40, 3.01$|) Rep 2 S: NB(|$7.40, 0.96$|); B: ZINB(|$0.49, 0.40, 1.06$|) |$(\tilde {q}_0,\tilde {q}_1)=(0.003,0.830)$| . | ||
|---|---|---|---|---|
| MRF | 0.0011 | — | 0.0008 | — |
| Mixture | 0.0072 | <2.2e–16 | 0.0057 | <2.2e–16 |
| jMOSAiCS | 0.0104 | <2.2e–16 | 0.0081 | <2.2e–16 |
| Scenario 6: multiple experiments and no Markov property | ||||
| . | Rep 1 S: NB(|$2.74, 1.55$|); B: ZINB(|$0.63, 0.43, 2.32$|) Rep 2 S: NB(|$5.99, 0.96$|); B: ZINB(|$0.48, 0.48, 1.25$|) |$p(X=1)=0.017$| . | Rep 1 S: NB(|$3.80, 1.14$|); B: ZINB(|$0.66, 0.40, 3.01$|) Rep 2 S: NB(|$7.40, 0.96$|); B: ZINB(|$0.49, 0.40, 1.06$|) |$p(X=1)=0.020$| . | ||
|---|---|---|---|---|
| MRF | 0.0073 | — | 0.0058 | — |
| Mixture | 0.0073 | 0.5001 | 0.0058 | 0.6738 |
| jMOSAiCS | 0.0099 | <2.2e–16 | 0.0080 | <2.2e–16 |
| . | Rep 1 S: NB(|$2.74, 1.55$|); B: ZINB(|$0.63, 0.43, 2.32$|) Rep 2 S: NB(|$5.99, 0.96$|); B: ZINB(|$0.48, 0.48, 1.25$|) |$p(X=1)=0.017$| . | Rep 1 S: NB(|$3.80, 1.14$|); B: ZINB(|$0.66, 0.40, 3.01$|) Rep 2 S: NB(|$7.40, 0.96$|); B: ZINB(|$0.49, 0.40, 1.06$|) |$p(X=1)=0.020$| . | ||
|---|---|---|---|---|
| MRF | 0.0073 | — | 0.0058 | — |
| Mixture | 0.0073 | 0.5001 | 0.0058 | 0.6738 |
| jMOSAiCS | 0.0099 | <2.2e–16 | 0.0080 | <2.2e–16 |
The table reports the average FNDR over 100 iterations, at a controlled FDR of 5%, for MRF, iSeq, BP, Mixture, and jMOSAiCS. The |$p$|-values show whether the MRF model has a significantly lower FNDR than each of the other methods.
Simulated count data are generated for |$M=10\,000$| regions under different scenarios|$,$| with parameter values given in brackets for signal |$(S)$| and background |$(B)$| distributions
| . | Less efficient experiment . | More efficient experiment . | ||
|---|---|---|---|---|
| . | FNDR . | |$p$|-value . | FNDR . | |$p$|-value . |
| . | Scenario 1: ZINB-NB mixture with |$\tilde {q}_1+\tilde {q}_0\neq 1$| (as MRF) . | |||
| . | S: NB(1.38, 2.07); B: ZINB(0.66, 0.33, 2.01) |$(\tilde {q}_0,\tilde {q}_1)$|=(0.002, 0.940) . | S: NB(6.95, 0.89); B: ZINB(0.53, 0.36, 0.88) |$(\tilde {q}_0,\tilde {q}_1)$|=(0.003, 0.866) . | ||
| MRF | 0.0090 | — | 0.0020 | —- |
| iSeq | 0.0086 | 0.7778 | 0.0052 | |$<$|2.2e|$-$|16 |
| BP | 0.0292 | |$<$|2.2e|$-$|16 | 0.0088 | |$<$|2.2e|$-$|16 |
| Scenario 2: Poisson–Poisson mixture with |$\tilde {q}_1+\tilde {q}_0 = 1$| (as iSeq) | ||||
| . | Less efficient experiment . | More efficient experiment . | ||
|---|---|---|---|---|
| . | FNDR . | |$p$|-value . | FNDR . | |$p$|-value . |
| . | Scenario 1: ZINB-NB mixture with |$\tilde {q}_1+\tilde {q}_0\neq 1$| (as MRF) . | |||
| . | S: NB(1.38, 2.07); B: ZINB(0.66, 0.33, 2.01) |$(\tilde {q}_0,\tilde {q}_1)$|=(0.002, 0.940) . | S: NB(6.95, 0.89); B: ZINB(0.53, 0.36, 0.88) |$(\tilde {q}_0,\tilde {q}_1)$|=(0.003, 0.866) . | ||
| MRF | 0.0090 | — | 0.0020 | —- |
| iSeq | 0.0086 | 0.7778 | 0.0052 | |$<$|2.2e|$-$|16 |
| BP | 0.0292 | |$<$|2.2e|$-$|16 | 0.0088 | |$<$|2.2e|$-$|16 |
| Scenario 2: Poisson–Poisson mixture with |$\tilde {q}_1+\tilde {q}_0 = 1$| (as iSeq) | ||||
| . | S: Poisson(1.5); B: Poisson(0.5) |$\tilde {q}_1=1-\tilde {q}_0=0.98$| . | S: Poisson(9.0); B: Poisson(0.5) |$\tilde {q}_1=1-\tilde {q}_0=0.98$| . | ||
|---|---|---|---|---|
| MRF | 0.0606 | — | 3.79e–06 | — |
| iSeq | 0.0586 | 0.7661 | 1.04e–05 | 0.1073 |
| BP | 0.4547 | |$<$|2.2e–16 | 0.2707 | |$<$|2.2e–16 |
| Scenario 3: Poisson–Poisson mixture with |$\tilde {q}_1+\tilde {q}_0\neq 1$| | ||||
| . | S: Poisson(1.5); B: Poisson(0.5) |$\tilde {q}_1=1-\tilde {q}_0=0.98$| . | S: Poisson(9.0); B: Poisson(0.5) |$\tilde {q}_1=1-\tilde {q}_0=0.98$| . | ||
|---|---|---|---|---|
| MRF | 0.0606 | — | 3.79e–06 | — |
| iSeq | 0.0586 | 0.7661 | 1.04e–05 | 0.1073 |
| BP | 0.4547 | |$<$|2.2e–16 | 0.2707 | |$<$|2.2e–16 |
| Scenario 3: Poisson–Poisson mixture with |$\tilde {q}_1+\tilde {q}_0\neq 1$| | ||||
| . | S: Poisson(3.0); B: Poisson(0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02,0.5)$| . | S: Poisson(6.0); B: Poisson(0.2) |$(\tilde {q}_0,\tilde {q}_1)=(0.02,0.5)$| . | ||
|---|---|---|---|---|
| MRF | 0.0225 | – | 0.0011 | – |
| iSeq | 0.0287 | <2.2e–16 | 0.0016 | 1.737e–12 |
| BP | 0.0299 | <2.2e–16 | 0.0200 | <2.2e–16 |
| Scenario 4: ZINB-NB mixture with |$\tilde {q}_1+\tilde {q}_0=1$| | ||||
| . | S: Poisson(3.0); B: Poisson(0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02,0.5)$| . | S: Poisson(6.0); B: Poisson(0.2) |$(\tilde {q}_0,\tilde {q}_1)=(0.02,0.5)$| . | ||
|---|---|---|---|---|
| MRF | 0.0225 | – | 0.0011 | – |
| iSeq | 0.0287 | <2.2e–16 | 0.0016 | 1.737e–12 |
| BP | 0.0299 | <2.2e–16 | 0.0200 | <2.2e–16 |
| Scenario 4: ZINB-NB mixture with |$\tilde {q}_1+\tilde {q}_0=1$| | ||||
| . | S: NB(3.0, 1.0); B: ZINB(0.5, 0.5, 0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02, 0.98)$| . | S: NB(6.0, 1.0); B: ZINB(0.5, 0.5, 0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02, 0.98)$| . | ||
|---|---|---|---|---|
| MRF | 0.0168 | – | 0.0039 | – |
| iSeq | 0.2903 | <2.2e–16 | 0.1874 | <2.2e–16 |
| BP | 0.4100 | <2.2e–16 | 0.4310 | <2.2e–16 |
| Scenario 5: multiple experiments and Markov property | ||||
| . | S: NB(3.0, 1.0); B: ZINB(0.5, 0.5, 0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02, 0.98)$| . | S: NB(6.0, 1.0); B: ZINB(0.5, 0.5, 0.5) |$(\tilde {q}_0,\tilde {q}_1)=(0.02, 0.98)$| . | ||
|---|---|---|---|---|
| MRF | 0.0168 | – | 0.0039 | – |
| iSeq | 0.2903 | <2.2e–16 | 0.1874 | <2.2e–16 |
| BP | 0.4100 | <2.2e–16 | 0.4310 | <2.2e–16 |
| Scenario 5: multiple experiments and Markov property | ||||
| . | Rep 1 S: NB(|$2.74, 1.55$|); B: ZINB(|$0.63, 0.43, 2.32$|) Rep 2 S: NB(|$5.99, 0.96$|); B: ZINB(|$0.48, 0.48, 1.25$|) |$(\tilde {q}_0,\tilde {q}_1)=(0.003,0.839)$| . | Rep 1 S: NB(|$3.80, 1.14$|); B: ZINB(|$0.66, 0.40, 3.01$|) Rep 2 S: NB(|$7.40, 0.96$|); B: ZINB(|$0.49, 0.40, 1.06$|) |$(\tilde {q}_0,\tilde {q}_1)=(0.003,0.830)$| . | ||
|---|---|---|---|---|
| MRF | 0.0011 | — | 0.0008 | — |
| Mixture | 0.0072 | <2.2e–16 | 0.0057 | <2.2e–16 |
| jMOSAiCS | 0.0104 | <2.2e–16 | 0.0081 | <2.2e–16 |
| Scenario 6: multiple experiments and no Markov property | ||||
| . | Rep 1 S: NB(|$2.74, 1.55$|); B: ZINB(|$0.63, 0.43, 2.32$|) Rep 2 S: NB(|$5.99, 0.96$|); B: ZINB(|$0.48, 0.48, 1.25$|) |$(\tilde {q}_0,\tilde {q}_1)=(0.003,0.839)$| . | Rep 1 S: NB(|$3.80, 1.14$|); B: ZINB(|$0.66, 0.40, 3.01$|) Rep 2 S: NB(|$7.40, 0.96$|); B: ZINB(|$0.49, 0.40, 1.06$|) |$(\tilde {q}_0,\tilde {q}_1)=(0.003,0.830)$| . | ||
|---|---|---|---|---|
| MRF | 0.0011 | — | 0.0008 | — |
| Mixture | 0.0072 | <2.2e–16 | 0.0057 | <2.2e–16 |
| jMOSAiCS | 0.0104 | <2.2e–16 | 0.0081 | <2.2e–16 |
| Scenario 6: multiple experiments and no Markov property | ||||
| . | Rep 1 S: NB(|$2.74, 1.55$|); B: ZINB(|$0.63, 0.43, 2.32$|) Rep 2 S: NB(|$5.99, 0.96$|); B: ZINB(|$0.48, 0.48, 1.25$|) |$p(X=1)=0.017$| . | Rep 1 S: NB(|$3.80, 1.14$|); B: ZINB(|$0.66, 0.40, 3.01$|) Rep 2 S: NB(|$7.40, 0.96$|); B: ZINB(|$0.49, 0.40, 1.06$|) |$p(X=1)=0.020$| . | ||
|---|---|---|---|---|
| MRF | 0.0073 | — | 0.0058 | — |
| Mixture | 0.0073 | 0.5001 | 0.0058 | 0.6738 |
| jMOSAiCS | 0.0099 | <2.2e–16 | 0.0080 | <2.2e–16 |
| . | Rep 1 S: NB(|$2.74, 1.55$|); B: ZINB(|$0.63, 0.43, 2.32$|) Rep 2 S: NB(|$5.99, 0.96$|); B: ZINB(|$0.48, 0.48, 1.25$|) |$p(X=1)=0.017$| . | Rep 1 S: NB(|$3.80, 1.14$|); B: ZINB(|$0.66, 0.40, 3.01$|) Rep 2 S: NB(|$7.40, 0.96$|); B: ZINB(|$0.49, 0.40, 1.06$|) |$p(X=1)=0.020$| . | ||
|---|---|---|---|---|
| MRF | 0.0073 | — | 0.0058 | — |
| Mixture | 0.0073 | 0.5001 | 0.0058 | 0.6738 |
| jMOSAiCS | 0.0099 | <2.2e–16 | 0.0080 | <2.2e–16 |
The table reports the average FNDR over 100 iterations, at a controlled FDR of 5%, for MRF, iSeq, BP, Mixture, and jMOSAiCS. The |$p$|-values show whether the MRF model has a significantly lower FNDR than each of the other methods.
In the last two scenarios, we compare the MRF model with our previously developed mixture model for multiple experiments (Bao and others, 2013) and the recently developed jMOSAiCS (Zeng and others, 2013), which also does not account for spatial dependencies. As the jMOSAiCS implementation requires a control sample, we use a sample of 10 000 bins from the IgGrabbit control sample provided by Wang and others (2009). For a fairer comparison, we extend the model of Bao and others (2013) to include zero-inflated distributions for the background. In Table 1 in scenario 5, the data are generated from an MRF model, using the parameter values estimated from two real datasets. As expected, the MRF model performs better than the mixture models in this case, as it accounts for the Markov dependencies. In the final scenario (Table 1, scenario 6), we generated data without Markov properties, that is, the latent state |$X$| follows a Bernoulli distribution. In this case, the MRF and our mixture model give the same results, but the MRF model is still superior to jMOSAiCS.
From all scenarios considered, one can conclude that the proposed MRF model performs as well as the other methods under similar conditions, but it outperforms the other models under more general mixture distributions and modeling assumptions.
4. Real data analysis
In this section, we use the new model on real ChIP-seq data on two proteins, p300 and CBP. P300 and CBP are two histone acetyltransferaces: they are transcriptional co-activators whose regulatory mechanisms are not fully understood, but are thought to be crucial for a number of biological functions. As chromatin modifiers, they do not bind directly to the DNA and thus they generally show broad binding profiles. We analyze ChIP-seq data from six experiments, three for CBP and three for p300 (Ramos and others, 2010). For each protein, one experiment is conducted at time point 0 (T0) and two technical replicates are performed after 30 min (T301 and T302). We also use CBP and p300 ChIP-seq data from an earlier independent study (Wang and others, 2009), where CBP and p300 binding was evaluated in resting cells, using a different cell line and different antibodies. The data are further described in Bao and others (2013), which includes a discussion of the effect of the different IP efficiencies on the data. This is the case also for the technical replicates, with one replicate having a higher IP efficiency than the other. We divide the whole genome into 200 base pair windows and summarize the raw counts for each window by the number of tags whose first position is in the window. The window length was chosen as it matches the fragment size used in the ChIP-seq experiment. Furthermore, we exclude from the analysis genomic regions that have been found to exhibit anomalous or unstructured read counts from the analysis (Hoffman and others, 2012).
First of all, we have compared the fit of an NB mixture model, where an NB distribution is chosen both for the background and signal, against a ZINB-NB mixture, where a ZINB is considered for the background and an NB distribution for the signal. In general, we find that the BIC values are lower for the ZINB-NB mixture than for the NB mixture, suggesting a better fit for the zero-inflated mixture (see supplementary material available at Biostatistics online). In the following, we will therefore use zero-inflated distributions. This is different from iSeq (Mo, 2012) and BP (Spyrou and others, 2009), where Poisson and NB mixtures are used, respectively.
Within the eight datasets that we analyzed, CBPT0, p300T0, WangCBP, and Wangp300 are single experiments, i.e. with no replicates. We therefore compare the proposed MRF model with iSeq and BP on these four experiments. We also include in the comparison two widely used methods for analyzing ChIP-seq data, namely MACS (Zhang and others, 2008) and CisGenome (Ji and others, 2008), which do not account for spatial dependencies. For simplicity, we provide the results only for chromosome 21. Figure 2 shows the Venn diagrams of the detected regions for two representative experiments (CBPT0 and WangCBP). As MACS does not provide FDR control, we use the suggested default |$p$|-value cut-off of |$10^{-5}$|. For the other methods, we use a 5% FDR cut-off, although we found the empirical FDR values returned by CisGenome to be rather unreliable. The results show that MRF detects more regions than any of the other four methods. Furthermore, MRF tends to agree more with iSeq and BP, than with the other two methods. This is probably due to the fact that these three methods all account for spatial dependencies. Finally, Figure 2 shows how the overlap between MRF and iSeq and the overlap between MRF and BP are both larger than the overlap between iSeq and BP. The results for the other datasets are provided in supplementary material available at Biostatistics online.
Venn diagrams of the number of enriched regions detected by MRF, iSeq, BP, CisGenome, and MACS, at the 5% FDR, for the CBPT0 and WangCBP experiments.
Venn diagrams of the number of enriched regions detected by MRF, iSeq, BP, CisGenome, and MACS, at the 5% FDR, for the CBPT0 and WangCBP experiments.
CBP and p300 both have largely overlapping roles in transcriptional activation. We use ChromHMM (Ernst and Kellis, 2010) to explore whether the regions identified by MRF are likely functional in transcription activation and whether different chromatin features are enriched in the regions identified by the different methods. Figure 3 shows the results of ChromHMM using a four-state HMM on the enrichment profile given by MRF, BayePeak, and iSeq, each at a 5% FDR, for two representative experiments (CBPT0 and p300T0). The left plots give the emission probabilities for the different analyses, that is, the conditional probability of the observed enrichment for each of the four possible states. These plots show how, for all analyses, three of the four states explain most of the enrichment pattern in the identified lists. The right plots give the relative fold enrichment for several annotations. These plots show how these three states are represented by a similar enrichment of features for the three methods, mainly CpGisland and RefSeq Transcription Start Sites (RefSeqTSS), suggesting that the additional regions detected by MRF are likely to be genuine binding events.
Validation of the enriched bins detected by BP, iSeq, and MRF for CBPT0, p300T0. We use a four-state ChromHMM. The left plots show heatmaps of the probabilities (in %) that the detected bins are enriched given each identified chromatin state. The right plots show the relative percentage of the genome represented by each chromatin state (column 1) and the relative fold enrichment for several types of annotation (columns 2–8).
Validation of the enriched bins detected by BP, iSeq, and MRF for CBPT0, p300T0. We use a four-state ChromHMM. The left plots show heatmaps of the probabilities (in %) that the detected bins are enriched given each identified chromatin state. The right plots show the relative percentage of the genome represented by each chromatin state (column 1) and the relative fold enrichment for several types of annotation (columns 2–8).
For replicated experiments (CBPT301, CBPT302, p300T301, p300T302), we compare the proposed MRF model for multiple experiments, with the previously developed mixture model (Bao and others, 2013) and the recently developed jMOSAiCS (Zeng and others, 2013). Table 2 reports the number of enriched regions at a 5% FDR on chromosome 21. For the MRF and Mixture methods, we also include the number of regions differentially bound between p300 and CBP. The results show that, by including the assumption of Markov properties, the number of enriched regions detected is larger than when the Markov property is not considered. This is to be expected since regions with a relatively small number of counts but with neighboring enriched regions may not be detected by the mixture model but would be detected by the MRF model. As before, the ChromHMM validation shows a similar enrichment pattern in the identified lists, with predominantly TSS and CpGIsland features (see supplementary material available at Biostatistics online). Overall, these results lead us to the conclusion that, by taking into account Markov properties while combining replicates, many more regions are found at the same FDR, and that these regions are of the same nature as those found by latent mixture models.
Number of enriched and differentially bound regions identified by MRF, a ZINB-NB mixture model (Mixture), and jMOSAiCS, using technical replicates of CBP and p300 at time 30
| . | Enriched regions . | Differentially bound regions . | ||
|---|---|---|---|---|
| Method . | CBPT30 . | p300T30 . | Only CBP . | Only p300 . |
| MRF | 2977 | 3970 | 69 | 347 |
| Mixture | 981 | 1848 | 29 | 395 |
| jMOSAiCS | 1231 | 1970 | — | — |
| Overlap | 899 | 1672 | — | — |
| . | Enriched regions . | Differentially bound regions . | ||
|---|---|---|---|---|
| Method . | CBPT30 . | p300T30 . | Only CBP . | Only p300 . |
| MRF | 2977 | 3970 | 69 | 347 |
| Mixture | 981 | 1848 | 29 | 395 |
| jMOSAiCS | 1231 | 1970 | — | — |
| Overlap | 899 | 1672 | — | — |
The last row reports the number of regions identified by all three methods.
Number of enriched and differentially bound regions identified by MRF, a ZINB-NB mixture model (Mixture), and jMOSAiCS, using technical replicates of CBP and p300 at time 30
| . | Enriched regions . | Differentially bound regions . | ||
|---|---|---|---|---|
| Method . | CBPT30 . | p300T30 . | Only CBP . | Only p300 . |
| MRF | 2977 | 3970 | 69 | 347 |
| Mixture | 981 | 1848 | 29 | 395 |
| jMOSAiCS | 1231 | 1970 | — | — |
| Overlap | 899 | 1672 | — | — |
| . | Enriched regions . | Differentially bound regions . | ||
|---|---|---|---|---|
| Method . | CBPT30 . | p300T30 . | Only CBP . | Only p300 . |
| MRF | 2977 | 3970 | 69 | 347 |
| Mixture | 981 | 1848 | 29 | 395 |
| jMOSAiCS | 1231 | 1970 | — | — |
| Overlap | 899 | 1672 | — | — |
The last row reports the number of regions identified by all three methods.
Similar analyses can be performed under the assumption that p300 and CBP have the same number of binding sites, using the method discussed in Section 2.4. The results of these analyses are provided in supplementary material available at Biostatistics online.
5. Conclusion
In this paper, we have proposed a 1D MRF model for the analysis of ChIP-seq data. Our model can be viewed as an HMM where the initial distribution is the stationary distribution. As such, we follow the literature on existing HMM-based models, such as BP (Spyrou and others, 2009) and iSeq (Mo, 2012). Similarly to these models, we capture the spatial dependencies of local bins by an assumption of first-order Markov dependence. Differently from these methods, we propose a joint model for multiple ChIP-seq experiments under general experimental designs, such as replicated experiments, experiments using different antibodies, and two-sample analyses. Furthermore, similarly to a previously developed mixture model (Bao and others, 2013), we show how a priori knowledge of the same number of binding sites for different proteins can also be added to the model, in order to better account for the different ChIP efficiencies of individual experiments. Finally, we advocate the use of zero-inflated background distributions, as these better account for the large number of zeroes in the data.
In an extensive simulation study, we have shown that the proposed MRF model performs at least as well as competitive existing methods under similar conditions, but that it outperforms the other models under more general mixture distributions and modeling assumptions. Finally, we present an analysis on real data for the detection of histone modifications of two transcriptional activators from eight ChIP-sequencing experiments, including technical replicates with different IP efficiencies. Future work will extend the current methodology to account for sequencing biases and for the possibility of more than two mixture components.
6. Software
The method is available from CRAN in the R package enRich. The current parallel implementation of the ZINB-NB model takes about 46 mins on one chromosome with 10 000 MCMC iterations, using a 64-bit machine with two 2.39 GHz processors. At the moment, the running time is higher than that of the competitive methods, mainly due to the use of the Metropolis-within-Gibbs sampling procedure.
Supplementary material
Supplementary material is available at http://biostatistics.oxfordjournals.org.
Funding
This work was supported by the BBSRC [BB/H017275/1 to Y.B.]; the European Commission 7th Framework Program GEUVADIS [project nr. 261123 to P.t H.]; and the Centre for Medical Systems Biology within the framework of the Netherlands Genomics Initiative/Netherlands Organisation for Scientific Research.
Acknowledgements
Conflict of Interest: None declared.
In the original version, the middle initials of the last author were omitted. These have now been added. The authors apologize for the error.



