Abstract

Chromatin immunoprecipitation followed by next generation sequencing (ChIP-seq) is a powerful technique that is being used in a wide range of biological studies including genome-wide measurements of protein–DNA interactions, DNA methylation, and histone modifications. The vast amount of data and biases introduced by sequencing and/or genome mapping pose new challenges and call for effective methods and fast computer programs for statistical analysis. To systematically model ChIP-seq data, we build a dynamic signal profile for each chromosome and then model the profile using a fully Bayesian hidden Ising model. The proposed model naturally takes into account spatial dependency and global and local distributions of sequence tags. It can be used for one-sample and two-sample analyses. Through model diagnosis, the proposed method can detect falsely enriched regions caused by sequencing and/or mapping errors, which is usually not offered by the existing hypothesis–testing-based methods. The proposed method is illustrated using 3 transcription factor (TF) ChIP-seq data sets and 2 mixed ChIP-seq data sets and compared with 4 popular and/or well-documented methods: MACS, CisGenome, BayesPeak, and SISSRs. The results indicate that the proposed method achieves equivalent or higher sensitivity and spatial resolution in detecting TF binding sites with false discovery rate at a much lower level.

INTRODUCTION

Identifying the sites of interactions between proteins and genomic DNA is an important step in clarifying many critical cellular processes such as regulation of gene expression, DNA replication, recombination, and repair. Transcription factors (TFs) are proteins that regulate gene expression through recognizing and binding to specific DNA sequences. It is of particular interest for biologists to identify TF binding sites because it is helpful to discover TF target genes and regulatory networks. Chromatin immunoprecipitation followed by genomic tiling array analysis (ChIP-chip) has been used widely for this purpose. Due to the remarkable progress in next generation sequencing technologies (e.g. Illumina Genome Analyzer, ABI SOLID, Roche 454, and Helicos HeliScope), investigators are now able to simultaneously sequence tens to hundreds of millions of short DNA fragments (tags) in a cost-effective manner. Chromatin immunoprecipitation followed by next generation sequencing (ChIP-seq) overcomes several limitations of the array-based ChIP-chip technique such as low tiling resolution, dye bias, and nonspecific and probe-dependent hybridization behaviors (Kharchenko and others, 2008). Therefore, ChIP-seq is replacing ChIP-chip as the preferred approach for genome-wide mapping of protein–DNA interactions and epigenetic marks (Rozowsky and others, 2009).

In a typical ChIP-seq experiment, for instance, an experiment to identify a TF binding sites, the TF is expressed in living cells and bound to genomic DNA (reviewed in Park , 2009). The cells are then treated with a reagent, usually formaldehyde, to create cross-links between the TF and DNA. The DNA is sheared by sonication into small fragments, typically with lengths of ∼200–1000 base pairs (bp). The DNA fragments cross-linked to the TF are coimmunoprecipitated using an antibody that specifically binds to the TF. After reversing the cross-links, the DNA fragments are released. The immunoprecipitated DNA fragments are hence enriched. In the control experiment, a mock immunoprecipitation with either no antibody or irrelevant antibody is employed to prepare the reference DNA fragments. Finally, the DNA fragments with lengths of ∼150–350 bp are selected for massively parallel sequencing using next generation sequencing platforms. The 5 ends of the selected DNA fragments on either strand are sequenced, which produces clusters of forward and reverse strands on the 2 sides of the bound regions. A ChIP-seq experiment (e.g. using Illumina Genome Analyzer) typically produces millions of short (∼25–50 bp) DNA tags that can be aligned to the reference genome of the organism under investigation. Usually, only the tags that uniquely map to the reference genome are used for the downstream analyses.

It is a challenge to analyze ChIP-seq data, due to the large amount of sequence tags, the intrinsic complexities of the signals, and various biases introduced in the experiment. To date, a variety of statistical methods have been proposed for ChIP-seq data analysis (e.g. Ji and others 2008, Jothi and others 2008, Kharchenko and others 2008, Zhang and others 2008, Rozowsky and others 2009, Spyrou and others 2009). Typically, the first step for ChIP-seq data analysis is to build a signal profile along each chromosome (reviewed in Pepke and others 2009). To do this, most methods use window-based counting approaches. Specifically, these methods divide the genome under study into overlapping or nonoverlapping genomic regions and then count the tags mapped in the regions in a strand-specific or a nondirectional fashion. Statistical inference is usually based on the tag counts and/or the direction information. For example, CisGenome (Ji and others, 2008) scans the genome using a sliding window with fixed width and identifies regions with tag counts exceeding a user-defined threshold. SISSRs (Jothi and others, 2008) counts the tags in overlapping windows in a strand-specific manner and identifies the transition point where the sign of net tag counts (number of positive tags minus number of negative tags) changes. SPP (Kharchenko and others, 2008) also counts the tags in a strand-specific manner and calculates the strand cross-correlations, which are used to call enriched regions. MACS (Zhang and others, 2008) slides a window with fixed width across the genome after shifting the tags toward the tags' 3 ends to account for the average length of the DNA fragments. XSET (Robertson and others, 2007) extends the tags directionally into extended tags with an estimated length and then identifies regions with highly overlapping tags. In contrast to the window-based counting methods, QuEST (Valouev and others, 2009) and F-Seq (Boyle and others, 2008) build kernel density estimation profiles along each chromosome.

After building the signal profiles, the next step of analysis is to identify enriched regions. For the window-based counting methods, the simplest approach to call enriched regions is to set a threshold and define regions as enriched if their tag counts exceed the threshold value (e.g. Johnson and others 2007, Fejes and others 2008). Usually, false discovery rates (FDRs) or individual p values are calculated for the regions called enriched. In two-sample analysis, where control data are available, the binomial distribution is often used to estimate the p values for the enriched regions (Ji and others, 2008), (, rozowsky09); the FDR is then calculated from the p values. Some methods, instead of calculating p values, estimate an empirical FDR, which is defined as the ratio of the number of control peaks to the number of chromatin immunoprecipitation (ChIP) peaks (Valouev and others, 2009), (Zhang and others, 2008). In one-sample analysis, where only the ChIP sample is sequenced, the background (null) distribution used to calculate the p values is usually derived from a Poisson or negative binomial distribution (Robertson and others, 2007), (, ji08), (, zhang08), (, jothi08). For example, to determine the model parameters, CisGenome (Ji and others, 2008) fits a negative binomial distribution to the number of windows with 2 or fewer tags. These hypothesis–testing-based approaches leave with a difficult problem for the adjustment of multiplicity because a vast number of tests are performed and the test statistics are usually not independent. Therefore, it is difficult to choose an appropriate cutoff to discriminate true enriched regions from background noise.

Similar to ChIP-chip data, an important feature of ChIP-seq data is the spatial dependency. Markov models and Markov random field models are perhaps the most popular models for data with spatial dependency, and they have been widely used in ChIP-chip data analysis (Ji and Wong, 2005), (, gottardo08), (Mo and Liang, 2010). However, probably due to the computational challenge of processing the vast amount of data, only a few Markov model-based or Bayesian methods have been used for ChIP-seq analysis (Spyrou and others, 2009), (Qin and others, 2010), (Zhang and others, 2011).

In this article, we propose a fully Bayesian hidden Ising model for ChIP-seq data analysis. The proposed method differs from the existing hypothesis–testing-based and Bayesian methods in several respects. First, we build a dynamic signal profile for each chromosome and model the profile as a system using a fully Bayesian hierarchical framework. Second, our method naturally takes into account spatial dependence and local and global distributions of sequence tags. Third, our method makes few assumptions about the size of enriched regions, does not need to perform tag extension or shift, and is able to detect sharp and broad enriched regions. As a result, our method achieves improved operating characteristics in terms of sensitivity and FDR. In addition, our method is computationally much more efficient than BayesPeak, an alternative Bayesian method (Spyrou and others, 2009).

The remainder of this paper is organized as follows. In Section 2, we describe our new Bayesian hidden Ising model and its Markov chain Monte Carlo (MCMC) implementation. In Sections 3 and 4, we compare our method with 4 popular and/or well-documented methods, MACS, CisGenome, BayesPeak, and SISSRs using 3 real data sets and 2 mixed data sets. In Section 5, we conclude with a brief discussion.

A FULLY BAYESIAN HIDDEN ISING MODEL

Building dynamic signal profiles for ChIP-seq data

Typical ChIP-seq data contain the genomic positions and directions of millions of sequence tags. In order to model ChIP-seq data and precisely infer the true TF binding site, we build a dynamic signal profile for each chromosome using a 2-step approach. First, we scan each chromosome and aggregate the tags into nonoverlapping dynamic bins that have various sizes (e.g. 80,40,20,10 bp). We let the bin sizes vary according to the local density of the tags. For example, suppose we have {t1,t2,…,tm} tags along a chromosome with positions {p1,p2,…,pm}. For each tag, we use its middle genomic position as its position on the chromosome. Let {w1,w2,…,wn} be the predefined bin sizes in a decreasing order and T be the threshold for triggering bin size alteration. The signal profile for the chromosome is made of consecutive bins {b1,b2,…,bq}. The first bin b1 starts with p1 and ends with pi, where pi is the largest position falling in the interval [p1, p1 + w1]. Let ci be the number of tags falling in bin bi and si = wd,d∈{1,2,…,n} be the size of bi. The algorithm for choosing si,i > 1 is: if ci − 1 < T, let si = wd − 1 if d > 1, else si = w1; if ci − 1T, let si = wd + 1 if d < n, else si = wn. In other words, if bi − 1 has tag count ci − 1 less than the threshold T, we increase si to the next higher level (e.g. from 20 to 40 bp, 40 to 80 bp, etc.) if si − 1 is not equal to w1 (the largest bin size), otherwise let si be w1; if bi − 1 has tag count ci − 1 equal to or greater than the threshold T, we decrease si to the next lower level (e.g. from 80 to 40 bp, 40 to 20 bp, etc.) if si − 1 is not equal to wn (the smallest size), otherwise let si be wn. The second bin will start with pi + 1 and end with pk, where pk is the largest position falling in the interval [pi + 1,pi + 1 + s2]. All the remaining bins are built in this way. To get a better understanding of this strategy, imagine that at the beginning, a bin with w1 bp is used to scan a chromosome. When it adds in T tags or more, suggesting that the following region is an enriched region, the next bin size will decrease to w2, otherwise the next bin size will still be w1. If it is indeed an enriched region, the bin size will decrease step by step till the minimum size wn. When the bin leaves the enriched region, the bin size will increase step by step till the maximum size w1.

After building the bins, we measure the genomic distances between neighboring bins. If the distance between any 2 neighboring bins exceeds the the average length of the sequenced DNA fragments, a bin with zero tag count is inserted into the 2 bins to make the bins physically closer. Unlike the popular methods that evenly divide genome into consecutive regions, we build a dynamic signal profile in order to make the distributions of tag counts in the enriched regions as similar as possible and achieve a higher resolution for inferring true TF binding sites.

Figure 1 shows a segment of the chromosome 19 signal profile of the human neuron-restrictive silencer factor (NRSF) data (Johnson and others, 2007). The top panel of Figure 1 shows the profiles for the forward and reverse chains, respectively. The bottom panel shows the profile where the tag counts of both chains are added together. For one-sample analysis, we model the bin-based total counts of both chains as shown on the bottom panel of Figure 1. For two-sample analysis, we model the adjusted bin-based total counts, which are the ChIP total counts minus the control total counts truncated at zero for every bin. Considering the characteristics of our model, here, the control data are used only for background correction in two-sample analysis. This background correction approach has been widely used in the image processing of microarray data. Although this approach is simple, it is very effective with our model. We will demonstrate this in our simulation studies.

Fig. 1.

A chromosome-based dynamic signal profile. The segment contains a typical enriched region (indicated by “+”) and nonenriched regions (indicated by “ − ”). The actual binding site in the enriched region is usually around the center—the middle position of the forward and reverse peaks. This bimodal shape characteristic is used by the iSeq software that implements the proposed method to infer the true binding site (indicated by the dashed line).

Fig. 1.

A chromosome-based dynamic signal profile. The segment contains a typical enriched region (indicated by “+”) and nonenriched regions (indicated by “ − ”). The actual binding site in the enriched region is usually around the center—the middle position of the forward and reverse peaks. This bimodal shape characteristic is used by the iSeq software that implements the proposed method to infer the true binding site (indicated by the dashed line).

We can see in Figure 1 that a typical enriched region (indicated by “+”) is made of consecutive bins with large tag counts, while nonenriched regions (indicated by “ − ”) typically have consecutive bins with small tag counts. If we use an indicator for a bin's state (enriched or nonenriched), we observe that a chromosome is partitioned into a series of segments. Each segment is made of a cluster of bins with the same state. In other words, the bins tend to be locally self-clustered, and their states depend on whether they fall in enriched or nonenriched regions. In typical ChIP-seq experiments for identifying TF binding sites, most genomic regions are not enriched. Therefore, a majority of the bins are in the nonenriched state.

The locally self-clustering behavior of ChIP-seq data is similar to the behavior of electrons in ferromagnetic materials, where electrons interact locally leading to spinning in the same direction at certain temperatures. The Ising model, initially designed to model the simplified spin behavior of electrons, has been widely used in many areas including image analysis, social networks, and computational biology. The 1D Ising model considers a sequence of points on a line (see review in Kindermann and Snell 1980). At each point, there is an atom of a magnetic material that can only be in 1 of 2 states, 1 and − 1 at any given moment. If we use this coding, then the distribution of the states can be modeled by the 1D Ising model.

The model

Let y = (y1,…,yn) be a realization of tag counts Y = (Y1,…,Yn) for n genomic bins along a chromosome. Here, the definition of Yi is flexible. For one-sample analysis, yi is the number of tags falling in bin i. For two-sample analysis, yi is the adjusted tag count for bin i as described in the previous section. We let each bin associate with a binary latent variable Xi∈{ − 1,1}, where Xi = − 1 or Xi = 1 denotes that the bin belongs to a nonenriched or enriched region, respectively. We assume, conditioning on Xi = xi, 

(2.1)
graphic

That is, Yi is assumed to follow a Poisson distribution with mean λ0 or λ1 when bin i belongs to a nonenriched or enriched region, respectively. Furthermore, we assume, conditioning on X = (X1,…,Xn) and Y1,…,Yn are independent. This results in the likelihood function 

(2.2)
graphic
where X = (x1,…,xn) is a realization of X and I(xi = c) = 1 if xi = c and 0 otherwise.

The priors

The key to the proposed method is to model the hidden state vector X. Here, we model X using the 1D Ising model, which has been used by Mo and Liang (2010) for ChIP-chip data analysis. 

(2.3)
graphic
where κ is the interaction parameter and Z(κ) is the normalizing constant of the distribution, which has the closed form (Baxter, 1982) 
(2.4)
graphic
Parameter κ controls the interaction strength between neighboring bins. The model (2.3) encourages neighboring bins to be in the same state when κ > 0 and to be in the opposite state when κ < 0. When κ = 0, it means there is no interaction between neighboring bins and the distribution of the states tends to be random. In view of the locally self-clustering behavior of the bins, we set κ > 0 in order to reflect the intrinsic structure of ChIP-seq data. The value of κ affects the size of the bin clusters. In general, a large positive value of κ gives rise to large clusters of bins, while a small positive value produces small clusters of bins. Note that we can easily extend the model to a higher-order Ising model if we allow the interaction beyond the nearest neighbors. However, in a higher-order Ising model, Z(κ) becomes intractable. It would be very challenging to consider a fully Bayesian approach because sampling from the posterior distribution of κ is extremely difficult, especially for the vast amount of data that would need to be processed.

Let Γ(a,b) denote a Gamma distribution with mean a/b and variance a/b2 and U(a,b) denote a uniform distribution from a to b. The parameters λ0, λ1, and κ are assumed to have the following prior distributions: 

graphic
Note that by introducing the Gamma priors for the Poisson parameters, the tag counts in the enriched and nonenriched regions are modeled by hierarchical Poisson–Gamma distributions. The hierarchical Poisson–Gamma distribution can be expressed as a negative binomial distribution, which allows for greater variance. The prior imposed on κ is to ensure the posterior to be proper. As a matter of practice, it is equivalent to U(0, ).

The full conditionals

Let n0 = ∑i = 1nI(xi = − 1) and n1 = ∑i = 1nI(xi = 1) be the total numbers of nonenriched and enriched bins, respectively. Let s0 = ∑i = 1nI(xi = − 1)yi and s1 = ∑i = 1nI(xi = 1)yi be the sums of the tag counts for nonenriched and enriched bins, respectively. In addition, let y∣· denote the full conditional distribution of y given everything else in the model. After some algebra, we get the following full conditional distributions: 

(2.5)
graphic
 
(2.6)
graphic
 
(2.7)
graphic
 
(2.8)
graphic
where i = 1,…,n and x0 = xn + 1 = 0. From (2.7), we can see whether a bin is enriched depends on the global distribution of tags (λ0 and λ1), the local distribution of tags (yi), the states of its neighboring bins (xi − 1 and xi + 1), and the interaction strength (κ).

Metropolis–Hastings within Gibbs sampling algorithm

Given the full conditional distributions (2.5–2.7), it is straightforward to simulate from the posterior distributions using a cyclic Gibbs sampler. The posterior distribution of κ can be simulated using the Metropolis–Hastings algorithm with a Gaussian random walk proposal. The acceptance probability of the update is 

(2.9)
graphic
where κ is the current value and κ is the proposed value. The posterior probabilities of the hidden variable Xi, where i = 1,2,…,n, are used for inference of binding sites.

CASE STUDY: ANALYSIS OF TF CHIP-SEQ DATA

The CCCTC-binding factor, NRSF, and signal transducer and activator of transcriptionprotein 1 ChIP-seq data

Barski and others (2007), Johnson and others (2007), and Rozowsky and others (2009) performed genome-wide identification of the binding sites of TFs CCCTC-binding factor (CTCF) in human CD4+ T cells, NRSF in Jurkat T cells, and signal transducer and activator of transcription protein 1 (STAT1) in interferon-γ-stimulated HeLa S3 cells, respectively. NRSF is a transcriptional repressor that negatively regulates gene expression through binding to a 21-bp DNA element called neuron restrictive silencer element. CTCF is a multizinc finger protein that binds to a conserved sequence about 19 bp. The conserved binding sequence for STAT1 is about 9 bp.

The Illumina/Solexa sequencing technology was used for the 3 experiments, which generated 25- to 36-nt sequence tags. The NRSF sequence tags were mapped to the human genome May, 2004 (hg17) assembly, and the CTCF and STAT1 sequence tags were mapped to the human genome March, 2006 (hg18) assembly. Only the uniquely mapped tags (up to 2 mismatches) were used for the analysis. The CTCF data were made of ∼2.95 million (M) tags that were generated from the treatment/ChIP samples. The NRSF data consisted of ∼5.36 M ChIP and ∼6.16 M control tags. The STAT1 data consisted of ∼23.44 M ChIP and ∼26.73 M control tags.

Analysis of the CTCF, NRSF, and STAT1 data

In order to thoroughly evaluate the proposed method, we applied it to 3 recently published data sets. For all the 3 data sets, we used the bin sizes {80,40,20,10bp} to construct the signal profiles as described in Section 2.1. The CTCF and NRSF samples were not deeply sequenced but the STAT1 samples were. Considering this, we let the thresholds for triggering size change be 10 tags for the CTCF and NRSF data and 20 tags for the STAT1 data, respectively. In the TF binding data, the genomic regions are typically dominated by nonenriched regions that are made of a large number of consecutive bins in the nonenriched state. Considering this prior information, we let the initial value of κ be 3.0 (a relatively large value that could produce large bin clusters in the Ising system), and let the top 5% of the bins have the initial state of 1 and the rest have the initial state of − 1. For the NRSF and CTCF profiles, the bins with 3 or more tags were among the top 5%, and for the STAT1 profiles, the bins with 4 or more tags were among the top 5%. We modeled the profiles chromosome by chromosome and used the prior distributions λ0∼Γ(1,1) and λ1∼Γ(5,1) for all the profiles. Note that the prior distributions, in practice, had little effect on the posterior distributions of the model parameters because the values of the prior parameters were negligible, compared to the numbers of enriched and nonenriched bins and the tag counts (see 2.5 and 2.6). In our simulations, we ran 3 chains and let the Metropolis within Gibbs sampler run for 20 000 iterations (the first 10 000 were used as burn-in for each profile) for each chain. Because all the chains essentially converged to the same parameter values (Supplementary Figure 5 of the supplementary material available at Biostatistics online), we used the results produced by one of the chains to call enriched regions. Figure 2 shows the trace plots of the model parameters and histogram of the posterior probabilities of enriched state for the chromosome 1 profile of the STAT1 data. We can see that the chains converged to the stationary distributions in dozens of iterations. We also performed convergence diagnostics using the method of Heidelberger and Welch (1983). All the chains passed the stationarity test (p value > 0.05). The posterior probabilities produced by our method were dichotomized (either 0 or 1), indicating a clear classification of our method. Enriched bins are the bins whose enriched states have high posterior probabilities (e.g. > 0.5). In addition, the FDR can be calculated for the enriched bins using the direct posterior probability approach as described in Newton and others (2004). We merged adjacent enriched bins into the same enriched region if the distance between the adjacent bins was less than the average length of the sequenced DNA fragments, which was about 200 bp for the 3 data sets.

Fig. 2.

Trace plots of the model parameters and histogram of the posterior probabilities of enriched state for the STAT1 chromosome 1 profile.

Fig. 2.

Trace plots of the model parameters and histogram of the posterior probabilities of enriched state for the STAT1 chromosome 1 profile.

Figure 3 shows the distributions of the posterior means of the model parameters for the profiles. The posterior means of λ0 for the CTCF and NRSF profiles were centered around 0.65, indicating a similar background distribution of the tags in the 2 data sets. In contrast, STAT1 had relatively higher posterior means for λ0, reflecting a higher sequencing depth. The median values of λ1 were about 14.0, 22.0, and 13.0 for the CTCF, NRSF, and STAT1 profiles, respectively. Note that the λ1 values for the chromosome Y profiles of the NRSF and STAT1 data were about 5.0, which were obvious outliers as shown in the middle panel of Figure 3, which suggests there is no enriched region in the Y chromosome of the NRSF and STAT1 data. In fact, all the posterior probabilities for the bins on the Y chromosome of the NRSF data were 0 or very close to 0. The median values of κ were around 2.2,3.4, and 2.4 for the CTCF, NRSF, and STAT1 profiles, respectively (Figure 3, the right panel). Note that the parameter κ was about 6.0 for the NRSF chromosome Y profile, indicating the formation of large bin clusters on the Y chromosome. In fact, the distribution of tag counts on the NRSF chromosome Y appeared to be random. There were about 1780 unique sequence tags mapped to the Y chromosome. The maximum adjusted tag count was 9 and there were only 7 bins with tag counts greater than 5. All the posterior probabilities calculated by our model were 0 or very close to 0, which is the typical self-clustering effect of Ising model. More specifically, the Ising model encourages the bins to be in the nonenriched state if the distribution of tag counts appears to be random.

Fig. 3.

Box plots of the posterior means of the model parameters for the CTCF, NRSF, and STAT1 profiles. The λ1 values for the chromosome Y profiles of the NRSF and STAT1 data were outliers (the values were about 5.0), suggesting that there is no enriched region on the Y chromosome of the 2 data sets. Note the parameter κ was about 6.0 (the outlier on the right panel) for the NRSF chromosome Y profile. All the posterior probabilities for the NRSF chromosome Y profile were 0 or close to 0.

Fig. 3.

Box plots of the posterior means of the model parameters for the CTCF, NRSF, and STAT1 profiles. The λ1 values for the chromosome Y profiles of the NRSF and STAT1 data were outliers (the values were about 5.0), suggesting that there is no enriched region on the Y chromosome of the 2 data sets. Note the parameter κ was about 6.0 (the outlier on the right panel) for the NRSF chromosome Y profile. All the posterior probabilities for the NRSF chromosome Y profile were 0 or close to 0.

Comparison

We compared our methods with 4 alternative methods, MACS, CisGenome, BayesPeak, and SISSRs. We chose MACS because it is one of the most popular methods, CisGenome and SISSRs because they have good spatial resolution due to reporting relatively narrow enriched regions (see review in (Laajala and others, 2009), (, wilbanks10)), and BayesPeak because it uses a Bayesian hidden Markov model that is similar to our approach. For all the 4 alternatives, we used the default or recommended parameter settings for the data analysis. For the comparison, we focused on the number, quality, and spatial resolution of the called regions.

We selected the bins with FDR < 0.05 to form enriched regions for our method. MACS did not report FDR for one-sample analysis, so we selected the regions with p value < 10 − 5 as it suggested. For two-sample MACS analysis, we chose the regions with FDR < 0.05 as we did for one- and two-sample CisGenome analyses. For BayesPeak, we selected the regions with posterior probability > 0.5 and discarded the regions that were overfitted by its suggestion. For SISSRs, we used its default parameter settings to call regions enriched. Figure 4 (the top panel) shows the numbers of enriched regions detected by all the methods for the 3 data sets, where “iSeq” refers to our method. The numbers of enriched regions detected by the 5 methods varied greatly. CisGenome produced a relatively low numbers of detections, which are consistent with the reports by Laajala and others (2009) and Wilbanks and Facciotti (2010). SISSRs identified a large number of enriched regions for the not deeply sequenced CTCF and NRSF data, but it reported the lowest number of detections (∼8000) for the deeply sequenced STAT1 data. However, if we only used the STAT1 ChIP data for the analysis (one-sample analysis), SISSRs reported ∼80 000 regions, which included 92 regions from the Y Chromosome. BayesPeak reported comparable numbers of detections for the CTCF and NRSF data, respectively, but it detected only ∼17 000 regions for the STAT1 data because of overfitting. The majority of the genomic regions were overfitted for the STAT1 data. If we included the overfitted regions, BayesPeak reported ∼127 000 regions, including 455 regions from the Y chromosome. The female HeLa S3 cells that lack the male Y chromosome were used for the STAT1 experiment, but ∼16 000 tags were mapped to that chromosome for the ChIP and control samples, respectively. Therefore, any enriched regions found on the Y chromosome are false positives, which has been noticed by Wu and others (2010). These observations suggest that SISSRs and BayesPeak didn't perform consistently for the data with different sequencing depths. In contrast, iSeq and MACS performed more consistently. The numbers of regions detected by MACS were among the top 2 for all the 3 data sets. ISeq detected a middle number of regions for the CTCF and NRSF data and the largest number of regions for the STAT1 data, which probably reflects the sequencing depth. As shown in Figure 3, the λ1 estimate of the STAT1 chromosome Y profile was an obvious outlier, providing evidence to exclude the regions found on that chromosome. Regarding controlling false positives on the Y chromosome, MACS also performed quite well; it only claimed one detection.

Fig. 4.

Number of identified enriched regions (the top panel), motif occurrence rate (the middle panel), and spatial resolution(the bottom panel). The enriched regions were ranked and then divided into a series of bins. Each of the bins contained 400 nonrepetitive regions. The numbers on the x-axes of the figures in the middle panel are the bins' IDs. The larger the ID, the less significant the enriched regions. For each motif, if a region contained a sequence with position p value < 10 − 4, the region was claimed to have the motif. For each data set, the regions containing the motif were used to calculate the spatial resolution—the distance between the motif center and the predicted exact binding site.

Fig. 4.

Number of identified enriched regions (the top panel), motif occurrence rate (the middle panel), and spatial resolution(the bottom panel). The enriched regions were ranked and then divided into a series of bins. Each of the bins contained 400 nonrepetitive regions. The numbers on the x-axes of the figures in the middle panel are the bins' IDs. The larger the ID, the less significant the enriched regions. For each motif, if a region contained a sequence with position p value < 10 − 4, the region was claimed to have the motif. For each data set, the regions containing the motif were used to calculate the spatial resolution—the distance between the motif center and the predicted exact binding site.

In order to evaluate the quality of the detected regions, we obtained the sequences that contain the canonical motif for each TF from the JASPAR database (http://jaspar.genereg.net). We used the software MEME to generate the position-specific scoring matrices, which were then used as the inputs of the software MAST for searching the canonical motifs (Bailey and others, 2006), (Bailey and Gribskov, 1998). To calculate the motif occurrence rate and assess the positional accuracy of the predicted exact binding sites (PEBSs), we searched 200 bp upstream and downstream of the PEBSs. MACS and iSeq provided the PEBS for each region. For BayesPeak and SISSRs, the center of the region was used because the PEBSs were not available. CisGenome gave the genomic coordinates of the forward and reverse peaks, thus, we used the middle position of the coordinates as the center. We ranked the regions and then divided them into a series of bins. Each of the bins contained 400 nonrepetitive regions. MACS, SISSRs, and iSeq reported tag counts for detected regions, so we ranked them using the tag counts. BayesPeak only reported posterior probabilities for detected regions, thus, we ranked them according to their posterior probabilities. CisGenome automatically ranked the regions, so we just used the default-ranked regions. Figure 4 (the middle panel) shows the occurrence rates for the ranked regions. Overall, the motif occurrence rates were similar for the top-ranked regions for the 3 data sets, except for BayesPeak for the STAT1 data, where the motif occurrence curve was much lower than the others. In addition, the motif occurrence rate of CisGenome dropped more dramatically after the 30th bin, compared to the others. Regarding the positional accuracy, all the distances between the PEBSs and the motif centers were centered at zero. In general, all the methods achieved higher positional accuracy for the CTCF and NRSF data than for the STAT1 data, except for BayesPeak, which had a much wider box than the others for the NRSF data (Figure 4, the bottom panel). Note that iSeq detected many more regions, thus, its box was slightly wider than the alternatives (Figure 4, the bottom panel). For more details about the comparison, please see Section E of the supplementary material available at Biostatistics online.

SIMULATION STUDY

As shown in Figure 4, the numbers of enriched regions detected by all the methods varied greatly. However, the motif occurrence rates were generally similar for the top regions. If the research goal is to get a short list of top regions, most methods would work. However, if the investigator wants to achieve sensitivity as high as possible and at the same time control FDR at an acceptable level, it becomes very challenging. Although most of the current methods provide statistical significance for detected regions, it is not easy to determine a suitable threshold to call regions enriched because the adjustment for multiplicity is very complicated.

To comprehensively evaluate the methods under comparison, we performed further analyses using simulated data sets. Arguably, there is no simulated data better than the data derived from real data; in fact, the real data provide ideal resources for our simulation studies. To generate simulated data, we first identified the enriched regions that contained with high confidence the CTCF motif (p value < 10 − 6). We found 9395 such regions consisting of 372 662 tags. We mixed these tags with the STAT1 control tags to make 2 ChIP data sets that contained about 4.41 and 8.33 M tags, respectively. Both ChIP data sets contained the same CTCF tags. The second data set had more noise because it had more control tags. For the corresponding two-sample analyses, we used the remaining STAT1 control tags to make 2 control data sets, which had about 4.51 and 8.47 M tags, respectively. We used the same parameter settings as those used for the real data analysis to analyze the mixed data sets. The criteria used to call enriched regions were the same as those used for the real data analysis except for the FDR cutoff, which were controlled at 0.1 because we wanted to see how these methods performed at a less stringent level. For each method, we evaluated its performance for one- and two-sample analyses. The detected regions were ordered as described previously for the analysis of sensitivity and FDR. Figure 5(A) and (C) shows the results for the one-sample analyses of the 2 ChIP data sets. MACS and iSeq achieved similar sensitivities, which was higher than those achieved by the other methods. However, iSeq had much lower FDRs than MACS. The FDRs for the regions reported by MACS and SISSRs reached ∼0.35 and ∼0.55 for the first and second ChIP data, respectively. BayesPeak performed poorly for the second ChIP data because a major proportion of the genomic regions were overfitted. Ignoring the overfitting problem, it called 41 893 regions enriched (more than 75% were false positives), indicating it was not able to discriminate the true binding regions from background noise for this data set. CisGenome achieved the lowest FDRs, but the sensitivities were only ∼0.7 for the 2 ChIP data sets. Figure 5(B) and (D) shows the results for the two-sample analyses of the simulated data. Again, MACS and iSeq achieved the highest sensitivities but iSeq had lower FDRs than MACS. CisGenome had the lowest sensitivities and FDRs. BayesPeak and SISSRs achieved a little bit higher sensitivities than CisGenome but with the cost of higher (or much higher) FDRs. In general, the top regions reported by all methods had relatively higher sensitivities and lower FDRs than the remaining regions.

Fig. 5.

Sensitivity and FDR for the detected enriched regions. (A) One-sample analysis of the mixed ChIP data (∼4.41 M tags); (B) Two-sample analysis of the mixed ChIP and control data (∼4.41 and ∼4.51 M tags, respectively). (C) One-sample analysis of the mixed ChIP data (∼8.33 M tags); (D): Two-sample analysis of the mixed ChIP and control data (∼8.33 and ∼8.47 M tags, respectively).

Fig. 5.

Sensitivity and FDR for the detected enriched regions. (A) One-sample analysis of the mixed ChIP data (∼4.41 M tags); (B) Two-sample analysis of the mixed ChIP and control data (∼4.41 and ∼4.51 M tags, respectively). (C) One-sample analysis of the mixed ChIP data (∼8.33 M tags); (D): Two-sample analysis of the mixed ChIP and control data (∼8.33 and ∼8.47 M tags, respectively).

DISCUSSION

In this article, we have presented a simple but flexible and powerful hidden Ising model for ChIP-seq data analysis. Through systematically modeling the dynamic signal profiles, the spatial dependence and the global and local distributions of the sequence tags were integrated to identify enriched regions. We compared our method with 4 representative methods using 3 real and 2 mixed data sets. The results showed that overall our method achieved equivalent or higher sensitivity and spatial resolution in detecting TF binding sites with the FDR at a much lower level. Therefore, our method can gain advantage in the analysis that aims to obtain a comprehensive list of enriched regions with acceptable false positives. If the research goal is to obtain a short list of enriched regions (e.g. top enriched regions), most of current algorithms including ours can be used. Another characteristic of our method is that the posterior probabilities of the bins being enriched are dichotomized (either 0 or 1). This overcomes the difficulty of choosing the appropriate cutoff encountered by the existing hypothesis–testing-based methods.

We propose a dynamic-bin-size algorithm for building the signal profiles of ChIP-seq data, which is a new effort in order to achieve high sensitivity and spatial resolution in inferring TF binding sites. The dynamic-bin-size algorithm tends to use large bins for regions with low tag density and small bins for regions with high tag density. The user can control bin size alteration by adjusting the threshold value. If the user lets the threshold be a very large value (e.g. one million), the largest predefined bin size will be constantly used to build the signal profiles. We have compared the effects of the dynamic profiles and various static profiles (built by using fixed bin sizes) on the real and simulated data. In general, the dynamic profile algorithm produced better operating characteristics (Supplementary Figures 24 of the supplementary material available at Biostatistics online).

Because ChIP-seq data are “count” data, it is natural to use Poisson or negative binomial distribution to model them. To the best of our knowledge, most parametric models including the 4 alternative methods use these 2 distributions to model the tag counts. In general, negative binomial distribution fits the data better because it allows for greater variance. Our approach uses the hierarchical Poisson–Gamma distributions to model the tag counts, which can be expressed as negative binomial distributions. We were aware that, in practice, ChIP-seq data may not exactly follow negative binomial distribution. Since the major function of the proposed distributions is used to distinguish enriched bins from nonenriched bins, we found that our method was quite robust when the data did not exactly follow the proposed distributions, which has been demonstrated by the analyses of the 3 real and 2 simulated data sets (For more details, see Supplementary Section A of the supplementary material available at Biostatistics online). We also noticed that the total tag counts in ChIP and control libraries are usually unbalanced. For two-sample analysis, if the ChIP and control tag counts are not extremely unbalanced, our method does not require to perform so-called “normalization” that makes the counts balanced as the other methods typically do. The approach we use to build the signal profile significantly reduces the effect of the unbalanced tag counts in identification of enriched regions. We have analyzed several simulated ChIP and control data with unbalanced total tag counts. We found that the unbalance had little effect on the sensitivity and minor effect on the FDR even when the data were extremely unbalanced (Supplementary Table 1 of the supplementary material available at Biostatistics online). Our method achieving these effects is because the control data are mainly used for background subtraction, which controls the FDR. Usually, the more the control tags, the lower the FDR. Because typical ChIP-seq data are not extremely unbalanced (e.g. the data used for this paper), we expect that the unbalance of tag counts has little effect on the results.

In summary, we have proposed a method that combines a dynamic signal profile algorithm with a Bayesian hidden Ising model to infer TF binding sites from ChIP-seq data. It is challenging to analyze ChIP-seq data using a fully Bayesian approach, considering the large amount of data. The simplicity of our method makes it useful in practice. Through parallel computation, our method can process a deeply sequenced ChIP-seq data set in a few hours. For example, we ran 8 jobs simultaneously on a 64-bit Linux machine with eight 2.33 GHz processors, and our method finished analyzing the deeply sequenced STAT1 data in 2 h when running 12 000 MCMC iterations.

SOFTWARE

The proposed method has been implemented in the C language and wrapped into a R package called iSeq, which is freely available at http://www.bioconductor.org/. Besides the 1D standard Ising model, iSeq also implements a higher-order Ising model that allows the interaction beyond the nearest neighboring bins, which is useful for noisy ChIP-seq data.

SUPPLEMENTARY MATERIAL

Supplementary material is available at http://biostatistics.oxfordjournals.org.

The author thanks Faming Liang and Mingqi Wu for helpful discussion of the simulation study, Adam Olshen and Venkatraman Seshan for their critical review of this paper, and the reviewers and associate editor for their constructive comments that have led to significant improvement of this paper. Conflict of Interest: None declared.

References

Bailey
TL
Gribskov
M
Combining evidence using p-values: application to sequence homology searches
Bioinformatics
 , 
1998
, vol. 
14
 (pg. 
48
-
54
)
Bailey
TL
Williams
N
Misleh
C
Li
WW
MEME: discovering and analyzing DNA and protein sequence motifs
Nucleic Acids Research
 , 
2006
, vol. 
34
 (pg. 
W369
-
W373
)
Barski
A
Cuddapah
S
Cui
K
Roh
TY
Schones
DE
Wang
Z
Wei
G
Chepelev
I
Zhao
K
High-resolution profiling of histone methylations in the human genome
Cell
 , 
2007
, vol. 
129
 (pg. 
823
-
837
)
Baxter
RJ
Exactly Solved Models in Statistical Mechanics
 , 
1982
San Diego, CA
Academic Press
Boyle
AP
Guinney
J
Crawford
GE
Furey
TS
F-Seq: a feature density estimator for high-throughput sequence tags
Bioinformatics
 , 
2008
, vol. 
24
 (pg. 
2537
-
2538
)
Fejes
AP
Robertson
G
Bilenky
M
Varhol
R
Bainbridge
M
Jones
SJ
FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology
Bioinformatics
 , 
2008
, vol. 
24
 (pg. 
1729
-
1730
)
Gottardo
R
Li
W
Johnson
WE
Liu
XS
A flexible and powerful Bayesian hierarchical model for ChIP-Chip experiments
Biometrics
 , 
2008
, vol. 
64
 (pg. 
468
-
478
)
Heidelberger
P
Welch
PD
Simulation run length control in the presence of an initial transient
Operations Research
 , 
1983
, vol. 
31
 (pg. 
1109
-
1144
)
Ji
H
Jiang
H
Ma
W
Johnson
DS
Myers
RM
Wong
WH
An integrated software system for analyzing ChIP-chip and ChIP-seq data
Nature Biotechnology
 , 
2008
, vol. 
26
 (pg. 
1293
-
1300
)
Ji
H
Wong
W
Tilemap: create chromosomal map of tiling array hybridizations
Bioinformatics
 , 
2005
, vol. 
18
 (pg. 
3629
-
3636
)
Johnson
DS
Mortazavi
A
Myers
RM
Wold
B
Genome-wide mapping of in vivo protein-DNA interactions
Science
 , 
2007
, vol. 
316
 (pg. 
1497
-
1502
)
Jothi
R
Cuddapah
S
Barski
A
Cui
K
Zhao
K
Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data
Nucleic Acids Research
 , 
2008
, vol. 
36
 (pg. 
5221
-
5231
)
Kharchenko
PV
Tolstorukov
MY
Park
PJ
Design and analysis of ChIP-seq experiments for DNA-binding proteins
Nature Biotechnology
 , 
2008
, vol. 
26
 (pg. 
351
-
359
)
Kindermann
R
Snell
JL
1980
 
Markov Random Fields and Their Applications, Contemporary Mathematics, Volume 1, Providence, RI, American Mathematical Society
Laajala
TD
Raghav
S
Tuomela
S
Lahesmaa
R
Aittokallio
T
Elo
LL
A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments
BMC Genomics
 , 
2009
, vol. 
10
 pg. 
618
 
Mo
Q
Liang
F
A hidden Ising model for ChIP-chip data analysis
Bioinformatics
 , 
2010
, vol. 
26
 (pg. 
777
-
783
)
Newton
M
Noueiry
A
Sarkar
D
Ahlquist
P
Detecting differential gene expression with a semiparametric hierarchical mixture method
Biostatistics
 , 
2004
, vol. 
5
 (pg. 
155
-
176
)
Park
P
ChIP-seq: advantages and challenges of a maturing technology
Nature Reviews Genetics
 , 
2009
, vol. 
10
 (pg. 
669
-
680
)
Pepke
S
Wold
B
Mortazavi
A
Computation for ChIP-seq and RNA-seq studies
Nature Methods
 , 
2009
, vol. 
6
 
11 Suppl
(pg. 
S22
-
S32
)
Qin
ZS
Yu
J
Shen
J
Maher
CA
Hu
M
Kalyana-Sundaram
S
Yu
J
Chinnaiyan
AM
HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data
BMC Bioinformatics
 , 
2010
, vol. 
11
 pg. 
369
 
Robertson
G
Hirst
M
Bainbridge
M
Bilenky
M
Zhao
Y
Zeng
T
Euskirchen
G
Bernier
B
Varhol
R
Delaney
A
others
Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing
Nature Methods
 , 
2007
, vol. 
4
 (pg. 
651
-
657
)
Rozowsky
J
Euskirchen
G
Auerbach
RK
Zhang
ZD
Gibson
T
Bjornson
R
Carriero
N
Snyder
M
Gerstein
MB
PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls
Nature Biotechnology
 , 
2009
, vol. 
27
 (pg. 
66
-
75
)
Spyrou
C
Stark
R
Lynch
AG
Tavaré
S
BayesPeak: Bayesian analysis of ChIP-seq data
BMC Bioinformatics
 , 
2009
, vol. 
21
 pg. 
10
  
299
Valouev
A
Johnson
DS
Sundquist
A
Medina
C
Anton
E
Batzoglou
S
Myers
RM
Sidow
A
Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data
Nature Methods
 , 
2009
, vol. 
5
 (pg. 
829
-
834
)
Wilbanks
EG
Facciotti
MT
Evaluation of algorithm performance in ChIP-seq peak detection
PLOS One
 , 
2010
, vol. 
5
 pg. 
e11471
 
Wu
S
Wang
J
Zhao
W
Pounds
S
Cheng
C
ChIP-PaM: an algorithm to identify protein-DNA interaction using ChIP-seq data
Theoretical Biology and Medical Modelling
 , 
2010
, vol. 
7
 pg. 
18
 
Zhang
Y
Liu
T
Meyer
CA
Eeckhoute
J
Johnson
DS
Bernstein
BE
Nussbaum
C
Myers
RM
Brown
M
Li
W
others
Model-based analysis of ChIP-seq (MACS)
Genome Biology
 , 
2008
, vol. 
9
 pg. 
R137
  
17–R137.9
Zhang
X
Robertson
G
Krzywinski
M
Ning
K
Droit
A
Jones
S
Gottardo
R
PICS: probabilistic inference for ChIP-seq
Biometrics
 , 
2011
, vol. 
67
 (pg. 
151
-
163
)