Zerone: a ChIP-seq discretizer for multiple replicates with built-in quality control

Motivation: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is the standard method to investigate chromatin protein composition. As the number of community-available ChIP-seq profiles increases, it becomes more common to use data from different sources, which makes joint analysis challenging. Issues such as lack of reproducibility, heterogeneous quality and conflicts between replicates become evident when comparing datasets, especially when they are produced by different laboratories. Results: Here, we present Zerone, a ChIP-seq discretizer with built-in quality control. Zerone is powered by a Hidden Markov Model with zero-inflated negative multinomial emissions, which allows it to merge several replicates into a single discretized profile. To identify low quality or irreproducible data, we trained a Support Vector Machine and integrated it as part of the discretization process. The result is a classifier reaching 95% accuracy in detecting low quality profiles. We also introduce a graphical representation to compare discretization quality and we show that Zerone achieves outstanding accuracy. Finally, on current hardware, Zerone discretizes a ChIP-seq experiment on mammalian genomes in about 5 min using less than 700 MB of memory. Availability and Implementation: Zerone is available as a command line tool and as an R package. The C source code and R scripts can be downloaded from https://github.com/nanakiksc/zerone. The information to reproduce the benchmark and the figures is stored in a public Docker image that can be downloaded from https://hub.docker.com/r/nanakiksc/zerone/. Contact: guillaume.filion@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online.


The HMM formalism applied to ChIP-seq
The output of ChIP-seq experiments consists of genomic profiles that can be thought of as ordered windows of equal size. Each window is associated to a certain number of read counts coming from different replicate experiments or negative controls.
HMMs are particularly adapted to this framework. The read counts associated to each window can be thought of as the observable emissions, whereas the unobservable states can be thought of the possible processes ongoing on those windows. Typically, the states may correspond to the events "the protein of interest is bound on the window" and "the protein of interest is not bound on the windown". There may be more states, and they do not have to correspond to a protein binding event (most notably, they can correspond to the present of some histone modifications).
The rest of this section pertains to general HMMs and will not make any hypothesis regarding the nature of the states and the emissions, however it can be useful for the intuition to think about states are chromatin states, and emissions as read counts.

The Forward-Backward algorithm
For a sequence of emissions y 0 , . . . , y n , the likelihood of the state sequence i 0 , . . . , i n is proportional to ν(i 0 )g i 0 (y 0 ) n k=1 Q(i k−1 , i k )g i k (y k ).
By summing over all possible combinations of states, we obtain the normalizing constant L n such that L n = i 0 ∈S,...,in∈S ν(i 0 )g i 0 (y 0 ) n k=1 Q(i k−1 , i k )g i k (y k ). (1) We denote φ k|n (i) the probability that the system is in state i at time k given the emissions y 0 , . . . , y n . If we call S n (k, i) the set of n-tuples (i 0 , . . . , i n ) such that i k = i, the value of φ k|n (i) comes as φ k|n (i) = 1 L n (i 0 ,...,in)∈Sn(k,i) ν(i 0 )g i 0 (y 0 ) n l=1 Q(i l−1 , i l )g i l (y l ).
We now introduce α k (i) the probability that the system is in state i at time k given the emissions y 0 , . . . , y k , and the β k|n (·) the numerical function such that φ k|n (i) = α k (i)β k|n (i).
To preserve the equality φ k|n (i) = α k (i)β k|n (i) for every k, we set by definition β n|n (i) = 1. From the equations above, we draw the following recursive equations: Equations (2) and (3) are the basis of the Forward-Backward algorithm to compute φ k|n (i). The terms α k (i) can be recursively computed from k = 0 to k = n with equation (2), and the terms β k|n (i) can be computed from k = n − 1 to k = 0 with equation (3). The terms φ k|n (i) are then found as the product α k (i)β k|n (i).
We now turn to the term φ k−1,k|n (i, j), which is by definition the probability that the system is in state i at time k − 1 and in state j at time k given y 0 , . . . , y n . If we call S n (k, i, j) the set of n-tuples (i 0 , . . . , i n ) such that i k−1 = i and i k = j, we get When the α k (i) and the β k|n (i) have been computed by the Forward-Backward algorithm, we also have access to the φ k−1,k|n (i, j) by using formula (4).

The Baum-Welch algorithm
The Baum-Welch algorithm is the special case of the EM algorithm applied to HMMs. Let us consider the general case of the triplet (X, Z, θ) where the variable X is observed, Z is not observed, and θ is the set of parameters of the distribution of (X, Z). The full likelihood L 0 (X, Z, θ) cannot be computed because the value of Z is unknown.
To find the value of θ that maximizes the full likelihood, we introduce an iterative procedure where the values of the parameter are updated upon each iteration. The current value of θ is noted θ (t) , and we compute the expected complete log-likelihood Q(θ|θ (t) ) assuming the current value of θ (note the difference between the intermediate quantity of the EM Q and the transition matrix Q).
This computation is called the E-step. The notations mean that the expectation is taken over the variable Z, assuming that it is conditional on the observed values of X and that the parameters of the distribution are given by θ (t) . The E-step is followed by the M-step, in which θ (t+1) is set to the value of θ that maximizes Q(θ|θ (t) ).
In the case of HMMs, the variable that is not observed is the sequence of states. The set of parameters θ (t) represents the transition probabilities (the matrix Q) and the parameters of the m distributions of the emissions.
The log-likelihood of the state sequence (i 0 , . . . , i n ) is The addition of θ to the terms above emphasizes that they depend on the value of the parameters. To compute Q(θ|θ (t) ), we need to take the expectation of the above over the state sequence conditionally on y 0 , . . . , y n and assuming that the parameters are given by θ (t) .
In practice, the first term of (5) will often not depend on θ so it will not contribute to the evaluation. The third term can be rewritten as This term depends on the emission probabilities, and nothing can be said about it in general terms because they differ between different models. But the second term depends only on the transition probabilities, which are present in every HMM, and it can be solved in general. First we notice that Remember that by definition E θ (t) 1 {(i k−1 ,i k )=(i,j)} y 0 , . . . , y n is φ k−1,k (i, j), so that we can rewrite the second term of (5) as The values of φ k−1,k (i, j) are computed during the E-step by the Forward-Backward algorithm. The terms Q(i, j) are part of θ and are thus updated during the M-step. By using Lagrange multipliers, we can show that the update values are .
To complete the Baum-Welch algorithm, we need to compute the last term of (5), which requires making a model for the emissions.

Zero-Inflated Negative Multinomial
The Baum-Welch algorithm provides a general framework to estimate the transition probabilities and the emission parameters. However, the detail of the estimation depends on the emission model. Here we give some general results about the negative multinomial and the zero-inflated negative multinomial distributions that will be useful to setup a model for emissions in ChIP-seq experiments.

The Gamma-Poisson process
In what follows, y is a non negative integer (an element of N). Let Y be a discrete random variable distributed as a Poisson distribution with parameter λ. The probability that Y is equal to y is by definition Let us now assume that λ is itself a random variable, such that the above equality is actually P (Y = y|λ). If λ is independent of Y and has a Gamma distribution with parameters α and β, the joint distribution of Y and λ is the product of the two distributions, that is The marginal distribution of Y , i.e. P (Y = y), is found by integrating the equality above over λ.
Equation (7) is the expression of the negative binomial distribution, with one of the many possible parametrizations. We will refer to this distribution as a negative binomial with parameters (α, 1/(1 + β)).
The Gamma-Poisson process can describe many phenomena because of the flexibility of the Gamma distribution. The α parameter of the Gamma distribution is often referred to as the "shape" parameter. For negative values of alpha, the distribution has a singularity at 0, whereas for positive values, the distribution has a single "bump". The β parameter is often referred to as the "scale" parameter because it represents a stretching of the curve along the x-axis that can fit different variances. As a result, the Gamma distribution is a good choice for almost every continuous distribution with positive values and a single mode. Combined to the Poisson distribution, it allows to fit almost every discrete distribution with positive values.

The negative multinomial distribution
We now introduce the case of r Poisson variables that are conditionally independent given λ. Intuitively, this means that λ is drawn at random first, which sets the parameter of the r Poisson distributions; the Poisson variables are then drawn from their respective distribution independently of each other. In other words, the conditional distribution can be written as follows Multiplying by the density of λ and integrating as above, the marginal distribution of the vector (Y 1 , . . . , Y r ) comes as This distribution is called the negative multinomial, which the negative binomial is a special case of (for r = 1). We will refer to it is as a negative multinomial with parameters (α, p 1 , . . . , p r ). It can be interpreted as the observations of a Gamma-Poisson process, where a common λ is drawn from a Gamma distribution, and r variables are drawn from independent Poisson distributions with parameters γ i λ (1 ≤ i ≤ r). The variables Y 1 , . . . , Y r are independent contionally on λ, but in section 2.3 we prove that they are never unconditionally independent.
The negative multinomial distribution has an alternative interpretation that sometimes proves useful. Suppose an urn contains black balls and balls of r different colors in respective proportions p 0 , p 1 , . . . , p r , such that p 0 + p 1 + . . . + p r = 1. If we draw balls with replacement from the urn until a black ball is drawn for the k-th time, the probability that the counts for the balls of each color are (y 1 , . . . , y r ) is This is formula (9), where α has been replaced by k. The negative multinomial distribution is thus a generalization of the drawing process described above. From the ball and urn interpretation, we get that the observed counts (y 1 , . . . , y r ) are twice smaller for a twice larger value of p 0 or for a twice smaller value of α.

Marginal distributions
To compute the marginal distributions of (Y 1 , . . . , Y r ), we could sum over (9), but it is simpler to sum over (8) and then multiply by the density of λ and integrate. The sum of (8) over all indices distinct from s ≤ r is a Poisson term of the form of (6) therefore, integrating over λ yields an expression similar to (7), namely Not surprisingly, we obtain a negative binomial distribution. More interestingly though, the parameters of this distribution are linked to the previous parameters by the equality p * s /p * 0 = p s /p 0 . This equality comes in handy to recompute the parameters of the negative multinomial distribution when variables are added or removed.
In the light of the analogy with balls in an urn, this result is not surprising. The marginal distribution corresponds to the same process where some colors are removed. The proportion of the remaining colors change, but their relative ratios do not.
To illustrate the use of this equality, we show with r = 2 that the marginal variables of a negative multinomial distribution are never independent. This also holds for r > 2, which can be proved by induction from the observation that mutual independence entails pairwise independence.
Assume that (Y 1 , Y 2 ) has a negative multinomial distribution and that it is not degenerate (all the parameters are strictly positive). Let us fix y 2 = 0. The terms P (y 1 = k, y 2 = 0) are proportional to Γ(α + k)p k 1 /k! and the terms P (y 1 = k)P (y 2 = 0) are proportional to Γ(α + k)p * k 1 /k! where p * 1 = p 1 /(p 1 + p 0 ) < p 1 so equality cannot hold for every k ≥ 0. In conclusion, the joint distribution cannot be equal to the product of the marginal distributions.
Note that the proof above assumes p 0 > 0, which is a consequence of β < ∞. So as long as λ is distributed according to a proper Gamma distribution, which is a defining feature of the negative multinomial distribution, the variables cannot be independent.
From the marginal distributions we can compute the conditional distribution of (Y 1 , . . . , Y s ) given (Y s+1 , . . . , Y r ) (and similary the distribution of any set of variables given the complentary set). Using the same approach as above, the marginal distribution is a negative multinomial that can be found to be The conditional distribution is computed as the ratio of the full distribution and the marginal distribution.

Inference about α and p 0
We now turn our attention to the distribution of the sum of observations drawn from negative multinomial distribution. More precisely, if (y 1 , . . . , y r ) is a random observation from a negative multinomial distribution with parameters (α, p 0 , p 1 , . . . , p r ), we are interested in the distribution of y 1 +. . .+y r . Conditionally on a given value of λ, Y 1 , . . . , Y r are independent and Poisson distributed. In regard of equation (8), this means that the conditional value of the sum is Poisson with parameter λ(γ 1 + . . . + γ r ). The distribution of the sum is thus negative binomial with parameters (α, p 0 ), where the value of p 0 is as defined in (9). This observation will be basis of the demonstration that the vector of marginal sums is a sufficient statistics for α and p 0 , which means that all the inference about these two parameters can be performed with the marginal sums. Let us consider a random sample of size n drawn from a negative multinomial distribution with parameters (α, p 0 , p 1 , . . . , p r ) and compute its distribution conditionally on the marginal sums. The observations consist of n vectors of dimension r, denoted (y k,1 , . . . , y k,r ), where 1 ≤ k ≤ n, and we denote the associated marginal sum y k . The likelihood of the sample is n k=1 Γ(α + y k,1 + . . . + y k,r ) Γ(α)y k,1 ! . . . y k,r ! p α 0 p The likelihood of the marginal sums is n k=1 Γ(α + y k,1 + . . . + y k,r ) Γ(α)(y k,1 + . . . + y k,r )! p α 0 (p 1 + . . . + p r ) y k,1 +...+y k,r .
The conditional distribution of the observations is found by dividing these two values, which yields n k=1 (y k,1 + . . . + y k,r )! y k,1 ! . . . y k,r ! p * y k,1 1 . . . p * y k,r r , where p * s = p s /(p 1 + . . . + p r ). This expression does not depend on either α nor p 0 , which proves that the n marginal sums represent a sufficient statistic for α and p 0 .

Negative multinomial parameter estimation
The multinomial negative distribution can be fitted with the maximum likelihood method. Suppose that an IID random sample of size n is available and denote the individual observations (y k,1 , . . . , y k,r ), 1 ≤ k ≤ n. The likelihood of the observations is L = n k=1 Γ(α + y k,1 + . . . + y k,r ) Γ(α)y k,1 ! . . . y k,r ! p α 0 p y k,1 1 . . . p y k,r r .
According to section 2.4, we can estimate α and p 0 from the marginal sums of the observed sample, which we denote y 1 , . . . , y n . The likelihood of the marginal sums is In practice it is easier to maximize the log-likelihood = log(L), which is as follows = n k=1 log Γ(α + y k ) + log Γ(α) − log(y k !) + α log(p 0 ) + y k log(1 − p 0 ). (10) The maximum is found by differentiating (10).
The equation above can be rearranged to express p 0 as a function of α whereȳ is the mean of the marginal sums (i.e.ȳ = n k=1 y k /n). Differentiating with respect to α, we now obtain We substitute (11) in the equation above and obtain an expression that depends on α only.
The equation above is solved by the Newton-Raphson method. For this, we use the update formula α (t+1) = α (t) − f (α (t) )/f (α (t) ), which requires to differentiate f and to choose an initial value α (0) .
The Newton-Raphson method is fast and converges after a few cycles. The initial value is usually chosen as α (0) = 1. Once α is known, the value of p 0 is set using equation (11).
To find the values of p 1 , . . . , p r , we differentiate log(L) with respect to p s (1 ≤ s ≤ r) and use Lagrange multipliers. It then appears that whereȳ s is the mean of the s-th variable,ȳ s = n k=1 y k,s /n.
We first differentiate with respect to π in order to obtain a substitution expression that will yield equations independent of π.
We now differentiate with respect to p 0 , . . . , p r , and as in section 2.5, we need to use Lagrange multipliers.
where we have defined the statistic y * i = k 1 ,...,kr =z 0 z(k 1 , . . . , k r )k i . From these equations and p 0 + . . . + p r = 0 we obtain Observe en passant that the term y * 1 + . . . + y * r is simply equal to the sum of all the observations. Substituting (15) and (17) in the expression of ∂ /dp 0 , we obtain We now differentiate with respect to α and substitute (15) to obtain an equation that depends on α and p 0 .
We now need to find p 0 and α such that f (p 0 , α) = g(p 0 , α) = 0. This is done with the Newton-Raphson method, which, in the case of several equations requires the Hessian matrix H of the system. By definition The update formula is analogous to the case of a single equation, namely .
The terms of the Hessian matrix are computed directly by differentiating f and g with respect to p 0 and α.
As a gradient method, the Newton-Raphson only guarantees convergence to a local maximum, the choice of the initial values is therefore important.
To maximize the chances of finding the global maximum, the procedure described above is applied to a range of initial conditions. To determine these conditions, the number of all-null observations z 0 is decreased artificially, and the resulting dataset is fitted by the procedure descrbed in 2.5 which gives a pair (p 0 , α). The pairs corresponding to distinct values of z 0 are collected and used as initial conditions for the algorithm described in this section.
In this method, there are only two free parameters, making it easier to find suitable intial conditions compared to the EM algorithm, in which there are three.

An emission model for ChIP-seq
The readout of ChIP-seq and similar experiments is a sequence of reads mapped to genomic windows of identical size. The zero-inflated negative multinomial distribution is a good choice 1 to describe the number of reads per window for the following reasons: 1. It is a discrete random variables with values in N.

It is multidimensional and can thus accomodate multiple replicates.
3. The marginal distributions are correlated.
4. Unmappable regions of the genome inflate the windows with no read.

Section 2 shows that it can be interpreted as a Poisson distribution
where the parameter λ varies as a Gamma variable. With this interpretation, each genomic window has a different expected read number. Conditionally on that number, the read count for a given window is a Poisson variable.

Estimating π and α
In section 2.1, we have argued that the negative multinomial distribution can be seen as a Gamma-Poisson process, where α is the shape parameter of the underlying Gamma distribution. The interpretation in the context of ChIP-seq experiments is that genomic windows have different expected read counts because of experimental and computational artifacts. Similarly, the parameter π is the probability that a genomic window is not mappable at all and will always have read count 0. These variations are a property of the genome and the methods used for the experiment, and not of a particular replicate. In terms of the formalism presented in section 1.1, those values are independent of the of the HMM, and they can be estimated independently. The estimation of π and α is based on the negative controls. These experiments provide the baseline variation of read count in the given genome with the given experimental setup. Section 2.7 provides a method to estimate π and α, together with the parameters p 0 , p 1 , . . . , p c , where c is the number of available control experiments.
If the values of π and α can be considered constant, this is not the case of p 0 , p 1 , . . . , p c . Indeed, there are r profiles to be modelled by a zero-inflated multinomial distribution, and the constrain p 0 +p 1 +. . .+p r = 1 imposes that the values estimated on the negative controls alone cannot remain the same when all the profiles are considered. For this reason, we refer to estimates based on the controls only as p * 0 , p * 1 , . . . , p * c . In section 2.3, we have shown that p s /p 0 = p * s /p * 0 (1 ≤ s ≤ c), so even if the values have to be updated, their respective ratios have to be maintained. Note that those constrains also determine the value of the ratio p s /p u for every pair (s, u) where s ≤ c and u ≤ c. So instead of storing the values p * 0 , p * 1 , . . . , p * c , at the end of the procedure presented in section 2.7, we store the ratios R 1 = p * 1 /p * 0 , . . . , R c = p * c /p * 0 .

Estimating state-dependent parameters
The remaining parameters are state-dependent, which means that their value depends on the state of the HMM. For this reason, p c+1 , . . . , p r have to be further indexed by the state and they are therefore referred to as p c+1,i , . . . , p r,i , where 1 ≤ i ≤ m. This is done through the Baum-Welch algorithm described in section 1.3. The last term of equation (5) strongly depends on the emission model, which is why the solution had to be deferred until here. To complete the cycle of the Baum-Welch algorithm, we need to maximize the expression n k=0 E θ (t) log g i k (y k , θ) y 0 , . . . , y n , where g i k (y k , θ) is the probability of the emission if the state of the HMM at step k is i k , and θ is the set of parameters. In this expression, y k is actually an r-dimensional vector (y k,1 , . . . , y k,r ), where y k,s is the value of the s-th variable (i.e. the read count in window k for experiment s). This can now be replaced by the formula of the zero-inflated negative multinomial model introduced above. More specifically, if (y 1 , . . . , y r ) = 0 r , the log-likelihood of a single emission if the HMM is in state i is g i (y 1 , . . . , y r |θ) = log(π) + log Γ(α + y 1 + . . . + y r ) − log Γ(α)+ α log(p 0,i ) + y 1 log(p 1,i ) + . . . + y r log(p r,i ), and if (y 1 , . . . , y r ) = 0 r it is g i (y 1 , . . . , y r |θ) = log(πp α 0,i + 1 − π). In each state i, remember that the parameters p 1 , . . . , p c satisfy the constrain p s,i /p 0,i = R s , 1 ≤ s ≤ c. In the first expression above, the constant log factorial terms have been removed because they do not depend on the state, and they are therefore neutral for the estimation process. As explained above, π and α are also state-independent and they can be considered constant. The first term above can thus be simplified to α log(p 0,i ) + y 1 log(p 1,i ) + . . . + y r log(p r,i ).
If we denote Z 0 the set of indices k such that y k,1 = . . . = y k,r = 0, the third term of (5) can finally be computed as We differentiate (19) with respect to the parameters p 0,i , . . . , p r,i , which are bound by the constrains p 0,i + . . . + p r,i = 1 and p s,i = R i p 0,i (1 ≤ s ≤ c). Starting with p 0,i , we obtain where A = k / ∈Z 0 φ k|n (i) and B = k∈Z 0 φ k|n (i). The differentiation of (19) with respect to the other variables yields the following equalities For convenience, we define the statistics y * i,s = k / ∈Z 0 φ k|n (i)y k,s and the constant C = 1 + R 1 + . . . + R c . Using (20) and (21), we obtain the following expression Starting from p 0,i + p 1,i + . . . + p r,i = 1 we obtain where Cµ is as shown in equation (22). In summary, maximizing (19) boils down to solving the following equation where D = y * i,1 + . . . + y * i,c and E = y * i,c+1 + . . . + y * i,r . However tedious to derive, this expression is a function of p 0,i only and it can thus be solved numberically by the Newton-Raphson method. We need to compute f and use the update formula p Once the value of p 0,i has been computed, the p s,i can be computed via p s,i = R i p 0,i for 1 ≤ s ≤ c and p s,i = y * i,s /µ for c + 1 ≤ s ≤ r, where µ is available from (22). The marks the end of the Baum-Welch algorithm and allows to perform another cycle.

Negative binomial mixture model
In a negative binomial mixture model, every observation is drawn from a finite set of negative binomial distributions. Here we will only consider the case of two distributions. More specifically we will consider that the observations are drawn from a negative binomial with parameters (α, p) with probability θ and from a negative binomial with parameters (α, q) with probability 1 − θ. The distribution is thus Mixture distributions are commonly fitted by the EM algorithm. We suppose that an unobserved variable Z takes value 1 with probability θ and value 0 with probability 1 − θ. Z indicates which of the two distributions the observation is drawn from. The full likelihood is where δ 0 (z) = 1 if and only if z = 0, and δ 1 (z) = 1 if and only if z = 1. This leads to the observation that The E-step of the algorithm is to write the expected log-likelihood of the distribution with respect to the conditional distribution of Z. If we write θ k = P (Z k = 1|Y = y k ) and drop the constant term, this quantity is = −n log Γ(α) + n k=1 log Γ(α + y k ) + θ k (log(θ) + α log(p) + y k log(1 − p))+ The M-step is to maximize (25) with respect to α, p and θ, which is done by differentiation. Introducingȳ 1 = n k=1 θ k y k / n k=1 θ k andȳ 0 = n k=1 (1 − θ k )y k / n k=1 (1 − θ k ), it is easily verified that at the optimum By substituting those values in ∂ /∂α, we obtain an expression f (α) that depends on α only f (α) = n ((log(α) − ψ(α))+ n k=1 ψ(α+y k )+θ k log(α+ȳ 1 )+(1−θ k ) log(α+ȳ 0 ) We need to find the solution of f (α) = 0, which is done by the Newton-Raphson method. For this, we use the update formula α i+1 = α i −f (α i )/f (α i ), where f (α) = n (1/α − ψ (α)) + n k=1 ψ (α + y k ) + θ k α +ȳ 1 + 1 − θ k α +ȳ 0 .

5.
If the values of α, θ, p and q are stable stop the algorithm, otherwise start another cycle.

Zero-Inflated Negative Multinomial
The Zero-Inflated Negative Binomial (ZINB) is a mixture model with a negative binomial and a constant equal to 0. This will inflate the term P (Y = 0), whence the name of the distribution. By definition, the numbers are drawn from a negative binomial distribution with parameters (α, p) with probability θ and from the constant equal to 0 with probability 1 − θ. The distribtion is thus P (Y = y) = Γ(α + y) Γ(α)y! θp α (1 − p) y + (1 − θ)δ 0 (y).
Differentiating with respect to θ and p, we obtain θ = n + n 0 w n + n 0 p = α α + y * , where y * = n k=1 y k /(n + n 0 w). Differentiating with respect to α and using the second equality above, we obtain the following equation, which is solved numerically with the method of Newton-Raphson.