BPRMeth: a flexible Bioconductor package for modelling methylation profiles

Abstract Motivation High-throughput measurements of DNA methylation are increasingly becoming a mainstay of biomedical investigations. While the methylation status of individual cytosines can sometimes be informative, several recent papers have shown that the functional role of DNA methylation is better captured by a quantitative analysis of the spatial variation of methylation across a genomic region. Results Here, we present BPRMeth, a Bioconductor package that quantifies methylation profiles by generalized linear model regression. The original implementation has been enhanced in two important ways: we introduced a fast, variational inference approach that enables the quantification of Bayesian posterior confidence measures on the model, and we adapted the method to use several observation models, making it suitable for a diverse range of platforms including single-cell analyses and methylation arrays. Availability and implementation http://bioconductor.org/packages/BPRMeth Supplementary information Supplementary data are available at Bioinformatics online.


Maximum Likelihood Estimation (MLE)
For all three observation models considered in the paper (i.e. Binomial, Bernoulli and Beta regression models) there is no closed form solution for the maximum likelihood estimate (MLE) due to the presence of the probit transformation, hence, we perform numerical optimization using the Conjugate Gradients 1 method (Hestenes and Stiefel, 1952). The Conjugate Gradients method is a first order numerical optimization algorithm, thus, we need to derive both the likelihood and the gradient for each observation model. Similarly to the derivation for the Binomial model (Kapourani and Sanguinetti, 2016), we can obtain the MLE for the Bernoulli and Beta likelihood models.

Bernoulli observation model
To infer methylation profiles for single cell methylation data (Clark et al., 2018;Smallwood et al., 2014), we use a Bernoulli observation model. Briefly, assume that for a genomic region D, we have I independent CpGs y i ∈ {0, 1}, then the log-likelihood of the Bernoulli probit regression model is given by ln p(y|x, w) = ln where x i ≡ h(x i ) denotes the the basis function evaluations at CpG site x i , w ∈ R D represent the regression coefficients, and Φ is the inverse probit function (Gaussian cumulative distribution function) needed in order to map the function output to the (0, 1) interval. The gradient of Eq. 1 with respect to model parameters w is given by where φ(·) is the probability density function for the standard normal distribution N (0, 1).

Beta observation model
To infer methylation profiles for array methylation data and generally any continuous observations that lie in (0, 1) interval, we use a Beta observation model which has been successfully used for DNA methylation arrays in several earlier publications, see e.g. Siegmund (2011). We use a different parameterisation of the beta regression model (Ferrari and Cribari-Neto, 2004) in terms of a mean µ and a precision parameter γ. The Beta density then takes the form with 0 < µ < 1 and γ > 0 and Γ(·) is the Gamma function. Assume that for a genomic region D, we have I independent CpGs y i ∈ (0, 1), then the loglikelihood of the Beta probit regression model becomes The gradient of Eq. 4 with respect to model parameters w is given by where ψ(·) is the digamma function. Note that in this formulation the precision parameter γ is the same across observations. Also, the current implementation does not allow to infer the precision parameters γ, which are assumed to be fixed. We leave the extension of jointly optimising over the mean µ and precision γ as future work.

Variational Bayes
In mean-field variational inference (Blei et al., 2017) the intractable posterior probability distribution of the latent variables p(θ|X) is approximated by a factorized distribution q(θ) = i q i (θ i ), where θ denotes the latent variables and X the observed variables. Then we search over the space of approximating distributions to find the distribution with the minimum Kullback-Leibler (KL) divergence with the actual posterior The KL divergence can then be minimised by performing a free form minimisation over the q i (θ i ) leading to the following update equation where · q j =i denotes an expectation with respect to the distributions q j (θ j ) for all j = i.

Bernoulli observation model
To account for the inherent noise measurements and the limited CpG coverage, we also reformulate the model in a Bayesian framework. The Bayesian probit regression model for a single CpG site then becomes Performing inference for this model in the Bayesian framework is complicated by the fact that no conjugate prior p(w|H) exists for the parameters of the probit regression model. The model can be made amenable to Bayesian estimation thanks to a data augmentation strategy originally proposed by Albert and Chib (1993). This strategy consists of introducing an additional auxiliary latent variable z i , which has a Gaussian distribution conditioned on the input w T x i . The augmented model has the hierarchical structure shown in Fig. 1, where y i is now deterministic conditional on the sign of the latent variable z i . Now we introduce a conjugate Gaussian prior over the parameters w ∼ N (w|0, τ −1 I), where the hyper-parameter τ controlling the precision of the Gaussian prior is assumed to follow a Gamma distribution. This reduces the necessary conditional distributions to a tractable form as either Gaussian, Gamma or one-dimensional truncated Gaussian distributions. Thus, the the joint distribution over all the variables is given by To apply the variational inference machinery, we take the variational posterior distribution to factorise over the latent variables Applying Eq. (7) to our model, we obtain the following solutions for the optimised factors of the variational posterior 2 where m = SX T z q(z) , and T N + (·) denotes the normal distribution truncated on the left tail to zero to contain only positive values, and T N − (·) denotes the normal distribution truncated on the right tail to zero to contain only negative values. The variational lower bound (i.e. evidence lower bound) which we are optimising over is given by The predictive distribution over y * , given a new input x * , is evaluated using the variational posterior for the parameters p(y * |x * , X, y) = p(y * , z, w, τ |x * , X, y) dτ dw dz

Binomial observation model
For the Binomial observation model, we can recast it as a Bernoulli observation model with additional observations. Assume that we have I independent observations y i = (m i , T i ) that follow a Binomial distribution where T i is the total number of trials for the i th observation, m i denotes the number of successes and again the success probability γ i is related to a vector of covariates x i ∈ R D . We can think of each of the m i as the total number of successes of T i independent Bernoulli experiments with outcomes y * i1 , ..., y * iT i , where now each y * it follows a Bernoulli distribution. It is simple to reconstruct the binary outcomes y * it from the Binomial observations T i using the following formula, Using this approach of extending our observation matrix to binary outcomes, now we can use the variational implementation for the Bernoulli regression observation model we derived above to perform inference for Binomial data.

Beta observation model
In the case of Beta observation model, the data augmentation strategy cannot be applied hence we cannot introduce a conjugate prior over the coefficients that would lead to tractable form conditional distributions. Due to these difficulties, we have left the variational Bayes implementation of the Beta observation model as an interesting topic for future work.

Clustering methylation profiles
Imagine that each of our observations comprises of a different basis function regression models and our goal is to cluster together regression models with similar patterns, e.g. cluster genomic regions with similar methylation patterns or for a given genomic region cluster cells based on their methylation profiles. An concrete example is shown in the next section.

Bernoulli observation model
Assume that we have N (n = 1, ..., N ) observations and each of our observations is generated from a corresponding latent variable c n comprising a 1-of-K binary vector with elements c nk for k = 1, ..., K. The conditional distribution of C, given the mixing proportions π π π, is given by The latent variables C will generate our latent observations Z ∈ R N ×In , which in turn will generate our binary observations Y ∈ R N ×In depending on the sign of Z as explained above. The conditional distribution of the data (Z, Y), given the latent variables C and the component parameters w is To complete the model we introduce priors over the parameters. We choose a Dirichlet distribution over the mixing proportions π ∼ Dir(π|δ 0 ), an independent Gaussian prior over the coefficients for each cluster w k ∼ N (w k |0, τ −1 k I) and finally a Gamma prior for the precision parameter τ k ∼ Gamma(τ k |α 0 , β 0 ). Having defined our model, we can now write the joint distribution over the observed and latent variables p(Y, Z, C, w, π, τ |X) =p(Y|Z) p(Z|C, w, X) p(C|π) p(π) p(w|τ ) p(τ ), For out probabilistic model, the approximating distribution factorises over the latent variables as follows q(Z, C, w, π, τ ) = q(Z) q(C) q(w) q(π) q(τ ).
Applying Eq. (7) to our model, we obtain the following solutions for the optimised factors of the variational posterior 3 3 Detailed mathematical derivations can be found in http://rpubs.com/cakapourani/vb-mixture-bpr where ln ρ nk = ln π k q(π k ) + − 1 2 (z n − X n w k ) T (z n − X n w k ) q(zn,w k ) , r nk X T n z n q(zn) , The variational lower bound (i.e. evidence lower bound) which we are optimising over is then given by L(q) = C q(Z, C, w, π, τ ) ln p(Y, Z, C, w, π, τ |X) q(Z, C, w, π, τ ) dz dw dπ dτ = ln p(Y|Z) q(Z) + ln p(Z|C, X, w) q(Z,C,w) + ln p(C|π) q(C,π) The predictive density of a new observation y * which will be associated with a latent variable c * , latent observation z * and covariates X * is given by

Binomial observation model
Similarly to Section 2.1.2 we can transform the Binomial observations to Bernoulli observations and use the variational implementation for the Bernoulli regression model as derived above.

Model selection
One of the most appealing aspects of Bayesian inference and more specifically variational approximations within mixture models is the possibility of directly performing model selection, i.e. determining the number of clusters, within the optimisation procedure. It has been repeatedly observed (Corduneanu and Bishop, 2001) that, when fitting variationally a mixture model with a large number of components, the variational procedure will prune away components with no support in the data, hence effectively determining an appropriate number of clusters in an automatic fashion. We can gain some intuition as to why this happens in the following way. We can rewrite the KL divergence as where ln p(X) can be ignored since is constant with respect to q(θ). To minimize this objective function the variational approximation will both try to increase the expected log likelihood of the data ln p(X|θ) while minimizing its KL divergence with the prior distribution p(θ). Hence, using variational Bayes we have an automatic trade-off between fitting the data and model complexity (Bishop, 2006); giving the possibility to automatically determine the number of clusters without resorting to cross-validation techniques.
To provide a concrete example of model selection in the variational Bayes paradigm, we generated N = 300 methylation genomic regions from K = 3 clusters. Then, we set the initial number of clusters to K = 6 and let the variational optimisation to prune away inactive clusters. Figure 2 shows the state of the variational Bayes algorithm during different iterations; and only after 15 iterations it automatically recovered the correct number of clusters. Figure 3 shows how during optimisation, the evidence lower bound increases over each iteration, including small bumps when the model discards mixture components.