-
PDF
- Split View
-
Views
-
Cite
Cite
Samuel Livingstone, Giacomo Zanella, The Barker Proposal: Combining Robustness and Efficiency in Gradient-Based MCMC, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 84, Issue 2, April 2022, Pages 496–523, https://doi.org/10.1111/rssb.12482
- Share Icon Share
Abstract
There is a tension between robustness and efficiency when designing Markov chain Monte Carlo (MCMC) sampling algorithms. Here we focus on robustness with respect to tuning parameters, showing that more sophisticated algorithms tend to be more sensitive to the choice of step-size parameter and less robust to heterogeneity of the distribution of interest. We characterise this phenomenon by studying the behaviour of spectral gaps as an increasingly poor step-size is chosen for the algorithm. Motivated by these considerations, we propose a novel and simple gradient-based MCMC algorithm, inspired by the classical Barker accept-reject rule, with improved robustness properties. Extensive theoretical results, dealing with robustness to tuning, geometric ergodicity and scaling with dimension, suggest that the novel scheme combines the robustness of simple schemes with the efficiency of gradient-based ones. We show numerically that this type of robustness is particularly beneficial in the context of adaptive MCMC, giving examples where our proposed scheme significantly outperforms state-of-the-art alternatives.
1 Introduction
The need to compute high-dimensional integrals is ubiquitous in modern statistical inference and beyond (e.g. Brooks et al., 2011; Krauth, 2006; Stuart, 2010). Markov chain Monte Carlo (MCMC) is a popular solution, in which the central idea is to construct a Markov chain with a certain limiting distribution and use ergodic averages to approximate expectations of interest. In the celebrated Metropolis–Hastings algorithm, the Markov chain transition is constructed using a combination of a ‘candidate’ kernel, to suggest a possible move at each iteration, together with an accept-reject mechanism (Hastings, 1970; Metropolis et al., 1953). Many different flavours of Metropolis–Hastings exist, with the most common difference being in the construction of the candidate kernel. In the Random walk Metropolis (RWM), proposed moves are generated using a symmetric distribution centred at the current point. Two more sophisticated methods are the Metropolis-adjusted Langevin algorithm (Roberts & Tweedie, 1996) and Hamiltonian/hybrid Monte Carlo (Duane et al., 1987; Neal, 2011). Both use gradient information about the distribution of interest (the target) to inform proposals. Gradient-based methods are widely considered to be state-of-the-art in MCMC, and much current work has been dedicated to their study and implementation (e.g. Beskos et al., 2013; Dalalyan, 2017; Durmus & Moulines, 2017).
Several measures of performance have been developed to help choose a suitable candidate kernel for a given task. One of these is high-dimensional scaling arguments, which compare how the efficiency of the method decays with d, the dimension of the state space. For the random walk algorithm this decay is of the order (Roberts et al., 1997), while for the Langevin algorithm the same figure is (Roberts & Rosenthal, 1998) and for Hamiltonian Monte Carlo (HMC) it is (Beskos et al., 2013). Another measure is to find general conditions under which a kernel will produce a geometrically ergodic Markov chain. For the random walk algorithm, this essentially occurs when the tails of the posterior decay at a faster than exponential rate and are suitably regular (more precise conditions are given in Jarner & Hansen, 2000). The same is broadly true of the Langevin and Hamiltonian schemes (Durmus et al., 2017a; Livingstone et al., 2019; Roberts & Tweedie, 1996), but here there is an additional restriction that the tails should not decay too quickly. This limitation is caused by the way in which gradients are used to construct the candidate kernel, which can result in the algorithm generating unreasonable proposals that are nearly always rejected in certain regions (Livingstone et al., 2019; Roberts & Tweedie, 1996).
There is clearly some tension between the different results presented above. According to the scaling arguments, gradient information is preferable. The ergodicity results, however, imply that gradient-based schemes are typically less robust than others, in the sense that there is a smaller class of limiting distributions for which the output will be a geometrically ergodic Markov chain. It is natural to wonder whether it is possible to incorporate gradient information in such a way that this measure of robustness is not compromised. Simple approaches to modifying the Langevin algorithm for this purpose have been suggested (based on the idea of truncating gradients, for example Atchade, 2006; Roberts & Tweedie, 1996), but these typically compromise the favourable scaling of the original method. In addition to this, it is often remarked that gradient-based methods can be difficult to tune. Algorithm performance is often highly sensitive to the choice of scale within the proposal (Neal, 2003, Figure 15), and if this is chosen to be too large in certain directions then performance can degrade rapidly. Because of this, practitioners must spend a long time adjusting the tuning parameters to ensure that the algorithm is running well, or develop sophisticated adaptation schemes for this purpose (e.g. Hoffman & Gelman, 2014), which can nonetheless still require a large number of iterations to find good tuning parameters (see Sections 5 and 6). We will refer to this issue as robustness to tuning.
In this article, we present a new gradient-based MCMC scheme, the Barker proposal, which combines favourable high-dimensional scaling properties with favourable ergodicity and robustness to tuning properties. To motivate the new scheme, in Section 2, we present a direct argument showing how the spectral gaps for the random walk, Langevin and Hamiltonian algorithms behave as the tuning parameters are chosen to be increasingly unsuitable for the problem at hand. In particular, we show that the spectral gaps for commonly used gradient-based algorithms decay to zero exponentially fast in the degree of mismatch between the scales of the proposal and target distributions, while for the random walk Metropolis the decay is polynomial. In Section 3, we derive the Barker proposal scheme beginning from a family of π-invariant continuous-time jump processes, and discuss its connections to the concept of ‘locally balanced’ proposals, introduced in (Zanella, 2020) for discrete state spaces. The name Barker comes from the particular choice of ‘balancing function’ used to uncover the scheme, which is inspired by the classical Barker accept-reject rule (Barker, 1965). In Section 4, we conduct a detailed analysis of the ergodicity, scaling and robustness properties of this new method, establishing that it shares the favourable robustness to tuning of the random walk algorithm, can be geometrically ergodic in the presence of very light tails, and enjoys the scaling with dimension of the Langevin scheme. The theory is then supported by an extensive simulation study in Sections 5 and 6, including comparisons with state-of-the-art alternative sampling schemes, which highlights that this kind of robustness is particularly advantageous in the context of adaptive MCMC. The code to reproduce the experiments is available from the online repository at the link https://github.com/gzanella/barker. Proofs and further numerical simulations are provided in the supplement.
1.1 Basic set-up and notation
Throughout we work on the Borel space , with d ≥ 1 indicating the dimension. For , we write λ ↑ ∞ and λ ↓ 0 to emphasize the direction of convergence when this is important. For two functions , we use the Bachmann–Landau notation if and f(t) = Θ(g(t)) if both and .
2 Robustness to Tuning
In this section, we seek to quantify the robustness of the random walk, Langevin and Hamiltonian schemes with respect to the mismatch between the scales of π(·) and Q in a given direction. Unlike other analyses in the MCMC literature (e.g. Beskos et al., 2018; Roberts & Rosenthal, 2001), we are interested in studying how MCMC algorithms perform when they are not optimally tuned, in order to understand how crucially performance depends on such design choices (e.g. the choice of proposal step-size or pre-conditioning matrix). The rationale for performing such an analysis is that achieving optimal or even close to optimal tuning can be extremely challenging in practice, especially when π(·) exhibits substantial heterogeneity. This is typically done using past samples in the chain to compute online estimates of the average acceptance rate and the covariance of π (or simply its diagonal terms for computational convenience), and then using those estimates to tune the proposal step-sizes in different directions (Andrieu & Thoms, 2008). If the degree of heterogeneity is large, it can take a long time for certain directions to be well-explored, and hence for the estimated covariance to be representative and the tuning parameters to converge.
In such settings, algorithms that are more robust to tuning are not only easier to use when such tuning is done manually by the user, but can also greatly facilitate the process of learning the tuning parameters adaptively within the algorithm. We show in Sections 5 and 6 that if an algorithm is robust to tuning then this adaptation process can be orders of magnitude faster than in the alternative case, drastically reducing the overall computational cost for challenging targets. The intuition for this is that more robust algorithms will start performing well (i.e. sampling efficiently) earlier in the adaptation process (when tuning parameters are not yet optimally tuned), which in turn will speed up the exploration of the target and the learning of the tuning parameters.
2.1 Analytical framework
In the context of the above, the λ ↓ 0 regime is representative of distributions in which one component (in this case the first) has a very small scale compared to all others. Conversely, the λ ↑ ∞ regime reflects the case in which one component has a much larger scale than its counterparts. Studying robustness to tuning in the context of heterogeneity is particularly relevant, as highlighted above, as this is exactly the context in which tuning is more challenging. The λ ↓ 0 regime is particularly interesting and has been recently considered in Beskos et al. (2018), where the authors study the behaviour of the RWM for ‘ridged’ densities for different values of k using a diffusion limit approach. The focus in that work, however, was on the finding optimal tuning parameters for the algorithm as a function of λ, whereas the present paper is concerned with the regime in which the tuning parameters are fixed (as discussed above).
The above framework could be equivalently formulated by keeping the target distribution π fixed and instead rescaling the first component of the candidate kernel by a factor 1/λ. This is indeed the formulation we mostly use in the proofs of our theoretical results. A proof of the mathematical equivalence between the two formulations can be found in the supplement.
2.2 Measure of performance
The expression inside the infimum is called a Dirichlet form, and can be thought of as the ‘expected squared jump distance’ for the function f provided the chain is stationary. This can in turn be re-written as . Maximising the spectral gap of a reversible Markov chain can therefore be understood as minimising the worst-case first-order autocorrelation among all possible square-integrable test functions.
The spectral gap allows to bound the variances of ergodic averages (see Proposition 1 of Rosenthal, 2003). Also, a direct connection between the spectral gap and mixing properties of the chain can be made if the operator Pf(x) := ∫f(y)P(x,dy) is positive on . This will always be the case if the chain is made lazy, which is the approach taken in Woodard et al. (2009), and the same adjustment can be made here if desired.
2.3 The small λ regime
In this section, we assess the robustness to tuning of the random walk, Langevin and Hamiltonian schemes as λ ↓ 0. This corresponds to the case in which the proposal scale is chosen to be too large in the first component of . The results in this section will support the idea that classical gradient-based schemes pay a very high price for any direction in which this tuning parameter is chosen to be too large, as already noted in the literature (e.g. Neal, 2003, p. 738), while the RWM is less severely affected by such issues.
2.3.1 Random walk Metropolis
We impose the following mild regularity conditions on the density μ(ξ). These are satisfied for most popular choices of μ, as shown in the subsequent proposition.
Condition 1There exists such that for any and we have , where δ = y − x and(5)In addition, , where is the marginal distribution of under ξ ∼ μ.
Proposition 1Denoting the usualp-norm as, Condition1holds in each of the below cases:
- (i)
(Gaussian)
- (ii)
(Laplace)
- (iii)
forν ∈ {1, 2, …} (Student'st)
We conclude the section with a characterization of the rate of convergence to zero of the spectral gap for the RWM as λ ↓ 0.
Theorem 1
Note that Theorem 1 requires very few assumptions on the target π other than . Note also that the lower bound is of the form , see proof of Theorem 1 for details. No dependence on the dimension of the problem other than that intrinsic to is therefore introduced.
2.3.2 The Langevin algorithm
We present results for the Langevin algorithm in two settings. Initially, we consider more restrictive conditions under which our upper bound on the spectral gap depends on the tail behaviour of π in a particularly explicit manner, and then give a broader result.
Condition 2Assume the following:
- (i)
π has a density of the form , for some densities and on and , respectively.
- (ii)
For some q ∈ [0, 1), it holds that(7)
Theorem 2If Condition2holds, then there is aγ > 0 such that
When compared with the random walk algorithm, Theorem 2 shows that the Langevin scheme is much less robust to heterogeneity. Indeed, the spectral gap decays exponentially fast in , meaning that even small errors in the choice of step-size can have a large impact on algorithm efficiency, and so practitioners must invest considerable effort tuning the algorithm for good performance, as shown through simulations in Sections 5 and 6. Theorem 2 also illustrates that the Langevin algorithm is more sensitive to λ when the tails of π(·) are lighter. This is intuitive, as in this setting gradient terms can become very large in certain regions of the state space.
Remark 1Theorem 2 (and Theorem 4 below) could be extended to the case q ≥ 1 in Equation (7); however, in these cases, samplers typically fail to be geometrically ergodic when λ is small (Livingstone et al., 2019; Roberts & Tweedie, 1996) meaning the spectral gap is typically 0 and the theorem becomes trivial.
Remark 2Condition 2(ii) could be replaced with the simpler requirement that , with the corresponding bound .
A different set of conditions, which hold much more generally, and corresponding upper bound are presented below.
Condition 3Assume the following:
- (i)
There is a γ > 0 such that(8)- (ii)
Given X ∼ π there is a β > 0 such that(9)
Theorem 3If Condition3holds, thenfor someα > 0, which can be taken asα = min{β/2,β/γ,2/3}.
We expect Condition 3 to be satisfied in many commonly encountered scenarios, with the exception of particularly heavy-tailed models. In the exponential family class , for example, Condition 3 holds for any α and β > 0 (see proof in the supplement).
2.3.3 Hamiltonian Monte Carlo
A more detailed description is given in Neal (2011). We write for the corresponding Metropolis–Hastings kernel with proposal mechanism as above and target . Here we present a heterogeneity result under Condition 2 of the previous subsection.
Theorem 4If Condition2holds, then there is aγ > 0 such that
It is no surprise that Theorem 4 is comparable to Theorem 2, since setting L = 1 equates the Langevin and Hamiltonian methods.
2.4 The large λ regime
In this section, we briefly discuss the λ ↑ ∞ regime, where σ is chosen to be too small for the first component of , arguing that all samplers under consideration behave similarly in this regime and pay a similar price for too small tuning parameters in a given direction. The intuition for this is that as λ ↑ ∞ the gradient-based proposal mechanisms discussed here all tend towards that of the random walk sampler in the first coordinate. For example, if we consider one-dimensional models, for any we can write , meaning as λ ↑ ∞ the amount of gradient information in the proposal is reduced provided π is suitably regular. The following result makes this intuition precise. To avoid repetitions, we state here the result for both the Langevin and the Barker proposal that we will introduce in the next section.
Proposition 2
The same intuition applies to the Hamiltonian case provided L is fixed, since each gradient term in the proposal is also Θ(1/λ). While there are many well-known measures of distance between two distributions, we argue that total variation is an appropriate choice here, since it has an explicit focus on how much the two kernels overlap and is invariant under bijective transformations of the state space (including re-scaling coordinates).
While the above statements provide useful heuristic arguments, in order to obtain more rigorous results one should prove that the spectral gaps decay to 0 at the same rate as λ ↑ ∞, which we leave to future work. We note, however, that the conjecture that the algorithms behave similarly for large values of λ is supported by the simulations of Section 5.1.
2.5 Extensions
The lower bound of Theorem 1 extends naturally to the k > 1 setting, becoming instead , and so the rate of decay remains polynomial in λ for any k. Analogously, we expect the corresponding upper bound for gradient-based schemes to remain exponential and become , although the details of this are left for future work. We explore examples of this nature through simulations in Section 5 and find empirically that the single component case is informative also of more general cases. Further extensions in which a different is chosen in each of the k directions can also be considered, with each at a different rate. We conjecture that in this setting the that decays most rapidly will dictate the behaviour of spectral gaps, though such an analysis is beyond the scope of the present work. One could consider using a mixture of the MALA/HMC and random walk kernels in an attempt to achieve both robustness to tuning and favourable scaling properties. While this may seem promising in theory, in practice we believe that it would be difficult to achieve both robustness to tuning and favourable high-dimensional performance from such an approach. In the next section, we consider a scheme for which the two goals can be achieved simultaneously.
3 Combining Robustness and Efficiency
The results of Section 2 show that the two gradient-based samplers considered there are much less robust to heterogeneity than the random walk algorithm. In this section, we introduce a novel and simple to implement gradient-based scheme that shares the superior scaling properties of the Langevin and Hamiltonian schemes, but also retains the robustness of the random walk sampler, both in terms of geometric ergodicity and robustness to tuning.
3.1 Locally balanced Metropolis–Hastings
It is straightforward to show that is a self-adjoint operator on the Hilbert space using Equation (12), implying that the process is π-reversible and can therefore serve as a starting point for designing MCMC algorithms.
Remark 3One can also think at Equation (12) as a requirement to ensure that the proposals in Equation (15) are exact (i.e. π-reversible) at the first order. In particular, in the supplement, it is shown that a proposal defined as in Equation (15) is π-reversible for π log-linear if and only if Equation (12) holds.
Thus, MALA can be viewed as a particular instance of this class. Other choices of g are, however, also possible, and give rise to different gradient-based algorithms. In the next section we explore what a sensible choice of g might look like.
3.2 The Barker proposal on
The work of Peskun (1973) and Tierney (1998) showed that this choice of α is inferior to the more familiar Metropolis–Hasting rule α(t) = min(1, t) in terms of asymptotic variance. The same conclusion cannot, however, be drawn when considering the choice of balancing function g.
In fact, the choice g(t) = t/(1+t) was shown to minimize asymptotic variances of Markov chain estimators in some discrete settings in Zanella (2020). In addition, as shown below, this particular choice of g leads to a fully tractable candidate kernel that can be easily sampled from. For this reason, we focus on this choice of g here, and leave the question of optimality in general for future work.
Proposition 3Ifg(t) = t / (1 + t), then the normalising constantZ(x) in Equation(15)is 1/2.
We refer to as the the Barker proposal. A simple sampling strategy to generate is given in Algorithm 1.
Proposition 4Algorithm 1 produces a sample from on .
Algorithm 1 shows that the magnitude |y−x| = |z| of the proposed move does not depend on the gradient ∇ log π(x) here, it is instead dictated only by the choice of symmetric kernel . The direction of the proposed move is, however, informed by both the magnitude and direction of the gradient. Examining the form of p(x, z), it becomes clear that if the signs of z and ∇ log π(x) are in agreement, then p(x, z) > 1/2, and indeed as z∇ log π(x) ↑ ∞ then and so p(x, z) ↑ 1. Hence, if the indications from ∇ log π(x) are that π(x + z) ≫ π(x), then it is highly likely that b(x, z) will be set to 1 and y = x + z will be the proposed move. Conversely, if z∇ log π(x) < 0, then there is a larger than 50% chance that the proposal will instead be y = x − z. As ∇ log π(x) ↑ ∞ the Barker proposal converges to truncated on the right, and similarly to truncated on the left as ∇ log π(x) ↓ − ∞. See Figure 1 for an illustration.
![Left: density of the Barker proposal in one dimension. Current location is x = 0 and the four lines with increasing red intensity correspond to ∇ log π(x) equal to 1, 3, 10 and 50. Right: density of the Barker proposal in two dimensions. Solid lines display the proposal density contours, heat colours refer to the target density, and the current location is x = (−2, 0) [Colour figure can be viewed at https://academic.oup.com/]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/84/2/10.1111_rssb.12482/1/m_rssb_84_2_496_fig-1.jpeg?Expires=1747868044&Signature=ZVdlQOvEH2cc70IUP7X5uXi7iAdALVEcTRC6pZp-7-bEnrwrEv2xD3cB90PPJytHMx2q9OmEJPNmqmY-r1-Vu~g5OUpgKk5vRAzixKfcAJIDxznIB1NKxFBfDTDsbkUJByURUGuE9Hxjq4PhcE7ZhA1oAdqChakX3DuFPK5NCB9WjWG7w1hRuYZTfc7-Tl~p8ZvqDIC02YXJHhAcQ0V5BicWl4eXWLGFXR~VLyxTnr2-R86UnlU~b0k2KYLgnfMLYBpVDV6NeBr2v9QuRmVMjoVVpWdBteNczlv~9F2RAjJpgS2yVLSmirTiWRYNM0CWVHNDJi9ENqFs4jvwNCd8Yw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Left: density of the Barker proposal in one dimension. Current location is x = 0 and the four lines with increasing red intensity correspond to ∇ log π(x) equal to 1, 3, 10 and 50. Right: density of the Barker proposal in two dimensions. Solid lines display the proposal density contours, heat colours refer to the target density, and the current location is x = (−2, 0) [Colour figure can be viewed at https://academic.oup.com/]
The multiplicative term in Equation (16), which incorporates the gradient information, injects skewness into the base kernel (as can be clearly seen in the left-hand plot of Figure 1). Indeed, the resulting distribution is an example of a skew-symmetric distribution (Azzalini, 2013, Eq. (1.3)). Skew-symmetric distributions are a tractable family of (skewed) probability density functions that are obtained by multiplying a symmetric base density functionwith the cumulative distribution function (cdf) of a symmetric random variable. We refer to (Azzalini 2013 Ch. 1) for more details, including a more general version of Propositions 3 and 4. In the context of skewed distributions, the Gaussian cdf is often used, leading to the skew-normal distribution introduced in Azzalini (1985). In our context, however, the Barker proposal (which leads to the logistic cdf p(x, z) in Algorithm 1) is the only skew-symmetric distribution that can be obtained from Equation (15) using a balancing function g satisfying g(t) = tg(1/t). See the supplement for more detail.
3.3 The Barker proposal on
The full Metropolis–Hastings scheme using the Barker proposal mechanism for a target distribution is given in Algorithm 2 (see the supplement for more details and variations of the algorithm, such as a pre-conditioned version). Note that the computational cost of each iteration of the algorithm is essentially equivalent to that of MALA and will be typically dominated by the cost of computing the gradient and density of the target.
This second approach does not allow gradient information to feed into the proposal as effectively as in the first case. Specifically, only the global inner product is considered, and the decision to alter the sign of every component of z is taken based solely on this value. In other words, once has been sampled, gradient information is only used to make a single binary decision of choosing between the two possible proposals x + z and x − z, while in the first strategy gradient information is used to choose between possible proposals (where ). Indeed, the following proposition shows that the second strategy cannot improve over the RWM by more than a factor of two.
Proposition 5Letdenote the modified Barker proposal onusing Equation(19). Then.
One can also make a stronger statement than the above proposition, namely that if this strategy is employed, only a constant factor improvement over the RWM can be achieved in terms of asymptotic variance, for any function of interest. Given Proposition 5 we choose to use the first strategy described to produce Barker proposals on , and the multi-dimensional candidate kernel given in Equation (17). In the following sections, we will show both theoretically and empirically that this choice does indeed have favourable robustness and efficiency properties.
4 Robustness, Scaling and Ergodicity Results for the Barker Proposal
In this section, we establish results concerning robustness to tuning, scaling with dimension and geometric ergodicity for the Barker proposal scheme. As we will see, the method not only enjoys the superior efficiency of gradient-based algorithms in terms of scaling with dimension, but also shares the favourable robustness properties of the RWM when considering both robustness to tuning and geometric ergodicity.
4.1 Robustness to tuning
We now examine the robustness to tuning of the Barker proposal using the framework introduced in Section 2. We write and to denote the candidate and Metropolis–Hastings kernels for the Barker proposal targeting the distribution defined therein, and for the case λ = 1. The following result characterizes the behaviour of the spectral gap of as λ ↓ 0.
Theorem 5
Comparing Theorem 5 with Theorems 1-4 from Section 2.3, we see that the Barker proposal inherits the robustness to tuning of random walk schemes and is significantly more robust than the Langevin and Hamiltonian algorithms. In the next section, we establish general conditions under which .
4.2 Geometric ergodicity
In this section, we study the class of target distributions for which the Barker proposal produces a geometrically ergodic Markov chain. We show that geometric ergodicity can be obtained even when the gradient term in the proposal grows faster than linearly, which is typically not the case for MALA and HMC.
For the results of this section, we make the simplifying assumption that π is spherically symmetric outside a ball of radius R < ∞.
Condition 4There exists R < ∞ and a differentiable function f : (0,∞) → (0,∞) with and non-increasing for r > R such that log π(x) = f(‖x‖) for r > R.
Theorem 6Letg : (0, ∞)→(0, ∞) be a bounded and non-decreasing function, for everys > 0, andfor someδ > 0. If the target densityπsatisfies Condition4, then the Metropolis–Hastings chain with proposalisπ-a.e. geometrically ergodic.
We note that tail regularity assumptions such as Condition 4 are common in this type of analysis (e.g. Durmus et al., 2017a; Jarner & Hansen, 2000). As an intuitive example, the condition is satisfied in the exponential family for all β > 1. As a contrast, for MALA and HMC it is known that for β > 2 the sampler fails to be geometrically ergodic (Livingstone et al., 2019; Roberts & Tweedie, 1996). We expect the Barker proposal to be geometrically ergodic also for the case β = 1, although we do not prove it in this work. It is worth noting that for the MALA choice is unbounded above, which is a central reason for the lack of stability compared to bounded choices such as g(t) = t/(1 + t) employed in the Barker scheme.
4.3 Scaling with dimensionality
In this section, we provide preliminary results suggesting that the Barker proposal enjoys scaling behaviour analogous to that of MALA in high-dimensional settings, meaning that under appropriate assumptions it requires the number of iterations per effective sample to grow as with the number of dimensions d as d → ∞. Similarly to Section 4.2, we prove results for general proposals as in Equation (21) with balancing functions g satisfying g(t) = tg(1/t). The Barker proposal is a special case of the latter family.
We perform an asymptotic analysis for d → ∞ using the framework introduced in Roberts et al. (1997). The main idea is to study the rate at which the proposal step size σ needs to decrease as d → ∞ to obtain well-behaved limiting behaviour for the MCMC algorithm under consideration (such as a Θ(1) acceptance rate and convergence to a non-trivial diffusion process after appropriate time re-scaling). Based on the rate of decrease of σ, one can infer how the number of MCMC iterations required for each effective sample increases as d → ∞. For example, in the case of the RWM, must be scaled as as d → ∞ to have a well-behaved limit (Roberts et al., 1997), which leads to RWM requiring Θ(d) iterations for each effective sample. By contrast, for MALA it is sufficient to take as d → ∞, which leads to only iterations for each effective sample (Roberts & Rosenthal, 1998). While these analyses are typically performed under simplifying assumptions, such as having a target distribution with i.i.d. components, the results have been extended in many ways (e.g. removing the product-form assumption, see Mattingly et al., 2012) obtaining analogous conclusions. See also Beskos et al. (2013) for optimal scaling analysis of HMC and Roberts and Rosenthal (2016) for rigorous connections between optimal scaling results and computational complexity statements.
Proposition 6Letg : (0, ∞) → (0, ∞) andg(t) = tg(1/t) for allt. Ifgis three times continuously differentiable andfor alls > 0 andj ∈ {0, 1, 2, 3}, whereis thejth derivative ofg, thenfor anyandin.(23)
Proposition 6 suggests that Metropolis–Hastings algorithms with proposals such that g(t) = tg(1/t) have scaling behaviour analogous to MALA, meaning that is sufficient to ensure a non-trivial limit and thus iterations are required for each effective sample. To make these arguments rigorous, one should prove weak convergence results for d → ∞, as in Roberts and Rosenthal (1998). Proving such a result for a general g would require a significant amount of technical work, thus going beyond the scope of this section. In this paper we rather support the conjecture of scaling for by means of simulations (see Section 5.2). While Proposition 6 only shows , it is possible to show that with some extra assumptions on ϕ to exclude exceptional cases (see the supplement for more detail).
5 Simulations with Fixed Tuning Parameters
Throughout Sections 5 and 6, we choose the symmetric density within the random walk and Barker proposals to be for simplicity. Note, however, that any symmetric density could in principle be used. It would be interesting to explore the impact of different choices of to the performances of the Barker algorithm, and we leave such a comparison to future work.
5.1 Illustrations of robustness to tuning
We first provide an illustration of the robustness to tuning of the random walk, Langevin and Barker algorithms in three simple one-dimensional settings. In each case we approximate the expected squared jump distance (ESJD) using Monte Carlo samples and standard Rao–Blackwellisation techniques, across of range of different proposal step-sizes between 0.01 and 100. As is clearly shown in Figure 2, all algorithms perform similarly when the step-size is smaller than optimal, as suggested in Section 2.4. As the step-size increases beyond this optimum, however, behaviours begin to differ. In particular, the ESJD for MALA rapidly decays to zero, whereas in the random walk and Barker cases the reduction is much less pronounced. In fact, the rate of decay is similar for the two schemes, which is to be expected following the results of Sections 4.1 and 2.3. See the supplement for a similar illustration on a 20-dimensional example.
![Expected squared jump distance against proposal step-size for random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker on different one-dimensional targets [Colour figure can be viewed at https://academic.oup.com/]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/84/2/10.1111_rssb.12482/1/m_rssb_84_2_496_fig-2.jpeg?Expires=1747868044&Signature=E2yiMkxCHVsJ88keHohmcjPCepniJ5a8SFvGURHkTlz5kBGaHdremunnTSdgzunXKg9E0YQ2DyLxpSwfATgUuGrs~OxENEA3WxzDRujJKISyz19kDaA~W7iG2AyxW-lmKkBXxb~wgQc~BdHsKVXYIbJYfr7OoiD~bfaWfafkK-ZM8BgGzrPFn6RjHz3RERCflzLb0DwLpmDYXSIrYDKDBu0lG79aePHhHYm2kgqibLNZQtjklCwgenjAb~qLeytWc3EKWAo9T-VJoCxRxKKdFJoR2p1xvMpc3WzDIspFGu9ZfHRyXmzwlXUqiEiihn-B7zbYVCktQ4juljTU9JajSw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Expected squared jump distance against proposal step-size for random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker on different one-dimensional targets [Colour figure can be viewed at https://academic.oup.com/]
5.2 Comparison of efficiency on isotropic targets
Next we compare the expected squared jump distance of the random walk, Langevin and Barker schemes when sampling from isotropic distributions of increasing dimension, with optimised proposal scale (chosen to maximise expected squared jumping distance). This set-up is favourable to MALA, which is the least robust scheme among the three, as the target distribution is homogeneous and the proposal step-size optimally chosen. We consider target distributions with independent and identically distributed (i.i.d.) components, corresponding to the scenario studied theoretically in Section 4.3. We set the distribution of each coordinate to be either a standard normal distribution or a hyperbolic distribution, corresponding to and , respectively. Figure 3 shows how the ESJD per coordinate decays as dimension increases for the three algorithms. For MALA and Barker, the ESJD appears to decrease at the same rate as d increases, which is in accordance with the preliminary results in Section 4.3. In the Gaussian case, MALA outperforms Barker roughly by a factor of 2 regardless of dimension (more precisely, the ESJD ratio lies between 1.7 and 2.5 for all values of d in Figure 3), while in the hyperbolic case the same factor is around 1.2, again independently of dimension (ESJD ratio between 1.1 and 1.25 for all values of d in Figure 3). The rate of decay for the RWM is faster, as predicted by the theory.
![Expected squared jump distance against dimensionality for random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker schemes with optimally-tuned step size. The target distribution has i.i.d. coordinates following either a Gaussian distribution (left plot) or a hyperbolic one (right plot) [Colour figure can be viewed at https://academic.oup.com/]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/84/2/10.1111_rssb.12482/1/m_rssb_84_2_496_fig-3.jpeg?Expires=1747868044&Signature=vc7CT12vSLZ7~n1UbNQtVbV8BdlNFWHa-UWBZSXEqkVjLkIgdDKi~eJaxANjjrRocmti2PSWVFRdSJYjqBPJKVkvDUPBisoU~PE0xtI0uxtVpAK-O66CMazQEqfmH04lH5BO7okuDFqIsJa7i7Yy2zIRyK5w-U4lD~NozxCH2DEvNUXewXgySkHpIGZYju3PuDT0fUv~~R8HhwTUbr-ekDBESa4u1V~hUmGHKhAU~KPaFAQhWpHrPWJTKxPukw~LvPpma4GziPUFFU15pXLUGu8dNuJBS0b1xl0iRu-EdpJJQdcxANwspp~oxLT15Y4O9ISqS6va8lz6zlmfl~TuGA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Expected squared jump distance against dimensionality for random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker schemes with optimally-tuned step size. The target distribution has i.i.d. coordinates following either a Gaussian distribution (left plot) or a hyperbolic one (right plot) [Colour figure can be viewed at https://academic.oup.com/]
6 Simulations with Adaptive Markov Chain Monte Carlo
In this section, we illustrate how robustness to tuning affects the performance of adaptive MCMC methods.
6.1 Adaptation strategy and algorithmic set-up
Here denotes the current point in the Markov chain, is the proposed move, , denotes some ideal acceptance rate for the algorithm and the parameter is known as the learning rate. We set to be 0.23 for RWM, 0.57 for MALA and 0.40 for Barker. We tried changing the value of for Barker in the range [0.2, 0.6] without observing major differences. In our simulations, we constrain to be diagonal (i.e., all off-diagonal terms in Equation (26) are set to 0). This is often done in practice to avoid having to learn a dense pre-conditioning matrix, which has both a high computational cost and would require a large number of MCMC samples. See the supplement for full details on the pre-conditioned Barker schemes obtained with both diagonal and dense matrix , including pseudo-code of the resulting algorithms.
We set the learning rate to with κ ∈ (0.5, 1), as for example suggested in (Shaby & Wells, 2010). Small values of κ correspond to more aggressive adaptation, and for example Shaby and Wells (2010) suggest using κ = 0.8. In the simulations of Section 6.2, we use κ = 0.6 as this turned out to be a good balance between fast adaptation and stability for MALA (κ = 0.8 resulted in too slow adaptation, while values of κ lower than 0.6 led to instability). The adaptation of RWM and Barker was not very sensitive to the value of κ. Unless specified otherwise, all algorithms are randomly initialized with each coordinate sampled independently from a normal distribution with standard deviation 10. Following the results from the optimal scaling theory (Roberts & Rosenthal, 2001), we set the starting value for the global scale as for RWM and for MALA. For Barker we initialize to the same values as MALA.
6.2 Performance on target distributions with heterogeneous scales
In this section, we compare the adaptive algorithms described above when sampling from target distributions with significant heterogeneity of scales across their components. We consider 100-dimensional target distributions with different types of heterogeneity, tail behaviour and degree of skewness according to the following four scenarios:
- 1
(One coordinate with small scale; Gaussian target) In the first scenario, we consider a Gaussian target with zero mean and diagonal covariance matrix. We set the standard deviation of the first coordinate to 0.01 and that of the other coordinates to 1. This scenario mirrors the theoretical framework of Sections 2 and 4.1 in which a single coordinate is the source of heterogeneity.
- 2
(Coordinates with random scales; Gaussian target) Here we modify scenario 1 by generating the standard deviations of each coordinate randomly, sampling them independently from a log-normal distribution. More precisely, we sample independently for i = 1,…, 100, where is the standard deviation of the ith component.
- 3
(Coordinates with random scales; Hyperbolic target) In the third scenario, we change the tail behaviour of the target distribution, replacing the Gaussian with a hyperbolic distribution (a smoothed version of the Laplace distribution to ensure ). In particular, we set , with ε = 0.1 and c being a normalizing constant. The scale parameters are generated randomly as in scenario 2.
- 4
(Coordinates with random scales; Skew-normal target) Finally, we consider a non-symmetric target distribution, which represents a more challenging and realistic situation. We assume that the ith coordinate follows a skew-normal distribution with scale and skewness parameter α, meaning that , with c being a normalizing constant. We set α = 4 and generate the 's randomly as in scenario 2.
First we provide an illustration of the behaviour of the three algorithms by plotting the trace plots of tuning parameters and MCMC trajectories—see Figure 4 for the results in scenario 1. The adaptation of tuning parameters for the Barker scheme stabilises within a few hundred iterations, after which the algorithm performance appears to be stable and efficient. On the contrary both RWM and MALA struggle to learn the heterogeneous scales and the adaptation process has either just stabilized or not yet stabilized after iterations. Looking at the behaviour of MALA in Figure 4 we see that, in order for the algorithm to achieve a non-zero acceptance rate, the global scale parameter must first be reduced considerably to accommodate the smallest scale of π(·). At this point the algorithm can slowly begin to learn the components of the pre-conditioning matrix , but this learning occurs very slowly because the comparatively small value for results in poor mixing across all other dimensions than the first. Analogous plots for scenarios 2, 3 and 4 are given in the supplement and display comparable behaviour.

Random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker schemes with adaptive tuning as in Equations (24)-(26) and learning rate set to with κ = 0.6. The target distribution is a 100-dimensional Gaussian in which the first component has standard deviation 0.01 and all others have unit scale. First row: adaptation of the global scale ; second row: adaptation of the local scales ; third row: trace plot of first coordinate; fourth row: trace plots of coordinates from 2 to 100 (superposed)
We then compare algorithms in a more quantitative way, by looking at the average mean squared error (MSE) of MCMC estimators of the first moment of each coordinate, which is a standard metric in MCMC. For any , define the corresponding MSE as where is the MCMC estimator of after t iterations of the algorithm. Here is a burn-in period, which we set to , where ⌊·⌋ denotes the floor function. Below, we report the average MSE for the collection of test functions given by for i = 1, …, d after t MCMC iterations (rescaling by is done to give equal importance to each coordinate).
Figure 5 displays the evolution of and the MSE defined above over iterations of each algorithms, where and the MSE are estimated by averaging over 100 independent runs of each algorithm. The results are in accordance with the illustration in Figure 4, and suggest that the Barker scheme is robust to different types of targets and heterogeneity and results in very fast adaptation, while both MALA and RWM require significantly more iterations to find good tuning parameters. The tuning parameters of MALA appear to exhibit more unstable behaviour than RWM in the first few thousands iterations (larger ), while after that they converge more quickly, which again is in accordance with the behaviour observed in Figure 5 and with the theoretical considerations of Sections 2 and 4.1. To further quantify the tuning period, we define the time to reach a stable level of tuning as for some ε > 0. We take ε=1 and report the resulting values in Table 1, denoting simply as . The results show that in these examples Barker always has the smallest adaptation time, with a speed-up compared to RWM of at least 34x in all four scenarios, and a speed-up compared to MALA ranging between 3x (scenario 3) and 30x (scenario 2). The adaptation times tend to increase from scenarios 1 to 4, suggesting that the target distribution becomes more challenging as we move from scenarios 1 to 4. The hardest case for Barker seems to be the hyperbolic target, although even there the tuning stabilized in roughly 3000 iterations, while the hardest case for MALA is the skew-normal, in which tuning stabilized in roughly 30,000 iterations.
![Comparison of random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker on the four target distributions (scenarios 1 to 4) described in Section 6.2, averaging over ten repetitions of each algorithm. First row: convergence of tuning parameters, measured by dt defined in Equation (27). Second row: Mean Square Error of Markov chain Monte Carlo estimators of first moments averaged over all coordinates [Colour figure can be viewed at https://academic.oup.com/]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/84/2/10.1111_rssb.12482/1/m_rssb_84_2_496_fig-5.jpeg?Expires=1747868044&Signature=Jue~8R037weoam8XoxoGJZzMzcxcT-jOaXIOTruX-NcmhdsrD44mw4nPVgx-pEgmHWSmbtRJSZ904qYtJrumWTDVp-Eoe5eh86~joNpQNXCcSVb~h3pOb4L5yFyHuqLpXuNTcv555oRxXG0ja-YU94dwFxNpbTP8IuciDthPerKTQHJ6lkFgP6ck2FNO9ldP8ZHOIBcrlWWtV-fZqK2gbCSwB57cC-XcO1jjIHwXj-qCXuFws15Fbmz2QUn6Yg2FswswafzBmuCTyjhY~cCyF08v~uGyJiZhHLw6eY8g1pysW3OA-AU0N3647kAZfuiT27p1dhta7Nb4fN8cbLaUTA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Comparison of random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker on the four target distributions (scenarios 1 to 4) described in Section 6.2, averaging over ten repetitions of each algorithm. First row: convergence of tuning parameters, measured by defined in Equation (27). Second row: Mean Square Error of Markov chain Monte Carlo estimators of first moments averaged over all coordinates [Colour figure can be viewed at https://academic.oup.com/]
Adaptation times () and mean squared errors (MSE) from 10 k, 20 k and 40 k iterations of the random walk Metropolis (RWM), Metropolis-adjusted Langevin algorithm (MALA) and Barker algorithms under each of the four heterogeneous scenarios described in Section 6.2
. | Method . | . | . | . | . |
---|---|---|---|---|---|
1 | RWM | 18,757 | 0.200 | 0.036 | 0.013 |
MALA | 10,785 | 0.348 | 0.016 | 0.002 | |
Barker | 524 | 0.007 | 0.005 | 0.003 | |
2 | RWM | 19,163 | 0.228 | 0.045 | 0.013 |
MALA | 17,298 | 0.644 | 0.147 | 0.004 | |
Barker | 542 | 0.007 | 0.005 | 0.003 | |
3 | RWM | >40 k | 0.409 | 0.080 | 0.016 |
MALA | 10,630 | 0.248 | 0.019 | 0.006 | |
Barker | 3294 | 0.012 | 0.009 | 0.007 | |
4 | RWM | >40 k | 0.315 | 0.092 | 0.016 |
MALA | 34,340 | 0.813 | 0.488 | 0.112 | |
Barker | 1427 | 0.008 | 0.006 | 0.004 |
. | Method . | . | . | . | . |
---|---|---|---|---|---|
1 | RWM | 18,757 | 0.200 | 0.036 | 0.013 |
MALA | 10,785 | 0.348 | 0.016 | 0.002 | |
Barker | 524 | 0.007 | 0.005 | 0.003 | |
2 | RWM | 19,163 | 0.228 | 0.045 | 0.013 |
MALA | 17,298 | 0.644 | 0.147 | 0.004 | |
Barker | 542 | 0.007 | 0.005 | 0.003 | |
3 | RWM | >40 k | 0.409 | 0.080 | 0.016 |
MALA | 10,630 | 0.248 | 0.019 | 0.006 | |
Barker | 3294 | 0.012 | 0.009 | 0.007 | |
4 | RWM | >40 k | 0.315 | 0.092 | 0.016 |
MALA | 34,340 | 0.813 | 0.488 | 0.112 | |
Barker | 1427 | 0.008 | 0.006 | 0.004 |
Adaptation times () and mean squared errors (MSE) from 10 k, 20 k and 40 k iterations of the random walk Metropolis (RWM), Metropolis-adjusted Langevin algorithm (MALA) and Barker algorithms under each of the four heterogeneous scenarios described in Section 6.2
. | Method . | . | . | . | . |
---|---|---|---|---|---|
1 | RWM | 18,757 | 0.200 | 0.036 | 0.013 |
MALA | 10,785 | 0.348 | 0.016 | 0.002 | |
Barker | 524 | 0.007 | 0.005 | 0.003 | |
2 | RWM | 19,163 | 0.228 | 0.045 | 0.013 |
MALA | 17,298 | 0.644 | 0.147 | 0.004 | |
Barker | 542 | 0.007 | 0.005 | 0.003 | |
3 | RWM | >40 k | 0.409 | 0.080 | 0.016 |
MALA | 10,630 | 0.248 | 0.019 | 0.006 | |
Barker | 3294 | 0.012 | 0.009 | 0.007 | |
4 | RWM | >40 k | 0.315 | 0.092 | 0.016 |
MALA | 34,340 | 0.813 | 0.488 | 0.112 | |
Barker | 1427 | 0.008 | 0.006 | 0.004 |
. | Method . | . | . | . | . |
---|---|---|---|---|---|
1 | RWM | 18,757 | 0.200 | 0.036 | 0.013 |
MALA | 10,785 | 0.348 | 0.016 | 0.002 | |
Barker | 524 | 0.007 | 0.005 | 0.003 | |
2 | RWM | 19,163 | 0.228 | 0.045 | 0.013 |
MALA | 17,298 | 0.644 | 0.147 | 0.004 | |
Barker | 542 | 0.007 | 0.005 | 0.003 | |
3 | RWM | >40 k | 0.409 | 0.080 | 0.016 |
MALA | 10,630 | 0.248 | 0.019 | 0.006 | |
Barker | 3294 | 0.012 | 0.009 | 0.007 | |
4 | RWM | >40 k | 0.315 | 0.092 | 0.016 |
MALA | 34,340 | 0.813 | 0.488 | 0.112 | |
Barker | 1427 | 0.008 | 0.006 | 0.004 |
The differences in the adaptation times have a direct implication on the resulting MSE of MCMC estimators, which is intuitive because the Markov chain will typically start sampling efficiently from π only once good tuning parameters are found. As we see from the second row of Figure 5 and the second part of Table 1, the MSE of Barker is already quite low (between 0.007 and 0.012) after iterations in all scenarios, while RWM and MALA need significantly more iterations to achieve the same MSE. After finding good tuning parameters and having sampled enough, MALA is slightly more efficient than Barker for the Gaussian target in scenario 1 and equally efficient in the hyperbolic target of scenario 3, which is consistent with the simulations of Section 5.2 under optimal tuning.
6.3 Comparison on a Poisson random effects model
The model in Equation (28) is an example of a generalized linear model that induces a posterior distribution with light tails and potentially large gradients of log π, which creates a challenge for gradient-based algorithms. In particular, the task of sampling from the posterior becomes harder when either the observations contain large values or they are heterogeneous across values of i ∈ {1, …, I}. The former case results in a more peaked posterior distribution with larger gradients, while the latter induces heterogeneity across the posterior distributions of the parameters .
In our simulations, we consider three scenarios, corresponding to increasingly challenging target distributions:
- 1
In the first scenario, we take and generate the data y from the model in Equation (28) assuming the data-generating value of μ to be and sampling the data-generating values of from their prior distribution.
- 2
In the second scenario, we increase the value of to 3, which induces more heterogeneity across the parameters .
- 3
In the third scenario, we keep and increase the values of to 10, thus inducing larger gradients.
In each scenario, we run the algorithm directly on the joint parameter space . Similarly to Section 6.2, we first provide an illustration of the behaviour of the tuning parameters and MCMC trace plots for RWM, MALA and Barker in Figure 6. Here all algorithms are run for iterations, with the target defined in the first scenario. We use the adaptation strategy of Section 6.2 for tuning, following Equations (24)-(26) with κ = 0.6 and constrained to be diagonal, and initialize the chains from a random configuration sampled from the prior distribution of the model. In this example, the random walk converges to stationarity in roughly 10,000 iterations while the Barker scheme takes a few hundreds. By contrast MALA struggles to converge and exhibits unstable behaviour even after iterations. Note that the first iterations of MALA, in which the parameter μ appears to be constant, do not correspond to rejections but rather to moves with very small increments in the μ component.

Behaviour of random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker on the posterior distribution from the Poisson hierarchical model in Equation (28). Data are generated as in the first scenario of Section 6.3. First row: adaptation of the global scale ; second row: adaptation of the local scales ; third row: trace plot of the parameter μ; fourth row: trace plots of the parameter
We then provide a more systematic comparison between the algorithms under consideration in Table 2. In addition to RWM, MALA and Barker, we also consider a state-of-the-art implementation of adaptive HMC, namely the Stan (Stan Development Team, 2020) implementation of the No-U-Turn Sampler (NUTS) (Hoffman & Gelman, 2014) as well as of standard HMC (referred to as ‘static HMC’ in the Stan package). The NUTS algorithm is a variant of standard HMC in which the number of leapfrog iterations, that is, the parameter L in Equation (10), is allowed to depend on the current state (using a ‘No-U-Turn’ criterion). The resulting number of leapfrog steps (and thus log-posterior gradient evaluations) per iteration is not fixed in advance but rather tuned adaptively depending on the hardness of the problem. This is also the case for the static HMC algorithm implementation in Stan, as in that case the total integration time in Equation (10) is fixed and the step-size and mass matrix are adapted. For both algorithms, we use the default Stan version that learns a diagonal covariance/mass matrix during the adaptation process. This is analogous to constraining the preconditioning matrix for RWM, MALA and Barker to be diagonal, as we are doing here.
Comparison of sampling schemes on the posterior distribution arising from the Poisson hierarchical model in Equation (28)
. | Method . | Iterations (n) . | Leapfrog steps/n . | Gradient calls (g) . | ESS . | ESS/g × 100 . |
---|---|---|---|---|---|---|
1 | RWM | – | – | (49, 66) | – | |
MALA | – | (648, 727) | 1.30 ± 2.73 | |||
Barker | – | (1445, 1587) | 2.89 ± 0.07 | |||
HMC | 89.5 | (285, 1954) | 0.25 ± 0.78 | |||
NUTS | 8.5 | (1175, 1822) | 6.95 ± 1.68 | |||
2 | RWM | – | – | (0.4, 10.6) | – | |
MALA | – | (0.0, 8.0) | < 0.01 | |||
Barker | – | (1365, 1563) | 2.73 ± 0.13 | |||
HMC | 797 | (25, 1949) | < 0.01 | |||
NUTS | 57.7 | (942, 1826) | 1.19 ± 1.14 | |||
3 | RWM | – | – | (0.0, 5.3) | – | |
MALA | – | (0.0, 0.2) | < 0.01 | |||
Barker | – | (1301, 1594) | 2.60 ± 0.92 | |||
HMC | 8103 | (3.3, 899) | < 0.01 | |||
NUTS | 179 | (137, 348) | 0.012 ± 0.14 |
. | Method . | Iterations (n) . | Leapfrog steps/n . | Gradient calls (g) . | ESS . | ESS/g × 100 . |
---|---|---|---|---|---|---|
1 | RWM | – | – | (49, 66) | – | |
MALA | – | (648, 727) | 1.30 ± 2.73 | |||
Barker | – | (1445, 1587) | 2.89 ± 0.07 | |||
HMC | 89.5 | (285, 1954) | 0.25 ± 0.78 | |||
NUTS | 8.5 | (1175, 1822) | 6.95 ± 1.68 | |||
2 | RWM | – | – | (0.4, 10.6) | – | |
MALA | – | (0.0, 8.0) | < 0.01 | |||
Barker | – | (1365, 1563) | 2.73 ± 0.13 | |||
HMC | 797 | (25, 1949) | < 0.01 | |||
NUTS | 57.7 | (942, 1826) | 1.19 ± 1.14 | |||
3 | RWM | – | – | (0.0, 5.3) | – | |
MALA | – | (0.0, 0.2) | < 0.01 | |||
Barker | – | (1301, 1594) | 2.60 ± 0.92 | |||
HMC | 8103 | (3.3, 899) | < 0.01 | |||
NUTS | 179 | (137, 348) | 0.012 ± 0.14 |
Blocks of rows from 1 to 3 refer to the three data-generating scenarios described in Section 6.3. All numbers are averaged across ten repetitions of each algorithm. For each algorithm we report: number of iterations; number of leapfrog steps per iteration and total number of gradient evaluations (when applicable); estimated effective sample size (ESS) (minimum and median across parameters); minimum ESS per hundred gradient evaluations (with standard deviation across the ten repetitions).
Comparison of sampling schemes on the posterior distribution arising from the Poisson hierarchical model in Equation (28)
. | Method . | Iterations (n) . | Leapfrog steps/n . | Gradient calls (g) . | ESS . | ESS/g × 100 . |
---|---|---|---|---|---|---|
1 | RWM | – | – | (49, 66) | – | |
MALA | – | (648, 727) | 1.30 ± 2.73 | |||
Barker | – | (1445, 1587) | 2.89 ± 0.07 | |||
HMC | 89.5 | (285, 1954) | 0.25 ± 0.78 | |||
NUTS | 8.5 | (1175, 1822) | 6.95 ± 1.68 | |||
2 | RWM | – | – | (0.4, 10.6) | – | |
MALA | – | (0.0, 8.0) | < 0.01 | |||
Barker | – | (1365, 1563) | 2.73 ± 0.13 | |||
HMC | 797 | (25, 1949) | < 0.01 | |||
NUTS | 57.7 | (942, 1826) | 1.19 ± 1.14 | |||
3 | RWM | – | – | (0.0, 5.3) | – | |
MALA | – | (0.0, 0.2) | < 0.01 | |||
Barker | – | (1301, 1594) | 2.60 ± 0.92 | |||
HMC | 8103 | (3.3, 899) | < 0.01 | |||
NUTS | 179 | (137, 348) | 0.012 ± 0.14 |
. | Method . | Iterations (n) . | Leapfrog steps/n . | Gradient calls (g) . | ESS . | ESS/g × 100 . |
---|---|---|---|---|---|---|
1 | RWM | – | – | (49, 66) | – | |
MALA | – | (648, 727) | 1.30 ± 2.73 | |||
Barker | – | (1445, 1587) | 2.89 ± 0.07 | |||
HMC | 89.5 | (285, 1954) | 0.25 ± 0.78 | |||
NUTS | 8.5 | (1175, 1822) | 6.95 ± 1.68 | |||
2 | RWM | – | – | (0.4, 10.6) | – | |
MALA | – | (0.0, 8.0) | < 0.01 | |||
Barker | – | (1365, 1563) | 2.73 ± 0.13 | |||
HMC | 797 | (25, 1949) | < 0.01 | |||
NUTS | 57.7 | (942, 1826) | 1.19 ± 1.14 | |||
3 | RWM | – | – | (0.0, 5.3) | – | |
MALA | – | (0.0, 0.2) | < 0.01 | |||
Barker | – | (1301, 1594) | 2.60 ± 0.92 | |||
HMC | 8103 | (3.3, 899) | < 0.01 | |||
NUTS | 179 | (137, 348) | 0.012 ± 0.14 |
Blocks of rows from 1 to 3 refer to the three data-generating scenarios described in Section 6.3. All numbers are averaged across ten repetitions of each algorithm. For each algorithm we report: number of iterations; number of leapfrog steps per iteration and total number of gradient evaluations (when applicable); estimated effective sample size (ESS) (minimum and median across parameters); minimum ESS per hundred gradient evaluations (with standard deviation across the ten repetitions).
Table 2 reports the results of the simulations for the five algorithms in each of the three scenarios. For each algorithm, we report the number of log-posterior gradient evaluations and the minimum and median effective sample size (ESS) across the 51 unknown parameters. The ESS values are computed with the effectiveSize function from the coda R package (Plummer et al., 2006), discarding the first half of the samples as burn-in. The RWM, MALA and Barker schemes are run for iterations, and the HMC and NUTS schemes for iterations. The latter is the default value in the Stan package and in this example corresponds to a number of gradient evaluations between and . All numbers in Table 2 are averaged over ten independent replications of each algorithm. We use the minimum ESS per gradient evaluation as an efficiency metric, of which we report the mean and standard deviation across the 10 replicates (multiplied by 100 to facilitate readability).
According to Table 2, NUTS is the most efficient scheme in scenario 1, while Barker is the most efficient one in scenarios 2 and 3. This is in accordance with the intuition of Barker being a more robust scheme, as the target distribution becomes more challenging as we move from scenarios 1 to 3. MALA struggles to converge to stationarity in scenarios 2 and 3 (with an estimated ESS around zero), while it performs better in scenario 1, although with a high variability across different runs (shown by the large standard deviation in the last column). The RWM displays low ESS values for all three scenarios, although with a less dramatic deterioration going from scenarios 1 to 3. Interestingly, the performances of Barker are remarkably stable across scenarios (with an ESS of around 1400), as well as across parameters for which ESS is computed (in all cases the minimum and median ESS are close to each other) and across repetitions (shown by the relatively small standard deviation in the last column). We note that NUTS is also remarkably effective taking into consideration that it is not an algorithm designed with a major emphasis on robustness, but that performance does degrade when moving from scenarios 1 to 3. As in the MALA case, static HMC struggles to converge in scenarios 2 and 3 and is not very efficient in scenario 1. Note that NUTS, and in particular HMC, compensate for the increasing difficulty of the target by increasing the number of leapfrog steps per iteration. For example, the drop in efficiency of NUTS between scenarios 1 and 2 is mostly due to the increase in average number of leapfrog iterations from 8.5 to 57.7 rather than in a decrease in ESS. Somewhat surprisingly, in static HMC the number of leapfrog steps per iteration is increased significantly more than NUTS, which could either be due to genuine algorithmic differences or to variations in the details of the adaptation strategy implemented in Stan. Overall, Barker and NUTS are the two most efficient algorithms in these simulation, with a relative efficiency that depends on the scenario under consideration: NUTS being roughly 2.4 times more efficient in scenario 1, Barker 2.3 times more efficient in scenario 2 and Barker 40 times more efficient in scenario 3.
6.4 Additional simulations reported in the supplement
In the supplement, we report additional simulations for some of the above experiments. As a sensitivity check, we also performed simulations using the tamed Metropolis-adjusted Langevin algorithm (Brosse et al., 2018) and the truncated Metropolis-adjusted Langevin algorithm (Atchade, 2006; Roberts & Tweedie, 1996), two more robust modifications to MALA in which large gradients are controlled by monitoring the size of ‖∇ log π(x)‖. The schemes do offer some added stability compared to MALA in terms of controlling large gradients, but ultimately are still very sensitive to heterogeneity of the target distribution and to the choice of the truncation level, and do not exhibit the same robustness observed in the case of the Barker scheme. See the supplement for implementation details, results and further discussion.
7 Discussion
We have introduced a new gradient-based MCMC method, the Barker proposal, and have demonstrated both analytically and numerically that it shares the favourable scaling properties of other gradient-based approaches, along with an increased level of robustness, both in terms of geometric ergodicity and robustness to tuning (as defined in the present paper). The most striking benefit of the method appears to be in the context of adaptive MCMC. Evidence suggests that combining the efficiency of a gradient-based proposal mechanism with a method that exhibits robustness to tuning gives a combination of stability and speed that is very desirable in this setting, and can lead to efficient sampling that requires minimal practitioner input.
The theoretical results in this paper could be extended by studying in greater depth the large λ regime (Section 2.4) and the high-dimensional scaling of the Barker proposal (Section 4.3). Of course, there are many other algorithms that could be considered under the robustness to tuning framework, and it is worthwhile future work to explore which features of a scheme result in either robustness to tuning or a lack of it. Extensions to the Barker proposal that incorporate momentum and exhibit the decay in efficiency with dimension enjoyed by HMC may be possible, as well as the development of other methods within the first-order locally balanced proposal framework introduced in Section 3, or indeed schemes that are exact at higher orders.
Supporting Information
Additional supporting information may be found online in the Supporting Information section.
Funding information
Engineering and Physical Sciences Research Council, EP/K014463/1; European Research Council, 306406 (N-BNP)
Acknowledgements
The authors thank Marco Frangi for preliminary work in his Master's thesis related to the ergodicity arguments in the paper. SL acknowledges support from the Engineering and Physical Sciences Research Council through grant number EP/K014463/1 (i-like). GZ acknowledges support from the European research council starting grant 306406 (N-BNP), and by the Italian ministry of education, universities and research PRIN Project 2015SNS29B. Open Access Funding provided by Universita Bocconi within the CRUI-CARE Agreement. [Correction added on 11 June 2022, after first online publication: CRUI funding statement has been added.]
References