Abstract

There is a tension between robustness and efficiency when designing Markov chain Monte Carlo (MCMC) sampling algorithms. Here we focus on robustness with respect to tuning parameters, showing that more sophisticated algorithms tend to be more sensitive to the choice of step-size parameter and less robust to heterogeneity of the distribution of interest. We characterise this phenomenon by studying the behaviour of spectral gaps as an increasingly poor step-size is chosen for the algorithm. Motivated by these considerations, we propose a novel and simple gradient-based MCMC algorithm, inspired by the classical Barker accept-reject rule, with improved robustness properties. Extensive theoretical results, dealing with robustness to tuning, geometric ergodicity and scaling with dimension, suggest that the novel scheme combines the robustness of simple schemes with the efficiency of gradient-based ones. We show numerically that this type of robustness is particularly beneficial in the context of adaptive MCMC, giving examples where our proposed scheme significantly outperforms state-of-the-art alternatives.

1 Introduction

The need to compute high-dimensional integrals is ubiquitous in modern statistical inference and beyond (e.g. Brooks et al., 2011; Krauth, 2006; Stuart, 2010). Markov chain Monte Carlo (MCMC) is a popular solution, in which the central idea is to construct a Markov chain with a certain limiting distribution and use ergodic averages to approximate expectations of interest. In the celebrated Metropolis–Hastings algorithm, the Markov chain transition is constructed using a combination of a ‘candidate’ kernel, to suggest a possible move at each iteration, together with an accept-reject mechanism (Hastings, 1970; Metropolis et al., 1953). Many different flavours of Metropolis–Hastings exist, with the most common difference being in the construction of the candidate kernel. In the Random walk Metropolis (RWM), proposed moves are generated using a symmetric distribution centred at the current point. Two more sophisticated methods are the Metropolis-adjusted Langevin algorithm (Roberts & Tweedie, 1996) and Hamiltonian/hybrid Monte Carlo (Duane et al., 1987; Neal, 2011). Both use gradient information about the distribution of interest (the target) to inform proposals. Gradient-based methods are widely considered to be state-of-the-art in MCMC, and much current work has been dedicated to their study and implementation (e.g. Beskos et al., 2013; Dalalyan, 2017; Durmus & Moulines, 2017).

Several measures of performance have been developed to help choose a suitable candidate kernel for a given task. One of these is high-dimensional scaling arguments, which compare how the efficiency of the method decays with d, the dimension of the state space. For the random walk algorithm this decay is of the order d1 (Roberts et al., 1997), while for the Langevin algorithm the same figure is d1/3 (Roberts & Rosenthal, 1998) and for Hamiltonian Monte Carlo (HMC) it is d1/4 (Beskos et al., 2013). Another measure is to find general conditions under which a kernel will produce a geometrically ergodic Markov chain. For the random walk algorithm, this essentially occurs when the tails of the posterior decay at a faster than exponential rate and are suitably regular (more precise conditions are given in Jarner & Hansen, 2000). The same is broadly true of the Langevin and Hamiltonian schemes (Durmus et al., 2017a; Livingstone et al., 2019; Roberts & Tweedie, 1996), but here there is an additional restriction that the tails should not decay too quickly. This limitation is caused by the way in which gradients are used to construct the candidate kernel, which can result in the algorithm generating unreasonable proposals that are nearly always rejected in certain regions (Livingstone et al., 2019; Roberts & Tweedie, 1996).

There is clearly some tension between the different results presented above. According to the scaling arguments, gradient information is preferable. The ergodicity results, however, imply that gradient-based schemes are typically less robust than others, in the sense that there is a smaller class of limiting distributions for which the output will be a geometrically ergodic Markov chain. It is natural to wonder whether it is possible to incorporate gradient information in such a way that this measure of robustness is not compromised. Simple approaches to modifying the Langevin algorithm for this purpose have been suggested (based on the idea of truncating gradients, for example Atchade, 2006; Roberts & Tweedie, 1996), but these typically compromise the favourable scaling of the original method. In addition to this, it is often remarked that gradient-based methods can be difficult to tune. Algorithm performance is often highly sensitive to the choice of scale within the proposal (Neal, 2003, Figure 15), and if this is chosen to be too large in certain directions then performance can degrade rapidly. Because of this, practitioners must spend a long time adjusting the tuning parameters to ensure that the algorithm is running well, or develop sophisticated adaptation schemes for this purpose (e.g. Hoffman & Gelman, 2014), which can nonetheless still require a large number of iterations to find good tuning parameters (see Sections 5 and 6). We will refer to this issue as robustness to tuning.

In this article, we present a new gradient-based MCMC scheme, the Barker proposal, which combines favourable high-dimensional scaling properties with favourable ergodicity and robustness to tuning properties. To motivate the new scheme, in Section 2, we present a direct argument showing how the spectral gaps for the random walk, Langevin and Hamiltonian algorithms behave as the tuning parameters are chosen to be increasingly unsuitable for the problem at hand. In particular, we show that the spectral gaps for commonly used gradient-based algorithms decay to zero exponentially fast in the degree of mismatch between the scales of the proposal and target distributions, while for the random walk Metropolis the decay is polynomial. In Section 3, we derive the Barker proposal scheme beginning from a family of π-invariant continuous-time jump processes, and discuss its connections to the concept of ‘locally balanced’ proposals, introduced in (Zanella, 2020) for discrete state spaces. The name Barker comes from the particular choice of ‘balancing function’ used to uncover the scheme, which is inspired by the classical Barker accept-reject rule (Barker, 1965). In Section 4, we conduct a detailed analysis of the ergodicity, scaling and robustness properties of this new method, establishing that it shares the favourable robustness to tuning of the random walk algorithm, can be geometrically ergodic in the presence of very light tails, and enjoys the d1/3 scaling with dimension of the Langevin scheme. The theory is then supported by an extensive simulation study in Sections 5 and 6, including comparisons with state-of-the-art alternative sampling schemes, which highlights that this kind of robustness is particularly advantageous in the context of adaptive MCMC. The code to reproduce the experiments is available from the online repository at the link https://github.com/gzanella/barker. Proofs and further numerical simulations are provided in the supplement.

1.1 Basic set-up and notation

Throughout we work on the Borel space (Rd,B), with d ≥ 1 indicating the dimension. For λR, we write λ ↑ ∞ and λ ↓ 0 to emphasize the direction of convergence when this is important. For two functions f,g:RR, we use the Bachmann–Landau notation f(t)=O(g(t)) if lim suptf(t)/g(t)< and f(t) = Θ(g(t)) if both lim inftf(t)/g(t)>0 and f(t)=O(g(t)).

The Markov chains we consider will be of the Metropolis–Hastings type, meaning that the π-invariant kernel P is constructed as P(x,dy):=α(x,y)Q(x,dy)+r(x)δx(dy), where Q:Rd×B[0,1] is a candidate kernel,
(1)
is the acceptance rate for a proposal y given the current point x (provided that the expression is well-defined, see Tierney, 1998 for details here), and r(x) := 1−∫α(x,y)Q(x,dy) is the average probability of rejection given that the current point is x.

2 Robustness to Tuning

In this section, we seek to quantify the robustness of the random walk, Langevin and Hamiltonian schemes with respect to the mismatch between the scales of π(·) and Q in a given direction. Unlike other analyses in the MCMC literature (e.g. Beskos et al., 2018; Roberts & Rosenthal, 2001), we are interested in studying how MCMC algorithms perform when they are not optimally tuned, in order to understand how crucially performance depends on such design choices (e.g. the choice of proposal step-size or pre-conditioning matrix). The rationale for performing such an analysis is that achieving optimal or even close to optimal tuning can be extremely challenging in practice, especially when π(·) exhibits substantial heterogeneity. This is typically done using past samples in the chain to compute online estimates of the average acceptance rate and the covariance of π (or simply its diagonal terms for computational convenience), and then using those estimates to tune the proposal step-sizes in different directions (Andrieu & Thoms, 2008). If the degree of heterogeneity is large, it can take a long time for certain directions to be well-explored, and hence for the estimated covariance to be representative and the tuning parameters to converge.

In such settings, algorithms that are more robust to tuning are not only easier to use when such tuning is done manually by the user, but can also greatly facilitate the process of learning the tuning parameters adaptively within the algorithm. We show in Sections 5 and 6 that if an algorithm is robust to tuning then this adaptation process can be orders of magnitude faster than in the alternative case, drastically reducing the overall computational cost for challenging targets. The intuition for this is that more robust algorithms will start performing well (i.e. sampling efficiently) earlier in the adaptation process (when tuning parameters are not yet optimally tuned), which in turn will speed up the exploration of the target and the learning of the tuning parameters.

2.1 Analytical framework

The most general scenario we consider is a family of target densities π(λ,k) indexed by λ > 0 and k ∈ {1, …, d} defined as
(2)
where π is a density defined on Rd for which π(x) > 0 for all xRd and logπC1(Rd). The set-up allows modification of the scale of the first k components of π(λ,k) through the parameter λ. Our main results are presented for the case k = 1, and we write π(λ):=π(λ,1) for simplicity, before discussing extensions to the k > 1 setting in Section 2.5. We consider targeting π(λ) using a Metropolis–Hastings algorithm with fixed tuning parameters, and study performance as λ varies. Intuitively, we can think of λ as a parameter quantifying the level of heterogeneity in the problem. As a concrete example, consider a RWM algorithm in which given the current state x(t) the candidate move is y=x(t)+σξ, with σ > 0 a fixed tuning parameter and ξN(0,Id), where Id is the d × d identity matrix. It is instructive to take σ as the optimal choice of global scale for π, meaning when λ is far from one then σ is no longer a suitable choice for the first coordinate of π(λ).

In the context of the above, the λ ↓ 0 regime is representative of distributions in which one component (in this case the first) has a very small scale compared to all others. Conversely, the λ ↑ ∞ regime reflects the case in which one component has a much larger scale than its counterparts. Studying robustness to tuning in the context of heterogeneity is particularly relevant, as highlighted above, as this is exactly the context in which tuning is more challenging. The λ ↓ 0 regime is particularly interesting and has been recently considered in Beskos et al. (2018), where the authors study the behaviour of the RWM for ‘ridged’ densities for different values of k using a diffusion limit approach. The focus in that work, however, was on the finding optimal tuning parameters for the algorithm as a function of λ, whereas the present paper is concerned with the regime in which the tuning parameters are fixed (as discussed above).

The above framework could be equivalently formulated by keeping the target distribution π fixed and instead rescaling the first component of the candidate kernel by a factor 1/λ. This is indeed the formulation we mostly use in the proofs of our theoretical results. A proof of the mathematical equivalence between the two formulations can be found in the supplement.

2.2 Measure of performance

Our measure of performance for the various algorithms will be the spectral gap of the resulting Markov chains. Consider the space of functions
Note that any function g with Eπg2< can be associated with an fL0,12(π) through the map f=(gEπg)/Varπg, and that if X(t)π(·) and X(t+1)|X(t)P(X(t),·) then Corr{g(X(t)),g(X(t+1))}=Corr{f(X(t)),f(X(t+1))}. The (right) spectral gap of a π-reversible Markov chain with transition kernel P is
(3)

The expression inside the infimum is called a Dirichlet form, and can be thought of as the ‘expected squared jump distance’ for the function f provided the chain is stationary. This can in turn be re-written as 1Corr{f(X(t)),f(X(t+1))}. Maximising the spectral gap of a reversible Markov chain can therefore be understood as minimising the worst-case first-order autocorrelation among all possible square-integrable test functions.

The spectral gap allows to bound the variances of ergodic averages (see Proposition 1 of Rosenthal, 2003). Also, a direct connection between the spectral gap and mixing properties of the chain can be made if the operator Pf(x) := ∫f(y)P(x,dy) is positive on L2(π). This will always be the case if the chain is made lazy, which is the approach taken in Woodard et al. (2009), and the same adjustment can be made here if desired.

2.3 The small λ regime

In this section, we assess the robustness to tuning of the random walk, Langevin and Hamiltonian schemes as λ ↓ 0. This corresponds to the case in which the proposal scale is chosen to be too large in the first component of π(λ). The results in this section will support the idea that classical gradient-based schemes pay a very high price for any direction in which this tuning parameter is chosen to be too large, as already noted in the literature (e.g. Neal, 2003, p. 738), while the RWM is less severely affected by such issues.

2.3.1 Random walk Metropolis

In the RWM, given a current point xRd, a proposal y is calculated using the equation
(4)
with σ > 0 and ξμ(·) for some centred symmetric distribution μ. The resulting candidate kernel QR is given by QR(x,dy)=qR(x,y)dy with qR(x,y)=σdμ((yx)/σ), where μ(ξ) for ξRd denotes the density of μ. Following Section 2.1, we consider Metropolis–Hastings algorithms with proposal QR and target distribution π(λ) defined in Equation (2), and denote the resulting transition kernels as PλR.

We impose the following mild regularity conditions on the density μ(ξ). These are satisfied for most popular choices of μ, as shown in the subsequent proposition.

Condition 1
There exists λ0>0 such that for any x,yRd and λ<λ0 we have μ(δλ)μ(δ), where δ = yx and
(5)

In addition, supξ1Rμ1(ξ1)<, where μ1(ξ1)=Rd1μ(ξ1,ξ2,,ξd)dξ2dξd is the marginal distribution of ξ1 under ξμ.

Proposition 1

Denoting the usualp-norm asxp=(i=1dxip)1/p, Condition1holds in each of the below cases:

  • (i)

    qR(x,y)=(2πσ2)-d/2exp(-x-y22/(2σ2))(Gaussian)

  • (ii)

    qR(x,y)=2-dexp(-x-y1)(Laplace)

  • (iii)

    qR(x,y)(1+y-x22/ν)-(ν+d)/2forν ∈ {1, 2, …} (Student'st)

We conclude the section with a characterization of the rate of convergence to zero of the spectral gap for the RWM as λ ↓ 0.

Theorem 1
Assume Condition1andGap(P1R)>0. Then it holds that

Note that Theorem 1 requires very few assumptions on the target π other than Gap(P1R)>0. Note also that the lower bound is of the form Gap(PλR)λGap(P1R), see proof of Theorem 1 for details. No dependence on the dimension of the problem other than that intrinsic to Gap(P1R) is therefore introduced.

2.3.2 The Langevin algorithm

In the Langevin algorithm (or more specifically the Metropolis-adjusted Langevin algorithm, MALA), given the current point xRd, a proposal y is generated by setting
(6)
for some σ > 0 and ξN(0,Id). In this case the proposal is no longer symmetric and so the full Hastings ratio (1) must be used. The proposal mechanism is based on the overdamped Langevin stochastic differential equation dXt=logπ(λ)(Xt)dt+2dBt. We write QλM for the corresponding candidate distribution and PλM for the Metropolis–Hastings kernel with proposal QλM and target π(λ).

We present results for the Langevin algorithm in two settings. Initially, we consider more restrictive conditions under which our upper bound on the spectral gap depends on the tail behaviour of π in a particularly explicit manner, and then give a broader result.

Condition 2

Assume the following:

  • (i)

    π has a density of the form π(x)=π1(x1)π2:n(x2,,xd), for some densities π1 and π2:n on R and Rd1, respectively.

  • (ii)
    For some q ∈ [0, 1), it holds that
    (7)
Theorem 2
If Condition2holds, then there is aγ > 0 such that

When compared with the random walk algorithm, Theorem 2 shows that the Langevin scheme is much less robust to heterogeneity. Indeed, the spectral gap decays exponentially fast in λ(1+q), meaning that even small errors in the choice of step-size can have a large impact on algorithm efficiency, and so practitioners must invest considerable effort tuning the algorithm for good performance, as shown through simulations in Sections 5 and 6. Theorem 2 also illustrates that the Langevin algorithm is more sensitive to λ when the tails of π(·) are lighter. This is intuitive, as in this setting gradient terms can become very large in certain regions of the state space.

Remark 1

Theorem 2 (and Theorem 4 below) could be extended to the case q ≥ 1 in Equation (7); however, in these cases, samplers typically fail to be geometrically ergodic when λ is small (Livingstone et al., 2019; Roberts & Tweedie, 1996) meaning the spectral gap is typically 0 and the theorem becomes trivial.

Remark 2

Condition 2(ii) could be replaced with the simpler requirement that |logπ1(x1)|, with the corresponding bound Gap(PλM)=O(e1/λ).

A different set of conditions, which hold much more generally, and corresponding upper bound are presented below.

Condition 3

Assume the following:

  • (i)
    There is a γ > 0 such that
    (8)
  • (ii)
    Given Xπ there is a β > 0 such that
    (9)
Theorem 3
If Condition3holds, then
for someα > 0, which can be taken asα = min{β/2,β/γ,2/3}.

We expect Condition 3 to be satisfied in many commonly encountered scenarios, with the exception of particularly heavy-tailed models. In the exponential family class π(x)exp{αx2β}, for example, Condition 3 holds for any α and β > 0 (see proof in the supplement).

2.3.3 Hamiltonian Monte Carlo

In Hamiltonian Monte Carlo, we write the current point xRd as x(0), and construct the proposal y := x(L) for some prescribed integer L using the update
(10)
where each x(j) is defined recursively in the same manner, and ξ(0)N(0,Id). The transition is based on numerically solving Hamilton's equations for the Hamiltonian system H(x,ξ)=logπ(λ)(x)+ξTξ/2 for units of time. The decision of whether or not the proposal is accepted is taken using the acceptance probability min(1,π(λ)(y)/π(λ)(x)×eξ(L)Tξ(L)/2+ξ(0)Tξ(0)/2), where

A more detailed description is given in Neal (2011). We write PλH for the corresponding Metropolis–Hastings kernel with proposal mechanism as above and target π(λ). Here we present a heterogeneity result under Condition 2 of the previous subsection.

Theorem 4
If Condition2holds, then there is aγ > 0 such that

It is no surprise that Theorem 4 is comparable to Theorem 2, since setting L = 1 equates the Langevin and Hamiltonian methods.

2.4 The large λ regime

In this section, we briefly discuss the λ ↑ ∞ regime, where σ is chosen to be too small for the first component of π(λ), arguing that all samplers under consideration behave similarly in this regime and pay a similar price for too small tuning parameters in a given direction. The intuition for this is that as λ ↑ ∞ the gradient-based proposal mechanisms discussed here all tend towards that of the random walk sampler in the first coordinate. For example, if we consider one-dimensional models, for any xR we can write logπ(λ)(x)=λ1logπ(x/λ), meaning as λ ↑ ∞ the amount of gradient information in the proposal is reduced provided π is suitably regular. The following result makes this intuition precise. To avoid repetitions, we state here the result for both the Langevin and the Barker proposal that we will introduce in the next section.

Proposition 2
FixxRand let the densityπbe such that ∇ log πis bounded in some neighbourhood of zero. Then the Langevin and Barker candidate kernelsQλMandQλB, defined in Equations ((6)) and ((16)) respectively, both satisfy
whereQRis the (Gaussian) random walk candidate kernel.

The same intuition applies to the Hamiltonian case provided L is fixed, since each gradient term in the proposal is also Θ(1/λ). While there are many well-known measures of distance between two distributions, we argue that total variation is an appropriate choice here, since it has an explicit focus on how much the two kernels overlap and is invariant under bijective transformations of the state space (including re-scaling coordinates).

While the above statements provide useful heuristic arguments, in order to obtain more rigorous results one should prove that the spectral gaps decay to 0 at the same rate as λ ↑ ∞, which we leave to future work. We note, however, that the conjecture that the algorithms behave similarly for large values of λ is supported by the simulations of Section 5.1.

2.5 Extensions

The lower bound of Theorem 1 extends naturally to the k > 1 setting, becoming instead Θ(λk), and so the rate of decay remains polynomial in λ for any k. Analogously, we expect the corresponding upper bound for gradient-based schemes to remain exponential and become O(ek(γλ(1+q)+qlog(λ))), although the details of this are left for future work. We explore examples of this nature through simulations in Section 5 and find empirically that the single component case is informative also of more general cases. Further extensions in which a different λi is chosen in each of the k directions can also be considered, with each λi0 at a different rate. We conjecture that in this setting the λi that decays most rapidly will dictate the behaviour of spectral gaps, though such an analysis is beyond the scope of the present work. One could consider using a mixture of the MALA/HMC and random walk kernels in an attempt to achieve both robustness to tuning and favourable scaling properties. While this may seem promising in theory, in practice we believe that it would be difficult to achieve both robustness to tuning and favourable high-dimensional performance from such an approach. In the next section, we consider a scheme for which the two goals can be achieved simultaneously.

3 Combining Robustness and Efficiency

The results of Section 2 show that the two gradient-based samplers considered there are much less robust to heterogeneity than the random walk algorithm. In this section, we introduce a novel and simple to implement gradient-based scheme that shares the superior scaling properties of the Langevin and Hamiltonian schemes, but also retains the robustness of the random walk sampler, both in terms of geometric ergodicity and robustness to tuning.

3.1 Locally balanced Metropolis–Hastings

Consider a continuous-time Markov jump process on Rd with associated generator
(11)
for some suitable function f:RdR, where π(x) is a probability density, Q(x, dy) := q(x, y)dy is a transition kernel and the balancing function g : (0, ∞) → (0,∞) satisfies
(12)
A discrete state-space version of this process with symmetric Q was introduced in Power and Goldman (2019). The dynamics of the process are such that if the current state Xt=x, the next jump will be determined by a Poisson process with intensity
(13)
and the next state is drawn from the kernel

It is straightforward to show that L is a self-adjoint operator on the Hilbert space L2(π) using Equation (12), implying that the process is π-reversible and can therefore serve as a starting point for designing MCMC algorithms.

In the ‘locally balanced’ framework for discrete state-space Metropolis–Hastings introduced in Zanella (2020), candidate kernels are of the form
(14)
meaning the embedded Markov chain of Equation (11) with the choice Q(x,dy):=μσ(yx)dy, where μσ(yx):=σdμ((yx)/σ) for some symmetric density μ. It is well-known that the invariant density of the embedded chain does not coincide with that of the process when jumps are not of constant intensity, in this case becoming proportional to Z(x)π(x), as shown in Zanella (2020). As a result a Metropolis–Hastings step is employed to correct for the discrepancy. In Power and Goldman (2019) it is suggested that as an alternative the jump process can be simulated exactly.
The challenge with employing either of these strategies on a continuous state space is that the integral (13) will typically be intractable. To overcome this issue we take two steps, and for simplicity, we first describe these on R (there are two options on Rd for d > 1, which are discussed in Section 3.3). The first step is to consider a first-order Taylor series expansion of log π within g (again with a symmetric choice of Q), leading to the family of processes with generator
We refer to candidate kernels in Metropolis–Hastings algorithms that are constructed using the embedded Markov chain of this new process as first-order locally balanced proposals, taking the form
(15)
where Z(x):=g(elogπ(x)(yx))μσ(yx)dy.
Remark 3

One can also think at Equation (12) as a requirement to ensure that the proposals in Equation (15) are exact (i.e. π-reversible) at the first order. In particular, in the supplement, it is shown that a proposal Q(g) defined as in Equation (15) is π-reversible for π log-linear if and only if Equation (12) holds.

The second step is to note that, if particular choices of g are made, then Z(x) becomes tractable. In fact, if the balancing function g(t)=t and a Gaussian kernel μσ are chosen, then the result is the Langevin proposal

Thus, MALA can be viewed as a particular instance of this class. Other choices of g are, however, also possible, and give rise to different gradient-based algorithms. In the next section we explore what a sensible choice of g might look like.

3.2 The Barker proposal on R

The requirement for the balancing function g to satisfy g(t) = tg(1/t) is in fact also imposed on the acceptance rate of a Metropolis–Hastings algorithm to produce a π-reversible Markov chain. Indeed, setting t := π(y)q(y, x)/(π(x)q(x, y)) and assuming α(x, y) := α(t), then the detailed balance equations can be written α(t) = (1/t). Possible choices of g can therefore be found by considering suggestions for α in the literature. One choice proposed in Barker (1965) is

The work of Peskun (1973) and Tierney (1998) showed that this choice of α is inferior to the more familiar Metropolis–Hasting rule α(t) = min(1, t) in terms of asymptotic variance. The same conclusion cannot, however, be drawn when considering the choice of balancing function g.

In fact, the choice g(t) = t/(1+t) was shown to minimize asymptotic variances of Markov chain estimators in some discrete settings in Zanella (2020). In addition, as shown below, this particular choice of g leads to a fully tractable candidate kernel that can be easily sampled from. For this reason, we focus on this choice of g here, and leave the question of optimality in general for future work.

Proposition 3

Ifg(t) = t / (1 + t), then the normalising constantZ(x) in Equation(15)is 1/2.

The resulting proposal distribution is
(16)

We refer to QB as the the Barker proposal. A simple sampling strategy to generate yQB(x,·) is given in Algorithm 1.

Proposition 4

Algorithm 1 produces a sample fromQB on R.

Algorithm 1 shows that the magnitude |yx| = |z| of the proposed move does not depend on the gradient ∇ log π(x) here, it is instead dictated only by the choice of symmetric kernel μσ. The direction of the proposed move is, however, informed by both the magnitude and direction of the gradient. Examining the form of p(x, z), it becomes clear that if the signs of z and ∇ log π(x) are in agreement, then p(x, z) > 1/2, and indeed as z∇ log π(x) ↑ ∞ then ezlogπ(x)0 and so p(x, z) ↑ 1. Hence, if the indications from ∇ log π(x) are that π(x + z) ≫ π(x), then it is highly likely that b(x, z) will be set to 1 and y = x + z will be the proposed move. Conversely, if z∇ log π(x) < 0, then there is a larger than 50% chance that the proposal will instead be y = xz. As ∇ log π(x) ↑ ∞ the Barker proposal converges to μσ truncated on the right, and similarly to μσ truncated on the left as ∇ log π(x) ↓ − ∞. See Figure 1 for an illustration.

Left: density of the Barker proposal in one dimension. Current location is x = 0 and the four lines with increasing red intensity correspond to ∇ log π(x) equal to 1, 3, 10 and 50. Right: density of the Barker proposal in two dimensions. Solid lines display the proposal density contours, heat colours refer to the target density, and the current location is x = (−2, 0) [Colour figure can be viewed at https://academic.oup.com/]
FIGURE 1

Left: density of the Barker proposal in one dimension. Current location is x = 0 and the four lines with increasing red intensity correspond to ∇ log π(x) equal to 1, 3, 10 and 50. Right: density of the Barker proposal in two dimensions. Solid lines display the proposal density contours, heat colours refer to the target density, and the current location is x = (−2, 0) [Colour figure can be viewed at https://academic.oup.com/]

The multiplicative term 1/(1+elogπ(x)(yx)) in Equation (16), which incorporates the gradient information, injects skewness into the base kernel μσ (as can be clearly seen in the left-hand plot of Figure 1). Indeed, the resulting distribution QB is an example of a skew-symmetric distribution (Azzalini, 2013, Eq. (1.3)). Skew-symmetric distributions are a tractable family of (skewed) probability density functions that are obtained by multiplying a symmetric base density functionwith the cumulative distribution function (cdf) of a symmetric random variable. We refer to (Azzalini 2013 Ch. 1) for more details, including a more general version of Propositions 3 and 4. In the context of skewed distributions, the Gaussian cdf is often used, leading to the skew-normal distribution introduced in Azzalini (1985). In our context, however, the Barker proposal (which leads to the logistic cdf p(x, z) in Algorithm 1) is the only skew-symmetric distribution that can be obtained from Equation (15) using a balancing function g satisfying g(t) = tg(1/t). See the supplement for more detail.

3.3 The Barker proposal on Rd

There are two natural ways to extend the Barker proposal to Rd, for d > 1. The first is to treat each coordinate separately, and generate the proposal y=(y1,,yd) by applying Algorithm 1 independently to each coordinate. This corresponds to generating a zi and bi(x,zi) for each i ∈ {1, …, d}, and choosing the sign of each bi using
where ilogπ(x) denotes the partial derivative of log π(x) with respect to xi. Writing QiB(x,dyi) to denote the resulting Barker proposal candidate kernel for the ith coordinate, the full candidate kernel QB can then be written
(17)

The full Metropolis–Hastings scheme using the Barker proposal mechanism for a target distribution is given in Algorithm 2 (see the supplement for more details and variations of the algorithm, such as a pre-conditioned version). Note that the computational cost of each iteration of the algorithm is essentially equivalent to that of MALA and will be typically dominated by the cost of computing the gradient and density of the target.

The second approach to deriving a multivariate Barker proposal consists of sampling zRd from a d-dimensional symmetric distribution, and then choosing whether or not to flip the sign of every coordinate at the same time, using a single global bˇ(x,z){1,1}, to produce the global proposal y=x+bˇ(x,z)×z. In this case, the probability that bˇ(x,z)=1 will be
(19)

This second approach does not allow gradient information to feed into the proposal as effectively as in the first case. Specifically, only the global inner product zTlogπ(x) is considered, and the decision to alter the sign of every component of z is taken based solely on this value. In other words, once zμσ has been sampled, gradient information is only used to make a single binary decision of choosing between the two possible proposals x + z and xz, while in the first strategy gradient information is used to choose between 2d possible proposals {x+b·z:b{1,1}d} (where b·z:=(b1z1,,bdzd)). Indeed, the following proposition shows that the second strategy cannot improve over the RWM by more than a factor of two.

Proposition 5

LetPˇBdenote the modified Barker proposal onRdusing Equation(19). ThenGap(PR)Gap(PˇB)/2.

One can also make a stronger statement than the above proposition, namely that if this strategy is employed, only a constant factor improvement over the RWM can be achieved in terms of asymptotic variance, for any L2(π) function of interest. Given Proposition 5 we choose to use the first strategy described to produce Barker proposals on Rd, and the multi-dimensional candidate kernel given in Equation (17). In the following sections, we will show both theoretically and empirically that this choice does indeed have favourable robustness and efficiency properties.

4 Robustness, Scaling and Ergodicity Results for the Barker Proposal

In this section, we establish results concerning robustness to tuning, scaling with dimension and geometric ergodicity for the Barker proposal scheme. As we will see, the method not only enjoys the superior efficiency of gradient-based algorithms in terms of scaling with dimension, but also shares the favourable robustness properties of the RWM when considering both robustness to tuning and geometric ergodicity.

4.1 Robustness to tuning

We now examine the robustness to tuning of the Barker proposal using the framework introduced in Section 2. We write QλB and PλB to denote the candidate and Metropolis–Hastings kernels for the Barker proposal targeting the distribution π(λ) defined therein, and PB for the case λ = 1. The following result characterizes the behaviour of the spectral gap of PλB as λ ↓ 0.

Theorem 5
Assume Condition1andGap(PB)>0. Then it holds that

Comparing Theorem 5 with Theorems 1-4 from Section 2.3, we see that the Barker proposal inherits the robustness to tuning of random walk schemes and is significantly more robust than the Langevin and Hamiltonian algorithms. In the next section, we establish general conditions under which Gap(PB)>0.

4.2 Geometric ergodicity

In this section, we study the class of target distributions for which the Barker proposal produces a geometrically ergodic Markov chain. We show that geometric ergodicity can be obtained even when the gradient term in the proposal grows faster than linearly, which is typically not the case for MALA and HMC.

Recall that a Markov chain is called geometrically ergodic if
(20)
for some C < ∞, Lyapunov function V:Rd[1,), and ρ < 1, where μ(·)ν(·)TV:=supAB|μ(A)ν(A)| for probability measures μ and ν. When such a condition can be established for a reversible Markov chain, then a central limit theorem exists for any square-integrable function (Roberts & Rosenthal, 2004).
We prove geometric ergodicity results for generic proposals as in Equation (15), assuming g to be bounded and monotone, and μσ to have lighter than exponential tails. Following the discussion in Section 3.3, we consider proposals that are independent across components, leading to
(21)
where Zi(x):=Rg(eilogπ(x)(yixi))μσ(yixi)dyi. With a slight abuse of notation, we use μσ to represent one and d-dimensional densities. The Barker proposal in Equation (17) is the special case obtained by taking g(t) = t/(1+t).

For the results of this section, we make the simplifying assumption that π is spherically symmetric outside a ball of radius R < ∞.

Condition 4

There exists R < ∞ and a differentiable function f : (0,∞) → (0,∞) with limrf(r)= and f(r) non-increasing for r > R such that log π(x) = f(‖x‖) for r > R.

Theorem 6

Letg : (0, ∞)→(0, ∞) be a bounded and non-decreasing function, Rexp(sw)μσ(w)dw<for everys > 0, andinfw(-δ,δ)μσ(w)>0for someδ > 0. If the target densityπsatisfies Condition4, then the MetropolisHastings chain with proposalQ(g)isπ-a.e. geometrically ergodic.

We note that tail regularity assumptions such as Condition 4 are common in this type of analysis (e.g. Durmus et al., 2017a; Jarner & Hansen, 2000). As an intuitive example, the condition is satisfied in the exponential family π(x)exp(αxβ) for all β > 1. As a contrast, for MALA and HMC it is known that for β > 2 the sampler fails to be geometrically ergodic (Livingstone et al., 2019; Roberts & Tweedie, 1996). We expect the Barker proposal to be geometrically ergodic also for the case β = 1, although we do not prove it in this work. It is worth noting that for the MALA choice g(t)=t is unbounded above, which is a central reason for the lack of stability compared to bounded choices such as g(t) = t/(1 + t) employed in the Barker scheme.

4.3 Scaling with dimensionality

In this section, we provide preliminary results suggesting that the Barker proposal enjoys scaling behaviour analogous to that of MALA in high-dimensional settings, meaning that under appropriate assumptions it requires the number of iterations per effective sample to grow as Θ(d1/3) with the number of dimensions d as d → ∞. Similarly to Section 4.2, we prove results for general proposals Q(g) as in Equation (21) with balancing functions g satisfying g(t) = tg(1/t). The Barker proposal is a special case of the latter family.

We perform an asymptotic analysis for d → ∞ using the framework introduced in Roberts et al. (1997). The main idea is to study the rate at which the proposal step size σ needs to decrease as d → ∞ to obtain well-behaved limiting behaviour for the MCMC algorithm under consideration (such as a Θ(1) acceptance rate and convergence to a non-trivial diffusion process after appropriate time re-scaling). Based on the rate of decrease of σ, one can infer how the number of MCMC iterations required for each effective sample increases as d → ∞. For example, in the case of the RWM, σ2 must be scaled as Θ(d1) as d → ∞ to have a well-behaved limit (Roberts et al., 1997), which leads to RWM requiring Θ(d) iterations for each effective sample. By contrast, for MALA it is sufficient to take σ2=Θ(d1/3) as d → ∞, which leads to only Θ(d1/3) iterations for each effective sample (Roberts & Rosenthal, 1998). While these analyses are typically performed under simplifying assumptions, such as having a target distribution with i.i.d. components, the results have been extended in many ways (e.g. removing the product-form assumption, see Mattingly et al., 2012) obtaining analogous conclusions. See also Beskos et al. (2013) for optimal scaling analysis of HMC and Roberts and Rosenthal (2016) for rigorous connections between optimal scaling results and computational complexity statements.

In this section, we focus on the scaling behaviour of Metropolis–Hastings algorithms with proposal Q(g) as in Equation (21), when targeting distributions of the form π(x)=i=1df(xi), where f is a one-dimensional smooth density function. Given the structure of Q(g) and π(·), the acceptance rate takes the form α(x,y)=min{1,i=1dαi(xi,yi)}, where
(22)
and ϕ = log f. In such a context, the scaling properties of the MCMC algorithms under consideration are typically governed by the behaviour of log(αi(xi,yi)) as yi gets close to xi, or more precisely by degree of the leading term in the Taylor series expansion of log(αi(xi,xi+σui)) in powers of σ as σ → 0 for fixed xi and ui. For example, in the case of the RWM one has log(αi(xi,xi+σui))=Θ(σ) as σ → 0, which in fact implies the proposal variance σ2 must decrease at a rate Θ(d1) to obtain a non-trivial limit. By contrast, when the MALA proposal is used, one has log(αi(xi,xi+σui))=Θ(σ3) as σ→0, which in turn leads to σ2=Θ(d1/3). See Sections 2.1–2.2 of Durmus et al. (2017b) for a more detailed and rigorous discussion on the connection between the Taylor series expansion of log(αi(xi,yi)) and MCMC scaling results. The following proposition shows that the condition g(t) = tg(1/t), when combined with some smoothness assumptions, is sufficient to ensure that the proposals Q(g) lead to log(αi(xi,xi+σui))=O(σ3) as σ → 0.
Proposition 6
Letg : (0, ∞) → (0, ∞) andg(t) = tg(1/t) for allt. Ifgis three times continuously differentiable andRg(j)(esw)μ(w)dw<for alls > 0 andj ∈ {0, 1, 2, 3}, whereg(j):(0,)(0,)is thejth derivative ofg, then
(23)
for anyxianduiinR.

Proposition 6 suggests that Metropolis–Hastings algorithms with proposals Q(g) such that g(t) = tg(1/t) have scaling behaviour analogous to MALA, meaning that σ2=Θ(d1/3) is sufficient to ensure a non-trivial limit and thus Θ(d1/3) iterations are required for each effective sample. To make these arguments rigorous, one should prove weak convergence results for d → ∞, as in Roberts and Rosenthal (1998). Proving such a result for a general g would require a significant amount of technical work, thus going beyond the scope of this section. In this paper we rather support the conjecture of Θ(d1/3) scaling for Q(g) by means of simulations (see Section 5.2). While Proposition 6 only shows log(αi(xi,xi+σui))=O(σ3), it is possible to show that log(αi(xi,xi+σui))=Θ(σ3) with some extra assumptions on ϕ to exclude exceptional cases (see the supplement for more detail).

5 Simulations with Fixed Tuning Parameters

Throughout Sections 5 and 6, we choose the symmetric density μσ within the random walk and Barker proposals to be N(0,σ2Id) for simplicity. Note, however, that any symmetric density μσ could in principle be used. It would be interesting to explore the impact of different choices of μσ to the performances of the Barker algorithm, and we leave such a comparison to future work.

5.1 Illustrations of robustness to tuning

We first provide an illustration of the robustness to tuning of the random walk, Langevin and Barker algorithms in three simple one-dimensional settings. In each case we approximate the expected squared jump distance (ESJD) using 104 Monte Carlo samples and standard Rao–Blackwellisation techniques, across of range of different proposal step-sizes between 0.01 and 100. As is clearly shown in Figure 2, all algorithms perform similarly when the step-size is smaller than optimal, as suggested in Section 2.4. As the step-size increases beyond this optimum, however, behaviours begin to differ. In particular, the ESJD for MALA rapidly decays to zero, whereas in the random walk and Barker cases the reduction is much less pronounced. In fact, the rate of decay is similar for the two schemes, which is to be expected following the results of Sections 4.1 and 2.3. See the supplement for a similar illustration on a 20-dimensional example.

Expected squared jump distance against proposal step-size for random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker on different one-dimensional targets [Colour figure can be viewed at https://academic.oup.com/]
FIGURE 2

Expected squared jump distance against proposal step-size for random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker on different one-dimensional targets [Colour figure can be viewed at https://academic.oup.com/]

5.2 Comparison of efficiency on isotropic targets

Next we compare the expected squared jump distance of the random walk, Langevin and Barker schemes when sampling from isotropic distributions of increasing dimension, with optimised proposal scale (chosen to maximise expected squared jumping distance). This set-up is favourable to MALA, which is the least robust scheme among the three, as the target distribution is homogeneous and the proposal step-size optimally chosen. We consider target distributions with independent and identically distributed (i.i.d.) components, corresponding to the scenario studied theoretically in Section 4.3. We set the distribution of each coordinate to be either a standard normal distribution or a hyperbolic distribution, corresponding to logπ(x)=i=1dxi2/2+const and logπ(x)=i=1d(0.1+xi2)1/2+const, respectively. Figure 3 shows how the ESJD per coordinate decays as dimension increases for the three algorithms. For MALA and Barker, the ESJD appears to decrease at the same rate as d increases, which is in accordance with the preliminary results in Section 4.3. In the Gaussian case, MALA outperforms Barker roughly by a factor of 2 regardless of dimension (more precisely, the ESJD ratio lies between 1.7 and 2.5 for all values of d in Figure 3), while in the hyperbolic case the same factor is around 1.2, again independently of dimension (ESJD ratio between 1.1 and 1.25 for all values of d in Figure 3). The rate of decay for the RWM is faster, as predicted by the theory.

Expected squared jump distance against dimensionality for random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker schemes with optimally-tuned step size. The target distribution has i.i.d. coordinates following either a Gaussian distribution (left plot) or a hyperbolic one (right plot) [Colour figure can be viewed at https://academic.oup.com/]
FIGURE 3

Expected squared jump distance against dimensionality for random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker schemes with optimally-tuned step size. The target distribution has i.i.d. coordinates following either a Gaussian distribution (left plot) or a hyperbolic one (right plot) [Colour figure can be viewed at https://academic.oup.com/]

6 Simulations with Adaptive Markov Chain Monte Carlo

In this section, we illustrate how robustness to tuning affects the performance of adaptive MCMC methods.

6.1 Adaptation strategy and algorithmic set-up

We use Algorithm 4 in Section 5 of Andrieu and Thoms (2008) to adapt the tuning parameters within each scheme. Specifically, in each case, a Markov chain is initialised using a chosen global proposal scale σ0 and an identity pre-conditioning matrix Σ0=Id, and at each iteration the global scale and pre-conditioning matrix are updated using the equations
(24)
 
(25)
 
(26)

Here X(t) denotes the current point in the Markov chain, Y(t) is the proposed move, μ0=0, α¯* denotes some ideal acceptance rate for the algorithm and the parameter γt is known as the learning rate. We set α¯* to be 0.23 for RWM, 0.57 for MALA and 0.40 for Barker. We tried changing the value of α¯* for Barker in the range [0.2, 0.6] without observing major differences. In our simulations, we constrain Σt to be diagonal (i.e., all off-diagonal terms in Equation (26) are set to 0). This is often done in practice to avoid having to learn a dense pre-conditioning matrix, which has both a high computational cost and would require a large number of MCMC samples. See the supplement for full details on the pre-conditioned Barker schemes obtained with both diagonal and dense matrix Σt, including pseudo-code of the resulting algorithms.

We set the learning rate to γt:=tκ with κ ∈ (0.5, 1), as for example suggested in (Shaby & Wells, 2010). Small values of κ correspond to more aggressive adaptation, and for example Shaby and Wells (2010) suggest using κ = 0.8. In the simulations of Section 6.2, we use κ = 0.6 as this turned out to be a good balance between fast adaptation and stability for MALA (κ = 0.8 resulted in too slow adaptation, while values of κ lower than 0.6 led to instability). The adaptation of RWM and Barker was not very sensitive to the value of κ. Unless specified otherwise, all algorithms are randomly initialized with each coordinate sampled independently from a normal distribution with standard deviation 10. Following the results from the optimal scaling theory (Roberts & Rosenthal, 2001), we set the starting value for the global scale as σ02=2.42/d for RWM and σ02=2.42/d1/3 for MALA. For Barker we initialize σ0 to the same values as MALA.

6.2 Performance on target distributions with heterogeneous scales

In this section, we compare the adaptive algorithms described above when sampling from target distributions with significant heterogeneity of scales across their components. We consider 100-dimensional target distributions with different types of heterogeneity, tail behaviour and degree of skewness according to the following four scenarios:

  • 1

    (One coordinate with small scale; Gaussian target) In the first scenario, we consider a Gaussian target with zero mean and diagonal covariance matrix. We set the standard deviation of the first coordinate to 0.01 and that of the other coordinates to 1. This scenario mirrors the theoretical framework of Sections 2 and 4.1 in which a single coordinate is the source of heterogeneity.

  • 2

    (Coordinates with random scales; Gaussian target) Here we modify scenario 1 by generating the standard deviations of each coordinate randomly, sampling them independently from a log-normal distribution. More precisely, we sample log(ηi)N(0,1) independently for i = 1,…, 100, where ηi is the standard deviation of the ith component.

  • 3

    (Coordinates with random scales; Hyperbolic target) In the third scenario, we change the tail behaviour of the target distribution, replacing the Gaussian with a hyperbolic distribution (a smoothed version of the Laplace distribution to ensure logπC1(Rd)). In particular, we set logπ(x)=i=1d(ϵ+(xi/ηi)2)1/2+c, with ε = 0.1 and c being a normalizing constant. The scale parameters (ηi)i are generated randomly as in scenario 2.

  • 4

    (Coordinates with random scales; Skew-normal target) Finally, we consider a non-symmetric target distribution, which represents a more challenging and realistic situation. We assume that the ith coordinate follows a skew-normal distribution with scale ηi and skewness parameter α, meaning that logπ(x)=12i=1d(xi/ηi)2+i=1dlogΦ(αxi/ηi)+c, with c being a normalizing constant. We set α = 4 and generate the ηi's randomly as in scenario 2.

First we provide an illustration of the behaviour of the three algorithms by plotting the trace plots of tuning parameters and MCMC trajectories—see Figure 4 for the results in scenario 1. The adaptation of tuning parameters for the Barker scheme stabilises within a few hundred iterations, after which the algorithm performance appears to be stable and efficient. On the contrary both RWM and MALA struggle to learn the heterogeneous scales and the adaptation process has either just stabilized or not yet stabilized after 104 iterations. Looking at the behaviour of MALA in Figure 4 we see that, in order for the algorithm to achieve a non-zero acceptance rate, the global scale parameter σt must first be reduced considerably to accommodate the smallest scale of π(·). At this point the algorithm can slowly begin to learn the components of the pre-conditioning matrix Σt, but this learning occurs very slowly because the comparatively small value for σt results in poor mixing across all other dimensions than the first. Analogous plots for scenarios 2, 3 and 4 are given in the supplement and display comparable behaviour.

Random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker schemes with adaptive tuning as in Equations (24)-(26) and learning rate set to γt=t−κ with κ = 0.6. The target distribution is a 100-dimensional Gaussian in which the first component has standard deviation 0.01 and all others have unit scale. First row: adaptation of the global scale σt; second row: adaptation of the local scales diag(Σt)=(Σt,ii)i=1100; third row: trace plot of first coordinate; fourth row: trace plots of coordinates from 2 to 100 (superposed)
FIGURE 4

Random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker schemes with adaptive tuning as in Equations (24)-(26) and learning rate set to γt=tκ with κ = 0.6. The target distribution is a 100-dimensional Gaussian in which the first component has standard deviation 0.01 and all others have unit scale. First row: adaptation of the global scale σt; second row: adaptation of the local scales diag(Σt)=(Σt,ii)i=1100; third row: trace plot of first coordinate; fourth row: trace plots of coordinates from 2 to 100 (superposed)

We then compare algorithms in a more quantitative way, by looking at the average mean squared error (MSE) of MCMC estimators of the first moment of each coordinate, which is a standard metric in MCMC. For any h:RdR, define the corresponding MSE as E[(h^(t)Eπ[h])2] where h^(t)=(ttburn)1i=tburn+1th(X(i)) is the MCMC estimator of Eπ[h] after t iterations of the algorithm. Here tburn is a burn-in period, which we set to tburn=t/2, where ⌊·⌋ denotes the floor function. Below, we report the average MSE for the collection of test functions given by h(x)=xi/ηi for i = 1, …, d after t MCMC iterations (rescaling by ηi is done to give equal importance to each coordinate).

In addition, we also monitor the rate at which the pre-conditioning matrix Σt converges to the covariance of π, denoted as Σ, in order to measure how quickly the adaptation mechanism learns suitable local tuning parameters. We consider the l2-distance between the diagonal elements of Σt and Σ on the log scale. This leads to the following measure of convergence of the tuning parameters after t MCMC iterations:
(27)
where the expectation is with respect the Markov chain (X(t))t1. We use the log scale as it is arguably more appropriate than the natural one when comparing step-size parameters, and we focus on diagonal terms as both Σt and Σ are diagonal here. Monitoring the convergence of dt to 0 we can compare the speed at which good tuning parameters are found during the adaptation process for different schemes.

Figure 5 displays the evolution of dt and the MSE defined above over 4×104 iterations of each algorithms, where dt and the MSE are estimated by averaging over 100 independent runs of each algorithm. The results are in accordance with the illustration in Figure 4, and suggest that the Barker scheme is robust to different types of targets and heterogeneity and results in very fast adaptation, while both MALA and RWM require significantly more iterations to find good tuning parameters. The tuning parameters of MALA appear to exhibit more unstable behaviour than RWM in the first few thousands iterations (larger dt), while after that they converge more quickly, which again is in accordance with the behaviour observed in Figure 5 and with the theoretical considerations of Sections 2 and 4.1. To further quantify the tuning period, we define the time to reach a stable level of tuning as τadapt(ϵ)=inf{t1:dtϵ} for some ε > 0. We take ε=1 and report the resulting values in Table 1, denoting τadapt(1) simply as τadapt. The results show that in these examples Barker always has the smallest adaptation time, with a speed-up compared to RWM of at least 34x in all four scenarios, and a speed-up compared to MALA ranging between 3x (scenario 3) and 30x (scenario 2). The adaptation times τadapt tend to increase from scenarios 1 to 4, suggesting that the target distribution becomes more challenging as we move from scenarios 1 to 4. The hardest case for Barker seems to be the hyperbolic target, although even there the tuning stabilized in roughly 3000 iterations, while the hardest case for MALA is the skew-normal, in which tuning stabilized in roughly 30,000 iterations.

Comparison of random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker on the four target distributions (scenarios 1 to 4) described in Section 6.2, averaging over ten repetitions of each algorithm. First row: convergence of tuning parameters, measured by dt defined in Equation (27). Second row: Mean Square Error of Markov chain Monte Carlo estimators of first moments averaged over all coordinates [Colour figure can be viewed at https://academic.oup.com/]
FIGURE 5

Comparison of random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker on the four target distributions (scenarios 1 to 4) described in Section 6.2, averaging over ten repetitions of each algorithm. First row: convergence of tuning parameters, measured by dt defined in Equation (27). Second row: Mean Square Error of Markov chain Monte Carlo estimators of first moments averaged over all coordinates [Colour figure can be viewed at https://academic.oup.com/]

TABLE 1

Adaptation times (τadapt) and mean squared errors (MSE) from 10 k, 20 k and 40 k iterations of the random walk Metropolis (RWM), Metropolis-adjusted Langevin algorithm (MALA) and Barker algorithms under each of the four heterogeneous scenarios described in Section 6.2

MethodτadaptMSE10kMSE20kMSE40k
1RWM18,7570.2000.0360.013
MALA10,7850.3480.0160.002
Barker5240.0070.0050.003
2RWM19,1630.2280.0450.013
MALA17,2980.6440.1470.004
Barker5420.0070.0050.003
3RWM>40 k0.4090.0800.016
MALA10,6300.2480.0190.006
Barker32940.0120.0090.007
4RWM>40 k0.3150.0920.016
MALA34,3400.8130.4880.112
Barker14270.0080.0060.004
MethodτadaptMSE10kMSE20kMSE40k
1RWM18,7570.2000.0360.013
MALA10,7850.3480.0160.002
Barker5240.0070.0050.003
2RWM19,1630.2280.0450.013
MALA17,2980.6440.1470.004
Barker5420.0070.0050.003
3RWM>40 k0.4090.0800.016
MALA10,6300.2480.0190.006
Barker32940.0120.0090.007
4RWM>40 k0.3150.0920.016
MALA34,3400.8130.4880.112
Barker14270.0080.0060.004
TABLE 1

Adaptation times (τadapt) and mean squared errors (MSE) from 10 k, 20 k and 40 k iterations of the random walk Metropolis (RWM), Metropolis-adjusted Langevin algorithm (MALA) and Barker algorithms under each of the four heterogeneous scenarios described in Section 6.2

MethodτadaptMSE10kMSE20kMSE40k
1RWM18,7570.2000.0360.013
MALA10,7850.3480.0160.002
Barker5240.0070.0050.003
2RWM19,1630.2280.0450.013
MALA17,2980.6440.1470.004
Barker5420.0070.0050.003
3RWM>40 k0.4090.0800.016
MALA10,6300.2480.0190.006
Barker32940.0120.0090.007
4RWM>40 k0.3150.0920.016
MALA34,3400.8130.4880.112
Barker14270.0080.0060.004
MethodτadaptMSE10kMSE20kMSE40k
1RWM18,7570.2000.0360.013
MALA10,7850.3480.0160.002
Barker5240.0070.0050.003
2RWM19,1630.2280.0450.013
MALA17,2980.6440.1470.004
Barker5420.0070.0050.003
3RWM>40 k0.4090.0800.016
MALA10,6300.2480.0190.006
Barker32940.0120.0090.007
4RWM>40 k0.3150.0920.016
MALA34,3400.8130.4880.112
Barker14270.0080.0060.004

The differences in the adaptation times have a direct implication on the resulting MSE of MCMC estimators, which is intuitive because the Markov chain will typically start sampling efficiently from π only once good tuning parameters are found. As we see from the second row of Figure 5 and the second part of Table 1, the MSE of Barker is already quite low (between 0.007 and 0.012) after 104 iterations in all scenarios, while RWM and MALA need significantly more iterations to achieve the same MSE. After finding good tuning parameters and having sampled enough, MALA is slightly more efficient than Barker for the Gaussian target in scenario 1 and equally efficient in the hyperbolic target of scenario 3, which is consistent with the simulations of Section 5.2 under optimal tuning.

6.3 Comparison on a Poisson random effects model

In this section, we consider a Poisson hierarchical model of the form
(28)
and test the algorithms on the task of sampling from the resulting posterior distribution p(μ,η1,,ηI|y), where y=(yij)ij denotes the observed data. In our simulations, we set I = 50 and ni=5 for all i, leading to 51 unknown parameters and 250 observations.

The model in Equation (28) is an example of a generalized linear model that induces a posterior distribution with light tails and potentially large gradients of log π, which creates a challenge for gradient-based algorithms. In particular, the task of sampling from the posterior becomes harder when either the observations (yij)ij contain large values or they are heterogeneous across values of i ∈ {1, …, I}. The former case results in a more peaked posterior distribution with larger gradients, while the latter induces heterogeneity across the posterior distributions of the parameters ηi.

In our simulations, we consider three scenarios, corresponding to increasingly challenging target distributions:

  • 1

    In the first scenario, we take ση=1 and generate the data y from the model in Equation (28) assuming the data-generating value of μ to be μ*=5 and sampling the data-generating values of η1,,ηI from their prior distribution.

  • 2

    In the second scenario, we increase the value of ση to 3, which induces more heterogeneity across the parameters η1,,ηI.

  • 3

    In the third scenario, we keep ση=3 and increase the values of μ* to 10, thus inducing larger gradients.

In each scenario, we run the algorithm directly on the joint parameter space (μ,η1,,ηI). Similarly to Section 6.2, we first provide an illustration of the behaviour of the tuning parameters and MCMC trace plots for RWM, MALA and Barker in Figure 6. Here all algorithms are run for 5×104 iterations, with the target defined in the first scenario. We use the adaptation strategy of Section 6.2 for tuning, following Equations (24)-(26) with κ = 0.6 and Σt constrained to be diagonal, and initialize the chains from a random configuration sampled from the prior distribution of the model. In this example, the random walk converges to stationarity in roughly 10,000 iterations while the Barker scheme takes a few hundreds. By contrast MALA struggles to converge and exhibits unstable behaviour even after 5×104 iterations. Note that the first 3×104 iterations of MALA, in which the parameter μ appears to be constant, do not correspond to rejections but rather to moves with very small increments in the μ component.

Behaviour of random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker on the posterior distribution from the Poisson hierarchical model in Equation (28). Data are generated as in the first scenario of Section 6.3. First row: adaptation of the global scale σt; second row: adaptation of the local scales diag(Σt)=(Σt,ii)i=1100; third row: trace plot of the parameter μ; fourth row: trace plots of the parameter η1
FIGURE 6

Behaviour of random walk Metropolis, Metropolis-adjusted Langevin algorithm and Barker on the posterior distribution from the Poisson hierarchical model in Equation (28). Data are generated as in the first scenario of Section 6.3. First row: adaptation of the global scale σt; second row: adaptation of the local scales diag(Σt)=(Σt,ii)i=1100; third row: trace plot of the parameter μ; fourth row: trace plots of the parameter η1

We then provide a more systematic comparison between the algorithms under consideration in Table 2. In addition to RWM, MALA and Barker, we also consider a state-of-the-art implementation of adaptive HMC, namely the Stan (Stan Development Team, 2020) implementation of the No-U-Turn Sampler (NUTS) (Hoffman & Gelman, 2014) as well as of standard HMC (referred to as ‘static HMC’ in the Stan package). The NUTS algorithm is a variant of standard HMC in which the number of leapfrog iterations, that is, the parameter L in Equation (10), is allowed to depend on the current state (using a ‘No-U-Turn’ criterion). The resulting number of leapfrog steps (and thus log-posterior gradient evaluations) per iteration is not fixed in advance but rather tuned adaptively depending on the hardness of the problem. This is also the case for the static HMC algorithm implementation in Stan, as in that case the total integration time in Equation (10) is fixed and the step-size and mass matrix are adapted. For both algorithms, we use the default Stan version that learns a diagonal covariance/mass matrix during the adaptation process. This is analogous to constraining the preconditioning matrix Σt for RWM, MALA and Barker to be diagonal, as we are doing here.

TABLE 2

Comparison of sampling schemes on the posterior distribution arising from the Poisson hierarchical model in Equation (28)

MethodIterations (n)Leapfrog steps/nGradient calls (g)ESSESS/g × 100
1RWM5×104(49, 66)
MALA5×1045×104(648, 727)1.30 ± 2.73
Barker5×1045×104(1445, 1587)2.89 ± 0.07
HMC2×10389.51.8×105(285, 1954)0.25 ± 0.78
NUTS2×1038.51.7×104(1175, 1822)6.95 ± 1.68
2RWM5×104(0.4, 10.6)
MALA5×1045×104(0.0, 8.0)< 0.01
Barker5×1045×104(1365, 1563)2.73 ± 0.13
HMC2×1037971.6×106(25, 1949)< 0.01
NUTS2×10357.71.2×105(942, 1826)1.19 ± 1.14
3RWM5×104(0.0, 5.3)
MALA5×1045×104(0.0, 0.2)< 0.01
Barker5×1045×104(1301, 1594)2.60 ± 0.92
HMC2×10381031.6×107(3.3, 899)< 0.01
NUTS2×1031793.5×105(137, 348)0.012 ± 0.14
MethodIterations (n)Leapfrog steps/nGradient calls (g)ESSESS/g × 100
1RWM5×104(49, 66)
MALA5×1045×104(648, 727)1.30 ± 2.73
Barker5×1045×104(1445, 1587)2.89 ± 0.07
HMC2×10389.51.8×105(285, 1954)0.25 ± 0.78
NUTS2×1038.51.7×104(1175, 1822)6.95 ± 1.68
2RWM5×104(0.4, 10.6)
MALA5×1045×104(0.0, 8.0)< 0.01
Barker5×1045×104(1365, 1563)2.73 ± 0.13
HMC2×1037971.6×106(25, 1949)< 0.01
NUTS2×10357.71.2×105(942, 1826)1.19 ± 1.14
3RWM5×104(0.0, 5.3)
MALA5×1045×104(0.0, 0.2)< 0.01
Barker5×1045×104(1301, 1594)2.60 ± 0.92
HMC2×10381031.6×107(3.3, 899)< 0.01
NUTS2×1031793.5×105(137, 348)0.012 ± 0.14

Blocks of rows from 1 to 3 refer to the three data-generating scenarios described in Section 6.3. All numbers are averaged across ten repetitions of each algorithm. For each algorithm we report: number of iterations; number of leapfrog steps per iteration and total number of gradient evaluations (when applicable); estimated effective sample size (ESS) (minimum and median across parameters); minimum ESS per hundred gradient evaluations (with standard deviation across the ten repetitions).

TABLE 2

Comparison of sampling schemes on the posterior distribution arising from the Poisson hierarchical model in Equation (28)

MethodIterations (n)Leapfrog steps/nGradient calls (g)ESSESS/g × 100
1RWM5×104(49, 66)
MALA5×1045×104(648, 727)1.30 ± 2.73
Barker5×1045×104(1445, 1587)2.89 ± 0.07
HMC2×10389.51.8×105(285, 1954)0.25 ± 0.78
NUTS2×1038.51.7×104(1175, 1822)6.95 ± 1.68
2RWM5×104(0.4, 10.6)
MALA5×1045×104(0.0, 8.0)< 0.01
Barker5×1045×104(1365, 1563)2.73 ± 0.13
HMC2×1037971.6×106(25, 1949)< 0.01
NUTS2×10357.71.2×105(942, 1826)1.19 ± 1.14
3RWM5×104(0.0, 5.3)
MALA5×1045×104(0.0, 0.2)< 0.01
Barker5×1045×104(1301, 1594)2.60 ± 0.92
HMC2×10381031.6×107(3.3, 899)< 0.01
NUTS2×1031793.5×105(137, 348)0.012 ± 0.14
MethodIterations (n)Leapfrog steps/nGradient calls (g)ESSESS/g × 100
1RWM5×104(49, 66)
MALA5×1045×104(648, 727)1.30 ± 2.73
Barker5×1045×104(1445, 1587)2.89 ± 0.07
HMC2×10389.51.8×105(285, 1954)0.25 ± 0.78
NUTS2×1038.51.7×104(1175, 1822)6.95 ± 1.68
2RWM5×104(0.4, 10.6)
MALA5×1045×104(0.0, 8.0)< 0.01
Barker5×1045×104(1365, 1563)2.73 ± 0.13
HMC2×1037971.6×106(25, 1949)< 0.01
NUTS2×10357.71.2×105(942, 1826)1.19 ± 1.14
3RWM5×104(0.0, 5.3)
MALA5×1045×104(0.0, 0.2)< 0.01
Barker5×1045×104(1301, 1594)2.60 ± 0.92
HMC2×10381031.6×107(3.3, 899)< 0.01
NUTS2×1031793.5×105(137, 348)0.012 ± 0.14

Blocks of rows from 1 to 3 refer to the three data-generating scenarios described in Section 6.3. All numbers are averaged across ten repetitions of each algorithm. For each algorithm we report: number of iterations; number of leapfrog steps per iteration and total number of gradient evaluations (when applicable); estimated effective sample size (ESS) (minimum and median across parameters); minimum ESS per hundred gradient evaluations (with standard deviation across the ten repetitions).

Table 2 reports the results of the simulations for the five algorithms in each of the three scenarios. For each algorithm, we report the number of log-posterior gradient evaluations and the minimum and median effective sample size (ESS) across the 51 unknown parameters. The ESS values are computed with the effectiveSize function from the coda R package (Plummer et al., 2006), discarding the first half of the samples as burn-in. The RWM, MALA and Barker schemes are run for 5×104 iterations, and the HMC and NUTS schemes for 2×103 iterations. The latter is the default value in the Stan package and in this example corresponds to a number of gradient evaluations between 1.7×104 and 1.6×107. All numbers in Table 2 are averaged over ten independent replications of each algorithm. We use the minimum ESS per gradient evaluation as an efficiency metric, of which we report the mean and standard deviation across the 10 replicates (multiplied by 100 to facilitate readability).

According to Table 2, NUTS is the most efficient scheme in scenario 1, while Barker is the most efficient one in scenarios 2 and 3. This is in accordance with the intuition of Barker being a more robust scheme, as the target distribution becomes more challenging as we move from scenarios 1 to 3. MALA struggles to converge to stationarity in scenarios 2 and 3 (with an estimated ESS around zero), while it performs better in scenario 1, although with a high variability across different runs (shown by the large standard deviation in the last column). The RWM displays low ESS values for all three scenarios, although with a less dramatic deterioration going from scenarios 1 to 3. Interestingly, the performances of Barker are remarkably stable across scenarios (with an ESS of around 1400), as well as across parameters for which ESS is computed (in all cases the minimum and median ESS are close to each other) and across repetitions (shown by the relatively small standard deviation in the last column). We note that NUTS is also remarkably effective taking into consideration that it is not an algorithm designed with a major emphasis on robustness, but that performance does degrade when moving from scenarios 1 to 3. As in the MALA case, static HMC struggles to converge in scenarios 2 and 3 and is not very efficient in scenario 1. Note that NUTS, and in particular HMC, compensate for the increasing difficulty of the target by increasing the number of leapfrog steps per iteration. For example, the drop in efficiency of NUTS between scenarios 1 and 2 is mostly due to the increase in average number of leapfrog iterations from 8.5 to 57.7 rather than in a decrease in ESS. Somewhat surprisingly, in static HMC the number of leapfrog steps per iteration is increased significantly more than NUTS, which could either be due to genuine algorithmic differences or to variations in the details of the adaptation strategy implemented in Stan. Overall, Barker and NUTS are the two most efficient algorithms in these simulation, with a relative efficiency that depends on the scenario under consideration: NUTS being roughly 2.4 times more efficient in scenario 1, Barker 2.3 times more efficient in scenario 2 and Barker 40 times more efficient in scenario 3.

6.4 Additional simulations reported in the supplement

In the supplement, we report additional simulations for some of the above experiments. As a sensitivity check, we also performed simulations using the tamed Metropolis-adjusted Langevin algorithm (Brosse et al., 2018) and the truncated Metropolis-adjusted Langevin algorithm (Atchade, 2006; Roberts & Tweedie, 1996), two more robust modifications to MALA in which large gradients are controlled by monitoring the size of ‖∇ log π(x)‖. The schemes do offer some added stability compared to MALA in terms of controlling large gradients, but ultimately are still very sensitive to heterogeneity of the target distribution and to the choice of the truncation level, and do not exhibit the same robustness observed in the case of the Barker scheme. See the supplement for implementation details, results and further discussion.

7 Discussion

We have introduced a new gradient-based MCMC method, the Barker proposal, and have demonstrated both analytically and numerically that it shares the favourable scaling properties of other gradient-based approaches, along with an increased level of robustness, both in terms of geometric ergodicity and robustness to tuning (as defined in the present paper). The most striking benefit of the method appears to be in the context of adaptive MCMC. Evidence suggests that combining the efficiency of a gradient-based proposal mechanism with a method that exhibits robustness to tuning gives a combination of stability and speed that is very desirable in this setting, and can lead to efficient sampling that requires minimal practitioner input.

The theoretical results in this paper could be extended by studying in greater depth the large λ regime (Section 2.4) and the high-dimensional scaling of the Barker proposal (Section 4.3). Of course, there are many other algorithms that could be considered under the robustness to tuning framework, and it is worthwhile future work to explore which features of a scheme result in either robustness to tuning or a lack of it. Extensions to the Barker proposal that incorporate momentum and exhibit the d1/4 decay in efficiency with dimension enjoyed by HMC may be possible, as well as the development of other methods within the first-order locally balanced proposal framework introduced in Section 3, or indeed schemes that are exact at higher orders.

Supporting Information

Additional supporting information may be found online in the Supporting Information section.

Funding information

Engineering and Physical Sciences Research Council, EP/K014463/1; European Research Council, 306406 (N-BNP)

Acknowledgements

The authors thank Marco Frangi for preliminary work in his Master's thesis related to the ergodicity arguments in the paper. SL acknowledges support from the Engineering and Physical Sciences Research Council through grant number EP/K014463/1 (i-like). GZ acknowledges support from the European research council starting grant 306406 (N-BNP), and by the Italian ministry of education, universities and research PRIN Project 2015SNS29B. Open Access Funding provided by Universita Bocconi within the CRUI-CARE Agreement. [Correction added on 11 June 2022, after first online publication: CRUI funding statement has been added.]

References

1

Andrieu
,
C.
&
Thoms
,
J.
(
2008
)
A tutorial on adaptive MCMC
.
Statistics and Computing
,
18
,
343
373
.

2

Atchade
,
Y.F.
(
2006
)
An adaptive version for the Metropolis adjusted Langevin algorithm with a truncated drift
.
Methodology and Computing in Applied Probability
,
8
,
235
254
.

3

Azzalini
,
A.
(
1985
)
A class of distributions which includes the normal ones
.
Scandinavian Journal of Statistics
,
12
,
171
178
.

4

Azzalini
,
A.
(
2013
)
The skew-normal and related families
. Institute of Mathematical Statistics Monographs.
Cambridge
:
Cambridge University Press
.

5

Barker
,
A.A.
(
1965
)
Monte Carlo calculations of the radial distribution functions for a proton-electron plasma
.
Australian Journal of Physics
,
18
,
119
134
.

6

Beskos
,
A.
,
Pillai
,
N.
,
Roberts
,
G.
,
Sanz-Serna
,
J.-M.
&
Stuart
,
A.
(
2013
)
Optimal tuning of the hybrid Monte Carlo algorithm
.
Bernoulli
,
19
,
1501
1534
.

7

Beskos
,
A.
,
Roberts
,
G.
,
Thiery
,
A.
&
Pillai
,
N.
(
2018
)
Asymptotic analysis of the random walk metropolis algorithm on ridged densities
.
The Annals of Applied Probability
,
28
,
2966
3001
.

8

Brooks
,
S.
,
Gelman
,
A.
,
Jones
,
G.
&
Meng
,
X.-L.
(
2011
)
Handbook of Markov chain Monte Carlo
.
Boca Raton
:
CRC Press
.

9

Brosse
,
N.
,
Durmus
,
A.
,
Moulines
,
É.
&
Sabanis
,
S.
(
2018
)
The tamed unadjusted Langevin algorithm
.
Stochastic Processes and their Applications
,
129
(
10
),
3638
3663
.

10

Dalalyan
,
A.S.
(
2017
)
Theoretical guarantees for approximate sampling from smooth and log-concave densities
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
79
,
651
676
.

11

Duane
,
S.
,
Kennedy
,
A.D.
,
Pendleton
,
B.J.
&
Roweth
,
D.
(
1987
)
Hybrid Monte Carlo
.
Physics Letters B
,
195
,
216
222
.

12

Durmus
,
A.
&
Moulines
,
E.
(
2017
)
Nonasymptotic convergence analysis for the unadjusted Langevin algorithm
.
The Annals of Applied Probability
,
27
,
1551
1587
.

13

Durmus
,
A.
,
Moulines
,
E.
&
Saksman
,
E.
(
2017a
)
On the convergence of Hamiltonian Monte Carlo
. arXiv preprint arXiv:1705.00166.

14

Durmus
,
A.
,
Roberts
,
G.O.
,
Vilmart
,
G.
&
Zygalakis
,
K.C.
(
2017b
)
Fast Langevin based algorithm for MCMC in high dimensions
.
The Annals of Applied Probability
,
27
,
2195
2237
.

15

Hastings
,
W.K.
(
1970
)
Monte Carlo sampling methods using Markov chains and their applications
.
Biometrika
,
57
(
1
),
97
109
.

16

Hoffman
,
M.D.
&
Gelman
,
A.
(
2014
)
The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo
.
Journal of Machine Learning Research
,
15
,
1593
1623
.

17

Jarner
,
S.F.
&
Hansen
,
E.
(
2000
)
Geometric ergodicity of Metropolis algorithms
.
Stochastic Processes and their Applications
,
85
,
341
361
.

18

Krauth
,
W.
(
2006
)
Statistical mechanics: algorithms and computations
, vol.
13
.
Oxford
:
OUP
.

19

Livingstone
,
S.
,
Betancourt
,
M.
,
Byrne
,
S.
&
Girolami
,
M.
(
2019
)
On the geometric ergodicity of Hamiltonian Monte Carlo
.
Bernoulli
,
25
,
3109
3138
. To appear.

20

Mattingly
,
J.C.
,
Pillai
,
N.S.
&
Stuart
,
A.M.
(
2012
)
Diffusion limits of the random walk Metropolis algorithm in high dimensions
.
The Annals of Applied Probability
,
22
,
881
930
.

21

Metropolis
,
N.
,
Rosenbluth
,
A.W.
,
Rosenbluth
,
M.N.
,
Teller
,
A.H.
&
Teller
,
E.
(
1953
)
Equation of state calculations by fast computing machines
.
The Journal of Chemical Physics
,
21
,
1087
1092
.

22

Neal
,
R.M.
(
2003
)
Slice sampling
.
The Annals of Statistics
,
31
,
705
767
.

23

Neal
,
R.M.
(
2011
)
MCMC using Hamiltonian dynamics
.
Handbook of Markov Chain Monte Carlo
,
2
,
2
.

24

Peskun
,
P.H.
(
1973
)
Optimum Monte-Carlo sampling using Markov chains
.
Biometrika
,
60
,
607
612
.

25

Plummer
,
M.
,
Best
,
N.
,
Cowles
,
K.
&
Vines
,
K.
(
2006
)
Coda: convergence diagnosis and output analysis for MCMC
.
R News
,
6
,
7
11
. Available from: https://journal.r-project.org/archive/.

26

Power
,
S.
&
Goldman
,
J.V.
(
2019
)
Accelerated sampling on discrete spaces with non-reversible Markov processes
. arXiv preprint arXiv:1912.04681.

27

Roberts
,
G.O.
&
Rosenthal
,
J.S.
(
1998
)
Optimal scaling of discrete approximations to Langevin diffusions
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
60
,
255
268
.

28

Roberts
,
G.O.
&
Rosenthal
,
J.S.
(
2001
)
Optimal scaling for various Metropolis-Hastings algorithms
.
Statistical Science
,
16
,
351
367
.

29

Roberts
,
G.O.
&
Rosenthal
,
J.S.
(
2004
)
General state space Markov chains and MCMC algorithms
.
Probability Surveys
,
1
,
20
71
.

30

Roberts
,
G.O.
&
Rosenthal
,
J.S.
(
2016
)
Complexity bounds for Markov chain Monte Carlo algorithms via diffusion limits
.
Journal of Applied Probability
,
53
,
410
420
.

31

Roberts
,
G.O.
&
Tweedie
,
R.L.
(
1996
)
Exponential convergence of Langevin distributions and their discrete approximations
.
Bernoulli
,
2
,
341
363
.

32

Roberts
,
G.O.
,
Gelman
,
A.
&
Gilks
,
W.R.
(
1997
)
Weak convergence and optimal scaling of random walk Metropolis algorithms
.
The Annals of Applied Probability
,
7
,
110
120
.

33

Rosenthal
,
J.S.
(
2003
)
Asymptotic variance and convergence rates of nearly-periodic Markov chain Monte Carlo algorithms
.
Journal of the American Statistical Association
,
98
,
169
177
.

34

Shaby
,
B.
&
Wells
,
M.T.
(
2010
)
Exploring an adaptive Metropolis algorithm
. Technical report.

35

Stan Development Team
(
2020
)
RStan: the R interface to Stan
. R package version 2.19.3. Available from: http://mc-stan.org/

36

Stuart
,
A.M.
(
2010
)
Inverse problems: a Bayesian perspective
.
Acta Numerica
,
19
,
451
559
.

37

Tierney
,
L.
(
1998
)
A note on Metropolis-Hastings kernels for general state spaces
.
The Annals of Applied Probability
,
8
,
1
9
.

38

Woodard
,
D.
,
Schmidler
,
S.
&
Huber
,
M.
(
2009
)
Sufficient conditions for torpid mixing of parallel and simulated tempering
.
Electronic Journal of Probability
,
14
,
780
804
.

39

Zanella
,
G.
(
2020
)
Informed proposals for local MCMC in discrete spaces
.
Journal of the American Statistical Association
,
115
,
852
865
.

This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.