A Discrete Bouncy Particle Sampler

Markov Chain Monte Carlo (MCMC) algorithms are statistical methods designed to sample from a given probability density $\pi$. Most MCMC methods rely on discrete-time Metropolis-Hastings Markov chains that are reversible with respect to the probability density $\pi$. Nevertheless, it is now understood that the use of non-reversible Markov chains can be beneficial in many contexts. In particular, the recently-proposed Bouncy Particle Sampler (BPS) leverages a continuous-time and non-reversible Markov process to compute expectations with respect to $\pi$. Although the BPS empirically shows state-of-the-art performances when used to explore certain probability densities, in many situations it is not straightforward to use; indeed, implementing the BPS typically requires one to be able to compute local upper bounds on the target density. This, for example, rules out the use of the BPS for the wide class of problems when only evaluations of the log-density and its gradient are available. In this article, we propose the Discrete Bouncy Particle Sampler (DBPS), a general algorithm based upon a guided random walk and the Delayed-Rejection approach. In particular, we show that the BPS can be understood as a scaling limit of a special case of the DBPS. In contrast to the BPS, implementing the DBPS only requires point-wise evaluation of the target density and its gradient. Importantly, we also propose extensions of the basic DBPS for situations when exact gradient of the target densities are not available.


Introduction
In several contexts, non-reversible Markov Chaine Monte Carlo (MCMC) samplers are known to enjoy desirable mixing properties; the Hamiltonian Monte Carlo (HMC) algorithm Duane et al. (1987) is perhaps one of the most successful and widely-used examples. Indeed, several theoretical results have been obtained regarding the advantages of non-reversible samplers. For example, Diaconis et al. (2000) obtains rates of convergence for a non-reversible version of the random walk algorithm; subsequently and inspired by Diaconis et al. (2000), Chen et al. (1999) describes the best acceleration achievable through the idea of "lifting". On a different note, the articles Hwang et al. (2015); Lelièvre et al. (2013); Rey-Bellet and Spiliopoulos (2015); Duncan et al. (2016) investigate and quantify the advantages offered by leveraging (a discretisation of) a non-reversible diffusion process for computing Monte-Carlo averages; in many settings, it can be proved that the standard reversible Langevin dynamics is the worst in terms of asymptotic variances among a large class of diffusion processes that are ergodic with respect to a given target distribution. More recently, different ideas were proposed to design non-reversible MCMC sampler. The Zig-Zag sampler, first obtained as a scaling limit of a lifted Metropolis-Hastings Markov chain Bierkens and Roberts (2017), is a continuous-time non-reversible Markov process that can be used for computing ergodic averages. The subsequent article Bierkens et al. (2016) describes how the Zig-Zag sampler can be efficiently used for exploring Bayesian posterior distributions in the Big-Data regime. Inspired from the physics literature Peters and de With (2012), the Bouncy Particle Sampler (BPS) Bouchard-Côté et al. (2017) is another continuous-time non-reversible Monte-Carlo sampler that demonstrates state-of-the-art performance when used to explore certain Bayesian posterior distributions. Implementing the BPS, though, requires more than simple point-wise evaluation of the log-target density and its gradient -indeed, one typically needs to be able to obtain local upper bounds on derivatives of the log-target density in order to implement the BPS. Unfortunately, those bounds are unavailable or difficult to compute in many applied situations, which means that the BPS cannot be directly used in these settings. The aim of this paper is to design a discrete-time MCMC sampler, inspired by the BPS, that can be implemented when only point-wise evaluations of the target-density and its gradient are available.
Our algorithm, the Discrete Bouncy Particle Sampler (DBPS), is based upon the guided random walk of Gustafson (1998). In its simplest (indeed, reducible) form the guided random walk preserves an extended posterior of (1/2) × π(x) × [δ −u 0 (u) + δ u 0 (u)], where u can be thought of as a velocity. For some δ > 0 and from a current position of (x, u) a move to (x + δu, −u) is proposed, and accepted with a probability of 1 ∧ [π(x + δu)/π(x)], so that the move is reversible with respect to the extended posterior. Whatever the result, a flip move from (x, u) to (x, −u) always follows. The net result of the two reversible moves is a non-reversible Markov chain which heads in a specific direction until a rejection occurs, at which point it reverses direction. In the DBPS, however, when the proposal (x + δu, −u) is rejected, the gradient at the rejected point is calculated and a reflection of (x, u) in the tangent hyperplane perpendicular to the gradient is proposed, as in Figure 1. The acceptance probability for this proposal is set so as to preserve detailed balance with respect to an extended posterior that includes π is its x-marginal, as in the delayed-rejection algorithm of Tierney and Mira (1999). As in Gustafson (1998) an iteration of the basic DBPS ends with a velocity flip. An alternative formulation of a discrete BPS, based on the reflective slice sampler in Neal (2003) is described and extended in the independent work of Vanetti et al. (2017).
As with the BPS, the basic DBPS can be reducible, and so we offer two enhancements, either of which overcomes the problem: a random perturbation of the bounce direction at each delayedrejection step, which is a more focussed version of the BPS idea in Wu and Robert (2017), and a perturbation of the velocity vector at the end of each iteration; simulations suggest a preference for the latter. We demonstrate the advantage of pre-conditioning as in Pakman et al. (2017), and show that a surrogate may be substituted for the true gradient when the truth is computationally expensive to obtain. When numerical differentiation must be used to calculate the gradient in a d-dimensional space, the cost is O(d); our final contribution is a construction that allows the user to choose to only calculate any n ∈ {1, 2, . . . , d} orthogonal components of the gradient vector.
We initially demonstrate the performance and tuning of the algorithm on a toy target. This target is light-tailed, and our simulations suggest that, unlike the BPS, the DBPS is geometrically ergodic in such situations. We then apply the algorithm to simulated data from a Markov modulated Poisson process. q q q q q q Figure 1: An illustration of the DBPS on a two-dimensional target. The black arrows indicate the direction, u, and at the solid black points a proposal (x , u ) = (x + δu, −u) has been accepted, and then the velocity flip has been applied. At the red point a proposal of the same form has been rejected; the solid black line is the contour of the target through this point and the dashed line is the tangent hyperplane. The open blue circle and the blue arrow indicate the proposed new point and new velocity for the delayed-rejection step. A velocity flip will be applied whether or not this proposal is accepted.

The basic DBPS
Consider a probability distribution π on the standard Euclidean space E ≡ (R d , ·, · ) with associated norm · . In this section we describe the basic DBPS in terms of a general, non-vanishing vector field V : E → E\{0}. We show that it preserves the intended target distribution, and that when V(x) = ∇ log π(x), with an appropriate modification if ∇ log π(x) = 0, its limit is the BPS.

The algorithm
Let ρ(u) denote the uniform distribution on the unit d-dimensional sphere S ⊂ E. For any non-zero vector v ∈ E\{0}, its normalised version is v := v/ v ; the reflection operator u → R(u, v) maps a vector in the direction of the black arrows in Figure 1 to a vector of the same magnitude but in the direction of the blue arrow, The operator R is an involution: for any vector u ∈ E and non-zero vector v ∈ E \ {0} we have that R(R(u, v), v) = u. The DBPS algorithm operates on the extended state space E × S and explores the extended target distribution The definition of ρ will be generalised in Section 3.2. Henceforth, we will refer to the variable x as the position of a particle and the variable u as its direction. As outlined in Section 1, the basic DBPS algorithm alternates between two (reversible) Markovian transitions: a two-stage delayed-rejection kernel and a velocity flip. For a time-step parameter δ > 0, from a current state of (x, u) ∈ E × S it proceeds as follows: 1. Delayed Rejection: accept the proposal: (x, u) ← (x , u ) and go to Step 2.
Since ρ is symmetric, π is invariant to Step 2. For proving that Step 1 is an instance of a Delayed Rejection that preserves π, it suffices to show that the deterministic proposal (x, u) → (x , u ) and (x, u) → (x , u ) are time-reversible and preserve the Lebesgue measure. Time reversibility follows from straightforward algebraic manipulations. Finally we break down the transformation Since each step changes either position or velocity but not both, the full transformation preserves volume with respect to Lebesgue measure in E × S.

Continuous-time limit
When V = ∇ log π and the contours of the density π are exactly spherical, by symmetry, π(x ) ≡ π(x) and the DBPS is rejection free. More generally, we prove in this section that, as δ → 0 and for the choice of vector field V = ∇ log π, the trajectory of a DBPS accelerated by a factor 1/δ converge towards that of the BPS. For simplicity of exposition, we only consider the case where the log-density log π is globally Lipshitz. The proposal (x, u) → (x+δ u, −u) is rejected with log-probability max (0, −[log π(x + δ u) − log π(x)]); since log π(x + δ u) − log π(x) = δ ∇ log π(x), u + O(δ 2 ), it readily follows that in the limit δ → 0 and if time is accelerated by a factor 1/δ, the delayed-rejection steps occur as Poisson events with intensity λ(x, u) := max (0, − ∇ log π(x), u ).

Reducibility and poor mixing
A simple thought experiment, such as considering a target with spherically-symmetric contours, any initial u ∈ S and an initial position at the origin shows that, as with the BPS (e.g. Bouchard-Côté et al., 2017), the DBPS can be reducible. In practice, on more general targets, we have found that the velocity can mix slowly across S, and this can impact the mixing of x, as illustrated in Figure 2 which shows trace plots for three components of x and for log π(x). To understand the behaviour, consider the orientation of the velocity, the set O(u) := {u, −u}. If O(u) were to be fixed, as in the above thought experiment, the resulting x chain would be reducible. In high dimensions, for a random unit vector u and the vector V, u, V is typically O(1/ √ d), and hence small; thus O(u), which is not altered at all when a delayed-rejection step is rejected, is only altered slightly even when a delayed-rejection step is accepted and a reflection occurs.
Over the 25 components in the above example simulation, all effective sample sizes (calculated using the R package coda) were in excess of 20000, twice the number of samples, which provides the misleading impression of exceptionally good mixing. We have found that a more reliable diagnostic for the type of problem exhibited in Figure 2 is the effective sample size (ESS) of the function log π(x), which for the above simulation is 242. In our simulations in Section 5, therefore, we also report this quantity.

Modifications of the basic DBPS
In this section we describe two modifications to the algorithm which improve the mixing of u. We also describe the modifications required to pre-condition the algorithm to account for an a priori known shape. Finally, we describe a modification which permits the use of only a subset of the components of the vector field V.

Perturbing the velocity, u
As discussed and demonstrated in Section 2.3 the basic DBPS can lead to a reducible Markov chain, and even when it does not, the mixing across velocities, u ∈ S, can be slow. Increasing δ increases the number of delayed-rejection steps and, potentially, leads to better mixing of the orientation, but increasing δ too far leads to a high fraction of rejections at delayed-rejection steps, reducing the mixing of the orientation. Moreover, the primary purpose of the tuning parameter δ is to allow a trade off between the speed at which the posterior is crossed and the probability that a delayed-rejection step should be accepted, rather than to adjust the mixing of the orientation. We, therefore, present two possible solutions to this problem, each of which introduces another tuning parameter.

Perturbed bounces
The involution u → u = R(u, V) is not the only map that preserves the magnitude of a vector and the component in the direction of V; we now present an alternative, a perturbation of u . Our construction requires a quantity ε ∈ [0, 1], the degree of perturbation, the component of u orthogonal to V: u ⊥ := u − u , V V, and a vector ζ with ζ = 1, ζ, u ⊥ = 0 and ζ, V = 0. We then set To simulate ζ we simulate Z ∼ N (0, I d ) and then set: The central panel shows the two available points for a particular (small) value of ε, and the final panel shows the point that is chosen (according to ζ) and the associated vector, u * . The move (u, ζ, ε) → (u ♦ , ε, −ζ) is an involution, and since −ζ and ζ are equally likely, the proposal is reversible.

Per-iteration direction perturbation
A Brownian motion on the unit sphere of (R d , ·, · ) can be described as the solution of the Stochastic Differential Equation (SDE) where P ⊥ (u) ∈ R d,d is the matrix representing the orthogonal projection on the plane orthogonal to u ∈ R d and κ is a positive constant. A possible discretization of such a Brownian motion between time t = nδ and t + δ = (n + 1)δ is u n+1 = (u n + κ 1/2 δ 1/2 ζ)/(1 + κδ) 1/2 , where ζ is a random unit vector orthogonal to u n ; the random variable ζ has the same law as (Z − Z, u u)/ Z − Z, u u for a standard Gaussian random variable Z in R d . Clearly, if u 0 is on the unit sphere then so is u n for any n ≥ 1. A slight extension of Donsker's theorem further shows that a time-rescaled (i.e. accelerated by a factor 1/δ) and suitably interpolated version of the process {u n } n≥0 converges, as δ → 0, to a Brownian motion on the unit sphere of R d . We therefore propose a third step in the DBPS algorithm, which also preserves π: 3. Simulate ζ uniformly at random from the surface of the unit d − 1-dimensional hypersphere perpendicular to u and set: The parameter κ dictates the speed of mixing of the limiting Brownian motion on the unit sphere; thus, when δ is small (so the total time between delayed-rejection steps is approximately that of the limiting BPS), a given value of κ is associated with a given amount of direction refreshment between delayed-rejection steps (potential bounces), whatever the exact value of δ. Typically, we find κ needs to be chosen so that the mixing of O(u) is dominated by this additional refreshment step. In terms of their effects on the behaviour of the system, therefore, δ and κ are approximately orthogonal. Equivalently, since u 0 , u 1 = (1 + κδ) −1/2 , but with a step of δ/2, u 0 , u 2 ≈ (1 + κδ/2) −1 , δ and κ are close to orthogonal when κ δ 1; this is illustrated in the simulation study in Section 5.1.

Preconditioning
A general target may have very different length scales in different directions and, just as with Metropolis-Hastings algorithms such as the random walk Metropolis or the MALA (e.g. Roberts and Rosenthal, 2001), the efficiency can be improved, often by several orders of magnitude, by preconditioning. Consider an invertible linear transformation, M , and let x * = M x and u * = M u. The transformation induces a density π * (x * , u * ) ∝ π(x(x * ), u(u * )), and running the DBPS in the transformed space according to the density π * , is equivalent to running a more general version of the DBPS in the original space, still according to π but with u lying on the surface of an ellipsoid.
Define Γ = M M , v Γ = Γv and the involution Furthermore, since u * is now the unit vector, the velocity should be initialised by sampling a unit vector u * uniformly at random and then setting u = M −1 u * , and the per-iteration velocity perturbation of Section 3.1.2 becomes: 3. * Simulate ζ * uniformly at random from the surface of the unit d − 1-dimensional hypersphere perpendicular to u * = M u and set: Similarly the bounce perturbation of Section 3.1.1, if it is being used, should now be applied in the transformed space. Typically, we would aim to match Γ −1 to the variance of the target, or inverse Hessian at the mode, so that the transformed target has a variance or inverse Hessian approximating the identity matrix and the standard DBPS might be expected to perform much better in the transformed space.

Using a subset of the components of the gradient
Automatic differentiation (at least in the ForwardDiff and ReverseDiff Julia packages and in the AutogradMaclaurin et al. (2015) Python package ) does not work on complex functions or on functions which call code in other languages. Examples are matrix exponentials and some numerical ODE solvers. We must, therefore, resort to numerical differentiation to propose a reflection in the gradient. However numerical differentiation of the original function requires d (or 2d for centred differences) extra evaluations of the likelihood function, where d = dim(X ). Fortunately, we can choose to use fewer evaluations and still gain some of the benefits of the DBPS.
In this section we are directly concerned with g(x) = ∇ log π(x) and so refer to this, rather than the more general V(x). Recall that numerical differentiation along a direction given by the unit vector ζ evaluates ζ, g , where g = ∇ log π(x).
For any unit vector ζ the following transformation (u, ζ) → (v , ζ) is (i) reversible: (v , ζ) → (u, ζ), (ii) preserves magnitude: v = u = 1, and (iii) preserves the component parallel with the gradient: v , g = u, g . Set The form is obtained by finding the unique solution to the equation au + bv = ζ subject to constraints (ii) and (iii). Reversibility follows because, if supplied with v and ζ, solving this equation would give the same a and b. To derive the solution, set r = ζ, g / u, g and use (6), so that (ii) and a u, g + b v , g = ζ, g become b 2 = 1 + a 2 − 2a ζ, u and a + b = r. This gives that r 2 − 2ra = 1 − 2a ζ, u so that a = (r 2 − 1)/(2r − 2 ζ, u ); substituting for r gives the form provided. A simple implementation would sample a unit vector ζ uniformly at random at each iteration and then replace step 1(c) with: (c ζ ) Propose u = R(u, g; ζ) according to (6), and x := x + u − u .
The forms for a and b can equally be written replacing g with g, and, typically, u, g , u, ζ and ζ, g are all O(1/ √ d), so the move is mostly a random refresh of direction. However, suppose that ζ = − g, so that ζ, g = −1 and a = b = −1/[2 u, g ]. This gives that v = −u+2 u, g g = R(u, g). By contrast, if ζ = g then v = −R(u, g), the opposite of that desired. This suggests, at the very least, maximising the component of ζ in the direction −g; i.e. ζ ← sign( ζ, −g ) ζ; however, if one is willing to evaluate several components of the gradient, in the orthogonal directions ζ 1 , . . . , ζ ncpt for some n cpt ≤ d then we can choose the unit linear combination that has the maximal component in the direction −g, , so that ζ, g = − ζ 1 , g + . . . + ζ n , g 2 1/2 . With n cpt = d this will give ζ = − g. Since we must calculate u, g in (6), to save on computation in practice we set ζ 1 = u and always choose n cpt ≥ 2. Each component of the gradient that is evaluated takes an additional one or two evaluations of the log-posterior. However every single standard step x ← x + u also requires an evaluation of the log-posterior. So, if we take n cpt proportional to the typical number of steps between one delayedrejection step and the next then it will only increase the total CPU time by a constant factor. With multiple cores, one may also evaluate the components of the derivatives in parallel. Finally, the simulations of Section 5 suggest that optimal efficiency is achieved with n cpt << d.
4 Algorithm tuning 4.1 Choice of δ One possible heuristic for maximising true mixing as a function of δ is to maximise the fraction of successful delayed-rejection steps, or bounces, f b , over some fixed number, n, of iterations. We provide a heuristic as to why this is sensible, and then investigate the consequences.
Define a segment to be the set of points from one delayed-rejection step to the next, as depicted in Figure 4. If the previous delayed-rejection step was successful then the current segment will be 'new', otherwise it will be an approximate retracing of the previous segment. It is approximate because κ > 0 and because the next delayed-rejection step is unlikely to occur exactly at the start of the previous segment.
Provided δ is small, given the position, x 0 , and velocity at the start of a segment there is an approximate coupling between the limiting process and the actual, discrete process. Specifically, for the discrete process, let x 0 , x 1 , . . . be the (potential) positions after 0, 1, . . . iterations, were every iteration to be accepted without requiring a delayed-rejection step (so for the basic DBPS, x t = x 0 + tδu 0 ). Then, given a realisation η of a Unif(0, 1) random variable, the first value rejected in Stage 1(b) of the DBPS is x τ , where τ can be reformulated as where the approximation becomes more accurate as δ ↓ 0. Since − log η is a realisation of an Exp(1) random variable, as δ ↓ 0 the stopping condition approaches the bouncing condition of the BPS. More generally, for any given segment the actual next delayed-rejection step will occur close to the next reflection point of the BPS. So we consider there to be an (approximately) fixed amount of new information in any given path, whatever the small value of δ.
After a reverse (i.e. a rejected delayed-rejection step), because κ > 0 and because, even if κ were to be 0, the next delayed-rejection step would be unlikely to occur at exactly the start of the previous segment, the main penalty to efficiency due to the reversal is that the current segment is very highly correlated with the previous segment and so the current segment provides little extra information; the next segment is unlikely to be similar to the segment prior to the previous segment and so the penalty due this potential problem is relatively small.
Combining the above ideas we have: (i) each successful delayed-rejection step provides a new segment with an amount of new information that depends on the reflection point, the velocity vector and u, but has a very weak dependence on δ; (ii) the segment following a failed delayed-rejection step adds very little information but does not (much) penalise the information provided by any subsequent segments. It is therefore reasonable to try to maximise the number of new segments; that is, the number of reflections.
To be clear, then, since n is fixed, we expect the mixing as a function of δ to be maximal when f b , the fraction of the n iterations that involve a successful delayed-rejection step (i.e. a bounce), is maximised. As κ increases the penalty for a failed delayed-rejection step is less severe and so a slightly larger δ should be optimal.
In terms of ESS/sec, the computational efficiency, suppose that delayed-rejection steps, which involve evaluating the gradient, are a factor ω more expensive that non-delayed-rejection steps. We, therefore, wish to maximise where f r is the total fraction of the n iterations that involve a failed delayed-rejection step (i.e. leading to an approximate retracing of the previous segment). When

Choice of κ
A natural measure of the impact of κ is the typical change in the direction of u as the particle crosses the target, that is, between consecutive delayed-rejection steps, as illustrated in Figure 4. Let u start be the velocity just after the latest delayed-rejection step and let u end be the velocity just before the next such step. We require a measure of the mean angle between the start and end vectors over all segments of the MCMC run: c RMS := mean( u start , u end 2 ). Whilst E[ u start , u end ] = 0, for a uniformly random d-dimensional unit vector u * , E[ u start , u * 2 ] = 1/d. Thus, if κ were too large, the AR(1) process (3) would be approximately at equilibrium, u end would be an approximately uniform draw on the unit hypersphere so c RMS ≈ 1/ √ d. As we will see in Sections 5, efficiency is typically maximised at a value of c RMS that is closer to 1 than to 1/ √ d; informally, the particle maintains a sense of purpose as it crosses the target.

Simulation studies
To exemplify our tuning guidelines, to examine convergence from the tails in light-tailed targets and to confirm the improvement that preconditioning can bring we first detail a thorough exploration of a toy example, before repeating and briefly describing our findings on a second toy example. We then demonstrate the use of a surrogate and of evaluating a reduced number of gradient components via inference for the parameters of a Markov modulated Poisson process. We use the notation ESS min for the minimum effective sample size over all components and ESS lp for the effective sample size of log π(x). Effective sample sizes were estimated using the effectiveSize function in the coda package in R, and all MCMC code was written in Julia Bezanson et al. (2017).

Toy example
We consider a target of the form: For most of our experiments we set d = 25 and λ i = i, (i = 1, . . . , d) to provide a relatively challenging target; the exceptions are our investigation of convergence from the tails where for q q q q q q q q q q q q q q q q q q q 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 ε fb and fr q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q fb fr q q q q q q q q q q q q q q q q q q q 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 ε crms q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 log10 κ fb and fr q q q q q q q q q q q q q q q q q q q q q q q q q q q fb fr q q q q q q q q q q q q q simplicity we set λ i = 1, and when we consider d = 100, where we try to preserve the degree of anisotropy. Except in our investigation of convergence, where the number of iterations was variable, each experiment was run for 10 6 iterations and thinned (for storage) by a factor of 10; runs were replicated three times, each with a different starting value and a different unit vector for the starting velocity.

Behaviour as δ, ε and κ vary
We now investigate the behaviour of the DBPS as one of δ, the jump size, ε, the bounce perturbation, or κ, the per-iteration velocity pertubration, is varied whilst the others are kept fixed. Setting either ε or κ to a sufficiently large value is shown to mitigate against the mixing problem discussed in Section 2.3, and so we do not consider making both non-zero.
We first fix log 10 δ = 0.5, which is close to optimal for the target (7) with d = 25 and λ i = i, whatever the value of δ or ε. Figure 5 shows f b , f r , c RMS , ESS min and ESS lp as a function of ε with κ = 0 (left) and of κ with ε = 0 (right).
In each case, the bottom right figure shows ESS lp increasing substantially as either ε or κ increases from 0. For large ε a plateau in ESS lp is reached, whereas there is an optimal κ value. By contrast, both bottom left figures show that while ESS min increases initially as ε or κ increases from 0, there is an optimal value around ε = 0.4 or log 10 κ = −1.5. Whilst the choice of ε is absolute, it is helpful to note that the optimal κ occurs at c RMS ≈ 0.8, implying that, for optimality, the AR(1) process for u should keep a strong sense of direction over a crossing of the posterior. Since δ is not changing, both f b and f r remain fixed.
Finally, whilst there is little difference in the optimal ESS min , whether ε > 0 or κ > 0, the optimal ESS lp is much larger when κ > 0. We shall see an even more compelling reason to choose κ > 0 rather than ε > 0 in Section 5.1.2. Figure 6 shows f b , f r , c RMS , ESS min and ESS lp as a function of the scaling, δ, with (ε, κ) set to q q q q q q q q q q q q q −1.5 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 log10 δ fb and fr q q q q q q q q q q q q q q q q q q q q q q q q q q q fb fr q q q q q q q q q q q q q −1.5 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 log10 δ crms q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q fb fr q q q q q q q q q q q q q −1.5 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 log10 δ crms q q q q q q q q q q q q q q q q q q q q q q q q q q q q approximately optimal values obtained from Figure 5: (0.4, 0) (left) and (0, 10 −1.5 ) (right). As δ increases, the probability of rejecting the delayed-rejection bounce proposal increases monotonically. By contrast, the fraction of bounces increases initially as there are more crossings of the posterior, before decreasing as the probability of rejecting the delayed-rejection bounce proposal becomes substantial. Moreover, as suggested by the heuristic in Section 4.1, the maxima in ESS min and in ESS lp occur at approximately the same value of δ as the maximum in f b . Finally, as intended, except for very large δ, the diagnostic, c RMS is unaffected by changes in δ.

Convergence from the tails
To overcome an equivalent reducibility problem to that discussed in Section 2.3 the BPS typically resimulates a fresh velocity at random intervals according to a Poisson process with a fixed rate λ ref resh . In Deligiannidis et al. (2017) it has recently been demonstrated that when the target has tails that are lighter than Gaussian, the BPS is not geometrically ergodic unless the refresh rate increases with the distance from the centre of the distribution. We now provide simulations on a target of the form (7) with λ i = 1 for all i which suggest that this is not be the case for our DBPS.
From a long run of the DBPS on the target, we obtain the posterior median of log π(X), m π ; we also obtained a large number of (effectively) independent samples from π. Each run of the DBPS from an initial point in the tail picked a new x 0 value from this sample and multiplied every element of the x 0 vector by a scalar 'multiplier', ϕ; each run also picked a vector for u 0 sampled uniformly on S. The number of iterations until convergence was then calculated as n cvg := inf{n ∈ N : log π(x n ) > m π }.
The left and central panels of Figure 7 show the traceplots from 10 6 iterations for δ = 0.5 and with log π(x) with (ε, κ, ϕ) = (0.5, 0, 8) (left) and (ε, κ, ϕ) = (0, 0.1, 10000) (centre), and demonstrate a steady drift towards the centre, suggesting that both processes are geometrically ergodic. It is, however, clear that the drift when ε > 0 is much slower than the drift when κ > 0. To measure this, for each multiplier, ϕ, three replicate runs from different starting values, were performed. The experiment used δ = 0.5, which is approximately optimal for the body of this target. Two sets of simulations were performed, one with κ = 0.1 and ε = 0, and one with κ = 0 and ε = 0.4, both reasonable values for exploring the body of the target. Figure 7 plots log 10 n cvg against log 10 ϕ. It strongly suggests that multiplying the distance from the origin by ϕ multiplies the convergence time by approximately ϕ 0.75 when (κ, ε) = (0.1, 0) and ϕ 3.6 when (κ, ε) = (0, 0.4). This suggests a strong preference for using κ > 0. Further diagnostics suggest the following cause for the relatively poor performance when κ = 0: since the gradient and curvature are much larger in the tails the number of rejected delayed-rejection steps is much higher, but when such a rejection occurs and κ = 0 the particle retraces its previous route exactly, whereas when κ > 0 it does not.

Evaluating a subset of the components of ∇ log π
Henceforth, given the evidence in Sections 5.1.1 and 5.1.2, we have fixed ε = 0, and focus on κ > 0. The top panes of the left panel of Figure 8 shows ESS min , ESS lp as a function of the number of evaluated gradient components n cpt , with (ε, log 10 κ, log 10 δ) = (0, −1.5, 0.5); f b , f r and c RMS were approximately constant at 0.24, 0.11 and 0.88. The bottom two panes show ESS min and ESS lp per 10 6 evaluations of log π, assuming that each component of the derivative requires one additional evaluation, as would be the case if forwards differencing were used. The right panel shows similar simulations to the left panel but with d = 100 and (ε, log 10 κ, log 10 δ) = (0, − 1.7, 0.5). To preserve the level of anisotropy whilst increasing dimension we set λ i = 1 + i/4 when d = 100. Again f b , f r and c RMS were approximately constant, this time at 0.26, 0.05 and 0.91 As expected, efficiency increases with n cpt reaching the efficiency of the standard algorithm when n cpt = d. Perhaps, more surprising is the diminishing return for each increase in n cpt . The plots indicate that ESS lp has almost reached its maximum with n cpt = √ d, whereas ESS min requires a larger value, although still much smaller than d. Since with numerical differentiation or with automatic forward differentation the computational cost is approximately linear in n cpt these results suggest that evaluating only a subset of components may be worthwhile in such instances. The bottom panes suggest that the optimal n cpt is approximately √ d and that the gain from using only √ d components is larger when d itself is larger.

A different toy target
Some of the above experiments were repeated with a target of the form with very similar results. In particular, (i) both ESS min and ESS lp were optimised close to the value of δ that maximised the fraction of bounces, f b ; (ii) setting κ > 0 proved to be more efficient than setting ε > 0; (iii) there was an optimal choice of κ, which occured at a value of c RMS ≈ 0.6; (iv) the effects on c RMS of changing δ were minimal until δ became large; (v) as n cpt increased both when d = 25 and d = 100, the gains in ESS min and ESS lp diminished, and ESS min /10 6 evaluations was optimised with n cpt just less than √ d in both cases, whilst small additional gains in ESS lp were obtainable by reducing n cpt to 2 and 3, respectively when d = 25 and d = 100.

The Markov modulated Poisson process and a surrogate gradient
Consider a k-state, continuous-time Markov chain Z t started from state 1, and a Poisson process N t whose rate λ t is a fixed function of Z t . The doubly-stochastic process is parameterised by the rate matrix for the Markov chain, Q, and a vector of rates for the Poisson process, λ, where λ i , (i = 1, . . . , k) is the rate of N t when Z t = i.  Table 1: Fraction of succesful and unsuccessful bounce proposals, f b and f r , and efficiencies (effective samples per 10 3 CPU seconds) for each parameter and for log π(λ, Q) from 10 5 iterations of each algorithm.
The event times of N t are observed over a time window [0, t end ], but the behaviour of Z t is unknown, and we wish to perform inference on (Q, λ). Setting Λ = diag(λ), the likelihood for the number of events n and the event times t 1 , . . . , t n is (e.g. Fearnhead and Sherlock, 2006): where 1 is the k-vector of ones and e is (1, 0, . . . , 0) . We simulated a data-set using a cyclic four-state Markov chain for a 250-second time window with Q parameters of: Q 12 = 2.0, Q 23 = 1.0, Q 34 = 0.5, Q 41 = 0.5, and all other off-diagonal rates set to zero. The rate parameters were λ 1 = 15.0, λ 2 = 5.0, λ 3 = 1.0 and λ 4 = 10.0. We then conducted inference on the natural logarithm of each parameter that was not systematically zero, placing independent N (0, 2 2 ) priors on each of these.
Since autodifferentiation cannot be used on general matrix exponentials, if we wish to use an accurate representation of the true gradient then we must use numerical differentiation. For computational speed we tried to use a first-order divided difference, however precision problems in the evaluation of the log-likelihood meant that it was not possible to use a sufficiently small difference to provide a sufficiently accurate estimate of the gradient, and so centred differences were used. We applied the DBPS for 10 5 iterations using approximately optimal values of κ and δ, and repeated this but evaluating only n cpt = 3 randomly-orientated components (but including u) of the eight-dimensional gradient vector on each delayed-rejection step. We then created a surrogate gradient function based on a Gaussian approximation by finding the posterior mode using the Nelder-Mead algorithm, and then evaluating the Hessian. We ran the DBPS, again for 10 5 iterations, but this time using the surrogate gradient for reflections, rather than the true gradient. Table 1 supplies f b , f r , and the computational efficiency for each parameter as well as for log π; c rms values are not included as all were 0.73 to two decimal places; all ESSs were in excess of 1000. The CPU times taken to find the mode and Hessian of the target were only, respectively 21.9 and 3.7 seconds, compared with 3482.3 seconds to run the algorithm for 10 5 iterations using the surrogate gradient. The table shows a moderate efficiency gain in the mixing of log π from evaluating n cpt = 3, and perhaps a marginal gain for the individual parameters; we suspect this is because the dimension is lower than in Section 5.1.4. However, in this instance, the gain from obtaining and using a computationally-cheap surrogate for ∇ log π is large. Indeed the results show that it would have been preferable to use the surrogate even had it been feasible to use forward differences for the DBPS.
At first glance it might be surprising that the fraction of successful bounces when n cpt = 3 is larger than for the full DBPS; however, a second glance at the left panel of Figure 3 reveals the reason. In three dimensions the proposed bounce point with n cpt = 3, would reside on the blue dashed circle, but would be closer to the current point (black) than is the DBPS proposal (open blue point). Thus the discrepancy between target values at the current and proposed points would be smaller.