Blocking strategies and stability of particle Gibbs samplers

Summary Sampling from the posterior probability distribution of the latent states of a hidden Markov model is nontrivial even in the context of Markov chain Monte Carlo. To address this, Andrieu et al. (2010) proposed a way of using a particle filter to construct a Markov kernel that leaves the posterior distribution invariant. Recent theoretical results have established the uniform ergodicity of this Markov kernel and shown that the mixing rate does not deteriorate provided the number of particles grows at least linearly with the number of latent states. However, this gives rise to a cost per application of the kernel that is quadratic in the number of latent states, which can be prohibitive for long observation sequences. Using blocking strategies, we devise samplers that have a stable mixing rate for a cost per iteration that is linear in the number of latent states and which are easily parallelizable.


Introduction
Let {(X t , Y t ) ∈ (X, Y), t ∈ N + } be a Markov chain evolving according to a (general state space) hidden Markov model (HMM).That is, the state process {X t , t ∈ N + } is a Markov chain with state space X, evolving according to a Markov transition kernel M and with initial distribution µ (i.e., µ is the distribution of the state at time t = 1).The sequence {X t , t ∈ N + } is not directly observed and inference needs to be carried out based on the observations {Y t , t ∈ N + } only.Conditionally on {X t , t ∈ N + }, the observations {Y t , t ∈ N + } are independent.The conditional density of Y t given X t = x t , with respect to some dominating measure is denoted by g(x t , y t ).
We will work under the assumption that a fixed sequence of observations y 1:n := (y 1 , . . ., y n ) is available, where n is some final time point.The key object of interest is then the joint smoothing distribution (JSD) φ(dx 1:n ), which is the probability distribution of X 1:n := (X 1 , . . ., X n ) conditioned on Y 1:n = y 1:n .Markov chain Monte Carlo can be used to simulate from the JSD, for example the Gibbs scheme that uses Metropolis-Hastings samplers to update the state variables {X t } n t=1 one-at-a-time, with all the other variables kept fixed (Carter and Kohn, 1994;Frühwirth-Schnatter, 1994).However, the typically strong dependencies between consecutive states in the state sequence can cause this method to mix very slowly.As a result, this solution is often deemed as inefficient.
However, recent developments in sequential Monte Carlo (SMC) methods have had a significant impact on the practise of MCMC.The SMC methodology-which combines sequential importance sampling and resampling-has long been established as a key technique for approximating the JSD in general HMMs, see e.g., Doucet et al. (2000); Del Moral (2004); Doucet and Johansen (2011) for introductions, applications, and theoretical results.In a seminal paper by Andrieu et al. (2010), this key strength of SMC was exploited to construct effective (SMC based) high-dimensional MCMC kernels, resulting in so called particle MCMC (PMCMC) methods.
We will consider specifically an instance of PMCMC referred to as particle Gibbs (PG) Andrieu et al. (2010) (see also Chopin and Singh (2015)).The PG algorithm of Andrieu et al. (2010) defines (via an SMC construction) a Markov kernel which leaves the full JSD invariant.From a practitioner's point of view, PG can be regarded as a technique to mimic the behaviour of simulating all the state variables X 1:n as one block from the JSD for models where exact simulation (from the JSD) is not possible.
There are some recent theoretical results for the PG (Markov) kernel P to back-up its good observed performance.Chopin and Singh (2015) show it is uniformly ergodic but do not shed light on the dependence of the rate on time n and particle number N .The PG kernel has been shown to be minorised by the JSD, i.e.P ≥ const.× φ, under weak conditions (Lindsten et al., 2015;Andrieu et al., 2015) (thus also implying uniform ergodicity).Under stronger (but common) forgetting conditions on the HMM, an explicit lower bound for the minorising constant, (1 − 1 c(N −1)+1 ) n , has also been established in these works; here, 1 ≥ c > 0 is a model-dependent constant, N is the number of particles (of the SMC sampler) used to contruct the PG kernel and n is the number of observations.(See Kuhlenschmidt (2014) for an alternative proof of this rate, and a central limit theorem, using more routine SMC analysis techniques.)While the method is indeed uniformly geometrically ergodic for any N ≥ 2, the explicit rate reveals that the convergence deteriorates as n increases if N is kept fixed.This effect, which is also clearly visible in practice (Lindsten and Schön, 2013;Chopin and Singh, 2015), is related to the well known path-degeneracy issue of SMC samplers, see e.g., Doucet and Johansen (2011).For the rate not to deteriorate, N must increase linearly with n, giving rise to an algorithm that costs n 2 per-iteration (or application) of the PG kernel, which may be impractical when n is large.(Note that a PG kernel implemented with N particles has a cost per-iteration of N n.) We seek to answer an obvious but important question: is there a different implementation of the PG kernel with a cost per-iteration that grows linearly with n, but which is still stable in the sense that its convergence rate does not deteriorate with n? Secondly, and equally important, is the issue of parallelizing the PG kernel to save on wall-clock, or execution, time.We provide an affirmative answer to the posed question and we use blocking to achieve both aims.

Summary of main results and related work
In the literature PG is often presented (and invoked) as an exact approximation of a fully blocked sampler and it is contrasted with a single-state Gibbs sampler.However, there is an intermediate route in-between these two extremes, namely to use PG as a component of a (partially) blocked Gibbs sampler.(This possibility was indeed pointed as a potential extension in the seminal PMCMC paper (Andrieu et al., 2010, p. 294).)As we will demonstrate in this paper, using PG within a blocked Gibbs sampler can result in very efficient samplers.We now give a simplified interpretation of our results.
In light of the convergence properties of the PG kernel discussed above, the convergence rate of the blocked PG kernel will now depend on the block size L and not on the total number of observations n.Thus the particle approximation of the ideal blocked Gibbs sampler does not deteriorate with total number of observations n provided the block sizes themselves do not increase with n.The main insight which we exploit next is that blocking can also be used to control the convergence properties of the ideal Gibbs sampler for an HMM.Specifically, we show that under certain forgetting properties of the HMM it is possible to select a blocking scheme, using overlapping blocks, which results in a uniform rate of convergence for the ideal Gibbs sampler.Importantly, we show that that the rate of convergence after k complete sweeps is (see Theorems 4, 6) where P Ideal is the Markov kernel defined by one complete sweep and the rate λ Ideal improves with increasing the overlap between blocks but is independent of n.Furthermore, we translate this result to the (more practical) case in which exact sampling from the block-conditionals is not possible and PG kernels are used to simulate from each of these conditionals instead.In this case the rate becomes (see Theorem 8) where 0 ≤ ǫ N ≤ 1 quantifies the effect of departing from the ideal block sampler when using the PG kernels instead.Specifically, ǫ N ↓ 0 as the particle number N increases, i.e. as the PG kernel better approximates the ideal block sampler, but ǫ N ↑ 1 as the size of the blocks increase.We analyse different blocking schemes (Theorem 8) and discuss in particular how the blocking can be selected to open up for straightforward parallelization of the sampler.
Our analysis is based on Wasserstein estimates, see e.g.recent works by Wang and Wu (2014); Rebeschini and van Handel (2014).Wang and Wu (2014) study the convergence properties of a Gibbs sampler in high dimensions under a Dobrushin condition.Our work is in the same vein, but we are in particular interested in verifying a related condition for the case of a blocked Gibbs sampler for an HMM.Furthermore, we study the convergence of the non-ideal blocked Particle Gibbs sampler, for which the results by Wang and Wu (2014) do not apply.Rebeschini and van Handel (2014) generalise the Dobrushin comparison theorem and consider applications in high-dimensional filtering.
It should be noted that refined PG algorithms, incorporating explicit updates of the particle ancestry either as part of the forward SMC recursion (Lindsten et al., 2014) or in a separate backward recursion (Whiteley, 2010;Whiteley et al., 2010), have been developed.These modified PG samplers have empirically been shown to work well with small number of particles and to be largely robust to n.Nevertheless, to date, no theoretical guarantee for the stability of these algorithms has been given.For specific blocking schemes, blocked PG is straightforwardly parallelizable.It is worth noting that we can replace our blocked PG kernels with the corresponding PG kernels that incorporate updates of the particle ancestry as discussed above.
The rest of this paper is organised as follows.Section 2 presents the HMM, the definition of the JSD, the blocking schemes and particle implementation of the ideal blocked Gibbs sampler.Section 3 presents the main results on the uniform ergodicty and rates of convergence of the various samplers.Section 4 presents the proof of the stability of the blocked PG sampler.The proofs of supporting technical results are given in the appendices.

Hidden Markov models
Let set X be the state space of the time-and space-homogeneous HMM.We will work under the assumption that a fixed sequence of observations y 1:n := (y 1 , . . ., y n ) is available, where n is some final time point.The key object of interest is then the JSD, that is the probability distribution of X 1:n conditioned on Y 1:n = y 1:n .This will be the target distribution for the various Monte Carlo sampling algorithms that we shall consider throughout this work.With, denoting the density of the observations Y 1:n (with respect to some dominating measure) we can write the density of the JSD as where, by abuse of notation, we use the same symbol φ both for the JSD and for its density.Note that m(x t−1 , x t ) is the density of M (x t−1 , dx t ) with respect to some σ-finite dominating measure on X.For notational simplicity, we do not make explicit the dependence of φ on the fixed observation sequence in the notation in (1).We will analyse the stability of our samplers under the following set of strong, but standard, mixing assumptions (Del Moral, 2004;Lindsten et al., 2015;Andrieu et al., 2015) (S-1) There exists a probability measure ν on X such that m(x t−1 , x t ) is the density of M (x t−1 , dx t ) with respect to ν.Furthermore, there exist positive constants σ − and σ + and an integer h (S-2) There exists a constant δ ≥ 1 such that for all y ∈ Y, sup x g(x, y) ≤ δ 1/h inf x g(x, y) .

Block sampling
Before giving the algorithmic statements for the blocked Gibbs samplers that are analysed in this article, we introduce some notation that will be frequently used in the sequel.Let I := {1, . . ., n} be the "index set" of the latent variables X 1:n .Let X n be the n-fold Cartesian product of the set X and let the tuple x = (x i : i ∈ I) ∈ X n .For J ⊂ I, we then write x J := (x i : i ∈ J) ∈ X |J| (the restricted tuple).We also write x −i as a shorthand for x I\{i} .The complement of J in I is denoted by J c := I \ J. Given y = (y i : i ∈ J) ∈ X |J| and z = (z i : i ∈ J c ) ∈ X |J c | , we define x = {y, z} = {z, y} ∈ X n to be the tuple such that x J = y and x J c = z.Let φ J x denote (a version of) the regular conditional distribution of the variables X J conditionally on X J c = x J c under φ in (1).The Markov property of the HMM implies that φ J x depends on x only through the boundary points x ∂J where ∂J denotes the set of indices which constitutes the boundary of the set The Gibbs sampler generates samples from the JSD φ by iteratively sampling from its conditional distributions.Let J := {J 1 , . . ., J m } be a cover of I.A blocked, deterministic scan Gibbs sampler proceeds by sampling from the block-conditionals φ J x , J ∈ J , in turn in some prespecified order.More precisely, assuming that we apply the blocks in the order J 1 , J 2 , etc. and if the initial configuration of the sampler is given by x ′ ∈ X n , then the Gibbs sampler output is the Markov chain {X[k], k ∈ N} with X[0] = x ′ and, given . More generally, we can simulate X ′ J from some kernel Q J (x, •) with the conditional distribution φ J x as its invariant measure.That is, for any Since φ J x depends on x only through the boundary points x ∂J , it is natural to assume that Q J (x, •) depends on x only through x J∪∂J .For notational simplicity, we define when wishing to to emphasize the dependence of the kernel on the components in J + of current configuration x.When we have x (dx ′ J ) for every J ∈ J , we refer to the sampler as an ideal Gibbs sampler.It follows that the Markov kernel corresponding to updating the block J is given by, for the ideal Gibbs kernel, (4) for the non-ideal Gibbs kernel. (5) Formally, the subsets in J can be an arbitrary cover of I.However, there are certain blocking schemes that are likely to be of most practical interest and these schemes therefore deserve some extra attention.To exemplify this, we consider the following restrictions on the blocks in J .
(B-2) Consecutive blocks may overlap, but non-consecutive blocks do not overlap and are separated.That is, for 1 ≤ j < k ≤ m with k − j ≥ 2, max(J j ) < min(J k ) − 1.
In addition to ordering the blocks according to their minimum element, (B-1) avoids the case where one block is a strict subset of some other block.An illustration of (B-2), which requires that J k−1 and J k+1 do not cover J k , is shown in Figure 1.
The Gibbs sampling scheme that (perhaps) first comes to mind is a systematic sweep from left to right.Under (B-1), this implies that the kernel corresponding to one complete sweep of the Gibbs sampler is given by: L-R Assume (B-1).The Left-to-Right Gibbs kernel is given by P := P J1 • • • P Jm .
Assuming both (B-1) and (B-2), another blocking scheme that is of practical importance is the one that updates all the odd-numbered blocks first, then all the even-numbered blocks (or the other way around).We refer to this scheme as the Parallel (PAR) Gibbs sampler: PAR Assume (B-1)-(B-2).The Parallel Gibbs kernel is given by P := P odd P even where for odd/even m, respectively.
The reason for why we call this sampler parallel is that, under (B-1)-(B-2), it is possible to update all the odd blocks in parallel, followed by a parallel update of all the even blocks.This is because two consecutive odd (or even) blocks are separated by at least one element in I. Hence, if we have a total of C/2 processing units, we can simulate from both P odd and P even (and thus from P) in constant time for any number of blocks m ≤ C. Figure 1 shows an example of a blocking configuration which satisfies (B-1)-(B-2).This is the typical scenario that we have in mind for the PAR sampler.
Reversibility in MCMC is an important consideration.In Section 2.3 we use PG to define the kernels Q J (x J + , dx ′ J ) for each J and the PG kernel is known to be reversible (Chopin and Singh, 2015).As such it is simple to define reversible block samplers using the following simple fact.
Lemma 1.Let J = {J 1 , . . ., J m } be an arbitrary cover of I and, for each J ∈ J , let ) be the (possibly non-ideal) Gibbs kernel updating block J only.Assume that Q J is reversible with respect to to φ J x for all J ∈ J .Then, For example, for the PAR scheme, the kernel 1 2 (P odd P even +P even P odd ) is reversible.Lemma 1 can be used to define other reversible samplers.

Particle Gibbs
Our primary motivation for studying the convergence properties of blocked Gibbs samplers for general HMMs is the development of PMCMC.These methods provide systematic and efficient Markov kernels which can be used to simulate from the block-conditionals φ J x for "reasonably large" blocks, making the blocking strategy practically interesting.In particular, we will make use of the PG sampler (Andrieu et al., 2010) to define Q J and hence the non-ideal Gibbs kernel in (4).The PG sampler of Andrieu et al. (2010) is a Markov kernel that preserves the invariance of the full JSD.In this section we briefly review the PG sampler and discuss how the standard PG algorithm needs to be modified (see Algorithm 1) when targeting one of the block-conditionals φ J x instead of the full joint smoothing distribution φ.(Although not detailed as in Algorithm 1, blocking was mentioned by Andrieu et al. (2010, page 294) as a possible extension of their original PG sampler.However, there was no discussion or mention of the greatly improved stability of the sampler which is our main interest here.) The PG sampler can be used to construct a uniformly ergodic Markov kernel on X |J| which leaves φ J x invariant.This construction is based on SMC.We denote the PG Markov kernel for block J by Q J N , where N denotes a precision parameter; specifically the number of particles used in the underlying SMC sampler.We work under assumption (B-1), i.e. that J is an interval.Furthermore, for the sake of illustration, assume that J is an internal block, i.e., J = {s, . . ., u} for s > 1 and u < n.Obvious modifications to the algorithmic statement are needed for s = 1 and/or u = n.
The conditional probability density function of X s:u , given and  Y To aid the description of the PG kernel, we define for t ∈ J the intermediate probability density which is the conditional probability density function of X s:t given X s−1 = x s−1 and Y s:t = y s:t .It follows that the block-conditional density is given by The sampling procedure used in the PG sampler is reminiscent of a standard SMC sampler, see e.g.Doucet et al. (2000); Del Moral (2004); Doucet and Johansen (2011); Cappé et al. (2005) and the references therein.Here we review a basic method, though it should be noted that the PG sampler can be generalised to more advanced methods, see Andrieu et al. (2010); Chopin and Singh (2015).As in a standard SMC sampler, the PG algorithm approximates the sequence of "target distributions" {π s:t xs−1 , t = s, . . ., u} sequentially by collections of weighted particles.The key difference between PG and standard SMC is that, in the former, one particle is set deterministically according to the current configuration of the sampler.Let this fixed particle/state trajectory (also called the reference trajectory) be denoted by x ⋆ ∈ X n .Then, to simulate from the PG kernel Q J N (x ⋆ J + , dx ′ J ), we proceed as follows.Initially, we simulate particles X i s ∼ r s (x ⋆ s−1 , •), i = 1, . . ., N − 1 independently from some proposal density r s (x s−1 , x s ).The proposal density may depend on the fixed observation sequence, but we omit the explicit dependence in the notation for brevity.Note that all the particles {X i s } N i=1 share a common ancestor at time s − 1, namely the fixed boundary point x ⋆ s−1 .Note also the we simulate only N − 1 particles in this way.The N th particle is set deterministically according to the current configuration: X N s = x ⋆ s .To account for the discrepancy between the proposal density and the target density π s x ⋆ s−1 , importance weights are computed in the usual way: , where the weight function is given by The weighted particles {(X i s , ω i s )} N i=1 provide an approximation of the target distribution π s is an estimator of f (x)π s x ⋆ s−1 (x)dx for any measurable function f : X → R. We proceed inductively.Assume that we have at hand a weighted sample {(X i s:t−1 , ω i t−1 )} N i=1 approximating the target distribution π s:t−1 x ⋆ s−1 at time t − 1.This weighted sample is then propagated sequentially forward in time.This is done by sampling, for each particle i ∈ {1, . . ., N −1}, an ancestor index A i t with (conditional) probability Given the ancestor, a new particle position is sampled as ) is a proposal density on X.Again, note that we only generate N − 1 particles in this way.The N th particle and its ancestor index are set deterministically as X N t = x ⋆ t and A N t = N .The particle trajectories (i.e., the ancestral paths of the particles X i t , i ∈ {1, . . ., N }) are constructed sequentially by associating the current particle X i t with the particle trajectory of its ancestor: . It follows by construction that the N th particle trajectory coincides with the current configuration: X N s:t = x ⋆ s:t for any t ∈ J. Finally, the particles are assigned importance weights.For t < u, we compute , where the weight function is given by ( 11).At the final iteration of block J, we need to take into account the fact that the target distribution is φ s:u x ⋆ and not π s:u x ⋆ s−1 ; the two being related according to (8).We can view the fixed boundary state x ⋆ u+1 as an "extra observation" and, consequently, the weights are computed according to Similarly to the fact that the proposal densities are allowed to depend on the fixed observations y s:u , it is possible to let the final proposal density r u depend on the fixed boundary point x ⋆ u+1 , but again we do not make this explicit in the notation for simplicity.
After a complete pass of the above procedure, the current configuration of the sampler is updated by replacing states x ⋆ J (recall that J = s : u) by a draw from the particle approximation of φ J x ⋆ .That is, X ′ J is sampled from among the particle trajectories at time u, with probabilities given by their importance weights, i.e., and x ⋆ ← {x ⋆ J c , X ′ J } is taken as the new configuration of the sampler.The block PG sampling procedure is summarized in Algorithm 1.Note that the procedure associates each x ⋆ J + with a probability distribution on X |J| ; this is the PG kernel Q J N .As shown by Andrieu et al. (2010), the conditioning on a reference trajectory implies that the PG kernel leaves the conditional distribution φ J x invariant in the sense of (3).Quite remarkably, this invariance property holds for any N ≥ 1 (though, N ≥ 2 is required for the kernel to be ergodic; see Andrieu et al. (2010); Lindsten et al. (2015); Andrieu et al. (2015)).
Empirically, it has been found that the mixing of the PG kernel can be improved significantly by updating the ancestor indices A N t , for t ∈ {1, . . ., T }, either as part of a separate backward recursion (Whiteley et al., 2010), or in the the forward recursion (Lindsten et al., 2014) itself.Although we do not elaborate on the use of these modified PG algorithms in this work, the stability results of the blocked PG sampler presented in the subsequent sections also hold when the PG kernel is replaced by one of these modified algorithms (which might result in better empirical performance).
Algorithm 1 PG sampler for φ J x for J = {s, . . ., u} (non-boundary block) Input: Observations y J , fixed boundary states x ⋆ ∂J , and reference states x ⋆ J .Output: Draw from the PG Markov kernel Q 5: Set 9: else 10: 3 Main Results: Convergence of the Block Samplers In this section we state the main convergence results for the blocked Gibbs samplers detailed in Section 2.2.After introducing our notation and stating some known preliminary results in Section 3.1, we start in Section 3.2 by deriving Wasserstein estimates for the (ideal) blocked Gibbs kernels.We then investigate in Sections 3.3 and 3.4 how these estimates result in contraction rates for the ideal and Particle Gibbs block samplers, respectively.

Preliminaries, notation, and definitions
For a function f : X n → R, the oscillation of f with respect to the i-th coordinate is denoted by For a matrix A, recall that A ∞ = max i j |a i,j | = max i [A1] i where the last equality holds only if all elements are non-negative.The norm is sub-multiplicative, i.e.AB ∞ ≤ A ∞ B ∞ .Let µ and ν be two probability measures on X.With Ψ being a probability measure on X × X we say that Ψ is a coupling of µ and ν if Ψ(•, dx) = µ(•) and Ψ(dx, •) = ν(•).
We review some well-known techniques for the analysis of Markov chains (see, e.g., Follmer (1982)).Let P be a Markov kernel on X n .The matrix W is a Wasserstein matrix for P if for any function f of finite oscillation, for all j ∈ I . (15) If P and Q are two Markov kernels with Wasserstein matrices V and W , respectively, then W V is a Wasserstein matrix for the composite kernel P Q.
The convergence rate of an MCMC procedure can be characterised in terms of a corresponding Wasserstein matrix for the Markov transition kernel, through the following (well-known) result.For any probability distributions µ and ν and any function f with finite oscillation, we have for any k ≥ 1 and for any coupling Ψ of µ and ν. (Note that Ψ( is the probability of not coupling element j under the coupling Ψ.)This result can be verified by using ( 14) and iterating the inequality (15).It follows from ( 16) that W ∞ < 1 implies a geometric rate of contraction of the kernel P .

Wasserstein estimates for the ideal block sampler
Our convergence results, both for the ideal and for the PG samplers, rely on constructions of Wasserstein matrices for the ideal Gibbs kernels.Indeed, as we shall see in Section 3.4, the PG block sampler can be viewed as an ǫ-perturbation of the ideal block sampler, and this is exploited in the convergence analysis.Let P J denote the ideal Gibbs kernel that updates block J only, as defined in (4), and let W J be a Wasserstein matrix for P J .The following lemma reveals the structure of W J .
Lemma 2. A Wasserstein matrix for the ideal Gibbs kernel updating block J can be chosen to satisfy, where × denotes elements which are in general in the interval [0, 1].
The interpretation of components that are either 0 or 1 is noteworthy as these cannot be improved further.(Recall, the smaller the row sum of the Wasserstein matrix the better.)For example, the lemma states that when the states x and z differ in a single component, say x j = z j , and if j ∈ J, then this error is not propogated when computing the difference P J f (x) − P J f (z).In this sense the given Wasserstein matrices are the 'element-wise minimal' ones.
Proof (Lemma 2).From (4) we have A candidate Wasserstein matrix W J can be found via a coupling argument.For any pair (x, z) such that x −j = z −j and x j = z j , let Ψ J j,x,z be a coupling of φ J x and φ J z .Consider now where the second line follows from ( 14) and the last line follows since x −j = z −j , i.e., at most one term from the second sum in the penultimate line will be non-zero.Thus, for i ∈ J we can set and for i / ∈ J, W J i,j := I [i=j∈J c ] .Furthermore, since φ J x depends on x only through the boundary points x ∂J , it follows that for j / ∈ ∂J the coupling Ψ j can be made perfect: Ψ j,x,z (X ′ i = Z ′ i ) = 1 for any i ∈ J. Therefore, it is evident from (18) that W J i,j = 0 for i ∈ J and j ∈ I \ ∂J.
Specifically, if J = s : u is an interval (cf.(B-1)) it follows that W J is structured as (blanks correspond to zeros), with obvious modifications for the "boundary blocks" (s = 1 and/or u = n).The "square of zeros" correspond to rows and columns i ∈ J and j ∈ J, respectively.The columns of ×'s correspond to W J i,j with i ∈ J and j ∈ ∂J.It remains to compute the non-zero, off-diagonal elements of W J .To this end, we use the strong, but standard, mixing conditions (S-1) and (S-2), given in Section 2.1.This allows us to express the "unknown" elements of W J in terms of the mixing coefficients of the model.Lemma 3. Assume (B-1) and let J = s : u.Assume (S-1) and (S-2) and define the constant α ∈ [0, 1) as where δ, σ − , σ + , and h are defined in (S-1) and (S-2).Then, W J defined as in (17) and with for i ∈ J and j ∈ ∂J is a Wasserstein matrix for the ideal Gibbs block-transition kernel P J .
Proof.See Appendix B.

Contraction of the ideal block sampler
Let ∂ := J∈J ∂J denote the set of all boundary points.We start by stating a general geometric convergence result which holds for an arbitrary cover J of I.
Theorem 4. Assume (A-1) and let W J satisfy (17) for all J ∈ J .Let J = {J 1 , . . ., J m } be an arbitrary cover of I and let W := W Jm • • • W J1 be the Wasserstein matrix for P = P J1 • • • P Jm , i.e., for one complete sweep of the ideal Gibbs sampler.Then, Proof.See Appendix A.
The following result now characterises the convergence of the law of the sampled output of the ideal block sampler.
Corollary 5. Let µ and ν be two probability distributions on X n and let Ψ be an arbitrary coupling of µ and ν.Under the same conditions as in Theorem 4 we have, for any k ≥ 1 and any f of finite oscillation, Corollary 5 verifies two important facts.First is that if we are interested only in the convergence of certain marginals of the sampled process {X[k], k ∈ N} in (2) to the corresponding marginal of the JSD, then the convergence rate is independent of the dimension n of the JSD.Secondly, for convergence in total variation norm of the law of X[k] to the full JSD, we attain a bound on the error which is O(nλ k ).That is, the error grows slowly (linearly) with the dimension n of the JSD.
Under the conditions of Lemma 3 we can clearly see the benefit of blocking for verifying condition (A-1).As an illustration, let h = 1 in (S-1).The condition for contraction requires that for any i ∈ J ∩ ∂ (assuming J is an internal block), j∈∂J where α ∈ [0, 1) is defined in Lemma 3. First of all, we note that it is possible to ensure j∈∂J W J i,j < 1 for any i ∈ J by increasing the block size L := u − s + 1.Indeed, the maximum of ( 21) for i ∈ J is attained for i = s or i = u, for which j∈∂J W J i,j = α + α L .Secondly, however, we note that Lemma 3 also reveals the benefit of using overlapping blocks.Indeed, since we only need to control (21) for i ∈ J ∩ ∂, we can select the blocking scheme so that the set of boundary points ∂ excludes indices i close to the boundary of block J. Consequently, by using overlapping blocks we can control both terms in (21)-and thus the overall convergence rate of the algorithm-by increasing L.
Theorem 4 assumes no specific structure for J other than it being a cover.As such, it cannot provide a sharper estimate of the contraction since it caters for all blocking structures.In order to refine the contraction estimate we impose the blocking structure formalised by (B-1) and (B-2), as illustrated in Figure 1, and study the interplay between block size, overlap, and convergence.The theorem below improves the estimate of the decay of errors per complete sweep from λ in Theorem 4 to λ 2 for the blocking structure of Figure 1.Theorem 6. Assume (B-1), (B-2), and (A-1).Then, for any k ≥ 1: and W ∞ ≤ 2, where λ is defined as in (A-1).
• For the ideal L-R sampler, and Remark 7.There is parity in the two rates of Theorem 6 since it can be shown that β ≈ λ 2 .For example let each block be the same length L and the overlap between all adjacent blocks and β/λ 2 → 1 as L increases and p is fixed.

Contraction of the Particle Gibbs block sampler
We now turn our attention to the Particle Gibbs block sampler.In this section we state a main result (Theorem 8) that parallels Theorem 6 for the PG kernel.For the sake of interpretability, we specialize the result to the case of a common block size L and common overlap p between successive blocks (see also Remark 7).A version of Theorem 8 without this assumption, nor strong mixing, is presented in Section 4; see Theorems 9 and 10. (Theorem 8 is a corollary of these theorems.)(B-3) For all J ∈ J , |J| = L and for all consecutive Note that (B-3) implies that n is assumed to satisfy n = (L − p)m + p.
Theorem 8. Assume (B-1)-(B-3) and (S-1)-(S-2).Let P denote the Markov kernel corresponding to one complete sweep of either the PAR sampler or the L-R sampler, and assume that each block is updated by simulating from the PG kernel (Algorithm 1) using a bootstrap proposal: r(x, x ′ ) = m(x, x ′ ).Let µ and ν be two probability distributions on X n and let Ψ be an arbitrary coupling of µ and ν.Then, for any f of finite oscillation and any k ≥ 1 where: • For the Particle Gibbs PAR sampler, • For the Particle Gibbs L-R sampler, provided that λ < 1.In the above, ) L , for some constant c (specified in Proposition 12 below) which is independent of n, N , L and p.
Proof.The proof is given in Section 4.
Theorem 8 also applies to the ideal sampler (set ǫ = 0).In terms of sufficiency for contraction, the requirement that the non-ǫ terms of this theorem be less than 1 is stronger than (A-1); this should not be surprising since the analysis is catered for the non-ideal PG kernel and thus is inherently more conservative.For common blocks lengths L and overlap p, (A-1) requires The non-ǫ terms of Theorem 8 are shrunk by increasing the overlap of blocks and then increasing block size with overlap fixed.Alternatively, if p is a constant fraction of L, then λ tends to zero as L increases.We remark that the non-ǫ rates of Theorem 8 are not as sharp as they could be as given in Theorem 9 and Theorem 10 below.The upper bounds for the non-ǫ terms were chosen for the sake of simplicity and interpretability since they are functions of system forgetting (α and h), common block lengths L, and overlap p only.

Proof of the Contraction of the Particle Gibbs Block Sampler
This section is dedicated to the proof of Theorem 8. Furthermore, it provides a more general version of Theorem 8 that avoids the common block length and overlap structure of assumption (B-3).The general results given below are stated in terms of Wasserstein estimates for the nonideal blocked Gibbs sampler W , under the assumption (A-2) that W is an ǫ-perturbation of a Wasserstein matrix for the ideal block sampler.Specifically, let W J be a Wasserstein matrix for the non-ideal block-transition kernel defined in (5).By an analogous argument as in Lemma 2 it follows that W J has a similar structure as W J , but with (possibly) non-zero entries also for rows and columns i ∈ J and j ∈ J, respectively, which motivates the following assumed structure.
(A-2) There exists a (common) constant ǫ ∈ [0, 1) and, for each J ∈ J , a matrix W J satisfying (17) such that, for any J ∈ J , W J i,j := W J i,j + ǫ I [i∈J,j∈J+] is a Wasserstein matrix for the non-ideal block-transition kernel updating block J.
In Proposition 12 below we show that (A-2) indeed holds for the PG kernel, with W J being a Wasserstein matrix for block J of the ideal sampler (as in Lemma 3) and where the perturbation ǫ depends on (grows with) block size |J| and (decreases with) the number of particles N , but is independent of the length of the total data record n or the specific HMM observations pertaining to each block.This is key to the stability of the PG block sampler as n → ∞ for fixed N .
Theorem 9. Assume (B-1), (B-2), (A-2), PAR and assume that the number of blocks m = |J | is odd.1 Let J −k = ∪ J∈{J \J k } J and L = max J∈J |J|.Then W, the Wasserstein matrix of one complete sweep (defined analogously to Theorem 4,) satisfies In order to prove Theorem 8 we can now make use of Lemma 3 (assuming (S-1)-(S-2) and (B-1)-(B-3)) to identify the constants of Theorems 9 and 10.However, a technical detail is to handle the dependence on the maximum block size L = max J∈J |J| and maximum overlap L 1 = max k∈2:m |J k−1 ∩ J k | (of Theorem 10).Indeed, a direct application of Theorems 9 and 10 would suggest that the norm of W grows, respectively, quadraticly or linearly with ǫL.To avoid this issue we will make use of the following trick: when applying Theorems 9 and 10 we do not consider the original HMM formulation, but an equivalent model that lumps consecutive states together (for the sake of the analysis), thus effectively reducing the size of the blocks.Under (B-1)-(B-3), each block can be split into three distinct sections which are the left overlap, the middle of the block, and the right overlap.(See Figure 1.)The exception are the end blocks which are split into two sections.By viewing each of these parts as a single lumped state, we reduce the block size to 3 and maximum overlap to 1.For this scheme, the lumped states are given by where Ξ 1 and Ξ 2 are the states of the two sections of block 1, Ξ 2 , Ξ 3 and Ξ 4 are the states of block 2 etc.More precisely: Remark 11.The Ξ-system groups the random variables X 1 , . . ., X n of the X-system as Ξ 1 , . . ., Ξ 2m−1 , noting that (B-3) implies that n = (L − p)m + p, where The index set for the Ξ-system is I Ξ = {1, . . ., 2m − 1} and the cover J Ξ of I Ξ has m sets with To find a Wasserstein matrix for the Ξ-system we note that any conditional density of the states Ξ i , i ∈ J Ξ,k , conditionally on the boundaries of the block J Ξ,k and the observation pertaining to that block, is coupled analogously to the X-system; see the proof of Lemma 3 in the appendix.Analogously to Lemma 3, a Wasserstein matrix for block J Ξ,k of the Ξ-system is thus given by the (2m − 1) × (2m − 1) matrix where the "square of zeros" correspond to rows/columns 2k − 2, 2k − 1, and 2k (with obvious modifications for k = 1 or k = 2m − 1).The proof is omitted for brevity, but follows analogously to the proof of Lemma 3. As a final ingredient to prove Theorem 8 we need to verify condition (A-2) for the PG kernel.
A.1 Proof of Theorem 4 From Lemma 14 it is clear that We j = 0 for any j / ∈ ∂ (for which b j = ∞).Hence, with M being the binary mask matrix with elements M i,j = I [i=j∈∂] , it holds that Wr = W(M r) for any vector r.
. Define recursively, L 0 := ∅ and r 0 := 1 and We thus have r m = W1.Hence, the result follows if, for k = 0, . . ., m, (note that nothing is said about It remains to prove (31).For k = 0 the hypothesis is true by construction.We proceed inductively.Hence, assume that the hypothesis is true for k where we have made use of the structure of the Wasserstein matrix from Lemma 2. We need to consider different cases: , where we first use (32) and then the induction hypothesis, and the fact that where, again, we use (32) and the induction hypothesis.
where we use the fact that for j ∈ ∂, [r k−1 ] j ≤ 1 and assumption (A-1) for the final inequality.This completes the proof.

A.2 Proof of Theorem 6
Case L-R: The proof is similar to that of Theorem 4, but to exploit the structure of the L-R sampler we define the mask matrix M as M i,j = I [i=j∈∂+ ] where ∂ + = ∪ J∈J ∂ + J is the set of right boundary points, only, of all blocks.If j = ∂ − J k is the left boundary of some block k, say, then j ∈ J k−1 by definition of the L-R sampler.Hence, from Lemma 14 it follows that Wr = W(M r) for any vector r.Thus, ∞ .Define L k and r k as in (30).Note that (31) and (32) hold for any cover J , and in particular for the L-R sampler.Thus, for i where the inequality follows from the fact that, for the L-R sampler, ∂ − J k ∈ L k−1 and ∂ + J k ∈ I \ L k and by using (31).Hence, [r k ] ∂+J k−1 ≤ β.Since, for any k ∈ 1 : m, ∂ + J k−1 lies in at most one block (namely J k ), we conclude that [r m ] i ≤ β for any i ∈ ∂ + .Thus M W ∞ ≤ β.
Case PAR: For the PAR sampler we redefine the mask M to be M i,j = I [i=j∈∂ odd ] , where ∂ odd is the set of all boundary points of odd blocks, i.e., ∪ k odd ∂J k .Since any boundary point of an even block is the interior of some odd block, it follows from Lemma 14 that Wr = W(M r) for any vector r.2To complete the proof we thus need to bound M W ∞ ≤ λ 2 .Let i ∈ ∂ odd and let J k be the block such that i ∈ J k .Note that k is even and J k is the only block containing i.

Thus, [W1]
for all i which are boundaries of even blocks (a fact we will prove next).Thus To conclude: let i be a boundary of an even block and let i ∈ J k , the only odd block containing i.

A.3 Proof of Theorem 9
We prove the following more general version (while Theorem 9 was stated for r = 1).
Lemma 15.Let vector r satisfy where c = 2a + L(a ′ ∨ b) and the remaining constants were defined in Theorem 9.
Proof.Note that W = W even W odd where Let r ′ = W odd r and r ′′ = W even r ′ .Recall that for any vector s, W J k s differs from s only in components indexed by J k .Thus, by (B-1), (B-2), W odd r can be studied by considering each term W J k r, k odd, separately.We have for k odd and i The second line follows by the assumption on r and the fact that boundaries of an odd numbered block lie in the adjacent even blocks.For i

A.4 Proof of Theorem 10
The proof is inductive.For any vector r, observe that W J r differs from r only in components indexed by J.In particular, for any i We separately bound the terms for i ∈ J c −1 and i ∈ J 1 ∩ J 2 since W J1 i,∂+J1 can approach 1 as i approaches the boundary ∂ We have The first term of the sum can be simplified to The second term of ( 34) is k−1 , the coefficient of ǫ is by itself bounded by c since For i ∈ J k−1 ∩ J k , the coefficient of ǫ is

B Additional Proofs
B.1 Proof of Lemma 3 Lemma 16.For each i ∈ 1, . . ., t, let P i (x i−1 , dx i ) be a Markov transition kernel on X.
For some integer h > 0, assume that the composite transition kernels Q i (x (i−1)h , dx ih ) = (P (i−1)h+1 . . .P ih )(x (i−1)h , dx ih ) satisfy the following minorisation condition: there exists probability measures ν i (dx) on X and a common constant α ∈ [0, 1] such that Q i ≥ αν i .Then for the probability measures (on the product space X t ) t i=1 P (x i−1 , dx i ) and t i=1 P (y i−1 , dy i ), and for any x 0 and y 0 , there exists a coupling Ψ such that if (X 1:t , Y 1:t ) ∼ Ψ then Pr(X i = Y i ) ≤ (1 − α) ⌊i/h⌋ .Proof.When h = 1 the results follows standard arguments (Lindvall, 2002) and is repeated here for the sake of completeness.The coupling Ψ attempts to couple (X 1 , Y 1 ), followed by (X 2 , Y 2 ) etc. Specifically, if X i−1 = Y i−1 , then draw X i from P i (X i−1 , dx i ) and set Y i = X i .If X i−1 = Y i−1 , draw (X i , Y i ) from the measure αν i (dx i )δ xi (dy i ) + (1 − α) −1 (P i (X i−1 , dx i ) − αν(dx i )) (P i (Y i−1 , dy i ) − αν(dy i )) where δ x is the atom measure.It now follows that Pr( For h > 1 we couple the skeleton process (X h , Y h ), (X 2h , Y 2h ), . . .using a similar scheme as for h = 1.(Note that the h-skeleton X-system has transition kernels Q i that satisfy a minorisation condition which is to be used in the same manner as in the proof for h = 1.)Let i be the first instance that X ih = Y ih , i.e.X jh = Y jh for j < i.Then, simulate the future X-process, X k for k = ih + 1, ih + 2, . . .from P k and set Y k = X k .The non-skeleton terms X k and Y k for k < ih are simulated independently from their respective conditional laws.Thus for i ≥ h, Pr(X i = Y i ) ≤ Pr(X h⌊i/h⌋ = Y h⌊i/h⌋ ) which is bounded above by (1 − α) ⌊i/h⌋ .Proof.(Lemma 3) For notational brevity we write dx instead of ν(dx), where ν is the dominating probability measure defined in (S-1).
Recall that under (B-1), J is an interval and we can write J = {s, . . ., u}.To compute W J i,j for i ∈ J we use a coupling as in (18): where Ψ J j,x,z is a coupling of φ J x and φ J z for x −j = z −j .We know from Lemma 2 that we only need to consider the cases j = s − 1 and j = u + 1.Consider first j = s − 1 (assuming s > 1).
Write the density of φ J x as p(x s , . . ., x u |x s−1 , x u+1 , y s:u ) = u i=s p(x i |x i−1 , x u+1 , y i:u ), which is a product of inhomogeneous Markov transition kernels.(A similar expression follows if written backwards, i.e. u i=s p(x i |x i+1 , x s−1 , y s:i ).)We show that the composite kernel formed by any h (for h defined in (S-1) and (S-2)) successive kernels of the given inhomogeneous product, i.e. t+h i=t+1 p(x i |x i−1 , x u+1 , y i:u )dx t+1:t+h−1 , satisfies a minorisation condition with respect to some probability measure and the time-uniform constant δ 1−h h σ− σ+ .Thus the coupling Ψ J j,x,z can be defined as in Lemma 16 to complete the proof.Let t ≤ u − h be some time index and consider p(x t+h | x t , x u+1 , y t+1:u ) = p(x u+1 , y t+h:u | x t+h )p(x t+h | x t , y t+1:t+h−1 ) p(x u+1 , y t+h:u | x t+h )p(x t+h | x t , y t+1:t+h−1 )dx t+h . (36) Lemma 17.Let P and Q be two Markov kernels on X n and assume that there exists a constant ǫ ∈ [0, 1] such that Q(x, dy) ≥ (1 − ǫ)P (x, dy) for all x ∈ X n .For x, y ∈ X n such that x −j = y −j , let Ψ j,x,y be a coupling of P (x, •) and P (y, •) and let W be the matrix (a Wasserstein matrix for P ) W i,j = sup x, y ∈ X n x−j = y−j Ψ j,x,y (X ′ i = Y ′ i ) = sup x, y ∈ X n x−j = y−j Ψ j,x,y (dx ′ , dy Then, the matrix W with W i,j = W i,j + ǫ is a Wasserstein matrix for Q. Proof.For ǫ = 0 the result is immediate.Hence, consider ǫ > 0. For any x, y ∈ X n , define r x = 1 ǫ (Q(x, •) − (1 − ǫ)P (x, •)) and let R x,y be a coupling of r x and r y .It follows that, for any x, y ∈ X n with x −j = y −j , (1 − ǫ)Ψ j,x,y + ǫR x,y is a coupling of Q(x, •) and Q(y, •).We have,