From Denoising Diffusions to Denoising Markov Models

Denoising diffusions are state-of-the-art generative models exhibiting remarkable empirical performance. They work by diffusing the data distribution into a Gaussian distribution and then learning to reverse this noising process to obtain synthetic datapoints. The denoising diffusion relies on approximations of the logarithmic derivatives of the noised data densities using score matching. Such models can also be used to perform approximate posterior simulation when one can only sample from the prior and likelihood. We propose a unifying framework generalising this approach to a wide class of spaces and leading to an original extension of score matching. We illustrate the resulting models on various applications.


Introduction
Given a set of samples from an unknown distribution p data (x), generative modelling is the task of producing further synthetic samples coming from approximately the same distribution. Over the past decade, a variety of techniques have been developed to tackle this problem, including autoregressive models (Oord et al., 2016), generative adversarial networks (Goodfellow et al., 2014), variational autoencoders (Kingma and Welling, 2014) and normalising flows (Rezende and Mohamed, 2015). These methods have had significant success in generating perceptually realistic samples from complex data distributions, such as text and image data (Brown et al., 2020;Dhariwal and Nichol, 2021). A major motivation for the development of generative models is that they can be easily arXiv:2211.03595v1 [stat.ML] 7 Nov 2022 extended for Bayesian inference. In a typical setting, we make an observation ξ * based on underlying datapoint x, for example a category label or partial observation of x, and want to sample from the posterior distribution p data (x|ξ * ). We achieve this by learning a conditional generative model for x given any observation ξ based on samples from p data (x, ξ). This approach is particularly useful in high-dimensional scenarios where traditional sampling methods, such as Markov chain Monte Carlo (MCMC) methods or approximate Bayesian computation (ABC), are typically infeasible.
Recently, denoising diffusion models (Sohl-Dickstein et al., 2015;Ho et al., 2020;Song et al., 2021) have emerged as effective generative models for high-dimensional data. They work by incrementally adding noise to the data to transform the data distribution into an easy-to-sample reference distribution, and then learning to invert the noising process, which is achieved using score matching (Hyvärinen, 2005). Their use for inference has recently seen an explosion of applications, including text-to-speech generation (Popov et al., 2021), image inpainting and super-resolution (Song et al., 2021;Saharia et al., 2022) and protein structure modelling (Trippe et al., 2022).
Most of the current methodology, theory and applications of denoising diffusion models are for diffusion processes on R d . However, many distributions of interest are defined on different spaces. Recently, De Bortoli et al. (2022) and Huang et al. (2022) have extended continuous-time methods and the analogy with score matching from R d to general Riemannian manifolds in order to model data with strong geometric prior. Several diffusion methods have also been developed for discrete data, such as text, music or graph structures (Austin et al., 2021;Hoogeboom et al., 2021;Campbell et al., 2022;Anonymous, 2022). Here though, the relationships to score matching, as well as between these various methods and the real diffusion case, are less clear. All these recent extensions have been somewhat ad hoc, with training objectives needing to be re-derived for each new application.
The main contribution of this paper is to provide a unifying framework for such models, which we call denoising Markov models, or DMMs. We demonstrate how to construct and train a DMM for data in any state space satisfying mild regularity conditions. This yields a principled procedure for using these models for unconditional generation and inference on a wider class of spaces than previously considered. Additionally this general framework leads to a principled extension of score matching to general spaces. Finally, we demonstrate the application of our framework on examples in discrete space, continuous space and on Riemmanian manifolds.

Background
A denoising diffusion model is a generative model consisting of two stochastic processes. The fixed noising process takes a data point x 0 drawn from a data distribution q 0 := p data on state space X and maps it stochastically to some x T ∈ X . The learned generative process takes x T ∈ X drawn according to some initial distribution p 0 on X and maps it back stochastically to some x 0 ∈ X . Throughout, we denote the marginals of the noising and generative processes by q t (x) and p t (x) respectively for t ∈ [0, T ].
The basic idea is to pick a noising process so that (q t ) t≥0 converges to some easyto-sample-from distribution q ref , which we then take to be p 0 . We learn a generative process which approximates the time-reversal of the noising process. Then, we can generate approximate samples from q 0 by sampling x T ∼ p 0 and running the dynamics of the reverse process to produce a sample x 0 ∼ p T , which should be close to q 0 .
2.1. Continuous-time denoising diffusion models on R d The framework for continuous-time denoising diffusion models on R d was first set out by Song et al. (2021). The noising process (Y t ) t∈[0,T ] evolves according to the SDE for some chosen function b : R d × [0, T ] → R d , and standard Brownian motion B.
With this set-up, the time-reversed process X t = Y T −t can be simulated by initialising X 0 = x T ∼ q T and running the SDE where q t (x t ) denotes the marginals of the forward process andB is another standard Brownian motion (Anderson, 1982). We typically choose our forward process to be an Ornstein-Uhlenbeck process, i.e. b(x, t) = −x/2, for which q T ≈ q ref := N (0, I d ), the standard Gaussian distribution on R d , for large T .
To simulate the reverse process, we must approximate ∇ x log q t (x). We do this by fixing a parametric family of functions s θ (x, t), and then choosing the parameters θ to minimise the objective where q 0,t (x 0 , x t ) and q t|0 (x t |x 0 ) denote the joint and conditional distributions of the forward process respectively. The conditional is available in closed-form for the Ornstein-Uhlenbeck process. This is sensible since I DSM is minimised when s θ (x, t) = ∇ x log q t (x) (Song et al., 2021). If our score estimate were exact and we had p 0 = q T , then we would have p t = q T −t for all t ∈ [0, T ]. In practice, we typically use a neural network to parameterise s θ (x, t) and use stochastic gradient descent to minimise I DSM (θ). Once we have a score estimate s θ (x, t), we compute approximate samples from the reverse process by running the approximate reverse process starting in X 0 ∼ p 0 and setting x 0 = X T . In practice, we use suitable numerical integrators to simulate the approximate reverse process. Alternatively, the objective I DSM can be derived from a lower bound on the model log-likelihood (also known as an Evidence Lower Bound, or ELBO) for q T (x), either using Girsanov's theorem and the chain rule for Kullback-Leibler divergences (Song et al., 2021), or by combining the Fokker-Planck equation and Feynman-Kac formula with Girsanov's theorem (Huang et al., 2021).

Diffusion models for inference
Denoising diffusions can also be used to sample approximately from a posterior p data (x|ξ * ) when we only have access to samples from the joint distribution p data (x, ξ); see e.g. (Song et al., 2021).
We first draw a sample (x 0 , ξ 0 ) ∼ p data , set Y 0 = x 0 and let (Y t ) t∈[0,T ] evolve according to Equation (1). If we condition on ξ 0 , then the process Y has marginals q t (x t |ξ 0 ) = q t|0 (x t |x 0 )p data (x 0 |ξ 0 )dx 0 , where q t|0 (x t |x 0 ) is the transition kernel of the forward diffusion in Equation (1). So, the time-reversed process X t = Y T −t conditioned on ξ 0 can be simulated by initialising X 0 ∼ q T (·|ξ 0 ) and running the SDE If we have q T (·|ξ) ≈ q ref for all ξ and an approximation s θ (x, ξ, t) to ∇ x log q t (x|ξ), we can obtain approximate samples from q 0 (·|ξ * ) = p data (·|ξ * ) for any given ξ * by initialising X 0 ∼ p 0 := q ref , simulating the reverse dynamics in Equation (5) with ∇ x log q T −t (X t |ξ 0 ) replaced by s θ (X t , ξ * , T − t), and setting x 0 = X T . To learn s θ (x, ξ, t), we minimise

Score matching
The objective I DSM defined in Equation (3) can also be interpreted as a score matching objective. Score matching was introduced as a method for fitting unnormalised probability distributions defined on R d by Hyvärinen (2005). It approximates a true distribution q 0 (x) with a family of distributions p(x; θ) = q(x; θ)/Z(θ) by minimising known as an explicit score matching loss. This objective is intractable since it depends on ∇ x log q 0 (x), but there are methods for rewriting it in an equivalent tractable form, including implicit and denoising score matching (Hyvärinen, 2005;Vincent, 2011). Equation (3), which corresponds to denoising score matching, can also be written in explicit, implicit or sliced score matching form (Huang et al., 2021).

A general framework for denoising Markov models
In this section, we set out a general framework for DMMs. First, we explain how to construct a DMM on an arbitrary state space with a forward noising process Y and backward generative process X. Second, we derive an expression for the model likelihood in terms of an expectation over an auxiliary process Z, defined in terms of X and running forward in time. Third, we derive an ELBO by using Girsanov theorem to relate the expectation over Z to an expectation over Y . Finally, we show how this ELBO can be used to get a tractable training objective. Throughout, we follow the method of Huang et al. (2021) closely. For simplicity, we first present the framework for unconditional generation and then explain how to adapt it for inference.

Notation and set-up
Our data is assumed to be distributed according to p data on a state space X . We assume only that X comes with some reference measure ν, with respect to which all probability densities will be defined, and satisfies some mild regularity conditions given in Appendix B.1. This will include R d , discrete state spaces and Riemannian manifolds. Our DMM consists of a noising process (Y t ) t∈[0,T ] and a generative process (X t ) t∈[0,T ] , which are Markov processes. We consider Y fixed and learn X to approximate the reverse of Y . Initially, we must fix a class of processes to which X and Y belong and within which we will optimise X. The particular class and parameterisation we choose will necessarily depend on X , but a typical choice for X = R d would be a diffusion (see Example 1), while a typical choice when X is a finite discrete space may be a continuous-time Markov chain (CTMC) (see Example 2).
As X and Y are not necessarily time-homogeneous, it is helpful to define the extended processes X and Y by for example setting X t = X T for t ≥ T and letting X = (X t , t) t≥0 . Then X, Y are time-homogeneous Markov chains on the extended space S := X ×[0, ∞).
In general, it is most convenient to define X and Y via the generators of X and Y , which we denote by K and L respectively. Informally, the generator of a Markov process W with state space S is an operator A which acts on a subset D(A) of the space of functions f : S → R and satisfies Af = lim s→0 (P s f − f )/s, where (P s ) s≥0 is the transition semigroup associated to W and P s f ( For a more formal definition, see Appendix A.1. We denote the time marginals of the processes X and Y by p t (x) and q t (x) respectively, and make some mild smoothness assumptions on p, listed in Appendix B.2. In general, we also assume that K and L satisfy some mild regularity conditions, listed in Appendix B.3. One consequence of our assumptions is that the operator K decomposes as K = ∂ t +K, whereK operates only on the spatial variables of a function f . We can therefore viewK as an operator on functions from X , rather than on functions from S, and we denote byK * the adjoint ofK acting on functions on X (see Appendix A.2).
Example 2 (Discrete Space CTMC). If X and Y are CTMCs, then K = ∂ t + A and L = ∂ t + B, where A and B are the time-dependent generator matrices of X and Y . In this case,K * = A T , the transpose of A.

An expression for the model likelihood
We now derive an expression for the model likelihood p T (x). First, under our assumptions, a generalised form of the Fokker-Planck equation, stated precisely in Appendix C, implies that ∂ t p =K * p for ν-almost every x ∈ X . Typically, the adjoint operatorK * resembles the generator of another process in the same class as X and Y . We formalise this idea by making the following assumption.
Then we can write the equation ∂ t p =K * p in the form Mv + cv = 0 for some function c : S → R, where M is the generator of another auxiliary Feller process Z = (Z t , t) t≥0 on S.
Example 4 (Discrete Space CTMC). In the CTMC case, if c x = y∈X A yx , and D xy = A yx −c x 1 x=y , then M = ∂ t +D is the generator of a CTMC and Assumption 1 is satisfied. Here c has a natural interpretation as a "discrete divergence".
In general, we make two smoothness assumptions on c and v, given in Appendix B.4. Given the Fokker-Planck equation and Assumption 1, we apply a generalised form of the Feynman-Kac Theorem (see Appendix C) to Z and v to get the following expression for the model likelihood, which generalises that of Huang et al. (2021): This gives an expression in terms of an expectation over the auxiliary process Z. We next make this tractable by converting it into an expectation over Y .

Deriving a tractable lower bound on the model log-likelihood
We would like to train our model by finding a reverse process X which maximises the likelihood in Equation (6). Unfortunately this expression is intractable, but we can find a tractable lower bound for log p T (x) which can then be used as a surrogate objective. By taking logarithms in Equation (6) and applying Jensen's inequality, we get where P and Q are the path measures of the processes Z and Y respectively.
To write E ∞ in a tractable form we need to evaluate log dP dQ , which we do using a generalisation of Girsanov theorem. To apply this result, we require that the generators of the auxiliary process and the noising process are related in the following way.
Since M is defined in terms of K, we think of Assumption 2 as forcing a particular parameterisation of the generative process in terms of α. In general, not every generative process in the same class as L will have such a parameterisation. However, the true timereversal of L can always be parameterised in this way with α(x, t) = 1/p t (x) †, so this parameterisation is sufficient to capture the optimal generative process. In addition, the objective in Theorem 1 below can often be interpreted and used for a much broader set of generative processes than those which satisfy Assumption 2.
Under Assumption 2, along with a further technical assumption given in Appendix B.5, we may apply a generalised form of Girsanov's Theorem (see Appendix C) and Dynkin's formula (see Appendix A.1) to get In addition, we get that c = α −1 Lα − (vα) −1 L(vα) by combining Assumption 2 with f = v and Assumption 1. This allows us to rewrite the ELBO from Equation (7) as The final step required to get a tractable expression for E ∞ is to remove the function v from this expression. For this, we use the following lemma (see Appendix D).
Theorem 1. For a DMM as in Section 3.1, the log-likelihood is lower bounded by

Finding suitable training objectives
Based on Theorem 1, we fit our generative model by maximising the expectation of E ∞ with respect to p data . This is equivalent to minimising the objective Since q t andL are determined by the noising process, which is known and assumed easy to sample from, I ISM (β) and its gradient with respect to β can be estimated in an unbiased fashion. Learning β is equivalent to learning α, which parameterises M via Assumption 2, and thus K through Assumption 1. So, minimising I ISM (β) over β is equivalent to learning the generative process in some parameterisation. †Although this α is not bounded, for the application of Girsanov we in fact only require the weaker condition that a certain stochastic process defined in terms of α is a martingale, and this condition should hold for α(x, t) = 1/p t (x) for sufficiently smooth generative processes.
Inspired by the relationship with score matching (see Section 4 below), we call Equation (9) the implicit score matching objective for this DMM. We can also derive an equivalent denoising score matching objective (see Appendix E for proof) (10) These objectives generalise the following previously studied instances of diffusion models. For all derivations and remarks on the choice of parameterisation, see Appendix F.
Example 5 (Real Diffusion). In the setting of Example 1, Assumption 2 reduces (10)  Example 6 (Discrete Space CTMC). In the setting of Example 2, Assumption 2 reduces to A yx = α(y,t) α(x,t) B xy for all x = y. We may rewrite I ISM in terms of A to recover the objective of Campbell et al. (2022), Example 7 (Riemannian Manifolds). If X is a Riemannian manifold and we take K = ∂ t + µ · ∇ + 1 2 ∆, L = ∂ t + b · ∇ + 1 2 ∆ where ∆ is the Laplace-Beltrami operator associated to X , and perform the reparameterisation s θ (x t , t) = −∇ log α(x t , t), then we recover the framework for training diffusion models on Riemannian manifolds given in De  and Huang et al. (2022).

Inference
To use DMMs for inference, we follow a similar procedure to Section 2.2. To noise a sample (x 0 , ξ 0 ) ∼ p data , we set Y 0 = x 0 and let Y evolve according to L. To generate x 0 conditioned on an observation ξ * , we use a generative process X ξ * conditioned on ξ * . We parameterise X ξ * in terms of a function α(x t , ξ * , t) which now takes ξ * as an input.
We aim to learn X ξ * to approximate the time-reversal of Y conditioned on ξ * . The following extension of Theorem 1 (proved in Appendix D) gives us a way to do this.
Theorem 2. With the above set-up, minimising the objective is equivalent to maximising a lower bound on the expected model log-likelihood.
Since q t|0 (x t |x 0 ) is known, we may fit our model by calculating an empirical estimate for this objective based on samples (x 0 , ξ 0 ) drawn from p data and minimising over α. Then, we generate samples from p data (x 0 |ξ * ) by initialising X ξ * 0 ∼ p 0 , simulating the reverse process with generator K parameterised by α = α(·, ξ * , ·), and setting x 0 = X ξ * T .

Score matching on general state-spaces
When X and Y are real diffusions, the objective I DSM (β) in Equation (10) becomes the score matching objective in Equation (3). Similarly, the objective I ISM (β) from Equation (9) reduces to the implicit score matching objective introduced by Hyvärinen (2005). This suggests we can view Equations (9) and (10) as generalisations of score matching objectives to arbitrary state spaces. Given state space X on which we have a Markov process generator L and an unknown distribution q 0 (x) we wish to approximate, the corresponding generalised implicit score matching method learns an approximation β(x) to q 0 (x) by minimising We can show that J ISM is equivalent to the generalised explicit score matching objective In addition, we define the corresponding generalised denoising score matching method, which learns an approximation β τ (x τ ) to the noised distribution q τ (x τ ), formed by sampling x 0 ∼ q 0 (·) and x τ ∼ q τ |0 ( · |x 0 ), where q τ |0 is the transition probability associated to L run for time τ . It does this by minimising the objective It can be shown that J DSM is equivalent to both J ISM and J ESM when used to learn the smoothed distribution q τ (x τ ) (see Appendix E).
To illustrate the similarities between these definitions and traditional score matching on R d , we define the score matching operator Φ(f ) = f −1 Lf − L log f . Note that the time component of Φ cancels, so we can view it as an operator on X . Then, the generalised explicit score matching objective becomes In the general case, we interpret Φ(f ) as roughly measuring the magnitude of a logarithmic gradient of f . Proposition 1. Let Y be a Feller process with semigroup operators (Q t ) t≥0 , generator L and associated score matching operator Φ. Then: where KL(π 1 Q t ||π 2 Q t ) denotes the Kullback-Leibler divergence between π 1 Q t , π 2 Q t .
Proposition 1(a) shows that Φ is always non-negative, so J ESM is minimised if and only if β(x) ∝ q 0 (x). Thus minimising any of our generalised score matching objectives really does correspond to learning an approximation to q 0 . Proposition 1(b) was proved for score matching on R d by Lyu (2009). It suggests we can interpret score matching as finding an approximation β which minimises the decrease in KL divergence between q 0 and β caused by adding an infinitesimal amount of noise to both according to L.
Our generalised score matching methods give a principled way to extend score matching to fit unnormalised probability distributions on arbitrary spaces. Other extensions of score matching have been explored, notably ratio matching (Hyvärinen, 2007) and marginalisation with generalised score matching (Lyu, 2009). However, these methods lack the generality of our framework and do not respect the intuition coming from R d that Proposition 1(b) should hold.

Relationship to discrete time models
Denoising diffusion models were originally introduced in discrete time by Sohl-Dickstein et al. (2015). In this setting, the noising and generative processes are Markov chains x 0:T = (x tk ) N k=0 observed at a sequence of times 0 = t 0 < t 1 < · · · < t N = T , with fixed forwards transition kernelq(x tk |x tk−1 ) and learned backwards kernelp θ (x tk−1 |x tk ) ‡. To fit discrete time diffusion models, Sohl-Dickstein et al. (2015) minimize the following Kullback-Leibler divergence with respect to θ: Given any DMM with generators K, L and marginals p t , q t as in Section 3, we define its natural discretisation to be the discrete-time model withq(x tk |x tk−1 ) = q tk|tk−1 (x tk |x tk−1 ) andp θ (x tk−1 |x tk ) = p T −tk−1|T −tk (x tk−1 |x tk ). Then, the Kullback-Leibler divergence (11) for the natural discretisation can be viewed as a first-order approximation to I ISM for the continuous-time model.
Lemma 2. Suppose X, Y are fixed generative and noising processes with marginals p, q as in Section 3, and suppose that they are related as in Assumptions 1 and 2 for some sufficiently regular function α. Then for any 0 < s < t < T with γ = t − s, Applying this lemma on each interval [t k , t k+1 ], we get the following theorem.
Theorem 3. For any DMM, the objective (11) for its natural discretisation is equivalent to the natural discretisation of I ISM to first order in γ = max k=0,...,N −1 |t k+1 − t k |.
This theorem generalises to arbitrary state spaces a result of Ho et al. (2020), which demonstrated the equivalence of minimizing (11) and the score matching objective for real state spaces. For the proofs of Lemma 2 and Theorem 3, see Appendix H. ‡We use generic time indices t 0 , . . . , t N rather than the more standard 0, . . . , N for compatibility with the notation in continuous-time.

Fig. 2.
Posterior kernel density estimates of samples generated using our DMM, SA-ABC and W-SMC for the g-and-k distribution example, with x true = (3, 1, 2, 0.5) and N = 250.
Lemma 2 also implies a general equivalence between one-step denoising autoencoders and score matching. Vincent (2011) discussed this equivalence for autoencoders using Gaussian noise in R d , but our methods allow us to extend this correspondence to arbitrary state spaces and noising processes. For more details, see Appendix I.

Experiments
We now present experiments demonstrating DMMs on several tasks and data spaces, for unconditional generation and conditional simulation. All details are in Appendix J.
6.1. Inference on R d using diffusion processes First, we use diffusion processes in R d to perform approximate Bayesian inference for real-valued parameters. We consider p data (ξ|x) = N i=1 p data (ξ i |x), where p data (ξ i |x) is the g-and-k distribution with parameters x = (A, B, g, k) and d = 4, and we let p data (x) be uniform on [0, 10] 4 . The g-and-k distribution is a 4-parameter distribution in which A, B, g, k control the location, scale, skewness and kurtosis respectively.
We fix our noising process to be an Ornstein-Uhlenbeck process, and parameterise our reverse process as in Example 5, with s θ (x, ξ, t) being given by a fully connected neural network. To train the model, we sample (x 0 , ξ 0 ) ∼ p data and minimise the denoising score matching objective from Section 3.5 via stochastic gradient descent on θ.
To test our model, we first consider the case where there are a true set of underlying parameters x true = (3, 1, 2, 0.5). We generate an observation ξ 0 ∼ p data (ξ 0 |x true ) with N = 250, sample from the approximate posterior using our DMM and plot the result in Fig. 2. We compare our method with the semi-automatic ABC (SA-ABC) and Wasserstein SMC (W-SMC) methodologies. We see in Fig. 2 that our inference model achieves more accurate posterior estimation for all parameters, except the kurtosis parameter k.
Next, we demonstrate that our model can perform inference for a range of observation values ξ * simultaneously. We generate a series of 512 parameter values x 0 drawn from p data (x 0 ) and draw an observation ξ 0 from p data (ξ 0 |x 0 ) with N = 10000 for each x 0 . Then, we generate a sample x 0 from our approximation to the posterior p data (x 0 |ξ 0 ) for each ξ 0 . We plot each component of the pairs (x 0 , x 0 ) in Fig. 3. We see our model is able to infer the original parameters across a range of parameter values. Importantly, our model can perform amortised inference on the whole parameter space, in contrast to sequential ABC methodologies which require further sampling from p data (ξ|x) in local parameter regions for every fixed set of observations ξ * .
6.2. Image inpainting and super-resolution using discrete-space CTMCs Second, we demonstrate that our framework is applicable for large-scale Bayesian inverse problems, such as super-resolution and inpainting for images. For these problems, the prior p data (x) is the distribution of images. Most ABC techniques such as SA-ABC and W-SMC are not applicable as they require an analytical expression for this prior, whereas DMMs do not rely on such an expression.
We consider performing image inpainting for MNIST digit images, where each image x 0 has 28 × 28 pixels with values in {0, . . . , 255}, and the observed incomplete image ξ 0 has the middle 14 × 14 pixels missing. Since our state space X = {0, . . . , 255} 28×28 is discrete, we use the set-up of Example 2 and let the generator of our noising process factor over pixel dimensions. We use the denoising parameterisation of the reverse process (see Appendix F.2) and train by minimising the form of the objective in Example 6.
To test our model, we plot the reconstructed image samples for a number of digits in Fig. 4. We observe that the samples we obtain are consistent with conditioning and appear to be realistic, but also display diversity in the shape of the strokes.
In addition, we train a conditional discrete-space DMM to perform super-resolution on ImageNet images to demonstrate that this method provides perceptually high quality samples even in very high-dimensional scenarios. For details, see Appendix J.3.

Modelling distributions on SO(3) using manifold diffusions
Finally, we demonstrate that DMMs can approximate distributions on manifolds using two tasks on SO(3). Since SO (3) is a Lie group and so a Riemannian manifold, we use the framework from Example 7. As our noising process, we use Brownian motion with generator L = ∂ t + 1 2 ∆. We can explicitly calculate the transition kernels q t|0 (x t |x 0 ) for this process, allowing us to use the denoising score matching objective. We parameterise this objective in terms of a neural network approximation s θ (x, t) of the score. This is in Fig. 4. Samples from the MNIST inpainting task. The first column in each set plots the ground truth images, and the second column has the centre 14 × 14 pixels missing.

Fig. 5.
Samples from the ground truth and our DMM approximation to the mixture of wrapped normal distributions. Each sample is denoted by a point, whose position represents the axis of rotation and whose colour represents the angle of rotation. Stars denote the true cluster means.
contrast to De Bortoli et al. (2022), in which the explicit transition kernels are not used for sampling the forward process or in the loss function, both of which require further approximations.
First we check that our DMM can learn simple mixtures of wrapped normal dis- is the wrapped normal distribution on SO(3) with expectation µ m and variance σ 2 m (De Bortoli et al., 2022). We plot samples from our resulting DMM in Fig. 5. We see that our model provides a good fit to p data (x), covering all modes. In Appendix J.5, we provide additional results and show that we can also sample from the class conditional density p data (x|m).
Second, we consider a more realistic pose estimation task on the SYMSOL dataset, which requires predicting the 3D orientation of various symmetric 3D solids based on 2D views (Murphy et al., 2021). Due to the rotational symmetries, a key challenge is to predict all possible poses when only one possibility is presented in training. We use a conditional DMM where ξ is the 2D image view. Fig. 6 shows two sets of samples from our model conditioned on 2D images of two different solids. We see that our model learns to sample from the ground truth accurately and infer the full set of rotational symmetries for different views ξ. For further experimental details and plots, see Appendix J.6.

Discussion
Over the past two years, denoising diffusion models have become very popular in machine learning as they display remarkable empirical performance in a large variety of domains. We have provided here a general framework which allows us to extend denois- ing diffusion models to general state-spaces. We have shown how the resulting DMMs can be trained with principled objectives and used for inference, generalizing along the way score matching ideas. Finally we have demonstrated them on a range of problems. From a methodological point of view, the proposed framework is general enough to accommodate, for example, general noising processes, infinite dimensional processes which could be used to model functional data, or mixed continuous/discrete processes.
However, we still lack a proper theoretical understanding of these models. Under realistic assumptions on the data distribution such as the manifold hypothesis, De Bortoli (2022) and Chen et al. (2022) show that diffusion models on R d can in theory learn essentially any distribution given a good enough score approximation and infinite data. However finite sample guarantees are currently absent. Moreover, p data is typically an empirical measure as we only have access to a finite set of datapoints, so q t is a mixture of Gaussians for an Ornstein-Uhlenbeck noising diffusion and its score ∇ log q t is thus available. If we were simulating samples using the exact time reversal of this diffusion, we would simply recover the empirical distribution. It is because we are approximating the time-reversal and in particular using an approximation of the scores that we are able to obtain novel samples. It is not yet clear why the approximation of the score using neural networks appears to provide perceptually realistic samples for many applications.
The effectiveness of such methods for inference, even in scenarios where standard MCMC or ABC techniques are not applicable (Sharrock et al., 2022), may also be considered surprising. One perspective on the training process is that it involves the model constructing its own summary statistics that allow it to perform inference effectively on the training observations. It is not yet well understood why the summary statistics the model learns appear empirically effective, or what sorts of summary statistics our training procedure biases the model towards.
Overall, this contribution shows how the range of existing models relate to each other and may help applying DMMs in practice to a large variety of problems. However, our understanding of such models is still incomplete and deserves further attention.

A. Background on Feller processes
We recall some basic definitions and properties associated with Feller processes which we use for the derivations in Section 3. Our principal source is Dong (2003).
A.1. Definition of a Feller process Let S be a locally compact, separable metric space and let C 0 (S) denote the set of continuous functions f : S → R such that for any > 0 there exists a compact K ⊆ S such that |f (x)| < for all x ∈ K. Also, let ||f || denote the supremum norm on C 0 (S).
Definition 1 (Feller process). A time-homogeneous Markov process (X t ) t≥0 with state space S and associated transition semigroup (P t ) t≥0 is a Feller process if: • P t f ∈ C 0 (S) for all f ∈ C 0 (S) and t ≥ 0.
Definition 2 (Generator of a Feller process). Suppose X is a Feller process on S as above and f is a function in C 0 (S). If the limit we say that f is in the domain of the generator of X. We call the operator A defined in this way the generator of X and denote its domain by D(A).
In the main text, we are concerned with Feller processes X, Y defined on the extended space S = X × [0, ∞) which are constructed by taking a time-inhomogeneous Markov process X on X and defining X = (X t , t) t≥0 . In this setting, we have the following variant of Dynkin's formula.
Lemma 3 (Dynkin's formula). If X = (X t , t) t≥0 is a Feller process on S with generator A and f ∈ D(A), then Af (X s , s) ds is a martingale with respect to the natural filtration of X.

A.2. Adjoint of a generator
Given a state space S and a reference measure ν on S, we can define an inner product on C 0 (S) by letting for all f, h ∈ C 0 (S) such that the integral exists. This induces a Hilbert space structure on C 0 (S) and allows us to make the following definition, from Yosida (1965).
Definition 3 (Adjoint of an operator). Given operator A with domain D(A) contained in C 0 (S), we define the adjoint operator A * acting at function f ∈ C 0 (S) by The domain D(A * ) of A * is the set of all functions f such that there exists some function A * f for which the above holds.

B. Assumptions for Section 3
B.1. Assumptions on the state space X Assumption 3. Our state space X is a locally compact, separable metric space. In addition, there exists a reference measure ν on X with respect to which all relevant probability distributions are absolutely continuous.

B.2. Assumptions on the marginals p and q
Assumption 4. We have p t ∈ D(K * ) for each t ∈ [0, T ], whereK * is the adjoint of the spatial part of the operator K. In addition, p is differentiable with respect to t and ∂ t p is bounded.

B.3. Assumptions on the generators K and L
Assumption 5. X and Y are Feller processes with associated transition semigroups (P t ) t≥0 , (Q t ) t≥0 and generators K, L respectively.
Assumption 6. K decomposes as K = ∂ t +K, whereKf is defined only in terms of the spatial arguments of f , so we may view it as an operator on (a subset of ) C 0 (X ).
Assumption 7. There exists a subset D 0 ⊆ D(K) ∩ L 2 (X , ν) which is dense in L 2 (X , ν), satisfiesKh ∈ D 0 for all h ∈ D 0 and such that every function in D 0 is bounded and has compact support.

B.4. Assumptions on M and c
Assumption 8. The function c : S → R is bounded, and the function v : S → R is bounded, in D(M) and satisfies T 0 E |Mv(Z s , s)| 2 ds < ∞.

C. Stochastic process theory
We provide full statements of the general stochastic process results used in Section 3. For completeness, we also provide proofs of the given results adapted to our setting.
Theorem 4 (Fokker-Planck). Let (X t ) t∈[0,T ] be a Markov process with generator K and marginals p t satisfying the assumptions in Appendix B. Then p satisfies the forward Kolmogorov equation ∂ t p =K * p for ν-almost every x.
Proof. For any h ∈ D 0 , by Assumptions 4 and 7 we may write Applying Dynkin's formula to f (x, t) = h(x), taking expectations and using Fubini's theorem, we see that Differentiating with respect to t, we deduce that ∂ t p −K * p, h = 0. Since this holds for all h ∈ D 0 and D 0 is dense in L 2 (X , ν), we conclude that ∂ t p −K * p = 0 holds ν-a.e. as required.
Theorem 5 (Feynman-Kac). Let Z = (Z t , t) t≥0 be a Feller process on S with generator M. Suppose that we are given functions v, c : S → R and h : X → R such that v and c are bounded, v ∈ D(M), Proof. This result is well-known in the case of real diffusion processes (Karatzas and Shreve, 1991). In the general case, the proof relies on the theory of semimartingales (see for example Métivier (1982)). Fix τ ∈ [0, T ] and for all t ∈ [τ, T ] define Each of these processes is clearly a semimartingale, and so we may define dS t , dU t and dV t accordingly (Métivier, 1982). The following lemma will allow us to express dS t in terms of dU t and dV t .
Lemma 4 (Integration by parts for semimartingales). If U and V are semimartingales and at least one is continuous then we have Proof. This is Theorem 2.7.4(ii) of Pulido (2011), or follows from applying Theorem 27.1 of Métivier (1982) to the function ϕ(U, V ) = U V .
Since v ∈ D(M), by Dynkin's formula we have that V is a semimartingale and we may decompose where M v t is a martingale. Also, since c(x, t) is bounded, U is a continuous, adapted, previsible process of finite variation and satisfies In addition, note that d[U, V ] c t = 0 since U is continuous and of finite variation. Therefore, by Lemma 4, we can calculate where we have used that Mv + cv = 0 in the last line. Therefore, S can be expressed as a stochastic integral with respect to the martingale M v .
The conditions we have imposed on c and v imply that U is bounded and M v is square-integrable. It follows, for example from Theorem 24.4.5 in (Métivier, 1982), that S is a local martingale and hence, since it is also bounded, a true martingale. We then have that Theorem 6 (Girsanov). Let Y = (Y t , t) t≥0 and Z = (Z t , t) t≥0 be Feller processes on S with generators L, M and path measures Q, P respectively, such that Y 0 and Z 0 have the same law. Suppose also that there exists a bounded, measurable function α : S → (0, ∞) in D(L) such that α −1 Lα is bounded, and such that for all functions f such that f ∈ D(M) and f α ∈ D(L). Then we have Proof. This essentially follows from the work of Palmowski and Rolski (2002). Using their terminology, their Proposition 3.2 implies α is a good function, so the RHS of Equation (13) is a martingale and we may define a measureP by Under the measureP, the canonical process (ω t ) t∈[0,T ] is still Markov. By the proof of their Theorem 4.2, we see that is a martingale for all sufficiently smooth functions f , implying that M is the generator of (ω t ) t∈[0,T ] underP. It follows that (ω t ) t∈[0,T ] has the same law underP as Z does under Q, which is sufficient to prove the result since Y and Z are Feller.

D. Proof from Section 3
We give the proofs of Lemma 1 and Theorem 2 from Section 3.
Next, we can write Finally, note that L log α +L log β = ∂ t log α. Combining this with the final line above, we get the desired result.
Theorem 2. With the above set-up, minimising the objective is equivalent to maximising a lower bound on the expected model log-likelihood.
Proof. Applying Theorem 1 to the generative process X ξ * conditioned on observation ξ * , Replacing ξ * by ξ 0 , letting (x 0 , ξ 0 ) ∼ p data and taking expectations, we get For any given ξ, we have by the argument of Appendix E (see below). Substituting ξ 0 for ξ and taking expectations over ξ 0 ∼ p data , noting that q t|0 (x t |x 0 , ξ 0 ) = q t|0 (x t |x 0 ), we get It follows that The first term on the RHS is independent of the dynamics of the reverse process. Hence minimising is equivalent to maximising a lower bound on E pdata(x0,ξ 0 ) [log p T (x 0 |ξ 0 )], which is the expected model log-likelihood.

E. Equivalence of generalised score matching objectives
First, we show that I ISM and I DSM are equivalent training objectives.
where the constants depend only on the dynamics of the forward process and so are fixed during training. Integrating from t = 0 to t = T , we conclude that I ISM and I DSM are equivalent.
There is also an explicit score matching form of the general DMM training objective as follows: To see that this is equivalent to I ISM and I DSM , observe E qt(xt) L(q · (·)α(·, ·))(x t , t) q t (x t )α(x t , t) − L log(q · (·)α(·, ·))(x t , t) and integrate from t = 0 to t = T .

F. Application to particular spaces
In this section, we show how our general framework can be applied in some particular cases of interest, namely to real diffusion processes, continuous-time Markov Chains on finite discrete state spaces, and diffusions on Riemannian manifolds.
A recurring theme we see in each example is that the default parameterisation given by our framework in terms of α (or equivalently β) is sub-optimal, either because we expect it to lead to numerical instabilities when optimising the training objective, or because it only captures a restricted subset of the class of reverse processes we are interested in. However, in each case it turns out to be possible to reparameterise the generative process in a way which captures a wider class of processes and lets us interpret the training objective on this wider class. This allows us to optimise our generative process over this wider class of processes. In addition this reparameterisation typically leads to a form of the objective that we expect to be more numerically stable in practice.

F.1. Real vector spaces
We show how our framework recovers the setup of Song et al. (2021), described in Section 2.1, in the case where K and L are the real diffusion processes given in Example 1. For convenience, we recall that X and Y satisfy the SDEs respectively, and the corresponding generators are First, we check the assumptions made in Appendix B. If we let our reference measure ν be the Lebesgue measure, then Assumption 3 holds. Assumption 5 is satisfied whenever b and µ are Lipschitz functions (Schilling and Partzsch, 2012, Corollaries 19.27 and 19.31), and Assumption 6 follows given the form of K above. For Assumption 7 we take D 0 = C ∞ c (R d ), the set of infinitely differentiable functions with compact support, and note that this is dense in L 2 (X , ν). Finally, we assume that the reverse process and p 0 are sufficiently regular that Assumptions 4, 8 and 9 hold.
Using integration by parts, we can calculate the adjoint ofK. We have assuming f and h are sufficiently regular that all boundary terms are zero. Therefore, We see that Assumption 1 holds if we let c = −(∇ · µ) and noting that this is the generator of another diffusion process Z satisfying the SDE Given this form of L and M, Assumption 2 then becomes for some bounded measurable function α. This puts a restriction on the class of reverse processes K we may use; the condition that the drift µ must be expressible as −b+∇ log α for some α is not automatically satisfied. However, the true time-reversal of the forward process will satisfy this property. In addition, we will show that we may reparameterise the training objective so that it can be interpreted for a broader class of reverse processes.
Assuming for the moment that Assumption 2 does hold, we can evaluate and so the denoising score matching objective becomes Looking at Equations (15) and (16) suggests that it is more natural to parameterise the reverse process in terms of s θ (x, t) = −∇ log α(x, t) instead of α(x, t). Making this substitution, the objective becomes recovering the objective of Song et al. (2021). Parameterising in terms of s θ (x, t) rather than α(x, t) is preferable for a couple of reasons. First, s θ (x, t) is targeting the score ∇ log q t (x), while α(x, t) is targeting 1/q t (x), and we expect the former to typically be an easier target. Second, while Equation (16) only makes sense when the forward and backward processes are related via Assumption 2, the objective in Equation (3) is valid for any forward and backward diffusion processes as in Equation (14). Hence reparameterising allows us to capture a wider class of reverse processes in our optimisation.

F.2. Discrete state spaces
Next, we show how to apply our framework when X and Y are continuous-time Markov chains on a finite discrete state space as in Example 2. With a particular choice of parameterisation, we end up recovering the set-up of Campbell et al. (2022).
Recall that we start with K = ∂ t + A and L = ∂ t + B, where A and B are the timedependent generator matrices of X and Y respectively. From this it follows immediately thatK * = A T . We will use the counting measure as our reference measure ν.
On a finite discrete space, all functions are bounded and have compact support, and D(K) = D(L) = C 0 (S) is the set of all functions on X . Assumptions 3, 5, 6 and 7 follow immediately. In addition, we assume that the reverse process and p 0 are sufficiently regular that Assumptions 4, 8 and 9 always hold.
In order for Assumption 1 to hold, we need to find M and c such that M+c = ∂ t +K * (viewed as operators). Since M should be the generator of another CTMC, we write M = ∂ t + D for some generator matrix D. We then require D + c = A T , where c is viewed as a diagonal matrix and D must have zero row sums. This holds if and only if we take With this choice of M, Assumption 2 becomes for all x ∈ X . If we pick two distinct x, y and set f (z) = 1 z=y in the above, we deduce α(x, t)D xy = α(y, t)B xy for all x = y.
Hence for Assumption 2 to hold, we require An elementary check also shows that this condition is sufficient for Assumption 2 to hold for a given choice of α.
With this parameterisation, the implicit score matching objective becomes Unfortunately, fitting β directly using this objective is typically likely to perform poorly. This can be seen for a couple of reasons. Firstly, the optimal value of β(x, t) is q t (x), and so learning β(x, t) should be roughly as hard as targeting the marginals of the forward process directly. Secondly, the presence of β in the denominators can lead to numerical instabilities in regions where the forward process has low density.
Fortunately, we have at least a couple of methods for avoiding these problems available. The first is to find an equivalent formulation of the objective in terms of the generator of the reverse process, and then learn this generator using a denoising parameterisation. For x = y, we have where the constant depends only on the dynamics of the forward process, which are fixed. We can therefore write recovering the objective of Campbell et al. (2022). In addition, we can parameterise the reverse generator A via where p (t) θ (x 0 |x t ) is some learned estimate of the original datapoint x 0 given the noised observation x t , and θ denotes the learnable parameters. This parameterisation should be more stable, as it avoids potentially exploding denominators, and we expect predicting the original datapoint given the noised datapoint to be an easier goal than learning the marginals q t (x). See Campbell et al. (2022) for more details on this denoising parameterisation.
The second method is to reparameterise our objective in terms of the ratios s θ (x, y, t) = β(y, t)/β(x, t). Doing this, the training objective becomes In addition, the generative process is now parameterised in terms of s θ (x, y, t) via Importantly, this objective matches the generalised objective from Section 3 when the noising and generative processes are related by Assumption 2, and is still minimised when s θ (x, y; t) = q t (y)/q t (x). This parameterisation is potentially beneficial for a couple of reasons. Firstly, by removing β(x, t) from the denominators, we expect that objective should be more numerically stable. Secondly, this parameterisation captures a wider class of potential reverse processes, since A is now given in terms of B via Equation (20), which is less restrictive than Equation (17).
As discussed further in Section 4, the integrand in Equation (19) may be viewed as a score matching objective for discrete state space. It shares certain similarities with ratio matching techniques (Hyvärinen, 2007), in particular targeting the ratios β(y, t)/β(x, t). However, as far as we are aware this particular objective is not directly equivalent to any previously studied score matching objective in discrete state space (Hyvärinen, 2007;Lyu, 2009;Sohl-Dickstein et al., 2011).

F.3. Riemannian manifolds
Consider the case where X is a Riemannian manifold with metric tensor g and ν is the volume measure induced by g. A diffusion in X may be defined through its generator, so we let the noising and generative processes have generators respectively, where ∆ is the Laplace-Beltrami operator defined in local coordinates by and |g| denotes the determinant of the metric tensor.
To calculate the adjoint operator ofK, we recall that the canonical volume element on X induced by g is given by dω = |g| dx 1 ∧ · · · ∧ dx n and the divergence of a vector field a : X → T X on a Riemannian manifold is given by Then, using the generalised Stokes' Theorem, we have where we assume f and h are sufficiently smooth that we may disregard boundary terms.
In addition, we have We conclude that the adjoint operator is given bŷ Then, as in the real diffusion case we see that Assumption 1 holds if we let c = −(∇ · µ) and noting that M is also the generator of a diffusion process Z on X . We also find that Assumption 2 reduces to the condition ∇ log α = −µ − b, as before.
Assuming this holds, we can evaluate where · g(x) denotes the norm on the tangent space T x X induced by g. Finally, as in the real diffusion we make a reparameterisation s θ (x, t) = −∇ log α(x, t) in order to sidestep Assumption 2 and provide an easier training target. The resulting denoising score matching objective is . Notably, we find that all the relevant formulae in the manifold case are essentially the same as in the real diffusion case, except for the inclusion of the metric tensor.

G. Proof of properties of the score matching operator
We give the proof of the properties of the score matching operator from Proposition 1.
Proposition 1. Let Y be a Feller process with semigroup operators (Q t ) t≥0 , generator L and associated score matching operator Φ. Then: (a) Φ(f ) ≥ 0 for all f in the domain of Φ, with equality if and only if f is constant; (b) for any probability measures π 1 , π 2 on X , where KL(π 1 Q t ||π 2 Q t ) denotes the Kullback-Leibler divergence between π 1 Q t , π 2 Q t .
where Γ(f, g) = K(f g) − f Kg − gKf denotes the carré du champ operator associated to K. We deduce that where in the third line we have used the Fokker-Planck equation. Since f was arbitrary, it follows that Finally if we substitute t → T − t in this final equation, we get which gives the desired result when combined with the definition of M and c from Assumption 1.
Lemma 6. Suppose α : S → (0, ∞) is a function such that Assumption 2 holds. If we define ζ = vα, then for any function f decaying sufficiently rapidly, ζ satisfies Proof. For any sufficiently rapidly decaying f satisfying f ζ ∈ D(L) and vf ∈ D(M), using Lemma 5 we have Now we can give the proofs of Lemma 2 and Theorem 3.
Lemma 2. Suppose X, Y are fixed generative and noising processes with marginals p, q as in Section 3, and suppose that they are related as in Assumptions 1 and 2 for some sufficiently regular function α. Then for any 0 < s < t < T with γ = t − s, Proof. Let P xs and Q xs denote the path measures of W and Y respectively on the interval [s, t] when we condition on the initial value x s . Assuming α is sufficiently regular so that ζ is bounded away from zero and infinity and ζ −1 Lζ is bounded and continuous in the time variable, by Girsanov's theorem and Lemma 6 we have Taking logarithms and writing γ = t − s, to first order in γ for any fixed path ω this becomes log dP xs dQ xs (ω) = log Since the first order terms depend only on the value of the path at its endpoints (ω s , ω t ), we conclude that It follows that Taking expectations and using the definition of the generator as a stochastic derivative, we have where in the final line we have used Lemma 1.
Theorem 3. For any DMM, the objective (11) for its natural discretisation is equivalent to the natural discretisation of I ISM to first order in γ = max k=0,...,N −1 |t k+1 − t k |.
Proof. Given time steps 0 = t 0 < t 1 < · · · < t N = T , define γ k = t k+1 − t k for k = 0, . . . , N − 1 and set γ = max k=0,...,N −1 γ k . Then the natural discretisation of the objective I ISM (β) is given by Using Lemma 2 with s = t k−1 and t = t k for k = 1, . . . , N , we get Putting this together, we see that so objective (11) is equivalent to the natural discretisation of I ISM to first order in γ.

I. General equivalence between denoising autoencoders and score matching
A denoising autoencoder takes a datapoint x 0 drawn from a data distribution q 0 , noises it according to some density q τ (x τ |x 0 ) and then tries to reconstruct x 0 given the noised observation x τ (Vincent et al., 2008). Traditionally, q τ (x τ |x 0 ) is taken to be Gaussian with mean x 0 and some standard deviation σ and we make a point estimate f θ (x τ ) for x 0 given x τ . The parameters θ are learned by minimising the MSE error For a general denoising autoencoder on state space X , we allow a probabilistic reconstruction p (θ) 0|τ (x 0 |x τ ) of x 0 depending on a set of parameters θ, rather than a point estimate. We fit θ by minimising the objective Note that this reduces to the MSE objective in the case where X = R d and p Suppose now that we have a generalised denoising autodencoder where the noising distribution q 0,τ (x 0 , x τ ) is given by the endpoints of a Markov process on X with generator L and the denoising distribution p (θ) 0|τ (x 0 |x τ ) is given by the endpoints of a Markov process on X with generator K. Suppose further that we parameterise the denoising process K via some function β(x, t) according to Assumptions 1 and 2 as in Section 3. Then Lemma 2 implies that J DAE is equivalent to first order to the objective or alternatively to the corresponding generalised denoising score matching objective as in Section 4. This generalises the result of Vincent (2011), which demonstrated an equivalence between denoising autoencoders and denoising score matching in the case of Gaussian noise on R d . Indeed, we recover their result by considering the case where q τ |0 (x τ |x 0 ) and p (θ) 0|τ (x 0 |x τ ) are Gaussian, noting that these distributions are naturally induced as the distributions of the endpoints of diffusion processes.
Our work extends this equivalence between denoising autoencoders and generalised score matching as described in Section 4 to arbitrary state spaces and noising/denoising distributions, provided that the noising and denoising distributions can be viewed as the marginals at the endpoints of Markov processes with known generators.

J. Experimental details
We give the details of our experimental set-up and results from Section 6. Code for all of our experiments can be found at github.com/yuyang-shi/generalized-diffusion.

J.1. Inference on R d using diffusion processes
The g-and-k distribution with parameters (A, B, g, k) is defined via its quantile function where z(q) denotes the qth quantile of the standard Gaussian distribution, and we require B > 0 and k > −0.5. The parameters A, B, g, k control the location, scale, skewness and kurtosis of the distribution respectively (Prangle, 2020). The prior on the parameters is uniform on [0, 10] 4 . For the diffusion model, we centre and rescale each parameter linearly to [−1, 1] in our implementation, and transform back to [0, 10] for reporting. As our noising process, we use the Ornstein-Uhlenbeck process dY t = − 1 2 Y t dt + dB t . This has generator L = ∂ t − 1 2 x · ∇ + 1 2 ∆ and transition densities q t|0 (x t |x 0 ) which are Gaussian and available analytically. We can sample from the forward process at time t by sampling x 0 ∼ q 0 (x 0 ) and then x t ∼ q t|0 (x t |x 0 ). In practice, we apply a time-rescaling to the noising process following Song et al. (2021), in order to apply less noise at small times and move more quickly to the reference distribution at large times, by considering The β schedule is set to be linear and monotonically increasing, i.e.
We set β min = 0.001 and β max is selected using a grid search from 2, 4, 6, 8, 10. The reverse process is parameterised in terms of a conditional score network s θ (x t , ξ, t) using multilayer perceptrons (MLPs). We first encode x and ξ into 128-dimensional encodings using two separate MLPs with 3 layers and 512 hidden units in each layer. We then concatenate the two encodings as well as the time t and pass through another MLP with 3 layers and 512 hidden units in each layer. For N = 250, we take in ξ the full set of order statistics as inputs to our network, i.e. we sort the observation ξ and take all n = 250 values. For N = 10000, we take n = 100 evenly-spaced order statistics from our observation as inputs, following Fearnhead and Prangle (2012).
Since we have access to the analytic transition densities, we train using the denoising score matching objective I DSM (θ). We use a total of 10 6 training samples (x 0 , ξ 0 ) ∼ p data during training. We optimise the network using the Adam optimiser with batch size 512 and learning rate 0.0001 with a cosine annealing schedule for 2.5M iterations. For sampling, we use the Euler-Maruyama method with 1000 steps to simulate from the reverse SDE.
The ground truth posterior density is estimated with MCMC samples generated using the R package gk (Prangle, 2020). We compare our method with the semi-automatic ABC (SA-ABC) and Wasserstein SMC (W-SMC) methodologies using the R packages abctools (Nunes and Prangle, 2015) and winference (Bernton et al., 2019). Both methods are also set to use 10 6 data samples to generate 5000 posterior samples. Compared to these methodologies, our DMM model requires fitting a neural network and therefore is more computationally expensive at training time, but it is able to produce more accurate posterior estimates for fixed ξ 0 , and perform amortised inference across a range of parameter values using the same number of 10 6 data samples, and so is comparatively more data-efficient and does not require further sampling from the likelihood model p data (ξ|x) at test time.
As well as the plots in the main text, we also provide a pair plot comparing the approximate posterior from our diffusion model to the ground truth distribution in Fig. 7. We see that our model provides results very close to the ground truth for the parameters A, B and g and can model the dependency between parameters, but gives a wider estimate in its reproduction of the posterior over k.

J.2. MNIST digit image inpainting using discrete-space CTMCs
Our implementation in discrete space closely follows that of Campbell et al. (2022), and we refer to their paper for further details. We denote our states as x 0 = (x 1 0 . . . , x D 0 ) and for our noising process we use a CTMC with generator matrix B := B 1:D (x 1:D , y 1:D ) which factorises over the dimensions, so B 1:D (x 1:D , y 1:D ) = D i=1B (x i , y i )1 x 1:D\i =y 1:D\i for some rate matrixB acting on a single dimension. Thus each pixel evolves independently as a CTMC on {0, . . . , 255} with rate matrixB. We use the Gaussian rate matrix of Campbell et al. (2022) forB, which respects the ordinal structure of our state space and has a discretised Gaussian as its invariant distribution. The transition probabilities for this forward process can be calculated analytically efficiently by diagonalising the matrix and using matrix exponentials. This allows us to sample directly from the forward process at time t.
Since we have access to the forward transition probabilities, we use the denoising parameterisation of the reverse process in terms of p (t) θ (x 0 |x t ) given in Equation (18), which we expect to lead to more stable training. We parameterise p (t) θ (x 0 |x t , ξ) using a convolutional U-net (Ho et al., 2020), taking as inputs both x t and ξ (concatenated in the channel dimension), as well as a sinusoidal embedding of the time t. The total number of neural network parameters is approximately 6.1M. The output of the network is defined as the mean and log scale of a logistic distribution for each pixel. The logistic distribution is then discretised into bins {0, . . . , 255}, and p (t) θ (x 0 |x t , ξ) is defined as the product of the discretised logistic distributions across dimensions.
We used the MNIST dataset (LeCun et al., 2010) which consists of images of handwritten digits. To train our model, we minimise the objective given in Example 6. For optimisation, we use the Adam optimiser with batch size 128 and learning rate 0.0002 for 1M iterations. In order to simulate the reverse process efficiently, we use a tau-leaping approximation with 1000 steps (for more details see Campbell et al. (2022)).

J.3. Large-scale image super-resolution using discrete-space CTMCs
We perform an additional experiment using discrete-space DMMs for a large-scale image inverse problem on the ImageNet dataset (Russakovsky et al., 2015). We train a DMM using CTMC noising and generative processes to perform 4-fold image super-resolution.
Each input image has 64 × 64 pixels and three RGB colour channels, and we aim to output images at the higher resolution of 256 × 256 pixels which are consistent with the input images. Our state space X = {0, . . . , 255} 3×256×256 .
The noising process, reverse process parameterisation, and neural network design are the same as in Section J.2, but we use a larger neural network for this task. As the starting point of our network optimisation, we utilise the pretrained network weights for continuous diffusions by Dhariwal and Nichol (2021), but we retrain the network for our discrete-space DMM using the objective in Example 6. The total number of neural network parameters is approximately 311.8M. We train the network using the Adam optimiser with batch size 4 and learning rate 2 × 10 −5 for an additional 200000 iterations. For sampling, we use tau-leaping with 1000 steps.
We plot the simulated super-resolution samples in Fig. 8 for a number of low-resolution images generated from the ImageNet validation dataset. As shown in the images, the discrete diffusion model outputs different super-resolution samples that are realistic to the eye, and coherent with the low-resolution images, demonstrating that DMMs can continue to provide high-quality posterior samples even in very high-dimensional scenarios situations where the prior p data (x) is unavailable and standard ABC or MCMC techniques are not available.

J.4. Modelling distributions on SO(3) using manifold diffusions
Recall that our noising process on SO(3) is Brownian motion with generator L = ∂ t + 1 2 ∆. Since SO(3) is compact, this converges to the uniform measure for large times; see e.g. De . For this process, the transition probabilities can be explicitly written as where α = arccos 2 −1 (Tr(x T 0 x t ) − 1) is the angle between x t and x 0 , and x t , x 0 ∈ SO(3) are in matrix form. For completeness, we provide the derivation of this result below in Section J.4.1.
Given this expression, to sample from q t|0 (x t |x 0 ), we follow Leach et al. (2022) and first sample the rotation axis v uniformly from the sphere S 2 ⊂ R 3 . Then, we sample the rotation angle α ∈ [0, π] using inverse transform sampling from the distribution (2 + 1)e − ( +1)t/2 sin + 1 2 α sin(α/2) , where the normalising factor (1 − cos(α))/π is the measure on rotation angles induced by the uniform measure on SO (3). For larger t, we find that the above series converges quickly and evaluating summation terms up to l = 5 gives an accurate approximation. For t < 1, the above series converges slowly, and so we use the approximation Leach et al. (2022) instead. From the angle α and the axis v = (x, y, z), we define the skew symmetric matrix V associated to v to be and calculate the corresponding rotation matrix using Rodrigues' formula R = I + sin(α)V + (1 − cos(α))V 2 .
Finally, we set x t = Rx 0 . In this way, we can directly sample from the noising process at time t.
The reverse process is generated by K = ∂ t + s θ (x, t) · ∇ + 1 2 ∆ by Example 7, and the score network is parameterised as s θ (x, t) = 3 i=1 s i θ (x, t)E i (x), using a basis {E i } 3 i=1 of the tangent bundle.
We use the denoising score matching objective I DSM (θ) to learn θ (see Section F.3). To compute the score ∇ log q t|0 (x t |x 0 ), we use automatic differentiation on Equation (22), where x t , x 0 ∈ R 3×3 are represented in matrix form, followed by projection to the tangent space at x t . For small times, we find this can be numerically unstable, and so we use Varadhan's approximation lim t→0 t∇ log q t|0 (x t |x 0 ) = exp −1 xt (x 0 ) for the heat kernel q t|0 (x t |x 0 ) at small times instead . Once we have learned the score network, we generate approximate samples from the reverse process using the Geodesic Random Walk method of De , which corresponds to performing an Euler-Maruyama discretisation, taking Gaussian steps in the tangent space and then projecting back to the manifold using the exponential map.
Multiplying out, collecting like terms and inspecting the coefficients of dx 2 , dxdy etc., we see that g ij = 4 w 2   w 2 + x 2 xy xz xy w 2 + y 2 yz xz yz w 2 + z 2   and we can calculate |g| = 1/w 2 . Inverting the metric, we get  (3) We consider modelling a mixture of wrapped normal distributions on SO(3). The wrapped normal distribution N W (x | µ, σ 2 ) with mean µ and variance σ 2 is defined here as the transformed distribution via sampling w ∼ N (w | 0, σ 2 ), where w ∈ R 3×3 , from the standard normal distribution with variance σ 2 , projecting w onto the tangent space via v = w−w T 2 , then applying the exponential map x = exp µ (v) at µ. While we could apply standard parametric learning methods which involve learning of {µ m , σ m } directly, we do not rely on the specific form of the data distribution p data , which allows us to model different distributions flexibly. We consider modelling of a mixture of wrapped normal distributions with M = 16 mixtures.
We apply a time-rescaling for the noising process, which is given by L = ∂ t + 1 2 β(t)∆ with the linear β schedule given in Equation (21). Then, the reverse process is generated by K = ∂ t + β(t)s θ (x, t) · ∇ + 1 2 β(t)∆. We use an MLP with 5 layers and 512 hidden units in each layer to output a vector of dimension 3 parameterising {s i θ (x, t)} 3 i=1 . We train the network using the Adam optimiser with batch size 512 and learning rate 0.0002 with a cosine annealing schedule for 100000 iterations.
We learn both the unconditional distribution p data (x) and the conditional distribution p data (x|m) when conditioned on the cluster member m. In the conditional case, we learn a conditional score model s θ (x, m, t) under the same settings. Fig. 9 shows the results from our conditional model for p data (x|m), where we compare the unwrapped distributions in the tangent space between the ground truth normal distribution and the modelled distribution of mixture member m = 1, and plot a representative sample from our conditional model. We see that our model targets the correct mixture accurately. Our visualisations of distributions on SO(3) are adapted from Murphy et al. (2021). J.6. Pose estimation on the SYMSOL dataset Finally, we give details for the pose estimation task on the SYMSOL dataset. We use a similar network design for the conditional score s θ (x t , ξ, t) as Murphy et al. (2021), composed of a vision recognition model for processing the input images ξ, and an MLP for outputting the score. For the vision recognition model, we utilise pretrained ResNet-50 backbone without the final fully-connected classification layer, which outputs a 2048dimensional embedding. We next get sinusoidal positional embeddings of x t and t, use linear layers to transform all embeddings into 256 dimensions and take the summed embedding. This also allows efficient computations of embeddings with a single ξ and multiple values of (x t , t) as the computationally expensive forward pass through the vision recognition model only needs to be taken once. Thus, we simulate a small number of (x t , t) pairs given each pair (x 0 , ξ) at each step for more efficient training. We finally pass the embedding into an MLP with 3 layers and 256 hidden units in each layer.
Compared to the Implicit-PDF methodology by Murphy et al. (2021), which maintains a grid on SO(3) and approximates the density pointwise, our DMM model directly learns a sampling method and does not require maintaining a grid. Therefore, our method is more general and not specific to SO(3). For our implementation, we modify their network structure to take in the time t, and output the score parameterisation of dimension 3 as opposed to the unnormalised log density of dimension 1. We optimise the network using the Adam optimiser with batch size 128 and learning rate 0.0001 with a cosine annealing schedule for 100000 iterations.
We include further visualisations of the generated samples when conditioned on 2D views of different shapes in Fig. 10. As shown in the plots, the samples generated using DMM are all close to the ground truth and cover all modes of the class of rotational symmetries.