A Kernel Stein Test for Comparing Latent Variable Models

We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative Maximum Mean Discrepancy test, which is based on samples from the models and does not exploit the latent structure.

1999), factor analysis (see, e.g., Basilevsky, 1994), mixture models (see, e.g., Gilks et al., 1995), topic models for text (Blei et al., 2003), and hidden Markov models (HMMs) (Rabiner, 1989). The hidden structure in these generative models serves multiple purposes: it allows interpretability and understanding of model features (e.g., the topic proportions in a latent Dirichlet allocation (LDA) model of text), and it facilitates modeling by leveraging simple low-dimensional dynamics of phenomena observed in high dimensions (e.g., HMMs with a low dimensional hidden state). Statistical modelers ultimately use such models to reason about the data; thus, in order to guarantee the validity of the inference, tools for comparing models and evaluating model fit are required.
This article addresses the problem of evaluating and comparing generative probabilistic models, in cases where the models have a latent variable structure, and the marginals over the observed data are intractable. In this scenario, one strategy for evaluating a generative model is to draw samples from it and to compare these samples to the modeled data using a two-sample test: for instance, Lloyd and Ghahramani (2015) use a test based on the maximum mean discrepancy (MMD) . This approach has two disadvantages, however: it is not computationally efficient due to the sampling step, and it does not take advantage of the information that the model supplies, for instance the dependence relations among the variables.
Recently, an alternative model evaluation strategy based on Stein's method (Stein, 1972;Chen, 1975;Stein, 1986;Barbour, 1988;Götze, 1991) has been proposed, which directly employs a closed-form expression for the unnormalized model. Stein's method is a technique from probability theory developed to prove central limit theorems with explicit rates of convergence (see, e.g., Ross, 2011). The core of Stein's method is that it characterizes a distribution with a Stein operator, which, when applied to a function, causes the expectation of the function to be zero under the distribution. For our purposes, we will use the result that a model-specific Stein operator may be defined, to construct a measure of the model's discrepancy. Notably, Stein operators may be obtained without computing the normalizing constant.
Stein operators have been used to design integral probability metrics (IPMs) (Müller, 1997) to test the goodness of fit of models. IPMs specify a witness function which has a large difference in expectation under the sample and model, thereby revealing the difference between the two. When a Stein operator is applied to the IPM function class, the expectation under the model is zero, leaving only the expectation under the sample. A Stein-modified W 2,∞ Sobolev ball was used as the witness function class in (Gorham and Mackey, 2015;Gorham et al., 2019). Subsequent work in (Chwialkowski et al., 2016;Liu et al., 2016;Gorham and Mackey, 2017) used as the witness function class a Stein-transformed reproducing kernel Hilbert ball, as introduced by Oates et al. (2017): the resulting goodness-of-fit statistic is known as the kernel Stein discrepancy (KSD). Conditions for using the KSD in convergence detection were obtained by Gorham and Mackey (2017). While the foregoing work applies in continuous domains, the approach may also be used for models on a finite domain, where Stein operators (Ranganath et al., 2016;Yang et al., 2018;Bresler and Nagaraj, 2019;Reinert and Ross, 2019;Hodgkinson et al., 2020;Shi et al., 2022) and associated goodness-offit tests (Yang et al., 2018) have been established. Note that it is also possible to use Stein operators to construct feature dictionaries for comparing models, rather than using an IPM: examples include a test based on Stein features constructed in the sample space so as to maximize test power (Jitkrittum et al., 2017(Jitkrittum et al., , 2018) and a test based on Stein-transformed random features (Huggins and Mackey, 2018). While the aforementioned tests address simple hypotheses, composite tests that use Stein characterizations have been proposed for specific parametric families including gamma (Henze et al., 2012;Betsch and Ebner, 2019b) and normal distributions (Betsch and Ebner, 2019c;Henze and Visagie, 2019), and general univariate parametric families (Betsch and Ebner, 2019a) (note that these tests are not based on IPMs).
While testing goodness of fit alone may be desirable for models of simple phenomena, it will often be the case that in complex domains, no model will fit the data perfectly. In this setting, it is more constructive to ask which model fits better, either within a class of models or in comparing different model classes. A likelihood-ratio test would be an ideal choice for this task, since it is uniformly most powerful (Lehmann and Romano, 2005). If the models contain latent variables, however, a likelihood-ratio test requires evaluating marginal densities of the models, which are typically intractable. A number of Monte Carlo techniques have been developed to estimate marginal densities or log-density ratios (see, e.g., Friel and Wyse, 2012, for a review). Constructing a test with such techniques is challenging, however; e.g., estimating each marginal density induces a bias in the likelihood ratio, which is difficult to characterize when designing a calibrated threshold (see Section 3.5 for a detailed discussion). Addressing the intractability of the likelihood, Bounliphone et al. (2016) proposed a purely sample-based relative goodness of fit test, which compares maximum mean discrepancies between the samples from two rival models with a reference real-world sample. A second relative test was proposed by Jitkrittum et al. (2018), generalizing Jitkrittum et al. (2017) and learning the Stein features for which each model outperforms the other; this test requires marginal densities up to normalizing constants, and does not apply to latent variable models.
In the present work, we introduce a novel relative goodness-of-fit test for latent variable models (LVMs), which compares models by computing approximate kernel Stein discrepancies. Our contribution is to provide a frequentist test of relative goodness of fit, with an approximate U-statistic of the kernel Stein discrepancy difference as our test statistic. The statistic is expressed by a posterior expectation of the latent given an observation, and is amenable to standard Markov Chain Monte Carlo techniques: in particular, it does not suffer from the challenges in characterizing bias observed in likelihood-ratio estimates for LVMs. Note that our approach differs from Bayesian model selection (Jeffreys, 1961;Schwarz, 1978;Kass and Raftery, 1995;Watanabe, 2013), which reports posterior odds (or Bayes factors) and does not concern controlling frequentist risks such as type-I error rates. To the best of our knowledge, our test represents the first general-purpose, frequentist, relative test for latent variable models.
We recall the Stein operator and kernel Stein discrepancy in Section 2, and the notion of relative tests in Section 3. Our main theoretical contributions, also in Section 3, are two-fold: first, we derive an appropriate test threshold to account for the randomness in the test statistic caused by sampling the latent variables. Second, we provide guarantees that the resulting test has the correct Type-I level (i.e., that the rate of false positives is properly controlled) and that the test is consistent under the alternative: the number of false negatives drops to zero as we observe more data. Finally, in Section 4, we demonstrate our relative test of goodness-of-fit on a variety of latent variable models.
Our main point of comparison is the relative MMD test (Bounliphone et al., 2016), where we sample from each model.
We demonstrate that the relative Stein test outperforms the relative MMD test in the particular case where the low dimensional structure of the latent variables can be exploited.

| THE KERNEL STEIN DISCREPANCY AND LATENT VARIABLE MODELS
In this section, we recall the definition of the Stein operator as used in goodness-of-fit testing, as well as the kernel Stein discrepancy, a measure of goodness-of-fit based on this operator. We will then introduce latent variable models, which will bring us to the setting of relative goodness of fit with competing models in Section 3.
Before proceeding, we call attention to our setting: in this article, we treat both continuous-and discrete-valued observations, as formally defined at the outset of Section 2.1. It is our intention to study these two data modalities as they admit the same treatment. The subsequent definitions and analysis of our test are independent of whether a continuous or discrete Stein operator is used, apart from experiments concerning discrete-valued observations. Thus, the detail about discrete models in Section 2.1 may be initially skipped if desired.

| Stein operators and kernel Stein discrepancies
Let X be the space in which the data takes values; for D ≥ 1, the space X is either the Euclidean space D or a finite lattice {0, . . . , L − 1} D for some L > 1. Depending on X, we shall assume that the densities below are all defined with respect to the Lebesgue measure or the counting measure; i.e., the term density includes probability mass functions (pmfs).

Continuous-valued observations.
Suppose that we are given data {x i } n i =1 i.i.d.
∼ R from an unknown distribution R , and we wish to test the goodness of fit of a model P . We first consider the case where the probability distributions P , R are defined on D and have respective probability densities p, r , where all density functions considered in this paper are assumed strictly positive and continuously differentiable. We treat the case of densities defined on bounded domains in the supplement, Section B.1. For differentiable density functions, we define the score function, where the gradient operator is := ∂ ∂x 1 , . . . , ∂ ∂x D . The score is independent of the normalizing constant for p, making it computable even when p is known only up to normalization. Using this score, we define the Langevin Stein operator on a space F of differentiable functions from D to D (Gorham and Mackey, 2015;Oates et al., 2017), A kernel discrepancy may be defined based on the Stein operator (Chwialkowski et al., 2016;Liu et al., 2016;Gorham and Mackey, 2017), which allows us to measure the departure of a distribution R from a model P . We define reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950;Steinwart and Christmann, 2008, Definition 4.18) with a positive definite kernel k ( ·, ·) : X × X → (we use the same kernel for each dimension). The inner product on F is f , g F := D d =1 f d , g d F k , and F k denotes an RKHS of real-valued functions with kernel k . The (Langevin) kernel Stein discrepancy (KSD) between P and R is defined as Under appropriate conditions on the kernel and measure P , the expectation ¾ y ∼P A P f (y ) = 0 for any f ∈ F. To ensure this property, we will require that k ∈ C (1,1) , the set of continuous functions on X × X with continuous first derivatives and that ¾ y ∼P s p (y ) 2 < ∞ with · 2 the Euclidean norm. We further assume that the following tail condition holds outside a bounded set : p (x ) k (x, x ) ≤ C x δ 2 for some constants C > 0 and δ > D − 1 (see the clarification by South et al., 2021, p.12, on the tail condition for the Stein's identity). With the vanishing expectation ¾ y ∼P A P f (y ) = 0, the KSD reduces to KSD (P R ) = sup f F ≤1 |¾ x ∼R A P f (x ) | . The use of an RKHS as the function class yields a closed form expression of the discrepancy by the kernel trick (Chwialkowski et al., 2016; Mackey, 2017, Proposition 2), if ¾ x ∼R [h p (x, x ) 1/2 ] < ∞. Here, the symbol R ⊗ R denotes the product measure of two copies of R (so x and x are independent random variables identically distributed with the law R ). The function h p (called a Stein kernel) is expressed in terms of the RKHS kernel k and the score function s p , where we have defined For a given i.i.d. sample {x i } n i =1 ∼ R , the discrepancy has a simple closed-form finite sample estimate, which is a U-statistic (Hoeffding, 1948). When the kernel is integrally strictly positive definite (ISPD) (Sriperumbudur et al., 2011, Section 2), and R admits a density r that satisfies ¾ Discrete-valued observations. We next recall the kernel Stein discrepancy in the discrete setting where distributions are defined on X = {0, . . . , L − 1} D with L > 1, as introduced by Yang et al. (2018). In place of derivatives, we specify ∆ k as the cyclic forward differ- the corresponding vector-valued operator ∆ = (∆ 1 , . . . , ∆ D ). The inverse operator ∆ −1 k is given by the backward differ- . The score is then s p (x ) := p (x ) −1 ∆p (x ), where it is assumed that the pmf is strictly positive (i.e., it is never zero). (Yang et al., 2018, Theorem 2) (note that we include a trace for consistency with the continuous case-this does not affect the test statistic (Yang et al., 2018, Eq. 10)). We have defined the Stein operator and the score function slightly differently from Yang et al. (2018); the change is only in their signs, but this results in the same discrepancy. The difference Stein operator is not the only allowable Stein operator on discrete spaces: other alternatives are given by Yang et al. (2018, Theorem 3), Hodgkinson et al. (2020), andShi et al. (2022). Although we focus on the Stein operator above, in practice, one might want to consider different Stein operators depending on the application. For instance, the score function s p can be numerically unstable, as it contains the reciprocal 1/p (x ); this can occur when the support of the model is severely mismatched to that of the data. In this particular case, one might choose the Barker-Stein operator proposed by Shi et al. (2022), an instance of the Zanella-Stein operator of Hodgkinson et al. (2020, Example 2). See Appendix B.2 for details. We compare this operator to the difference operator in an experiment where this mismatch occurs (Section 4.3.3).

The difference Stein operator is then defined as
As in the continuous case, the KSD can be defined as an IPM, given a suitable choice of reproducing kernel Hilbert space for the discrete domain. An example of kernel is the exponentiated Hamming kernel, The population KSD is again given by the expectation of the Stein kernel, and the kernel gradient is replaced by the inverse difference operator, e.g., x indicates that the operator ∆ −1 is applied with respect to the argument x . From Yang et al. (2018, Lemma 8), we have that KSD (P R ) = 0 iff P = R , under the conditions that the probability mass functions for P and R are positive and that the Gram matrix defined over all the configurations in X is strictly positive definite (i.e., the kernel is integrally strictly positive definite). One can define a kernel satisfying the required condition, for example, by embedding X into L×D with one-hot encoding and using a Taylor-type kernel such as the exponentiated quadratic kernel (Christmann and Steinwart, 2010, Theorem 2.2).

| Kernel Stein discrepancies of latent variable models
Our objective is to use the KSD to evaluate latent variable models, and here we formally specify our target model class. Let L X|Z = {p ( · |z ) : z ∈ Z } be a family of probability density functions on X (we call these likelihood functions), which are indexed by elements of a set Z. A latent variable model P is specified by such a family L X|Z and a (prior) probability measure P Z over Z. The combination of these defines the marginal density function p (x ) = ∫ p (x |z )dP Z (z ) and the posterior distribution P Z (dz |x ) = {p (x |z )/p (x ) }P Z (dz ); The distribution P induced by the former acts as a model of the distribution R underlying the observation, and the latter enables us to draw an inference over the unobserved variable.
Remark In our notation, the variable z can represent multiple latent variables. The likelihood p (x |z ) often contains parameters, but the dependency on these is suppressed here. If a prior is defined on a parameter, we may treat it as a latent variable; this consideration is relevant to predictive distributions. The likelihood and the prior in a model may be conditioned on some fixed data (e.g., they can be posterior predictive distributions), which we require to be independent of the data used for testing -in such a case, we omit the dependency on the held-out data. For examples, we refer the reader to Section 4.
The definition of the KSD remains the same for latent variable models, but an additional difficulty arises in its estimation. Unfortunately the U-statistic estimator given in (2) requires the score function of the marginal p, which is challenging to obtain due to the intractability of marginalizing out the latent variable. We will address this challenge by rewriting the score function in terms of the posterior distribution of the latent. In the following, we focus on continuous variable models, but the same conclusion holds for discrete counterparts by replacing gradient operation with cyclic differences.
Under a regularity condition, the score function can be expressed as where s p (x |z ) is the score function of the conditional p (x |z ); i.e., s p (x |z ) = p (x |z ) −1 x p (x |z ) for continuous-valued x . The reasoning is as follows: where we have assumed the exchangeability of differentiation and integration: (3) is an analogue of Fisher's identity (Fisher, 1925;Dempster et al., 1977), which pertinently formed the basis for Stein control variate methodology in (Friel et al., 2016), parameter inference for doubly-intractable models via score matching (Vértes and Sahani, 2016), and Bayesian model selection with Hyvärinen score (Dawid and Musio, 2015;Shao et al., 2019). Note that the conditional score s p (x |z ) is typically possible to evaluate. For example, consider the following simple form of an exponential family density p (x |z ) ∝ exp(T (x )η (z )) defined on D with T (x ) : D → and η : Z → ; for this density, s p (x |z ) = η (z ) x T (x ). As can be seen in this example, the formula (3) does not require the likelihood p (x |z ) to be normalized. This feature eliminates the need for estimating the normalizing constant of p (x |z ) for each z , which is required to compute goodness-of-fit measures based on the marginal density p (x ) (Friel and Wyse, 2012); Section C.2 in the supplementary presents a use case with a truncated model.
With this identity, the KSD is rewritten as follows.

Lemma 1 Let
Proof Substituting the formula (3) in the definition of KSD gives the required equation by the Tonelli-Fubini theorem.

Remark
The integrability assumption holds trivially if the input space X is finite, while care needs to be taken otherwise. The condition can be checked by examining the absolute integrability of each term in (4). The integrability assumption on the fourth term is mild, and is satisfied by common kernels, e.g., the exponentiated quadratic or the inverse multi-quadratic kernels. The condition on the other terms needs to be checked on a model-by-model basis. It can be shown that the example models in Section 4 satisfy the assumption (please see Section B.3 in the supplementary material for details).
The new KSD expression is an expectation of a computable symmetric kernel, and constructing an unbiased estimate is straightforward once we obtain a sample. In practice, when the model is complex, sampling from the posterior distribution generally requires simulation, as the posterior is not available in closed form. Therefore, we propose to approximate the expectation by Markov Chain Monte Carlo (MCMC) methods and construct an approximate U-statistic estimator as follows. Let z i ,m ∈ Z m be a latent sample of size m drawn by an MCMC method having P Z ( · |x i ) as its invariant measure after t burn-in iterations. Lets p (x i |z (t ) . Given a joint sample whereH and the sum is taken over all distinct sample pairs. If P (t ) m ), then this estimator is indeed a U-statistic, but its expectation is that of kernelH p with respect to P (t ) Thus, the estimator is biased against the target estimand, the model's KSD, for a finite burn-in period t , and can therefore be seen an approximation to the true U-statistic U (∞) n . Designing a statistical test requires understanding the behavior of the statistic (5), and we will provide its analysis in the next section. Although we focus on MCMC for its approximate unbiasedness in our proposed test, different posterior approximations may be considered in other applications; for example, with a more computationally efficient approach (e.g., variational approximation), the new KSD expression in Lemma 1 might allow us to consider parameter estimation for unnormalized statistical models with latent variables .

| A RELATIVE GOODNESS-OF-FIT TEST
We now address the setting of statistical testing for model comparison. We begin this section with our problem settings and notation, and then define a test by showing the asymptotic normality of approximate U-statistics.

| Problem setup
We consider the case where we have two latent variable models P and Q , and we wish to determine which is a closer approximation of the distribution R generating our data {x i } n i =1 . The respective density functions of the models are given by the integrals p (x ) = ∫ p (x |z )dP Z (z ) and q (x ) = ∫ q (x |w )dQ W (w ). As with P , the latent variable w is assumed to take values in a set W with prior Q W . We assume that p (x ) and q (x ) cannot be tractably evaluated, even up to their normalizing constants. Our goal is to determine the relative goodness-of-fit of the models by comparing each model's discrepancy from the data distribution. Our problem is formulated as the following hypothesis test: In other words, the null hypothesis is that the fit of P to R (in terms of KSD) is as good as Q , or better. Note that the KSD in (6) is defined by a particular reproducing kernel, and thus different kernels yield distinct hypotheses. For kernel selection, we refer the reader to Section 3.4.
We next provide an overview of the formal assumptions made throughout the paper. Let (Ω, S, Π) be a probability space, where Ω is a sample space, S is a σ-algebra, and Π is a probability measure. All random variables (for example, data points x i and draws z (t ) i from a Markov chain sampler) are understood as measurable functions from the sample space Ω. The input space X is equipped with the Borel σ-algebra generated by its standard topology. We assume that Z, W are Polish spaces with the Borel σ-algebras defined by their respective topologies, on which the priors P Z , Q W are defined. Finally, we require that the two models are distinct; i.e., their marginal densities disagree on a set of positive R -measure.

| Estimating kernel Stein discrepancies of latent variable models
The hypotheses in (6) can be equally stated in terms of the difference of the (squared) KSDs, KSD 2 (P R )−KSD 2 (Q R ) , which motivates us to design a test statistic by estimating each term. Let U (t ) n (P , Q ) is an approximate U-statistic (in the sense of the final paragraph in Section 2.2) defined by the difference kernel The statistic takes as input random variables with evolving laws, and defining a test require us to understand the behavior of such statistics. This section delivers an analysis in a general setting.
We first characterize the asymptotic distribution of an approximate U-statistic. The following theorem shows that such a statistic is asymptotically normal around the expectation of the true U-statistic provided its bias vanishes fast.

Theorem 1 (Asymptotic normality) Let {γ t } ∞
t =1 be a sequence of Borel probability measures on a Polish space Y and γ be another Borel probability measure. Let Y ∼ γ t , and for a symmetric function h : Y × Y → , define a U-statistic and its mean by converges to a constant σ 2 . Assume that we have θ t → θ as t → ∞. Then, in the limit of large n and of t growing as a function of n such that √ n (θ t − θ) → 0, the following two statements hold: if The proof is in Section A in the supplement. Note that in the preceding and following results, the limit of n and t is taken simultaneously rather than sequentially, such that the condition √ n (θ t −θ) → 0 holds: see discussion below and in Section 3.3. By letting Y and h =H p,q in the foregoing theorem, we obtain the same conclusion for the difference estimate U (t ) n (P , Q ). The asymptotic normality allows us to define a test procedure. Theorem 2 involves unknown variance σ 2 , however; in order to construct a test, we need to be able to estimate it consistently. For our test, we propose to use the following jackknife variance estimator v n,t := (n − 1) where U (t ) n is defined as in Theorem 2, and U (t ) n,−i the U-statistic computed on the sample with the i -th data point removed. We defer the discussion on this choice until we introduce our test procedure in Section 3.3. Here, we present the required consistency, the proof of which can be found in Appendix A.2 (see Lemma 3).
Lemma 2 (Jackknife consitency) Define symbols as in Theorem 2 and the jackknife variance estimator as in (7). Assume . Then, we have the double limit ¾ v n,t − σ 2 2 → 0 as n, t → ∞. In particular, the limit holds regardless of the growth rate of t as a function of n.
We have shown that the jackknife estimator allows consistent estimation of the asymptotic variance of U (t ) n (P , Q ). Using the results obtained in this section, we present our test procedure in the next section.

| Test procedure
We are finally ready to define the test procedure. Recall that our objective is to compare model discrepancies, which can be accomplished by estimating the difference KSD 2 (P R ) − KSD 2 (Q R ). The previous section has established the asymptotic normality of the difference estimate U (t ) n (P , Q ) and provides a consistent estimator of its asymptotic variance. Therefore, we define our test statistic to be with v n,t the jackknife variance estimator in (7) computed using the joint sample The following property follows from Theorem 2 and Slutsky's lemma (see e.g., van der Vaart, 2000, p. 13) along with the consistency of v n,t from Lemma 2.
1 represents a copy of random variables (x, z (t ) , w (t ) ); the variables z (t ) , w (t ) are draws from the respective Markov chains of P and Q conditioned on   Remark Corollary 1 holds for any choice of the Markov chain sample size m ≥ 1. However, in practice, a small value of m leads to large variance of the score estimatess p ,s q , and hence the test statistic T n,t , which results in a conservative test. To improve the test's sensitivity, we therefore recommend using as large an m as possible.
Corollary 1 leads to the following simple model comparison test (summarized in Algorithm 1): for a given significance level α ∈ (0, 1), we compare the test statistic T n,t against the (1 − α)-quantile τ 1−α of the standard normal, and reject the null if T n,t exceeds τ 1−α . By this design, under the null hypothesis H 0 : µ P ,Q ≤ 0, we have lim n,t →∞ Π(T n,t > τ 1−α |H 0 ) ≤ α, and the test is therefore asymptotically level α for each fixed R satisfying H 0 . On the other hand, under any fixed alternative H 1 : µ P ,Q > 0, it follows from √ nµ P ,Q → ∞ (n → ∞) that we have lim n,t →∞ Π(T n,t > τ 1−α |H 1 ) = 1, indicating that the test is consistent in power.

Algorithm 1: Test procedure
Input: Data {x i } n i =1 , models P , Q , and significance level α Result: i ,m ) after t burn-in steps with an MCMC algorithm to simulate i ,m ) after t burn-in steps with an MCMC algorithm to simulate Q W (dw |x i ); We remark that the above analysis will not apply in particular extreme cases, where both models are identical, or both perfectly match the data distribution. When these occur, then σ P ,Q = 0 and µ P ,Q = 0 (note that if µ p,Q 0, the test statistic diverges as the sample size increases). Applying our procedure as above to this setting, the normal approximation might fail to correctly capture the variability of the test statistic, and the type-I error could exceed the significance level. To detect this failure mode, we would need to independently check that the two models are not identical, either by inspection or via two-sample testing. A more systematic treatment could be performed, e.g., by preventing degeneracy using a sample splitting technique as proposed by Schennach and Wilhelm (2017), and we leave this refinement for future work.
We empirically found that our choice of the variance estimator acted as a safeguard against the failure mode mentioned above. The jackknife estimator is nonnegative, while individually estimating the variances and covariance of the two U-statistics might yield a negative estimate. The jackknife is also known to overestimate the variance (Efron and Stein, 1981), and its use may result in a more conservative test. This estimator is not the only allowable choice, as the variance estimation in U-statistics has been extensively studied (for other concrete estimators, see, e.g., Maesono, 1998, and references therein). In our preliminary analysis, we considered two other estimators, but the jackknife estimator controlled type-I errors better than these alternatives in the near degenerate case. For details, we refer the reader to experiments in Sections C.4 and C.5 in the supplement.
The limiting behaviors of the test are only guaranteed when an appropriate double limit is taken with respect to the burn-in size t and the sample size n. Theorem 2 suggests that the bias of the statistic U (t ) n (P , Q ) should decay faster than 1/ √ n in the limit of t . Our practical recommendation is to take a burn-in period as long as the computational budget allows; this heuristic is justified if the bias vanishes as t → ∞. For KSD 2 (P R ) and its estimate, the bias is due to that of the score estimate s (t ) z|x denotes the expectation with respect to P (t ) Z (dz |x ). If the score's bias is confirmed to converge to zero, we can check the bias of the KSD estimate by examining the convergence of ¾ (x,x )∼R ⊗R [h p,t (x, x ) ], with h p,t (x, x ) a Stein kernel defined by the approximate score s (t ) p . The convergence of s (t ) p can be established by assuming appropriate conditions on s p (x |z ) and the sampler; for instance, for the exponential family likelihood p (x |z ) ∝ exp(T (x )η (z )), if the natural parameter η is a continuous bounded function, the weak convergence of the sampler implies the desired convergence (the score s p (x |z ) is factorized as η (z ) T (x )). The quantification of the required growth rate of t relative to n needs more stringent conditions on the employed MCMC sampler, which we discuss in the supplement, Section B.4. Admittedly, it is often not straightforward to theoretically establish an explicit relation between the growth rates of t and n. We therefore experimentally evaluate the finite-sample performance of our test in Section 4.
The overall computational cost of the proposed test is O {n 2 + n (t + m) }, assuming that the cost of sampling a latent is constant. The test statistic in (8) requires evaluating the U-statistic kernelH p,q on all distinct sample pairs. Note that we need to perform this computation only once if we memoize the evaluated values; in particular, the cost of the variance estimate (2) can be made O (n 2 ) with memoization. Thus, assuming that we have evaluated and stored the score values s p x i |z , the cost of evaluating the U-statistic kernel is O (n 2 ), but this operation can be easily parallelized over sample pairs. The additional O {n (t + m) } cost comes from evaluating the approximate score functions, as it requires running Markov chains for each data point (see the loop between Lines 1-4 in Algorithm 1). We can improve the sample-size n dependency in score evaluation by parallelization, since MCMC can be performed independently over sample points x i .

| Kernel choice
A discrepancy measure such as KSD encodes a particular sense of how two distributions differ. In the case of KSD, the magnitude of this discrepancy is affected not only by evaluated models but also the choice of a reproducing kernel.
Ideally, we should choose a kernel that makes the KSD reflect the discrepancy of features relevant to the problem at hand. We provide general guidance on kernel selection as follows: Continuous observations: As mentioned in Section 2, ISPD kernels enable the KSD to distinguish any two distributions satisfying certain regularity conditions. Of ISPD kernels, in the light of practical performances reported in prior work in goodness-of-fit testing (Gorham and Mackey, 2017) and distribution approximation Riabiz et al., 2022), we advocate for the use of the preconditioned IMQ kernel  k with Λ a strictly positive definite matrix, and scalars c > 0 and 0 < β < 1; as a default choice, we recommend to take β = 1/2 and c = 1. Following the kernel method literature, we recommend to choose the pre-conditioner Λ in a data-dependent way so that the KSD can capture relevant features of the data. We suggest two default options: the median heuristic, where Λ = λ 2 I with λ = median x i − x j 2 : 1 ≤ i < j ≤ n } and I the identity matrix; the x i /n, which should be suitably regularized. Each of these choices has its own merits, as we illustrate in a simple example with Gaussian distributions in the supplement (Section B.7). Moreover, in general, the KSD is not invariant to a change of coordinates representing the data. The above choices partially address this issue, as they ensure that the KSD is invariant to rotation and displacement (see Section B.9.2 in the supplement; Matsubara et al., 2022, Section 5.1) . For continuous observations, we additionally need to examine the integrability of the Stein kernel to use the KSD expression of Lemma 1. To this end, one might want to make an assumption about the tail decay of the data distribution.
The integrability condition can be alternatively enforced by reweighting the reproducing kernel so that the Stein kernel is uniformly bounded; i.e., for a kernel k , define a new kernel k w by k w : X → (0, ∞) is some decreasing function dominating the growth of the score function. We discuss how to choose such w in the supplement, Section B.3. This reweighting might reduce the sensitivity of the KSD and break the aforementioned property of coordinate-choice independence, however.

Discrete observations:
We have given a condition for a kernel to be ISPD at the end of Section 2; e.g., the exponentiated quadratic kernel on one-hot encoding, which can be efficiently implemented in a sparse tensor format. In general, however, it is challenging to compute such an ISPD kernel for discrete objects in high-dimensions. Note that ISPD-ness is only required to distinguish any two distributions. In practice, we only require the KSD to capture aspects relevant to model evaluation, and might therefore choose a kernel insensitive to some differences, as long as they represent computationally affordable alternatives suited to the given problem. An instructive example is testing on distributions over graphs. Graphs of V nodes can be represented as adjacency matrices that are elements of {0, 1} V ×V . The Dirac delta kernel that examines if two graphs are identical is an ISPD kernel but computationally intractable (no polynomial algorithm is known). This notion of graph identification is in practice too restrictive, and therefore one typically uses kernels that convey other relevant graph properties (see e.g., Borgwardt et al., 2020, for more details). We also demonstrate this trade-off in our experiment with latent Dirichlet allocation models in Section 4.3, where we can ignore the sequential structure of the data.
Use of multiple kernel functions: As we have seen, there are often multiple choices of the kernel function, and they might represent distinct features. Our recommendation is to test the hypotheses corresponding to the kernel choices simultaneously, as it makes evaluation more rigorous. However, one has to correct for multiple comparisons such as controlling the family-wise error rate. It should be noted that a correction typically makes the test more conservative as the number of kernels grows. The user thus needs to control the number of kernels to be used (e.g., using a handful of values of scale parameter λ for the above IMQ kernel with Λ = diag(λ, . . . , λ)).
Finally, we note that specific kernels can be employed that encode domain-specific expertise in particular problem settings: for instance, kernels have been defined on groups (Fukumizu et al., 2008) and graphs (Borgwardt et al., 2020).
KSDs and associated statistical tests can likewise be defined for certain of these cases (e.g., Xu and Matsuda, 2020).
That being said, it may sometimes be preferable to favor an MMD with goal-specific features over an omnibus KSD test.

| Challenges with likelihood-ratio tests based on marginal density estimation
As noted in the introduction, a likelihood-ratio test is an alternative choice for comparing latent variable models.
This section details challenges associated with likelihood-ratio tests. There are two paths to designing such tests, depending on how we estimate the (log-)density ratio. One approach is to estimate the marginal density and take the ratio, and the other is to directly estimate the (log-)density ratio. We refer the reader to (Friel and Wyse, 2012) for a review of estimation techniques.
Methods such as annealed importance sampling (Neal, 2001) belong to the first category. The problem with this approach is that an estimator ratio is a (typically heavily) biased estimator (let alone the difference of log density estimates); thus, deriving a calibrated test threshold from such an estimator is often challenging. In contrast, in our test, we can characterize the bias due to MCMC relatively straightforwardly, by evaluating the bias of the approximate Stein kernel.
The second category includes techniques such as reversible jump MCMC (Green, 1995) and thermodynamical integration. It is anecdotally known that reversible jump MCMC can be challenging to implement, especially for complex models. Thermodynamic integration uses formula where ¾ z |x,t is the expectation with respect to the posterior over z defined by the tempered likelihood p (x |z ) t . This technique can in principle be used to construct a likelihood-ratio test, as in our approach. For example, we may uniformly sample t from the unit interval to approximate the outer integral, while conducting MCMC to estimate the inner expectation. We are not aware of a frequentist test based on this construction, however. That said, the computational cost of such an approach would be significant: given data of size n, we would generate T samples for the temperature t , then m samples for the inner MCMC, giving a (naive) computational complexity of O (nmT ) (here we compute the sum of n log density evaluations rather than the evidence conditioned on the data batch).

| EXPERIMENTS
We evaluate the proposed test (LKSD, hereafter) through simulations. Our goal is to show the utility of the KSD in model comparison. To this end, we compare our test with the relative MMD test (Bounliphone et al., 2016), a kernelbased frequentist test that supports a great variety of latent variable models. Note that this test and ours address different hypotheses, as the MMD and LKSD tests use different discrepancy measures; it is indeed possible that they reach conflicting conclusions (e.g., a model is better in terms of KSD but worse in MMD). To align the judgement of both tests, we construct problems using models with controllable parameters; for a given class, a reference distribution, from which a sample is drawn, is chosen by fixing the model parameter; two candidates models are then formed by perturbing the reference's parameter such that a larger perturbation yields a worse model for both tests. We show that there are cases where the MMD fails to detect model differences whereas the KSD succeeds. For completeness, we provide the detail of our implementation of the MMD test in the supplement (Section B.5), since it requires modification to yield satisfactory performance in our setting.
Following are details shared by the experiments below. All results below are based on 300 trials, except for the experiment in Section 4.2 (see the section for details). In the light of the discussion in Section 3, unless specified, the MMD test draws n model = m + t samples for each model so that its cost matches the additional computation afforded

| Probabilistic Principal Component Analysis
We first consider a simple model in which the score of its marginal is tractable. This allows us to separately assess the impact of employing a score function approximation. Probabilistic Principal Component Analysis (PPCA) models serve this purpose since the marginals are given by Gaussian distributions. Let X = D and where A ∈ D ×Dz , I x ∈ D ×D , I z ∈ Dz ×Dz are the identity matrices, ψ is a positive scalar, and 0 is a vector of zeros. The conditional score function is s p (x |z ) = −(x − Az )/ψ 2 . In particular, the marginal density is given by While the posterior in this model is tractable, it is instructive to see how KSD estimation is performed by MCMC.
By using an MCMC method, such as the Metropolis Adjusted Langevin Algorithm (MALA) (Besag, 1994;Roberts and Tweedie, 1996) or Hamiltonian Monte Carlo (HMC) (Duane et al., 1987;Neal, 2011), we obtain latent samples ; samples z i are used to compute a score estimate at each point and these approximate score values are used to compute the U-statistic estimate in (5). By choosing suitably decaying kernels (Section B.3), we can guarantee the integrability condition in Lemma 1. The vanishing bias assumption in Theorem 2 corresponds to the convergence in mean, which can be measured by the Kantorovich-Rubinstein distance (Kantorovich, 2006) (also known as the L 1 -Wasserstein distance (see, e.g., Villani, 2009, Chapter 6)). Note that the negative logarithm of the unnormalized posterior density is strongly convex, and its gradient is Lipschitz; the strong convexity-and Lipschitz constants are independent of x . Therefore, using HMC for example, by appropriately choosing a duration parameter and a discretization step size, we can show that the bias of the above score estimate diminishes uniformly over x (Bou-Rabee et al., 2020).

| Type-I error and test power
We investigate the finite-sample performance of the proposed test in terms of type-I error and power rates. We generate data from a PPCA model R = PPCA(A, ψ). The dimensions of the observable and the latent are set to D = 100, D z = 10, respectively. Each element of the weight matrix A is drawn from a uniform distribution U [0, 1] and fixed. The variance parameter ψ is set to 1. As PPCA models have tractable marginals, we also compare our test with the KSD test using exact score functions (i.e. no MCMC simulation), which serves as the performance upper-bound.
The MCMC sampler we use is HMC; more precisely, we use the NumPyro ( We use two kernel functions: (a) the exponentiated quadratic (EQ) kernel k (x, x ) = exp − x − x 2 2 /(2λ 2 ) , and (b) the IMQ kernel (9) with β = 0.5, c = 1, and Λ = λ 2 I . All three tests use the same kernel function, which allows us to investigate the effect of using the Stein-modified kernel. The length scale parameter λ is set to the median of the pairwise (Euclidean) distances of holdout samples from R so that the parameter (and thus the hypothesis) is fixed across trials. We include the EQ kernel in our comparison, as the population MMD is possible to compute, allowing us to verify the hypothesis in advance.
We simulate null and alternative cases by perturbing the weight parameter A; we add a positive value δ > 0 to the (1, 1)-entry of A. Let us denote a perturbed weight by A δ . Note that the data PPCA model has a Gaussian TA B L E 1 Type-I errors the MMD test of Bounliphone et al. (2016), the proposed LKSD test, and the KSD test in PPCA Problem 1. Rejection rates are computed on 300 trials with significance level α = 0.05. The columns EQ-med and IMQ-med denote EQ and IMQ kernels with the median bandwidth, respectively. . Therefore, this perturbation gives a model N (0, A δ A δ + ψ 2 I x ), where the first row and column of A δ A δ deviate from those of AA . The perturbation is additive and increasing in δ, as each element of A is positive. We create a problem by specifying perturbation parameters (δ P , δ Q ) for (P , Q ). For the EQ-kernel MMD, we numerically confirmed that the perturbation gives a worse model for a larger perturbation. While the population KSD is not analytically tractable, this perturbation affects the score function through the covariance matrix, and the same behavior is expected for KSD; see Section B.6 in the supplement for details.

Problem 1 (null)
We create a null scenario by choosing (δ P , δ Q ) = (1, 1 + 10 −5 ) (P has a smaller covariance perturbation and is closer to R than Q ). For different null settings, we refer the reader to Section C.4 in the supplement. We run the tests with significance levels α = 0.01, 0.05. Table 1 reports the finite-sample size of the three tests for significance level α = 0.05. The result for α = 0.01 is omitted as none of the tests rejected the hypotheses. The size of the proposed LKSD test is indeed controlled. The extremely small type-I errors of the KSD tests are caused by the sensitivity of KSD to this perturbation; the population KSD value is negative and far from zero, and the test statistics easily fall in the acceptance region. The other two tests also have their error rates lower than the significance level. Note that their test thresholds are determined by treating the population discrepancy differences as zero, resulting in conservative tests.

Problem 2 (alternative)
We investigate the power of the proposed test. We set up an alternative scenario by fixing δ P = 2 for P and δ Q = 1 for Q . For comparison with different parameter settings, please see Appendix C.1. The significance level α is fixed at 0.05. All the other parameters are chosen as in Problem 1. Figure 1 shows the plot of the test power against the sample size in each problem. The KSD reaches a near 100 percent rejection rate relatively quickly, indicating that information from the score function is helpful for these problems. The effect of the score approximation is negligible in this experiment, as the power curve of the LKSD test overlaps with that of KSD. The power of the MMD test is substantially lower than the other tests, indicating that the MMD is insensitive to this perturbation to the covariance.  F I G U R E 2 Power curves of the proposed LKSD test and the MMD test in PPCA Problem 2. The perturbation parameters are set as (δ P , δ Q = 2, 1). each result is computed on 300 trials. The significance level α = 0.05. Markers: 3 (LKSD test with IMQ kernel); 0 (LKSD test with EQ kernel); ⃝ (MMD test with IMQ kernel); × (MMD test with EQ kernel).

| Effect of kernel parameter choice
Dependency on scaling parameter.
Using Problem 2 above, we examine how the test power is affected by the scaling parameter. We use the EQ and IMQ kernels as above, and choose their scaling parameter λ 2 from {10 −3 , 10 −2 , . . . , 10 3 }. For each n ∈ {100, 300} we run 300 trials and estimate the test power of the LKSD and MMD tests. Figure 2 plots the power curves of the tests.
We can see that the high-power region of the EQ kernel is localized while the IMQ kernel's power curves are flat, indicating that the IMQ kernel does not depend on the parameter as much as the EQ. Therefore, for this problem, the Rejection rate F I G U R E 3 Power curves of the MMD test, the proposed LKSD test, and the KSD test in PPCA Problem 2. All the test use the covariance-preconditioned IMQ kernel. The perturbation parameters are set as (δ P , δ Q = 2, 1). Each result is computed on 300 trials. The significance level α = 0.05. Markers: 3 (the LKSD test); (the KSD test); ⃝ (the relative MMD test).
IMQ kernel can be seen as more robust against misspecification of the scaling parameter. Nonetheless, with the right choice of the scaling parameter, the EQ kernel yields higher power for both MMD and KSD tests. It can be considered that the distinction arises because of the local nature of the difference between the two distributions; the EQ kernel is more sensitive in choosing features used to compute the KSD (see Section B.7).

Different parameterization.
We also consider a different parameter choice for the preconditioning matrix. Here, we compare the median-scaled IMQ kernel with the same kernel having a covariance preconditioning matrix, as suggested in Section 3.4. Figure 3 shows the power curves of the three tests. Here, the relation between the MMD and KSD tests is overturned, and the KSD test struggles to detect the perturbation to the covariance. This result demonstrates that certain kernel choices can make the testing problem more challenging than others. Using multiple kernels, rather than relying on a single choice, could therefore robustify the evaluation, at the expense of a loss of power due to multiple testing correction.

| Quality of Markov chain samplers
The asymptotic property of our test (Corollary 1) hinges on the quality of the Markov chain samplers. This section studies the effect of these Markov chains on the inference. We vary the burn-in size t and the score approximation sample size m, which is expected to affect the type-I error rate and the power of the test. In the experiments below, we set α = 0.05. We choose t from {50, 100, . . . , 600} and m from {1, 10, 100, 1000}.   F I G U R E 5 The effect of a poor MCMC sampler on the test. Type-I error rates against the burn-in size t with varying Markov chain sample size m. PPCA Problem 1 (the null H 0 is true). The dark dashed line indicates the significance level α = 0.05. The samplers for P and Q are respectively MALA and NUTS. Markers: 3 (m = 1); 2 (m = 10); 1 (m = 100); 4 (m = 1000).
In our first experiment, as in the previous sections, we use the NUTS with the same initialization strategy for both models. With n = 300, we run the test using Problems 1 and 2 above. Figure 4 shows rejection rates of the test for different settings of t and m. In both cases, the burn-in length t does not affect the test's performance, indicating the fast convergence of the sampler. The importance of a larger value of m can be seen when the alternative hypothesis holds, since the test power improves as m increases. The improved performance is likely due to reduced variance.
We next consider a slow-converging sampler for which the burn-in length t becomes crucial. We consider the null case (Problem 1) and replace the sampler for the first model P with MALA. We set the step size for the MALA sampler to make its convergence slow; we use the step size of 10 −4 D −1/3 z . We initialize the two samplers differently to make sure that the resulting distributions differ when the samplers have not converged: the MALA sampler for P is initialized with samples from a Gaussian N (1, . . . , 1), I z and the NUTS sampler for Q a uniform distribution Figure 5 demonstrates the relation between type-I error rates and choices of t and m. In contrast to the previous experiment, the burn-in has a clear effect on the type-I error: insufficient burn-in leads to uncontrolled error rates. The right panel (n = 300) shows that the test has substantially higher type-I error rates than in the left (n = 100). Comparison between these cases illustrates that a larger sample size n requires more intensive burn-in, as the test becomes more confident to reject. A large value of m improves the test as in the previous experiment. It can be understood that the contribution of burn-in samples is negligible in the score approximation. Although our analysis in Corollary 1 requires long burn-in, taking large m appears to be more important in practice, especially under a computational budget constraint. This experiment thus confirms the importance of the quality of the sampler.

| Dirichlet process mixtures
Our next experiment applies our test to a Dirichlet process mixtures (DPM) model. Let ψ (x |z ) be a probability density function on D . We consider a mixture density where ρ is a Borel probability measure on a Polish space Z. A DPM model (Ferguson, 1983) places a Dirichlet process prior DP(a) on the mixing distribution ρ. Thus, a DPM model DPM(a) assumes the following generative process: Here, a is a finite Borel measure on Z. Note that the marginal density (on a single observation) is given by Although the prior has an infinite-dimensional component, the required conditional score function is simply s ψ (x |z , φ); thus we only need to sample from a finite-dimensional posterior P Z (dz |x ). If a model is conditioned on held-out data , which may be used to estimate the density (10). The score function is given by the expectation of s ψ (x |z ) with respect to the posterior withF D the mean measure of P F (dF | D). Sampling from the posterior can be performed with a combination of the Metropolis-Hastings algorithm and Gibbs sampling (see e.g., Ghosal and van der Vaart, 2017, Chapter 5). For the score formula and the MCMC procedure, we refer the reader to Section B.8 in the supplement. By setting ψ to an isotropic normal density, for example, we can guarantee the integrability assumption in Lemma 1 (see Section B.3).
Our problem below considers comparing the predictive densities defined by two models with different Dirichlet process priors. Note that since candidate models P , Q here are point estimates derived from their respective posterior means of F , we discard some aspects of uncertainty in estimating the target (10); our setting only concerns evaluating the quality of those point estimates in approximating the data generating distribution R . The supposed null and alternative regimes are δ < 1 and δ > 1, respectively. Markers: 3 (the LKSD test); ⃝ (the relative MMD test). The dark dashed line indicates the significance level α = 0.05. The errorbars indicate the standard deviations of the estimated rejections rates.

Experiment details.
For the data distribution R , we use the mixture (10) defined by ψ (x |z ) = N (x ; z , 2I ) and ρ = N (0, I ); this choice yields R = N (0, 3I ). We consider the following simple Gaussian DPM model GDPM(µ), where µ ∈ D . Note that without conditioning on observations, the model's marginal density is simply a Gaussian distribution N (µ, 3I ), which does not require approximation.
We therefore compare predictive distributions, i.e., we compare two GDPM models conditioned on training data ∼ R . We consider two GDPM models with wrong priors, where their prior means are shifted. Specifically, we take and two models chosen as Q = GDPM(1) and P = GDPM(δ1) with1 = 1/ √ D . Unlike the preceding experiments, we condition the two models on the training data, and obtain the predictive distributions, denoted by P D tr and Q D tr , respectively; our problem is thus the comparison between P D tr and Q D tr . The distributions now require simulating their posterior, and we use a random-scan Gibbs sampler and the Metropolis algorithm with a burn-in period t = 1, 000 and the size of the latents m = 500. For sampling observables from the models, we use a random-scan Gibbs sampler with a burn-in period 2, 000. We expect that if the training sample size n tr is small, a larger perturbation would give a worse model as the effect of the prior is still present; we thus set n tr = 5. Due to the small sample size, the expected model relation might not hold, depending on the draw of D tr . Therefore, we examine the rejection rates of the LKSD and MMD tests, averaged over 50 draws; for each draw of D tr , we estimate the rejection rates based on 100 trials. Our problem is formed by varying the perturbation scale δ for P D tr , which is chosen from a regular grid {0.5, 0.6, · · · , 0.9, 1.1, · · · , 1.5}. This construction gives a null case when δ < 1, the alternative otherwise. We set the dimension D to 10 and the significance level α to 0.05. As in Section 4.1, we use the IMQ kernel with median scaling. Figure 6 reports the rejection rates of the two tests for each of n ∈ {50, 100, 200}. Note that the curves in the graph do not represent type-I errors nor power, as they are rejection rates averaged over draws D tr , each of which forms a different problem. It can be seen that on average, both tests have correct sizes (δ < 1). In the alternative regime (δ < 1), the LKSD test underperforms the MMD with a small sample size (n = 50); however, its improvement in power is faster and exceeds the MMD at n = 200. These results imply that the LKSD estimate has a large variance for a small sample size, whereas its estimand (the population difference) is also larger, and thus the mean of the test statistic diverges faster. Thus, it may be understood that the KSD is more sensitive to model differences in this setting.

| Latent Dirichlet Allocation
Our final experiment studies the behavior of the LKSD test on discrete data using Latent Dirichlet Allocation (LDA) models. LDA is a mixed-membership model (Airoldi et al., 2014) for grouped discrete data such as text corpora. We follow Blei et al. (2003) and use the terminology of text data for ease of exposition. Accordingly, the following terms are defined using our notation. A word is an element in a discrete set (a vocabulary) {0, . . . , L −1} of size L. A document x is a sequence of D words, i.e., x ∈ {0, . . . , L − 1} D is a D -dimensional discrete vector. A prominent feature of LDA is that it groups similar words assuming they come from a shared latent topic, which serves as a mixture component.
An LDA model assumes the following generative process on a corpus of documents {x i } n i =1 : where θ i is a probability vector of size K ≥ 1.

For the
where b k is the distribution over words for topic k , and the topic assign- Here, a = (a 1, . . . , a K ) is a vector of positive real numbers, and b = (b 1 , . . . , b K ) ∈ [0, 1] K ×L represents a collection of K distributions over L words. In summary, an LDA model P = LDA(a, b) assumes the factorization where z i and θ i act as latent variables.
Because of the independence structure over words, the conditional score function is simply given as Score approximation requires the posterior distribution p (z |x ; a, b) with respect to z . Marginalization of θ renders latent topics dependent on each other, and thus the posterior is intractable. A latent topic is conjugate to the corresponding topic distribution given all other topics. Therefore, an MCMC method such as collapsed Gibbs sampling allows us to sample from p (z |x ; a, b). As the observable and the latent are supported on finite sets, the use of Lemma 1 is justified; the finite moment assumptions in Corollary 1 are guaranteed; and the consistency of the population mean and variance of the test statistic follows from the convergence of ¾ for each x ∈ X.

| Synthetic data -prior sparsity perturbation
In the two problems below, we observe a sample {x i } n i =1 from an LDA model R = LDA(a, b). The number of topics is K = 3. The hyper-parameter a is chosen as a = (a 0 , a 0 , a 0 ); for model R , we set a 0 = 0.1. Each of three rows in b = words.
We design problems by perturbing the sparsity parameter a 0 . Recall that Dir(a) is a distribution on the (K − 1)probability simplex. A small a 0 < 1 makes the prior p θ (θ i |a) = Dir(a) concentrate its mass on the vertices of the simplex; the case a 0 = 1 corresponds to the uniform distribution on the simplex; choosing a 0 > 1 leads to the prior mass concentrated on the center of the simplex. The data distribution R (with a 0 = 0.1) is thus intended to draw sparse topic proportions θ i , and a document x i is likely to have words from a particular topic. By increasing a 0 , we can design a departure from this behavior. Therefore, as in the PPCA experiments, we additively perturb a 0 with parameters (δ P , δ Q ) for respective candidate models (P , Q ).
As LDA disregards word order, we need a kernel that respects this structure. We use the Bag-of-Words (BoW) 2 ) −1/2 ; it is simply the IMQ kernel computed in the BoW representation B (x ) ∈ {0, 1, 2, . . . , D } L whose -th entry (counting from 0 to L − 1) is the count of the occurrences of word ∈ {0, . . . , L − 1} in a document x . By Lemma 10 in Section B.9.1, this choice ensures that arbitrary reordering of text sequences does not change the KSD value; i.e., the KSD does not assess models by their ability to generate sequences.
We also tested differing input-scaling values and found that the bandwidth of the IMQ kernel did not have a significant effect on the test power (Section C.3.2).
For score estimation in the LKSD test, we use a random scan Gibbs sampler; we generate m = 1, 000 latent samples after t = 4, 000 burn-in iterations.

Problem 1 (null)
We create a null situation by having (δ P , δ Q ) = (0.5, 0.6). In this case, Q s prior on θ is less sparse than that of P . Table   2a shows the size of the different tests for significance levels α = 0.05; the result for α = 0.01 is omitted as both tests did not reject the hypothesis. It can be seen that the rejection rates of both tests are bounded by the nominal level.

Problem 2 (alternative)
We consider an alternative case in which the sparsity parameters are chosen as (δ P , δ Q ) = (1.0, 0.5) (we consider other parameter choices in Appendix C.3). Here, the model Q is expected to have less mixed topic proportions. Table   2b demonstrates the power of the MMD and LKSD tests. The power of the LKSD test improves as the sample size n increases, whereas the MMD has almost no power in this case. In this problem, the topics b are not sparse enough for each topic to have a sufficiently distinctive vocabulary. Thus, the problem is challenging for the MMD, as it is unable to find distinguishing words, in addition to the high-dimensionality. By contrast, the KSD is able to distinguish the models by taking advantage of their underlying structure.

| Synthetic data -topic perturbation
We provide a negative example to illustrate a failure mode of the LKSD test for discrete data. The data is generated as in the previous section, whereas we construct two models differently. We set up a model by perturbing the topics of the data model R . That is, a model is given by LDA(a, b δ ) with b δ = (1 − δ)b + δb ptb with 0 < δ < 1. We choose b ptb as we did for b; the value is drawn independently of b. We set the perturbation parameter for Q as δ = 0.01 and vary it for P , where the value is chosen from {0.06, 0.11, . . . , 0.51}. Thus, P is morphed from b to b ptb and therefore expected to underperform Q as perturbation δ increases. We run trials with n = 300. For score estimation, we take m = 10, 000 and t = 4, 000. Figure 7 shows the plot of rejection rates against perturbation parameters. We see that the power of the LKSD test degrades as the perturbation increases. As P 's topic becomes close to b ptb , some words in the target's topic b become rare and therefore fall in the low probability region of P . This situation leads to increasing variance of the test statistic as δ increases, because the score function contains the reciprocal 1/p (x ). The LKSD test can therefore fail when the support of the model is severely mismatched to that of the data, since the high variance of the statistic makes is difficult to detect significant departures from the null. Note that this observation does not apply to the continuous counterpart as the score can be written as the gradient of the logarithm of the density, which is typically numerically stable.

| Comparing topic models for arXiv articles
Our final experiment investigates the test's performance using the arXiv dataset (Cornell University, 2020). The dataset consists of meta information of scholarly articles on the e-print service arXiv. We treat the abstract of an article as a document, and use paper categories to set up a problem. Specifically, we construct a problem by choosing three paper categories for model P , Q and the data distribution R . Unlike the preceding experiments, for a model category, we fit an LDA model to the dataset of abstracts in the category. As the KSD requires the number of words to be fixed, then for a given data category, we extract abstracts of length no less than D = 100 and subsample excess words.
This process yields a dataset of articles of equal length D ; for each trial, we obtain the data {x i } n i =1 by subsampling from the larger set of articles. Thus, our problem is to compare two LDA models trained on different article sets, and assess their fit to the dataset.
In the following experiments, we examine the power of LKSD and MMD tests. We vary the sample size n from 100 to 500. We fix the dataset category to stat.TH (statistics theory) and inspect two combinations of model categories.
To train an LDA model LDA(a, b), we use the Gensim implementation (Rehurek and Sojka, 2011) of the variational algorithm of Hoffman et al. (2010). For sparsity parameters a, we use the parameter returned by this algorithm; we point-estimate topics b using the mean of the topics under the variational distribution. The number of topics is set to 100. The vocabulary set is comprised of words that appear in the abstracts of three chosen categories. As in the previous experiments, we use the IMQ-BoW kernel for both tests. We fix the significance level α at 0.05.
As we have seen the numerical instability issue in the previous section, we also consider an alternative KSD that is stable but computationally more expensive, as mentioned in 2.1 and the supplement (Section B.2). For this, we take a burn-in size t = 500 and a Markov chain size m = 1, 000. We denote this method by LKSD-stable.

Probability theory vs Statistical methodology.
We choose math.PR (mathematics probability theory) for P and stat.ME (statistics methodology) for Q . In addition to the taxonomic proximity to stat.TH, the category stat.ME has a larger proportion of articles shared with the target category: 3, 121 of 18, 973 (stat.ME) vs. 2, 884 of 46, 769 (math.PR). Thus, we expect Q to outperform P . This combination results in a vocabulary set of size L = 126, 190. For score estimation, we set the burn-in length t to 500 and the Markov chain sample size m to 5, 000. Additionally, we run the LKSD test with m = 15, 000 (labeled LKSD-extra) and the MMD test with the model sample size n model = 10, 000 (labeled MMD-extra). The sample size n model is thresholded at 10, 000 as the computational cost exceeds that of the LKSD test (in fact, sampling in this case makes the MMD by an order of magnitude slower due to the large vocabulary size).

Machine learning vs Statistical Methodology.
Our second experiment uses cs.
LG (computer science machine learning) for P , while Q uses the same category as the previous experiment. With this combination, the vocabulary size L is 208, 671. By the same reasoning as above, the second model Q is expected to be better than P . We run the same tests as above and compare their performances. when there is a severe mismatch in data and model support. The stable LKSD test approaches the same level as the MMD at n = 500, but still underperforms. While stable, the KSD used for this test can also suffer from the mismatch of the support, since it depends on the same density ratio as in the unstable counterpart.

| CONCLUSION
We have developed a test of relative goodness of fit for latent variable models based on the kernel Stein discrepancy.
The proposed test applies to a wide range of models, since the requirements of the test are mild: (a) models have MCMC samplers for inferring their latent variables, and (b) likelihoods have evaluable score functions. The proposed test complements existing model evaluation techniques by providing a different means of model comparison, which takes advantage of the known model structure. Our experimental results confirm this view -the relative MMD test was unable to detect subtle differences between models in several of our benchmark experiments.
Our asymptotic analysis of the test statistic indicates that the test could suffer from bias if the mixing of the deployed MCMC sampler is slow. Removing the assumptions on the bias and the moments in Theorem 2 is certainly desirable; we envision that the recent development of unbiased MCMC (Jacob et al., 2020) could be used to construct an alternative unbiased KSD estimator, and leave this possibility as future work. While we have focused on comparing two models, extensions to ranking multiple models are possible as in (Lim et al., 2019). Finally, the technique used in this article can be applied to other Stein discrepancies requiring the score function Xu and Matsuda, 2020); one interesting application would be the KSD for directional data (Xu and Matsuda, 2020), where densities with computable normalizing constants are scarce.

Data availability
Code to reproduce all the results is available at https://github.com/noukoudashisoup/lkgof.

A | PROOFS
This section provides proofs for the results concerning the asymptotic normality of our test statistic: (a) Theorem 2, and (b) an estimator of the variance of a U-statistic and its consistency.

A.1 | Asymptotic normality of approximate U-statistics
Theorem 2 (Asymptotic normality) Let {γ t } ∞ t =1 be a sequence of Borel probability measures on a Polish space Y and γ be another Borel probability measure. Let Y ∼ γ t , and for a symmetric function h : Y × Y → , define a U-statistic and its mean by converges to a constant σ 2 . Assume that we have θ t → θ as t → ∞. Then, in the limit of large n and of t growing as a function of n such that √ n (θ t − θ) → 0, the following two statements hold: if Proof Recall that (Ω, S, Π) is the underlying probability space, and U (t ) n is a random variable on it. We show that the cumulative distribution function (CDF) of √ n (U (t ) n − θ) converges to that of a normal distribution in the limit of n and t specified in the statement.
First we consider the case σ > 0. Note that we can express the CDF as .
We show that both terms on the RHS converge to zero simultaneously. Let ν = lim sup t →∞ ν t and δ be a fixed constant such that 0 < δ < σ. Note that by the convergence assumptions on ν t and σ t , there exists t ν,δ ≥ 1 such that ν t < ν + δ < ∞ for any t ≥ t ν,δ ; and there exists t σ,δ ≥ 1 such that σ t > σ − δ > 0 for any t ≥ t σ,δ . By the Berry-Esseen bound for U-statistics (Callaert and Janssen, 1978), for t ≥ max(t ν,δ , t σ,δ ) and any n ≥ 2, the term (i) is bounded as where C is a universal constant. For the term (ii), by the continuity of Φ and our assumptions on √ n (θ t − θ) and σ t , we can make the term (ii) arbitrarily small. Formally, by assumption, we can specify t = g (n) with an increasing function g such that √ n (θ g (n ) − θ) → 0 as n → ∞; therefore, for ε > 0, we can take N ε/2 ≥ 1 such that the term (ii) is bounded by ε/2 for any n ≥ N ε/2 . Thus, for any ε > 0, choosing n and t such that Next, for the case σ = 0, consider the squared error n¾(U (t ) n − θ) 2 , which is decomposed as The first term is the variance of the U-statistic U (t ) n , and so according to Hoeffding (1948, Eq. 5.18), we have, for any n ≥ 2, We have w = lim sup t →∞ Var Y ,Y ∼γt ⊗γt [h (Y ,Y ) ] < ∞ by the finiteness of (the limit supremum of) the third central moment, and σ 2 t → σ 2 = 0 by assumption. Therefore, for any ε > 0, we can take t ε,v ≥ 1 such that for any t ≥ t ε,v . Choosing n ≥ 4(w + 1)/ε + 1, we have n¾(U (t ) n − θ t ) 2 ≤ ε/2. For the second term, we can take N ε/2 ≥ 1 such that n (θ t − θ) 2 ≤ ε/2 for any n, t ≥ N ε/2 . Thus, having

A.2 | Variance of a U-statistic
We first recall known facts about U-statistics. For an i.i.d. sample {y i } n i =1 ∼ R , let us define a U-statistic where h is a symmetric measurable kernel. According to Hoeffding (1948, Eq. 5.18), the variance of U n is where ζ 1 = Var y ∼R ¾ y ∼R [h (y , y ) ] and ζ 2 = Var y ,y ∼R ⊗R [h (y , y ) ] . To obtain the asymptotic variance of the √ n (U n − ¾[U n ]), we only need the first term ζ 1 as nVar[U n ] → 4ζ 1 (n → ∞), assuming ¾ (y ,y ) ∼R ⊗R [h (y , y ) 2 ] < ∞.
Recall that our test statistic in Section 3.3 requires a consistent estimator of the asymptotic variance in Corollary 1 (see also Theorem 2). To accommodate the setting of our test, we consider the following situation: we are given samples {y  n ] is given by the corresponding parameters ζ 1,t and ζ 2,t (see Eq. (11)). In the following, we address the estimation of σ 2 = lim t →∞ σ 2 t with σ 2 t = 4ζ 1,t , assuming that the limit exists. As Theorem 2 suggests, the quantity σ 2 is the asymptotic variance of √ n (U (t ) n − ¾[U n ]). In the main body, we define our test using the jackknife estimator v n,t := (n − 1) (see also the test statistic T n,t in Section 3.3), where U (t ) n,−i the U-statistic computed on the sample with the i -th data point removed. In the following, we provide a consistency proof for this estimator.
Therefore, if ¾h (y , y ) 2 < ∞, the estimator v J n converges to the asymptotic variance 4ζ 1 of the U-statistic Next, we recover the dependency on t and define a jackknife estimator v J n,t = (n − 1) which is the estimator in (13) computed on the sample y (t ) i n i =1 . The following lemma provides the required consistency. Let σ 2 = lim t →∞ σ 2 t where σ 2 t = 4ζ 1,t = 4Var y ∼Rt ¾ y ∼Rt [h (y , y ) ] . Then, we have the double limit ¾ v J n,t − σ 2 2 → 0 as n, t → ∞.
Proof Note that we have the following relation The decomposition indicates that as long as the bias and the variance terms in (16) decay as n, t → ∞, the estimator v J n,t serves as a consistent estimator of σ 2 (note that we have σ 2 t − σ 2 → 0 by assumption). In the following, we show that the assertion holds.
Therefore, taking appropriate n, t diminishes Var[Û (t ) c ]. In consequence, as for the bias term, we can take Thus, the bias and variance terms can be made arbitrarily small by taking n, t ≥ max (N v ,ε , N b,ε ).

B | AUXILIARY RESULTS
In this section we provide auxiliary results used in the main body.

B.1 | Kernel Stein discrepancy for bounded domains
We detail the case where data take values in a bounded open set X ⊂ D . For a density p, we require that p ∈ C (X) ∩C 1 ( X) withX the closure of X. The kernel is assumed to satisfy the same conditions everywhere; i.e., k (x, ·) ∈ C (X) ∩C 1 ( X). Additionally, we require the following conditions : (a) the domain X is assumed to be sufficiently regular -we assume that X is convex or C 1 ; (b) for any boundary point x ∈ ∂ X, p (x )k (x, ·) = 0; and (c) ¾ x ∼P s p (x ) 2 < ∞, The first assumption allows us to use the divergence theorem, and the third one ensures the P -integrability of s p , f and , f (Steinwart and Christmann, 2008, Corollary 4.36). By the divergence theorem, the second condition on the kernel implies that any f = (f 1 , . . . f D ) of the corresponding RKHS F satisfies ¾ x ∼P [ A P f (x ) ] = 0 (Gorham and Mackey, 2015, Proposition 1). We can fulfill the required condition (b) by and ϕ (x ) = 0 for x ∈ D \ X. The KSD is then defined similarly as in Section 2. We demonstrate an example of this setting in Section C.2.

B.2 | Numerically stable Stein operator
As we note in Section 2, the Stein operator of Yang et al. (2018) can induce numerically unstable functions, as it contains the reciprocal 1/p (x ). This appendix provides a more stable alternative using the Zanella-Stein operator proposed by Hodgkinson et al. (2020). We focus on the Barker-Stein operator of Shi et al. (2022) below, but other operators can be considered depending on the application. For instance, when the size L of the discrete domain is small (e.g., binary domains such as adjacent matrices of graphs), we may use the Gibbs-Stein operator (Bresler and Nagaraj, 2019;Reinert and Ross, 2019;Shi et al., 2022), as it is manageable to perform the required conditional expectation.

B.2.1 | Univariate case
We first consider the univariate case X = {0, . . . , L − 1} with L > 1, as the multivariate case builds on the construction here. Let be a density p on X. The Zanella-Stein operator is defined by where the function a : [0, ∞) → [0, ∞) is referred to as a balancing function, which is assumed to satisfy a (0) = 0, and a (r ) = r a (1/r ) for r > 0. The symbol N x ⊂ X denotes the neighborhood of x . The operator is derived from the infinitesimal generator of a Markov jump process with the jump rate given by a as a function of p (y )/p (x ). In the following, we use the Barker balancing function a (r ) = r /(1 + r ); this results in This choice of the balancing function was proposed by Shi et al. (2022) and is referred to as the Barker-Stein operator.
The above ratio takes values between 0 and 1 (it saturates to one as the ratio p (y )/p (x ) gets large), and is thus numerically stable even when p (x ) is close to zero. The neighborhood N x is chosen such that the process can transit from any starting point to any other point, and admits p as the invariant distribution (see Example 2 of Hodgkinson et al., 2020, for details). In the following, N x is assumed to be the two adjacent points of x with respect to the cyclic difference. This construction is not limited to D = 1, but it can be challenging to compute the sum in a high-dimensional space when applied to the KSD, since the number of neighbors typically grows with the dimension.

B.2.2 | Multivariate case
We next consider the multivariate case X = {0, . . . , L − 1} D with D > 1. We follow the product space construction of a Stein operator (Hodgkinson et al., 2020, Proposition 2). We define an operator A p that acts on a D -dimensional where p d is the distribution given by conditioning on all but the d -th coordinate, . . x D ), and P x d is the projection defining a function on the d -th coordinate by freezing all the other coordinates; i.e., P x d f : . We can define the KSD using a vector-valued RKHS F = D d =1 F k with F k the RKHS of a scalar kernel k : X × X → . With this choice, we can define the KSD as in 2.1; in particular, for this operator, we have for any x, y ∈ X, with A Uni * ,p d (·|x −d ) acting on * . Specifically, the Stein kernel h p is given by Here, N x,d denotes the set of neighborhood points with respect to the d th coordinate; we identify this as a set of functions, each of which maps x to the corresponding neighboring point. The weight a ν is defined by From this expression, we can conclude that the KSD is possible to estimate as long as we can evaluate a ν . Note that a single evaluation of the Stein kernel h p requires O (D N ) with N being the maximum of the sizes of the neighborhood N x,d (in our case, N = 2).

B.2.3 | Application to latent variable models
The KSD defined above admits essentially the same treatment as in the main body. The Stein operator above requires evaluating the marginal density p (x ) = ∫ p (x |z )P Z (dz ). Following the approach in Section 2.2, we can circumvent this issue as The weight a ν (x ) can be estimated by sampling from the modified posterior Given a sample z = {z i } m i =1 for simulating P Z ,ν (dz ), we can estimate a ν (x ) by which is possible to evaluate as long as the likelihood p (x |z ) is tractable. In contrast to the difference operator KSD, this approach requires simulating O (D ) posterior distributions, as each dimension generates distinct modified posterior distributions. Thus, while numerically stable, this approach is computationally more expensive. The user might want to consider this limitation when they choose between the operator defined here and the difference Stein operator considered in the main body.

B.3 | Integrability condition in Lemma 1
We give sufficient conditions for the integrability condition By the triangle inequality, we have From this, the integrability condition is satisfied if Note that for a finite domain X = {0, · · · , L − 1} D , these conditions are trivial, as .
In what follows, we consider continuous-valued x . As mentioned in the main body, the third condition is mild, and fulfilled by e.g., the exponentiated quadratic kernel. Unfortunately we do not have a handy test for the other requirements, and therefore deal with specific scenarios below. To this end, we clarify the growth of ¾ z |x s p (x |z ) 2 as a function of x so that the user can check the required conditions above.
Exponential families with bounded natural parameters.
Let us first consider an exponential family likelihood p (x |z ) ∝ exp S s=1 T s (x )η s (z ) , T s : D → , η s : Z → for 1 ≤ s ≤ S . For this likelihood, we have The conditions concerning the score function are satisfied provided that we have
Let a (x ) := x T s (x ) 2 ¾ z |x |η s (z ) |. These conditions can be verified if both the kernel and its derivative decay faster than a (x ). This can be challenging in practice as we have a posterior expectation in a (x ) whose dependency on x may not be easily analyzed. If we restrict the likelihood to have bounded parameters (i.e., η s (z ) is bounded), then the posterior expectation is bounded, so we only need to choose a kernel such as for a given kernel κ. We summarize this observation in the following lemma: Lemma 4 Consider a latent variable model with likelihood p (x |z ) ∝ exp S s=1 T s (x )η s (z ) , T s : D → , η s : Z → for 1 ≤ s ≤ S and an arbitrary prior P Z . If sup z ∈Z n s (z ) < ∞, then Furthermore, for a given bounded kernel κ, reweighting it by with some δ ≥ 1/2 ensures that the condition (17) holds.
The boundedness assumption on the natural parameter η s (z ) may not be satisfied for some models. For instance, if we consider a normal mixture model with prior on the mean of the mixture component, the support of the prior could be unbounded.
Alternatively, we consider a location-scale family given by a radial-basis function ψ : [0, ∞) → (0, ∞), with prior P µ,σ placed on the parameters. Here, we make the following assumptions:

Assumption 2 The prior satisfies
Isotropic normal-and Student's t-densities satisfy Assumption 1. As ψ is monotonically decreasing, the third condition in Assumption 2 can be verified alternatively by These assumptions effectively prevent the density from being peaky and thus control the growth of the score function.
Under these assumptions, we can quantify the growth of the score function as follows.
Proof First, note that we have .
We therefore show that the growth of the function g is of the order of x 2 by examining the limit lim For the numerator of g , note that we have where the first inequality is given by the Cauchy-Schwarz inequality. Thus, The second line is obtained with the relation between the L 1 -and L 2 -norms: p x L 1 (Pµ,σ ) ≤ p x L 2 (Pµ,σ ) . The function inside the integral in (18) is monotonically decreasing at each (µ, σ) as and by Assumption 2, it is integrable when x 2 = 1. Thus, by the monotone convergence theorem, we have lim where the upper-bound is finite by Assumption 2, indicating that g (x ) = O ( x 2 ). Therefore, for the location-scale family, we have that the score part grows at the speed at most of x 2 . Therefore, modifying kernel κ by ensures that the decay of the kernel is as fast as the score part.

B.4 | Convergence assumption in Theorem 2
To apply Theorem 2 to the KSD estimate, we need to verify that the bias ¾ y ,y ∼Rt ⊗RtH p (y , y ) − ¾ y ,y ∼R ⊗R H p (y , y ) decays at some rate. Here,R t (d(x, z)) = P (t ) Z (dz|x )R (dx ) (see the paragraph before Theorem 2 for the notation).
In what follows, we assume that the MCMC sampler satisfies for some function r (t ) : → (0, 1] decreasing to 0 in t and a positive function M (x ), wheres p,d denotes the d -th element of the conditional scores p (the same rule applies to s p ). There is usually dependency in M (x ) on the initial state of the chain, but this is suppressed here for simplicity. This assumption can be understood as the convergence of the t -step transition law of the Markov chain P (t ) Z (dz |x ) to the target P Z (dz |x ) in terms of the test functions s p,d (x | ·). The convergence can then be checked with the assumption that the test functions belong to a certain class, and the upper-bound of (20) can be stated as a worst-case convergence rate in that class. For example, the convergence in total variation distance corresponds to the class of nonnegative measurable function bounded uniformly by 1.
A specification of the decay rate r and the function class relates the condition (20) to standard notions of ergodicity studied in the Markov chain literature (Roberts and Rosenthal, 2004;Meyn et al., 2009). For instance, suppose that we have the geometric rate r (t ) = ρ t for some 0 < ρ < 1, and that the function class is such that all members are measurable and bounded uniformly by a function V : Z → [1, ∞). Then, the corresponding convergence is known as V -uniform convergence (if V ≡ 1, this corresponds to the uniform ergodicity) (Meyn et al., 2009, Chapter 16).
We can reduce the convergence of the bias (19) to that of the score estimate (20). To see this assertion, we first note that The first term of ¾ y ,y ∼Rt ⊗RtH p (y , y ) − ¾ y ,y ∼R ⊗RH p (y , y ) the difference in the first terms can be bounded by r (t ). A similar argument can be applied to the second and third terms. The constant M (x ) in the bound in (20) often depends on certain properties of the target P Z (dz |x ). If those properties hold uniformly over x (i.e., M (x ) can be treated as a constant), then the validation of these conditions is straightforward. A concrete example of such properties is strong log-convexity and having a Lipschitz continuous gradient of the target (assuming the target is given by a density p (z |x )) (Dalalyan, 2017;Dwivedi et al., 2019;Bou-Rabee et al., 2020). In this situation, the bias (19) is determined by the rate r (t ) at which the score function converges. Hence, t has to grow as a function of n such that t (n) = O {r −1 (n −s ) } with s > 1/2 to apply Theorem 2.

B.5 | The maximum mean discrepancy relative goodness-of-fit test
We provide the detail of the MMD relative goodness-of-fit test proposed by Bounliphone et al. (2016) and describe our modification to correct for underestimation of the variances required in the test (see Appendix C.4). Recall that the MMD between two probability distributions P and R is defined as the following IPM: where F k is the RKHS of (scalar-valued) kernel k : X × X → . The MMD can be written in terms of the kernel as ∼ R, we can estimate the squared MMD with a two-sample U-statistic (Kowalski and Tu, 2007, p. 131) MMD 2 (P , R ) = 1 n P 2 1 n R 2 j 1 <j 2 i 1 <i 2 Note that this statistic is equal to the unbiased estimator of Gretton et al. (2012, Eq. 3).
∼ R and two competing models P , Q , the relative MMD test is defined as follows: H 0 : MMD(P , R ) ≤ MMD(Q , R ) (null hypothesis), Note that here we use a different symbol for the data in the main body (there, the data variable is x ). Their procedure does not consider the case where the sample size n R does not match the sizes n P , n Q of respective samples {x i } n P i =1 , {y i } n Q i =1 from P and Q . Therefore, we provide the test procedure accommodating this case. The test statistic is defined by the difference between estimates of the squared MMDs This statistic is a three-sample U-statistic; its asymptotic distribution is normal under the same assumptions on the relations between the models and data distribution as in the main body (the second paragraph of Section 3.1). Let n sum = n P + n Q + n R . Assume the following growth condition on the sample sizes n sum n P → ρ 2 P , n sum n Q → ρ 2 Q , and n sum n R → ρ 2 R with finite constants ρ P , ρ Q , and ρ R . Assume ¾[ diff (x, x ; y , y ; z , z ) 2 ] < ∞. Then, according to Kowalski and Tu (2007, Theorem 3, p.168), the limit of (n P , n Q , n R ) gives x ; y , y ; z , z ) |x ], and the same notation applies to the conditional expectations of diff of y and z .
With a consistent estimatorσ P ,Q ,R , Bounliphone et al. (2016) proposed an asymptotically level-α test that rejects the null hypothesis if MMD 2 (P , R ) − MMD 2 (Q , R ) ≥ (σ P ,Q ,R / √ n sum ) · τ 1−α with τ 1−α the (1 − α)-quantile of the standard normal distribution. We found that the estimator given by Bounliphone et al. (2016) tends to underestimate the target variance, and that their test exceeded the nominal level in some problems where two models are close to each other. We therefore consider another estimator for our experiments, as described below.

Variance estimators
Following are the variances required for σ 2 P ,Q ,R : The first two quantities are symmetric in terms of x and y , and therefore we only need to consider one of them. An estimator for the first variance is given by We similarly estimate the third variance using The estimators (21) and (22) are simple to compute and always nonnegative. The consistency of these estimators can be checked by expanding the expressions. The derivation is tedious, and therefore we only prove it for (21). Then,Eq. (21) estimates consistently in the limit of (n P , n R ) such that the ratio n P /n R converges to a finite constant. Then,Eq. (22) estimates consistently in the limit of (n P , n Q , n R ) such that the ratios n P /n R and n Q /n R converge to finite constants.
Proof Note that the estimator (21) is the (approximate) sample variance Showing the consistency is equivalent to proving the following limits (the symbol p → denotes convergence in probability): The first limit is immediate as For the second convergence claim, we expand the expressions as follows: Note that the terms n −1 P n P i =1f i (x i ) 2 corresponding to A and B above vanish in the limit, since by the law of large numbers for U-statistics, 1 n P (n P − 1) The other three terms are U-statistics (up to scaling negligible in the limit) of their counterparts in ¾ x [f (x ) 2 ]. Thus, by the same reasoning, the second limit holds.

B.6 | MMD and KSD for Gaussian distributions
We provide explicit forms of MMD and KSD measured for Gaussian distributions. These results are used in constructing the PPCA experiment in Section 4 in the main body. In the process, we also obtain an understanding of the role of the reproducing kernel in the KSD.

B.6.1 | MMD
This section provides an explicit formula for MMD between two normal distributions, defined by the exponentiated quadratic kernel The MMD expression in this setting has been shown (see e.g., Sriperumbudur et al., 2012, Example 3). We use this formula to compute the difference of MMDs so that we can construct a problem as in Section 4.1.
Note that we can numerically evaluate this expression given covariance matrices. For completeness, we provide a proof below.
Proof Recall that the MMD between two distributions P , R is given by Let p (x ) = N (x ; µ p , Σ p ), r (x ) = N (x ; µ r , Σ r ). Note that when properly scaled, the exponentiated quadratic kernel can be also seen as a Gaussian density function. Therefore, by convolution, we have Then, the first term is where | · |denotes the determinant, and The second term is The third term is similarly obtained, but its form is not necessary for comparing models. We then impose the condition µ p = µ r = 0. In this case, we have Thus, substituting three Gaussian distributions P = N (0, Σ p ), Q = N (0, Σ q ), and R = N (0, Σ r ), we obtain the desired equality.
For two Gaussian densities p (x ) = N (x ; 0, Σ p ), r (x ) = N (x ; 0, Σ r ), the difference between their score functions is Therefore, where ·, · denotes the matrix inner product, and M k ,R is a matrix depending on the kernel k and the data distribution R . Therefore, it can be informally understood that the KSD depends on the difference between the covariances Σ p and Σ r ; if Σ p is given by additive perturbation as Σ r + E , the difference depends on the perturbation matrix E . Note that in the PPCA experiments, the perturbation matrix is an increasing function of δ, element-wise.

B.7 | Kernel choice and KSD: Gaussian models and data
We illustrate how kernel choice affects the sensitivity of the KSD. We consider the following setting: P ∼ N (0, diag(1, , 1, . . . , 1)), R ∼ N (0, diag(σ 2 1 , . . . , 1)) for some positive σ 1 1. The model P misestimates the variance of the first coordinate of the data. Let us consider the effect of a parameter choice for the IMQ kernel. We specifically compare the following kernels: where x 1 denotes the first coordinate of x . The latter kernel can be considered as the (preconditioned) IMQ kernel with dimension-wise scaling Λ = diag(σ 2 1 , 1, . . . , 1) where the scale is determined by the dimension-wise variance of the data. The KSDs corresponding to these kernel choices are given as follows: where the expectations are taken with respect to independent standard Gaussian random variables X ,Y . For the KSD to be sensitive to this deviation in variance, the expectations on the RHS have to be large. In this regard, the key difference between these expressions is that the variance σ 2 1 appears in the coefficient of (X 1 − Y 1 ) 2 . When σ 1 1, the non-scaled IMQ kernel k IMQ pays more attention to the first coordinate than the scaled counterpart k scale IMQ , and we can therefore expect a higher expectation value for k IMQ . On the other hand, when σ 1 1, the relation flips as σ 1 reduces the contribution of the first coordinate. Note that this relation holds more starkly in high dimensions, as the rest of the coordinates have greater influence on the kernel output. These considerations show that the ability to choose an effective kernel depends on the problem (not surprisingly). In particular, it shows that dimension-wise scaling (or covariance preconditioning) could hurt the performance in some problems.
Finally, if we instead use the exponentiated quadratic (EQ) kernel k EQ (x, y ) = exp − x − y 2 2 , then we have .
When σ 1 1, the EQ has higher selectivity in the first coordinate than the IMQ because of its exponential decay; the KSD with the EQ kernel could be more useful in revealing this perturbation. However, in practice, we do not know the discrepancy of our models a priori. At least in terms of local sensitivity such as the example above, the IMQ kernel could be considered more robust against poor specification of input scaling than the EQ, as the effect of input scaling is less significant.

B.8 | The score formula for Dirichlet process models
We provide an explicit formula for the score formula (3) for Dirichlet process mixture models, mentioned in 4. We first note that the density p (x | D) is given by whereF D is the mean measure of the posterior distribution of F given D. Note thatF D is the mean of a mixture of Dirichlet processes with the mixing distribution given by the distribution of the latents of the training data {z i } n tr i =1 conditioned on D (see Ghosal and van der Vaart, 2017, Remark 5.4). We can interchange the inside integral and the expectation of F , which results in the second line. This expression immediately gives We discuss how to evaluate the expectation with MCMC. Our target distribution is Note that this distribution is a mixture of two distributions written as For the Gaussian DPM model in 4.2, we can sample from the posterior ψ (x |z, φ)/C a da (z ), and it can be used for initializing the Markov chain. The distribution in the second term is not given in closed form as the mean measure b is unknown, but we can sample from b (and so fromF D ) by Gibbs sampling (Ghosal and van der Vaart, 2017, Theorem 5.3). Assuming that we can generate samples fromF D , we can use the random walk Metropolis algorithm where the acceptance probability of the transition from z to z is given by with the proposal distributionF D . However, sampling fromF D cannot be performed exactly, and therefore we use Gibbs sampling after sufficient burn-in. Consequently, the use of Gibbs and Metropolis samplers allows us to sample from ψ (x |z )/p (x | D)F D approximately.
B.9 | Invariance properties of kernel Stein discrepancy

B.9.1 | Model invariance
In some applications, models are designed to be invariant to certain transformations. When comparing a class of models invariant under a transformation, model comparison should be made so that the ranks of the models remain unaffected under the transformation of the data. We show that essentially for rotational transformations, we can make the KSD invariant by choosing a rotationally invariant kernel. In the following, for a map T : X → X and a distribution P , we denote T -push-forward of P by T # P ; i.e., T # P is defined as the distribution of a random variable T x with x ∼ P .

Lemma 9
Assume that for an orthogonal matrix O , the following conditions hold: (a) P has a density p such that p Proof When we push forward the data distribution by O , the KSD becomes The assumption p (O x ) = p (x ) implies that where {e 1 , . . . , e D } denotes the standard basis of D . For the kernel derivatives k 1 and k 12 , we obtain Thus, An analogous result holds for the KSD for discrete observations. Proof The proof proceeds as in the previous lemma. Note that taking the cyclic forward difference with respect to the i -th coordinate gives Similarly, for the kernel derivatives k 1 and k 12 , we have Thus, and therefore KSD (P (O σ ) # R ) = KSD (P R ) .

B.9.2 | Coordinate-choice independence
In general, the KSD is not invariant to a change of coordinates, and the KSD may be affected if we transform both the model and the data distribution. Precisely, for some one-to-one map T : X → X, we might have KSD (T # P T # R ) The following result is essentially the same as the previous lemmas except that here we do not have the invariance assumptions for the model; it shows that the KSD can be made invariant at least under rotation and translation.

Lemma 11 Let T be an affine transform such that
x, x ∈ D . Then the KSD between P and R is invariant under T ; that is, KSD (T # P T # R ) = KSD (P R ) .
Proof Let us denote the density of T # P by p T (x ) = p (T −1 x ). Then, its score function satisfies Similarly, for the kernel derivatives k 1 and k 12 , we have These relations imply that the Stein kernel satisfies Thus, An example of the data-dependent kernel is a covariance-preconditioned kernel whereΣ R is the sample covariance matrix of R and φ is some positive-definite function. Another example is the median-scaled kernel where σ R ,med is the sample median: median{ x i − x j 2 |1 < i < j < n }. In fact, radial basis kernels with dataindependent scaling also satisfy the required condition since k T # R (x, y ) = k R (x, y ) = k R (T −1 x, T −1 y ).

B.10 | Arvesen (1969)'s formula for the jackknife variance estimator
The formula in (Arvesen, 1969, Eq. 25) has minor errors. For completeness, we provide a proof for the decomposition (14). Proof For a combination in C n,s , we fix a order of integers (say, sorted in ascending order) and evaluate f . We may assume that the statistic is centered so that ¾[U n ] = 0.

C.1 | PPCA: type-I errors and test power
We provide results supplementary to the results in Section 4.3.1.

Problem 2 (alternative)
We present the result of the same experiment with α = 0.01 (Figure 8) to show power decay due to the conservatism. F I G U R E 8 Power curves of the MMD test of Bounliphone et al. (2016), the proposed LKSD test, and the KSD test with the exact score function in PPCA Problem 2. The perturbation parameters are set as (δ P , δ Q = 2, 1). Each result is computed on 300 trials. The significance level α = 0.01. Markers: 3 (the LKSD test); (the KSD test); ⃝ (the relative MMD test).
Next, we provide the result for the same power experiment with a different choice of perturbation parameters (δ P , δ Q ) = (3, 1). Figure 9 shows the power curves of the tests. The MMD test with the covariance pre-conditioner achieves power 1 at n = 100. F I G U R E 9 Power curves of the MMD test of Bounliphone et al. (2016), the proposed LKSD test, and the KSD test with the exact score function in PPCA Problem 2. The perturbation parameters are set as (δ P , δ Q = 3, 1). Each result is computed on 300 trials. The significance level α = 0.05. Markers: 3 (the LKSD test); (the KSD test); ⃝ (the relative MMD test).

Differing perturbation values
To demonstrate the effect of perturbation parameters, we also compare the two tests for differing configurations of (δ P , δ Q ). We use the IMQ kernel with median scaling as in Section 4.1.1 in the main text. We choose a perturbation parameter from a grid of size 10 between 10 −2 and 10 1/2 in the log 10 scale. Figure 10 demonstrates the test power for different settings of (δ P , δ Q ) with varying sample size n.

C.2 | Truncated Probabilistic Principal Component Analysis
As mentioned in Section 3.1, our test enables the comparison of models having likelihoods with unknown normalizing constants. To demonstrate this feature, we consider a truncated PPCA model. Following the notation in Section 4.1, where ϕ ρ is a continuously differentiable bump function satisfying ϕ ρ (x ) = 0 if x 2 ≥ ρ, and N (x ; µ, Σ) denotes the density of a Gaussian distribution with mean µ and covariance Σ. We may take smooth ϕ ρ such that for some (Lee, 2012, Lemma 2.22). We refer the reader to Appendix (B.1) for the KSD on bounded domains and the regularity condition required for the marginal p (x ).
Due to the truncation, the likelihood p (x |z ) is only available up to an intractable constant ∫ ϕ ρ (x ) N (x ; Az, ψ 2 I x )dx (a function of z ), making the latent posterior distribution doubly intractable. Nonetheless, there are MCMC methods that circumvent the double intractability (see Park and Haran, 2018, for a review); these methods typically apply if we can (approximately) sample from p (x |z ) for a given value of z . For this truncated PPCA model, we may generate samples from the likelihood using rejection sampling, allowing us to use techniques such as the auxiliary variable algorithm (Møller et al., 2006) or the exchange algorithm (Murray et al., 2006).
As in the previous section, we evaluate our test in terms of type-I error rates and test power; we use the same notation. We set D = 100 and D z = 10, and use the same parameters for R . We construct a problem by perturbing the (1,1)-entry of A. We compare the LKSD and MMD tests with IMQ kernels with median-and covariance preconditioning; these choices produced different outcomes in the previous section. For the truncation function ϕ ρ , we use ρ 0 = 45 and ρ = 50. We perform rejection sampling with proposal N (Az , ψ 2 I x ). For some values of z , we observed that the sampler never accepts the proposal. We therefore decide to generate 300 proposals; if no sample has been accepted, we normalize a proposal x such that x 2 = 0.99ρ 0 and use it as a sample. Finally, we use the exchange algorithm of Murray et al. (2006) for the LKSD test. We employ the same proposal as the MALA sampler with a step size of 10 −3 , and use m = 2, 000 samples after t = 500 burn-in steps. Although the inexact rejection sampling induces bias to the Monte Carlo estimate, we will see that this bias is negligible in terms of the size control.

Problem 1 (null)
As in the previous section, we create a null scenario by setting (δ P , δ Q ) = (1, 1 + 10 −5 ). We consider significance levels α = 0.01, 0.05. Table 6 reports the finite-sample size of the two tests for significance level α = 0.05. Again, the result for α = 0.01 is omitted as no tests rejected the null hypotheses. It can be seen that the size of both tests are controlled.
TA B L E 6 Type-I errors the MMD test of Bounliphone et al. (2016), the proposed LKSD test in Problem 1 for truncated PPCA models. Rejection rates are computed on 300 trials with significance level α = 0.05. The columns IMQ-med and IMQ-cov denote IMQ kernels with median-and covariance preconditioning, respectively.

Problem 2 (alternative)
We next compare the power of the two tests. We construct an alternative scenario using perturbation parameters δ P = 2 for P and δ Q = 1 for Q . Figure 11 shows the estimated power curves. We observe a trend similar to the PPCA experiment: the LKSD test with median preconditioning forms the MMD counterpart, whereas the converse relation holds for covariance preconditioning. The experiment shows that the LKSD test can also complement the MMD test for models without intractable likelihood functions. F I G U R E 1 1 Power curves of the MMD test and the proposed LKSD test in Problem 2 for truncated PPCA models. Dotted lines indicate tests with the IMQ kernel with the median preconditioning; dashed lines the covariance preconditioning. The perturbation parameters are set as (δ P , δ Q ) = (2, 1). Each result is computed on 300 trials. The significance level α = 0.05. TheMarkers: 3 (the LKSD test); ⃝ (the relative MMD test).

Differing perturbation values
As in the previous section, we compare the LKSD and MMD tests using differing configurations of (δ P , δ Q ). All the model configurations follow Section 4.3.1 in the main text, and we use the IMQ-BoW kernel. We choose a perturbation parameter from a grid of size 10 between 10 −2 and 1 in the log 10 scale. Figure 12 demonstrates the test power for different settings of (δ P , δ Q ) with varying sample size n. F I G U R E 1 2 Comparison between the MMD test of Bounliphone et al. (2016), the proposed LKSD test using the LDA model. The heat map represents the estimated test power. Each result is computed on 100 trials. The significance level α = 0.05.

LDA models with more sparse topics
We run the same experiment as in Section 4.3 except that the topics b are made sparse by sampling from the Dirichlet distribution with all the concentration parameters 0.1. The results are summarized in Tables 7 and 8. The LKSD test underperforms the MMD test in this case. As the topics are more sparse, generated documents tend to have words from a particular topic; this trend escalates as we increase the concentration parameter of the topic proportion prior.
Models thus observe compositions of words that they would not generate, resulting in a high-variance test statistic for the same reason as in Section 4.3.2 (In fact, the topics have probabilities as low as 10 −40 , which comes close to violating the assumption on the density).

C.3.2 | Kernel parameter
As in the PPCA experiment, we investigate the performance dependence on the kernel choice. Using LDA problem 2 , we examine how the test power is affected by the scaling parameter. We use the EQ and IMQ BoW kernels as above, and choose their scaling parameter λ 2 from {10 −6 , 10 −5 , . . . , 10 3 }. For each n ∈ {100, 300} we run 300 trials and estimate the test power of the LKSD and MMD tests. Figure 2 plots the power curves of the tests. We can see that the MMD test fails for any choice of the kernel. For the LKSD test, the IMQ kernel has a flat curve, indicating its independence from the bandwidth (at least in this candidate range), whereas the EQ kernel benefits from a small bandwidth value.  F I G U R E 1 3 Power curves of the proposed LKSD test and the MMD test in LDA Problem 2. The perturbation parameters are set as (δ P , δ Q = 2, 1). Each result is computed on 300 trials. The significance level α = 0.05. Markers: 3 (LKSD test with IMQ kernel); 0 (LKSD test with EQ kernel); ⃝ (MMD test with IMQ kernel); × (MMD test with EQ kernel).

C.4 | Experiment: close models and type-I errors
We investigate the behavior of the LKSD test when two models are close to each other. In this case, the difference of the U-statistic kernels defined by the models is small, which could therefore lead to the degeneracy of the U-statistic; i.e., the normal approximation of the test statistic is not appropriate. In the following, we investigate the three variants of the LKSD test defined by different choices of the variance estimator. We compare the jackknife estimator (12) with the following two estimators: (i) a U-statistic variance estimator where ζ 1 and ζ 2 in (11) are estimated by Ustatistics, and (ii) a V-statistic variance estimator where ζ 1 is estimated by a V-statistic. The U-statistic estimation was considered by Bounliphone et al. (2016) and Jitkrittum et al. (2018). The issue with the U-statistic estimator is that it underestimates the actual variance. In fact, we observed that the variance estimator sometimes returns negative values. This can occur since the statistic is given as a difference between unbiased estimates of quantities close to zero (this issue applies to the V-statistic estimator). We made the (arbitrary) choice to accept the null hypothesis when the variance estimate was negative to avoid false rejections. For this reason, we recommend against using the U-statistic estimator.

C.4.1 | PPCA
Our first experiment concerns PPCA models. Specifically, we choose a PPCA model for the data distribution as in Section 4.1 with D = 50. The difference is that we fix the perturbation parameter δ P for P at 1, and vary the parameter δ Q by choosing it from {10 −i : i ∈ {2, 3, . . . , 9} } (this choice yields null H 0 scenarios). We set the significance level α = 0.05. For each n ∈ {100, 200, 300}, we run the tests for 300 trials and examine the behavior of the tests under the null. Figure 14 shows the tests' rejection rates. We first note that as the perturbation parameter decays (the models get closer to each other), the test with the U-statistic estimator rejects more and has higher type-I errors than the nominal level α = 0.05. These plots demonstrate that the jackknife and V-statistic versions of the test are more robust in this setting. F I G U R E 1 4 The behaviors of the two LKSD tests under the null. The nominal level α is set to 0.05. The test with the U-statistic variance estimator has higher type-I errors as the models get closer to each other. Markers: 3 (LKSD test with the jackknife variance estimator); ⃝ (LKSD test with the U-statistic variance estimator); 0 (LKSD test with the V-statistic variance estimator).

C.4.2 | LDA
We conduct a similar experiment with LDA models. The problem setup is the same as in Section 4.3, except that the vocabulary size L = 100. We perturb the sparsity parameter of the Dirichlet prior of an LDA model. We set δ P = 1 and δ Q = 1 + δ, where δ is chosen from {10 −2i : i ∈ {1, . . . , 5} }. For each n ∈ {100, 200, 300}, we run the tests for 300 trials with significance level α = 0.05. Figure 15 shows the rejection rate of each test. The test with the V-statistic estimator is more conservative than the other tests. The U-statistic variance appears to underestimate the actual variance. F I G U R E 1 5 The behaviors of the two LKSD tests under the null. The nominal level α is set to 0.05. The test with the U-statistic variance estimator has higher type-I errors as the models get closer to each other. Markers: 3 (LKSD test with the jackknife variance estimator); ⃝ (LKSD test with the U-statistic variance estimator); 0 (LKSD test with the V-statistic variance estimator).

C.5 | Experiment: identical models
We look into the behaviors of the LKSD test when the models are identical. Our test procedure provides no guarantee in this case, as the asymptotic distribution would deviate from the normal distribution. As in the previous section, we compare the performance of the tests with the three proposed variance estimators. In the following, we fix the significance level α at 0.05. As in Sections 4.1, 4.3, we choose perturbation parameters for the candidate models, and run 300 trials for a differing sample size n ∈ {100, 200, 300}. For both PPCA and LDA models, we choose δ P = δ Q = 1. Figure 16 shows the plot for each problem. We note that the U-statistic test has higher type-I error rates in this setting, although they are closed to the design level. Notwithstanding that our test assumptions are violated, the jackknife and V-statistic approaches reject the null at a rate well below 0.05, and remain conservative in this example. F I G U R E 1 6 Plots of type-I errors when two models are identical. Markers: 3 (LKSD test with the jackknife variance estimator); ⃝ (LKSD test with the U-statistic variance estimator); 0 (LKSD test with the V-statistic variance estimator). The LKSD test with the U-statistic variance estimator has higher errors than the nominal level α = 0.05.