Testing for the Markov property in time series via deep conditional generative learning

Abstract The Markov property is widely imposed in analysis of time series data. Correspondingly, testing the Markov property, and relatedly, inferring the order of a Markov model, are of paramount importance. In this article, we propose a nonparametric test for the Markov property in high-dimensional time series via deep conditional generative learning. We also apply the test sequentially to determine the order of the Markov model. We show that the test controls the type-I error asymptotically, and has the power approaching one. Our proposal makes novel contributions in several ways. We utilise and extend state-of-the-art deep generative learning to estimate the conditional density functions, and establish a sharp upper bound on the approximation error of the estimators. We derive a doubly robust test statistic, which employs a nonparametric estimation but achieves a parametric convergence rate. We further adopt sample splitting and cross-fitting to minimise the conditions required to ensure the consistency of the test. We demonstrate the efficacy of the test through both simulations and the three data applications.


Introduction
The Markov property is fundamental and is commonly imposed in time series analysis.For instance, in economics and reinforcement learning, the Markov property is the foundation of the Markov decision process that provides a general framework for modelling sequential decision making.In finance and marketing, the Markov property is widely assumed in most continuous time modelling.See Chen and Hong (2012) for a review.Correspondingly, testing the Markov property, and relatedly, inferring the order of a Markov model, are of paramount importance in a broad range of applications.
Such a testing problem, however, is highly nontrivial and poses many challenges, especially for high-dimensional time series.For the Markov property test, Aït-Sahalia (1997) proposed a nonparametric test based on the Chapman-Kolmogorov equation and smoothing kernels.Chen and Hong (2012) tackled the testing problem based on the conditional characteristic function (CCF) estimated by local polynomial regressions (LPRs).However, kernel smoothers, including LPRs, suffer from a poor estimation accuracy in moderate to high-dimensional settings, leading to an inflated type-I error or a low power for the tests.For the order determination in nonparametric autoregression, Cheng and Tong (1992), Yao and Tong (1994) and Vieu (1995) developed some cross-validation based methods, and Auestad and Tjøstheim (1990) and Tschernig and Yang (2000) proposed a final prediction error based criterion.But none of those order determination methods are based on hypothesis testing, and they all assume the dimension of the time series is fixed.More recently, Shi et al. (2020) developed a quantile random forest algorithm and a doubly robust procedure to test the Markov assumption in the context of reinforcement learning.But their method, as we show later in Section 5, would fail to control the type-I error in the time series setting.
In this article, we propose a nonparametric testing procedure for the Markov property in highdimensional time series via deep conditional generative learning.The proposed test can be sequentially applied for order selection of the Markov model as well.Our proposal makes unique and useful contributions in several ways.
Particularly, we utilise some state-of-the-art deep conditional generative learning methods to address a classical yet challenging statistical inference problem in time series analysis.Deep conditional generative models include mixture density networks (MDNs) (Bishop, 1994), conditional generative adversarial networks (Mirza & Osindero, 2014), conditional variational autoencoders (Sohn et al., 2015), and normalising flow models (Kobyzev et al., 2020).They provide a powerful set of tools to flexibly learn conditional probability distributions, and have been used in numerous applications, such as computer vision, imaging processing, and artificial intelligence (Jo et al., 2021;Shu et al., 2017;Wang et al., 2018;Yan et al., 2016).Nevertheless, these tools are much less used and studied in the statistics literature.We employ this family of models to learn highly complex conditional distributions in a nonparametric fashion, and demonstrate their advantages over the more traditional kernel smoothers including LPRs, especially in a high-dimensional setting.
Meanwhile, it is far from a simple application of some ready-to-use deep learning tools, but instead it requires both crucial modification of the methods and careful characterisation of their theoretical properties.We build our testing procedure based upon MDNs (Bishop, 1994), combined with several crucial new components.First, we propose a new MDN architecture to model the conditional distribution of a multivariate response.Based on such an architecture, we learn two distributional generators, a forward generator and a backward generator, then properly integrate the two generators to construct the test statistic.Second, we derive the convergence rate of the MDN estimator in Theorem 3 , which is crucial to establish the consistency of our proposed test, but is not currently available in the MDN literature.In particular, we provide a sharp upper bound on the approximation error of MDN in Lemma 1 when the underlying conditional density function follows an infinite conditional Gaussian mixture model.We remark that, although it is possible to obtain a bound by directly applying Lemma 1 of Barron (1993), it would only yield a very loose bound; see Section 4.1 for more details.To our knowledge, we are among the first to systematically study the error bound of MDN, and our results are useful for the general theory of deep (generative) learning methods (see e.g.Chen et al., 2020;Farrell et al., 2021;Liang, 2021;Zhou, Jiao et al., 2022;Zhou, Su et al., 2022).Third, we show the proposed test controls the type-I error in Theorem 5, and has the power approaching one in Theorem 6.We show that our test statistic achieves a parametric convergence rate and a parametric power guarantee while its components are estimated nonparametrically.This is made possible because the way in which we combine the two distribution generators yields a doubly robust estimator of the test statistic (Tsiatis, 2007).Thanks to this double robustness, the bias of our test statistic estimator decays to zero faster than the rate of the individual nonparametric distribution generator.Finally, to avoid the requirement of certain metric entropy conditions for the distribution generator estimators (Chernozhukov et al., 2018, Equation (1.6)), we further employ the sample splitting and cross-fitting strategy (Romano & DiCiccio, 2019) to ensure the size control of the test.
The rest of the article is organised as follows.We formulate the hypotheses and propose a doubly robust test statistic in Section 2. We develop the corresponding test, as well as a forward sequential procedure for order determination in Section 3. We establish the theoretical guarantees in Section 4. We carry out simulations in Section 5, and illustrate with three real datasets in Section 6.We relegate all technical proofs to the Online Supplementary Material, Appendix.

Hypotheses
We first formulate the hypotheses of interest.Consider a strictly stationary d-dimensional time series, X t = (X t,1 , X t,2 , . . ., X t,d ) ⊤ , t ≥ 1.We target the following pair of hypotheses: where I t denotes the data history {X t , X t−1 , . . .}.The Markov property holds under H 0 .
Intuitively, this property requires the past and future values to be independent, conditionally on the present.To test H 0 , it suffices to test a sequence of conditional independences for any time t > 0 and any lag q ≥ 2, where ⊥ denotes the conditional independence.
We next characterise the conditional independence using the CCF.A similar result is given in Chen and Hong (2012, Equation (2.6)).For any vector μ ∈ R d of the same dimension as X t , define the CCF of X t+1 given X t as Theorem 1 The conditional independence (2) holds if and only if almost surely, for any t > 0, q ≥ 2, and μ, ν ∈ R d .
Computing (4) requires a suitable estimator  φ for φ * .Chen and Hong (2012) proposed to use the LPR to estimate φ * .However, the LPR tends to perform poorly when the dimension d of X t increases (Taylor & Einbeck, 2013), and the corresponding test would fail to be consistent.More recently, deep conditional generative learning models have demonstrated an exceptional capacity of estimating complex conditional distributions (e.g.Kobyzev et al., 2020;Sohn et al., 2015).These tools can be potentially employed to estimate P Xt|X t−1 , and subsequently the CCF φ * .However, naively plugging in a deep conditional generative learning estimator for φ * would induce a heavy bias in (4), which would fail to guarantee a tractable limiting distribution for the test statistic.
To address this issue, we propose to construct a doubly robust test statistic.Specifically, for any vector ν ∈ R d of the same dimension as X t , define the CCF of X t given X t+1 as We introduce a doubly robust estimating equation in the next theorem.
Theorem 2 Under H 0 , for any t ≥ 0, q ≥ 2, μ, ν ∈ R d , we have In addition, ( 5) is doubly robust, in that, for any CCFs φ and ψ, as long as either φ = φ * , or ψ = ψ * , we have that E{exp (iμ Motivated by ( 5), we propose the following test statistic: where  φ and  ψ denote some estimators of φ * and ψ * , respectively.This statistic, as suggested by Theorem 2, is doubly robust.A key advantage is that the bias of this test statistic can decay to zero at a faster rate than the convergence rate of the individual estimator  φ and  ψ.By contrast, the bias of the test statistic in (4) has the same order of magnitude as that of  φ; see Theorem 4. This double robustness property thus enables us to employ some highly flexible nonparametric estimators for φ * and ψ * .In the next section, we extend MDNs (Bishop, 1994) to estimate the CCFs, and develop the corresponding testing procedure.
3 Testing procedure

Mixture density networks
The MDN is a classical deep generative model that combines the Gaussian mixture model with deep neural networks (DNNs) (Bishop, 1994), and has shown promising performance in conditional density estimation (Koohababni et al., 2018;Rothfuss et al., 2019).In effect it integrates the universal approximation property of the Gaussian mixture model to approximate any smooth density function (Nguyen & McLachlan, 2019), with the capacity of DNNs to approximate both smooth and nonsmooth conditional mean and variance functions in high dimension.See Assumption 2(iii) for the class of smooth functions, and Imaizumi and Fukumizu (2019) for the class of nonsmooth functions that can be well approximated by DNNs.Next, we first introduce the standard MDN model, then propose a new MDN architecture to model the conditional distribution of a multivariate response.We aim to estimate an unknown conditional probability density function of some univariate response Y given a predictor vector X ∈ R d 0 with d 0 being the input dimension.Suppose the conditional density of Y given X follows a MDN model, where G is the number of mixture components, and DNNs are used to estimate the mean vector μ(x) = (μ 1 (x), . . ., μ G (x)) ⊤ , the standard deviation vector σ = (σ 1 (x), . . ., σ G (x)) ⊤ , and the weight vector α = (α 1 (x), . . ., α G (x)) ⊤ .Figure 1 depicts the structure of the model.The input layer is the d 0 -dimension vector x.Then, there are H hidden layers, each with a number of hidden units.
A hidden layer is between the input and output layers, which takes in a set of weighted inputs and produces an output through an activation function.The last hidden layer outputs a G-dimensional vector h (H) (x), and is connected to three parallel layers whose outputs are given by respectively, where Θ j is a G × G coefficient matrix that is to be trained via back propagation, j = 1, 2, 3. Next, two of those functions pass through activation functions, yielding respectively, where α(x), h α (x), μ(x), h μ (x), σ(x), and h σ (x) are all G-dimensional vectors, and the activation functions are applied in an element-wise fashion.Finally, all these components are combined to parametrise f (y|x) according to (7) with a total of W parameters.
Next, we propose a new MDN architecture to model the conditional density of a multivariate response variable Y ∈ R dy .The main idea is to factorise the joint conditional density function f (y|x) as the product of d y conditional densities, each with a univariate response, It then suffices to model each f j (y j |x, y 1 , . . ., y j−1 ) separately.When the individual component of Y is a continuous variable, we use the MDN model ( 7) to estimate the conditional density, whereas when it is a categorical variable, we use a supervised learning method, such as a random forest (Breiman, 2001), or a DNN (LeCun et al., 2015) to estimate the probability mass function.We briefly note that, Bishop (1994) also considered a version of MDN for the multivariate response, by extending (7) to a mixture of multivariate normal densities.However, such an extension does not work well when the components of the response have mixed type of continuous and categorical variables.We also comment that, most of the existing MDN literature study i.i.d.data.In our setting, the observed data are time-dependent.We later show that MDN is equally applicable, as long as the time series satisfies some mixing conditions such as β-mixing (Wu & Shao, 2004).Zhou et al.

Testing Markov property
Next, we develop a testing procedure for the hypotheses in (1), where the key idea is to build upon the doubly robust test statistic (6) and estimate the CCFs using MDN.Moreover, to avoid requiring the estimators of the CCFs to satisfy some restrictive metric entropy conditions, we employ the sample splitting and cross-fitting strategy.We first summarise our testing procedure in Algorithm 1, then discuss the main steps in detail.
In Step 1 of the algorithm, we divide the time series into L nonoverlapping chunks of similar sizes.For simplicity, suppose the length T of the observed time series is a multiple of L, and let ℓ − 1)n + 1, (ℓ − 1)n + 2,   ℓ+1) , to construct the test statistic.We then aggregate the estimates over all chunks to improve the estimation efficiency.
In Step 2, we employ MDN to estimate the CCFs.Specifically, for each subset ℓ = 1, . . ., L − 1, we first apply MDN to the data ̅ I (ℓ) up to the ℓth chunk to obtain the estimates of two conditional probability density functions, a forward generator  f (ℓ) Xt|X t−1 , and a backward generator  f (ℓ) X t−1 |Xt .For the forward generator, the 'predictor' for the MDN model ( 8) is (X 1 , X 2 , . . ., X ℓn−1 ) ⊤ and the 'response' is (X 2 , X 3 , . . ., X ℓn ) ⊤ , whereas for the backward generator, the 'predictor' for ( 8) is (X 2 , X 3 , . . ., X ℓn ) ⊤ and the 'response' is (X 1 , X 2 , . . ., X ℓn−1 ) ⊤ .Given the two estimated density , respectively.Next, we consider different combinations of (μ, ν) for the test statistic S(q, μ, ν) in ( 6).Toward that end, we randomly sample b=1 from a multivariate normal distribution with zero mean and identity covariance matrix.Finally, by noting that φ * (μ|x Input: Data {X t } t=1,...,T , the number of data chunks L, the number of pairs B, the largest number of lags Q, and the number of samples from the generators M.
Step 1: Divide the time series data into L nonoverlapping chunks, where n = T/L, Step 2: Deep conditional forward-backward generative learning.
Step 4: Compute the critical value.
Step 5: Stat Soc Series B: Statistical Methodology, 2023, Vol. 85, No. 4 pair of (μ b , ν b ) as Due to the use of both forward and backward generators and DNNs, we refer to this step as deep conditional forward-backward generative learning.
for a given q = 2, . . ., Q, and Q denotes the largest number of lags to consider in the test.We note that, for any given ℓ = 1, . . ., L − 1, the set of random variables {X ℓn+t } 1≤t≤n that appear in (10) are from the (ℓ + 1)th chunk of the data, and are, under H 0 , independent of  φ (ℓ) and  ψ (ℓ) given X ℓn+1 .This allows us to avoid imposing certain entropy growth condition that limits the growth rate of the VC dimension of the MDN model with respect to the sample size (Chernozhukov et al., 2018).
A similar cross-fitting procedure has also been utilised by Luedtke and Van Der Laan (2016) and Shi et al. (2022) for evaluation of an optimal policy, as well as by Luedtke and Van Der Laan (2018) and Shi et al. (2021) for high-dimensional statistical inference.Next, since  S(q, μ b , ν b ) is complex-valued, we use  S R (q, μ b , ν b ) and  S I (q, μ b , ν b ) to denote its real and imaginary part, respectively.We construct our final test statistic as In ( 11), we take the maximum absolute value over multiple combinations of (q, μ b , ν b ) to construct the test statistic, while we generate μ b and ν b from a Gaussian or uniform distribution.This way, we do not have to impose a bounded support for (μ b , ν b ), and avoid grid search that can be computationally intensive in a high-dimensional setting.
In Step 4, we compute the critical value of the test statistic  S. A key observation is that, under H 0 , each  S R (q, μ b , ν b ) and  S I (q, μ b , ν b ) corresponds to a sum of martingale difference sequences.Since the sum of martingale difference is a martingale (Hamilton, 2020), it follows from the highdimensional martingale central limit theorem that  S converges in distribution to a maximum of some Gaussian random variables.This allows us to employ the high-dimensional multiplier bootstrap method of Belloni and Oliveira (2018) to estimate the critical value.Specifically, we stack  S R (q, μ b , ν b ) and  S I (q, μ b , ν b ) for a given q and all b = 1, . . ., B together to form a 2B-dimensional vector, and estimate the covariance matrix of this vector by where λ R,ℓ,q,t , λ I,ℓ,q,t , ℓ = 1, . . ., L − 1, t = 1, . . ., n − q + 1, are both B-dimensional vectors, whose bth element is, respectively, the real and imaginary part of We then compute the critical value  c α by simulating the upper (α/2)th critical value of max q∈{2,...,Q} using Monte Carlo, where Z 0 , . . ., Z Q are i.i.d.2B-dimensional standard normal vectors. In Step 5, we reject H 0 , if  S > c α , under a given significance level α > 0.
We make a few remarks.First, in terms of the computational cost, step 2(a) is the most intensive step in Algorithm 1, as it involves fitting multiple MDN models.Second, there are a number of hyper-parameters in our test, including the number of mixture components G, the number of data chunks L, the number of pairs B of (μ, ν), the number of samples M from the forward and backward generators, and the largest number of lags Q considered in the test.We proposed to choose G using cross-validation, and take the rest as the input parameters.We further discuss their theoretical choices in Section 4, and their empirical choices in Section 5.

Determining Markov order
The proposed test can be used to determine the order of the Markov model.Specifically, let X (k)  t = (X ⊤ t , . . ., X ⊤ t+k−1 ) ⊤ denote the multivariate time series that concatenates the most recent k observations at each time point.Suppose the data follows a Kth order Markov model.Then the null hypothesis H 0 holds for the concatenated time series X (k)  t for any k ≥ K, but does not hold for any k < K.This suggests we can sequentially test the Markov property on the concatenated time series X (k)  t for k = 1, 2, . ... We set the estimated order to be the first integer k by which we fail to reject H 0 .We also briefly remark that K is different from Q.The former denotes the largest possible order of the underlying Markov model, whereas the latter denotes the largest number of lags considered in our test for a series of conditional dependences.

Convergence rate of MDN
We first establish the error bound of the MDN estimator, then establish the consistency of the proposed test.We begin with some regularity conditions, and argue they are relatively mild and reasonable.
Let f * X t+1 |Xt ( • |x) and f * Xt|X t+1 ( • |x) denote the true conditional density function of X t+1 given X t = x, and that of X t given X t+1 = x, respectively.A key observation is that where f belongs to a Sobolev ball with the smoothness γ ∈ N + : {f : max ν,‖ν‖ 1 ≤γ sup x |D ν f (x)| < +∞}.Given the data ̅ I (ℓ) up to chunk ℓ, the estimated density functions are based on the maximum likelihood.In the following, we focus on establishing the statistical prop- Xt|X t+1 can be derived in similar manner.
Assumption 1 Suppose the following conditions hold for the time series X t .
(i) Let X t be stationary, and its β-mixing coefficient satisfy the that β(t) ≤ c 1 exp (−c 2 t) for some constants c 1 , c 2 > 0. (ii) Let X denote the support of X t , and X be a compact subset of R d .
Assumption 1(i) requires the β-mixing coefficient to decay exponentially with respect to t.Under the Markov property, it is equivalent to the geometric ergodicity condition (Bradley, 2005).Such a condition is commonly imposed in the time series literature (see, e.g.Cline & Pu, 1999;Liebscher, 2005;Wu & Shao, 2004).We also note that the β-mixing condition is not limited to a Markov process.For instance, Neumann (2011) considered a class of observation-driven Poisson count process, which is β-mixing but non-Markovian.
Assumption 2 Suppose the following conditions hold for the true density function f * X t+1 |Xt .
(i) Suppose f * X t+1 |Xt (y|x) can be well approximated by a conditional Gaussian mixture model with G components, in that, there exists J R Stat Soc Series B: Statistical Methodology, 2023, Vol. 85, No. 4 some constant ω 1 > 0, such that where the big-O term is uniform in x and y. (ii) Suppose {μ * g } G g=1 and {σ * g } G g=1 are uniformly bounded away from infinity, and there exist a constant C 0 > 0, ω 2 ≥ 0, such that σ * g (x) ≥ C 0 G −ω 2 for any g and x.
Assumption 2(i) requires the true conditional density function f * X t+1 |Xt can be well approximated by a conditional Gaussian mixture model, with a sufficiently large number of components G.This is reasonable, since the Gaussian mixture model can approximate any smooth density function, and the conditional Gaussian mixture model can approximate any smooth conditional density function (Dalal & Hall, 1983).Assumption 2(ii) to (iv) impose certain boundedness and smoothness conditions on the mean, variance, and weight functions used in the approximation of f * X t+1 |Xt , as well as on f * X t+1 |X t itself.All these conditions are reasonably mild and hold under numerous settings.We consider three examples to further illustrate.
Example 1 Suppose the true conditional density function f * X t+1 |Xt follows a finite conditional Gaussian mixture model with bounded and smooth mean, variance, and weight functions.Then Assumption 2 trivially holds.
where g denotes a certain conditional density function, and ϕ σ denotes the probability density function of a Gaussian random variable with mean zero and variance σ 2 .Then under some mild conditions on g, the next lemma show that Assumption 2 holds.
Lemma 1 Suppose ( 14) holds, with a conditional density function g bounded away from infinity.Suppose the support of g( and c 4 is a positive constant independent of x and y. According to Lemma 1, the mean {μ * g (x)} G g=1 and variance {σ * g (x)} G g=1 are constant functions of x, which are equal to 2C 1 (g − 1)/K − C 1 and σ.Then Assumption 2(i) holds with ω 1 = 1, and Assumption 2(ii) holds with ω 2 = 0.When g lies in the Sobolev ball with the smoothness parameter γ, so are the weight functions {α * g } G g=1 , and Assumption 2(iii) holds.Assumption 2(iv) holds as g is bounded away from zero.Besides, the approximation error rate obtained in Lemma 1 is O(G −1 ) in L ∞ norm, which is shaper than the O(G −1/2 ) rate in L 2 norm obtained in Barron (1993, Lemma 1), as we focus on the Gaussian mixture and one-dimensional case.
Example 3 Suppose f * X t+1 |Xt satisfies Assumption 2(iv), and is Lipschitz continuous, i.e. |f * (y 1 |x) − f * (y 2 |x)| = O(|y 1 − y 2 |) where the big-O-term is uniform in x.It follows from Nguyen and McLachlan (2019, Theorem 9) that f * can be well approximated by an infinite conditional Gaussian mixture model specified in ( 14) with g = f * , with the approximation error O(σ).In addition, similar to Lemma 1, we can show that this infinite conditional Gaussian mixture model can be approximated by the finite conditional Gaussian mixture model, with the approximation error O(σ −1 G −1 ).By setting σ = G −1/2 , Assumption 2(i) holds with ω 1 = 1/2.The mean and variance are both constant functions of x, and the variance is lower bounded by G −1/2 .Assumption 2(ii) thus holds with ω 2 = 1/2.When f * lies in the Sobolev ball with the smoothness parameter γ, so are the weight functions α * g , and Assumption 2(iii) holds.
Assumption 3 Suppose the following conditions hold for the MDN model.
(i) Suppose the MDN function class is given by, for some sufficiently large constant C 2 , where α g , μ g and σ g are parametrised via DNNs.(ii) The total number of parameters W in the MDN model is proportional to G (d+γ)/γ T d/(2γ+d) log (GT), where γ is the smoothness parameter specified in Assumption 2(iii).
Assumption 3(i) is mainly to simplify the technical proof, since the estimated functions are bounded when both the model parameters and the data support are bounded.It is easy to enforce Assumption 3(i) in practice, by imposing range constraints on the model parameters.Assumption 3(ii) specifies the total number of parameters W, which represents a trade-off.On one hand, since we model {α g } G g=1 , {μ g } G g=1 and {σ g } G g=1 via DNNs, their approximation errors decay as W increases.On the other hand, the estimation error of MDN increases with W. We require W to be proportional to G (d+γ)/γ T d/(2γ+d) log (GT) to balance the bias-variance trade-off, and optimise the convergence rate of the MDN estimator.See the proof of Theorem 3 in the Online Supplementary Material, Appendix, for more details.
Next, we establish the error bound of the MDN estimator  f (ℓ) X t+1 |Xt .The bound of  f (ℓ) Xt|X t+1 is the same and can be derived similarly.
We remark that the first term of the error bound in ( 15) is due to the approximation error of the conditional Gaussian mixture model, while the second term is due to the approximation error of the DNNs and the estimation error of the MDN estimator.In general, the error bound increases with d and ω 2 , and decreases with γ and ω 1 .We next revisit Examples 1 to 3, and discuss the corresponding rate of convergence.
Example 1 revisited.In this example, the finite conditional Gaussian mixture model holds.As a result, G is finite and ω 1 can be chosen arbitrarily large.The error bound is then of the same order of magnitude as dT −γ/{2γ+d} log 3 (T).If the mean, variance, and weight functions are infinitely differentiable, i.e. γ = +∞, then the MDN estimator achieves a convergence rate of dT −1/2 up to some logarithmic term.
Finally, we remark on the problem of determining the order of a Markov model.In this case, we are interested in estimating the conditional density function of X t+K given X (K)  t and X t−1 given X (K) t .Similar to Theorem 3, we can show that the corresponding error bound is of the same order of magnitude as We note that this upper bound depends on the order K only through the exponents of G and T.

Consistency of the proposed test
Given the error bound of the MDN estimator, we now establish the consistency, i.e. the size and power properties of our proposed test.We first show the bias of  S(q, μ, ν) converges at a faster rate than the forward and backward generators.
Theorem 4 Suppose Assumption 4 holds.Then under the null hypothesis H 0 , sup q,μ,ν where f max = sup x max 1≤t≤T f Xt (x), and f Xt denotes the marginal density function of X t .
We note that, when the marginal density functions are uniformly bounded, Theorem 4 formally verifies the faster convergence rate of the bias of  S(q, μ, ν).
Next, we establish the size property of the proposed test.
Assumption 5 Suppose the following conditions hold.
Assumption 5(i) requires the convergence rates of  f (ℓ) X t+1 |Xt and  f (ℓ) Xt|X t+1 to be o(T −1/4 ), which allows us to derive the size property of the test based upon Theorem 3.This condition is reasonable.For instance, when the time series dimension d is fixed, this corresponds to requiring that γ > d/2 for Example 1, and γ > 2.69d for Example 2. Meanwhile, we may also relax this condition, by using the theory of higher-order influence functions (Robins et al., 2017).Assumption 5(ii) is a technical condition to help simplify the theoretical analysis.Essentially, it is used to guarantee that the diagonal elements of the asymptotic covariance matrix are bounded away from zero.When the fitted MDN is consistent, it follows that the diagonal elements of the estimated covariance matrix are bounded away from zero as well, with probability tending to 1.This allows us to apply Theorem 1 of Chernozhukov et al. (2017) to establish the size property.This condition automatically holds when the conditional density functions f * X t+1 |X t , f * X t |X t+1 , ‖μ b ‖ 2 s and ‖ν b ‖ 2 s are uniformly bounded away from zero.Meanwhile, if we truncate the diagonal elements of the estimated covariance matrix from below by some small positive constant, then this condition is not needed, and the subsequent test remains valid to control the type-I error.Finally, Assumption 5(iii) and (iv) impose some requirements on the parameters M, Q and B. In particular, B is allowed to diverge with T. Therefore, the classical weak convergence theorem is not applicable to show the asymptotic equivalence between the distribution of the test statistic and that of the bootstrap samples given the data.To overcome this issue, we employ the high-dimensional martingale central limit theorem recently developed by Belloni and Oliveira (2018).
Theorem 5 Suppose Assumptions 1 and 5 hold.Then, as under the null hypothesis.
Next, we establish the power property of the proposed test.
Assumption 6 Suppose the following conditions hold.
(i) Suppose sup q,μ,ν S 0 (q, μ, ν) ≫ T −1/2 log 1/2 (T), where S 0 (q, μ, ν Assumption 6(i) measures the degree to which the alternative hypothesis deviates from the null.This is because, for q = 1, . . ., Q, the quantity sup measures the weak conditional dependence between X t+q and X t given X t+q−1 and X t+1 (Daudin, 1980).Here, the supremum is taken with respect to the class of all squared integrable functions of X, i.e.L 2 (X).According to the Weierstrass approximation theorem, the class of trigonometric polynomials are dense in L 2 (X).As such, ( 16) is equal to zero if and only if sup μ,ν S 0 (q, μ, ν) = 0. Therefore, sup q,μ,ν S 0 (q, μ, ν) measures the degree to which the alternative hypothesis deviates from the null, and we require it to be lower bounded.Assumption 6(ii) is mild as B is user-specified.
We remark that our proposed test is built on weak conditional independence, and thus is not consistent against all alternatives.There are cases when (16) equals zero but (2) does not hold, since weak conditional independence does not fully characterise conditional independence.In those cases, our test becomes powerless.A possible remedy is to consider an alternative doubly robust test statistics based on The above expectation equals zero for any q ≥ 2, μ, ν ∈ R d , and ω q ∈ R d(q−1) , and the resulting supremum type test is consistent against all alternative hypotheses.However, it is computationally more expensive, since a large number of Monte Carlo samples {(μ b , ν b , ω q,b )} b are needed to approximate the supremum over the space of R × R × R d(q−1) when q is large.In addition, our numerical analysis finds this test less powerful compared to our proposed test.This agrees with the observation in the literature that, even though the test based on weak conditional dependence is not consistent against all alternatives, it may benefit from a simple procedure, and thus a better power property (Li & Fan, 2020).
We also note that Theorems 5 and 6 have suggested some theoretical choices of the parameters L, B, M, Q.In practice, we recommend to set L fixed, and set M to be proportional to the sample size.Besides, we choose a large value for Q that is proportional to T, and also choose a large B. We discuss their empirical choices in the next section.

Simulations
We study the empirical performance of our proposed test through simulations.We consider three different Markov time series models, each with order K = 3, dimension d = 3, and varying length T = {500, 1000, 1500, 2000}.We apply the proposed sequential testing procedure for k = 1, 2, . . ., 5, and report the percentage of times out of 500 data replications when the null hypothesis is rejected.When k < K, this percentage reflects the empirical power of the test, and when k ≥ K, it shows the empirical size.
We consider a linear type VAR model, a nonlinear type threshold model, and a nonlinear type GARCH model, all of which are commonly used in the time series literature (e.g.Auestad & Tjøstheim, 1990;Cheng & Tong, 1992;Tschernig & Yang, 2000).
Model 2: Threshold model
We apply the proposed test.For the hyper-parameters, we propose to select the number of mixture components G using cross-validation, as its choice is important to the empirical performance.When G is small, the fitted MDN model may suffer from a large bias, leading to an inflated type-I errors, whereas when G is large, the model may be overfitted, yielding a more variable test statistic.For the number of pairs B, a larger value of B generally improves the power of the test, but also increases the computational cost.We thus fix it at B = 1000 to achieve a trade-off between the power and the computational cost.For the rest of parameters, including the number of data chunks L, the number of pseudo samples M, and the largest number of lags Q, we conduct a sensitivity analysis in Section B.1 of the Online Supplementary Material, Appendix.We find that the proposed test is not overly sensitive to the choice of these parameters, as long as they are in a reasonable range.We thus set L = 3, M = 100, and Q = 10 in our numerical studies.For MDN, we fix the number of layers H = 1, and vary the number of nodes U per hidden layer to vary the total number of parameters, and correspondingly the overall complexity of MDN.We carry out another sensitivity analysis for U in Online Supplementary Material, Section B.1, and again find a similar performance of the test in a range of values of U, so we fix U = 20 for the first two models, and U = 40 for the last model, as the last one is more complex.We estimate the parameters of MDN through maximum likelihood, where the derivative of the likelihood function with respect to each parameter is derived and the back-propagation is employed.In our implementation, we employ the Adam algorithm (Kingma & Ba, 2015), and use Python and Tensorflow (Dillon et al., 2017).We publish our code on GitHub. 1e compare our proposed test with two baseline tests for the Markov property, including the test by Chen and Hong (2012), which used LPFs to estimate the CCFs, and a version of the random forest-based test by Shi et al. (2020), which was designed for reinforcement learning, and is modified and adapted to our setting.In addition, Chen and Hong (2012) suggested two methods to compute the p-value for their test.The first method estimates the asymptotic variance of the test and uses a normal approximation.The second method employs bootstrap.In our settings, we find that the bootstrap procedure is extremely slow for a large T. As such, we calculate the p-value based on the normal approximation.
Table 1 reports the empirical rejection rate of each test under the significance level α = 0.05, aggregated over 500 data replications.It can be seen that the proposed test effectively controls the type-I error when k ≥ 3, and is very powerful when k < 3. To the contrary, both the two baseline tests suffer from inflated type-I errors for large T. For instance, when T ≥ 1000, the type-I error of the test of Chen and Hong (2012) exceeds 0.09 in all cases.This is probably due to that the LPR tends to suffer with a larger dimension in the multivariate setting (Taylor & Einbeck, 2013).The test of Shi et al. (2020) has considerably large type-I errors when applied to the multivariate ARCH model.This is likely due to the fact that their test was not designed for time series data.
Finally, we report the computation time of the proposed test.We ran all simulations on savio2 htc node of the UC Berkeley Computing Platform, with 12 CPUs and 128 GB RAM, and it took around 2 min on average for a single data replication.We also run an example on a regular laptop computer with a single CPU and 8 GB memory RAM, and it took around 20 min on average for one data replication.
The first dataset consists of the monthly temperature of seven cities in Eastern China from January 1954 to December 1998.To remove the seasonal trend, we subtract the average across the same month of the year.This ensures that the resulting time series is stationary.The resulting time series has dimension d = 7 and length T = 528.
The second dataset consists of the daily average PM2.5 concentration readings, in the logarithmic scale, at 74 monitoring stations in Beijing and nearby areas of China from January 1, 2015 to December 31, 2016.PM2.5 refers to the mix of solid and liquid particles whose diameters are smaller than 2.5 micrometers, and is a key measure of air quality and pollution.We again subtract the average across the same day of the year.The resulting times series has dimension d = 74 and length T = 731.
The third dataset consists of measurements, recorded every 5 min, involving blood glucose level, meal, exercise and insulin treatment from six patients with type-I diabetes over eight weeks.We divide each day into 1-h intervals, and compute the average blood glucose level, the carbohydrate estimate for the meal, the exercise intensity, and the amount of insulin received during the 1-h interval.For each patient, the resulting time series has dimension d = 4 and length T = 1100.R Stat Soc Series B: Statistical Methodology, 2023, Vol. 85, No. 4 We note that the third data example is different from the other two examples as well as the setting of our problem in several ways.First, for each d 0 -dimensional time series, there are N = 6 replications corresponding to six patients.Second, for the d 0 = 4 variables, it is of interest to test the Markov property for three of them, but not the insulin amount, because the amount of insulin is determined by the patients themselves.In addition, the insulin amount should be included in the conditioning set, because it directly affects the blood glucose level.Finally, for the carbohydrate estimate of the meal and the exercise intensity, a good portion of the measurements are zero, because no meal or exercise was taken in those time intervals.We modify the test in Algorithm 1 to accommodate these differences.Specifically, in Step 1, to tackle multiple replications, instead of splitting a single time series into multiple chunks, we now randomly split N replications into multiple chunks of similar sizes.In Step 2, to test the Markov property of a subset of variables of the multivariate time series, instead of estimating  f (ℓ)  Xt|X t−1 , we now estimate the forward generator  f (ℓ)   Xt|Xt−1 , where  X t only includes those variables to test about.Meanwhile, we still estimate the backward generator  f (ℓ) X t |X t−1 as before.Also in Step 2, to tackle the issue that some observed time series involve many zeros, we fit a logistic regression to estimate the conditional densities, while we still use MDN for other continuous time series.The rest of steps remain essentially the same as in Algorithm 1.
We apply the proposed test, as well as the two alternative tests of Chen and Hong (2012) and Shi et al. (2020), for k = 1, 2, . . ., 12 sequentially, to the three datasets.Table 2 reports the corresponding p-values.For both the temperature and PM2.5 datasets, our test suggests the Markov property holds.This result is consistent with the findings in the literature, as a simple vector autoregressive model of order 1 is sufficient to model these high-dimensional datasets (see, e.g.Chang et al., 2018).For the diabetes data, the test suggests the order of the Markov model is 4, which is consistent with the finding of Shi et al. (2020).By contrast, the test of Chen and Hong (2012) yields a large p-value when k = 2 then a very small p-value when k = 4 for the diabetes dataset.The test of Shi et al. (2020) tends to select a large value of k for both the temperature dataset and the PM2.5 dataset.

Figure 1 .
Figure 1.Structure of the MDN. 1208 . . ., ℓn} denote the indices of the ℓth chunk of the time series, and let ̅ I (ℓ) = ∪ ℓ j=1 I (j) denote the union of indices of the first ℓ chunks, ℓ = 1, . . ., L. Data splitting allows us to use part of the data, i.e. the data ̅ I (ℓ) up to chunk ℓ, to train the MDN model, and another part, i.e.I ( we obtain the Monte Carlo estimators of φ * (μ|x) and ψ * (ν|x) for each Algorithm 1 Testing procedure for the Markov property

1218
Zhou et al.Table2.The p-values of the sequential tests for k = 1, 2, . . ., 12 for the three datasets: the temperature data, the PM2.5 data, and the diabetes data, by the three methods: our proposed test (MDN),Shi et al. (2020)'s method (RF), and Chen and Hong (2012)'s method(LPF)

Table 1 .
Shi et al. (2020)es out of 500 data replications when the null hypothesis is rejected under the significance level α = 0.05 The true order of the Markov model is K = 3 in all examples.Three methods are compared: our proposed test (MDN),Shi et al. (2020)'s method (RF), and Chen and Hong (2012)'s method (LPF). Note: