Conditional independence testing in Hilbert spaces with applications to functional data analysis

We study the problem of testing the null hypothesis that X and Y are conditionally independent given Z, where each of X, Y and Z may be functional random variables. This generalises testing the significance of X in a regression model of scalar response Y on functional regressors X and Z. We show, however, that even in the idealised setting where additionally (X, Y, Z) has a Gaussian distribution, the power of any test cannot exceed its size. Further modelling assumptions are needed and we argue that a convenient way of specifying these assumptions is based on choosing methods for regressing each of X and Y on Z. We propose a test statistic involving inner products of the resulting residuals that is simple to compute and calibrate: type I error is controlled uniformly when the in‐sample prediction errors are sufficiently small. We show this requirement is met by ridge regression in functional linear model settings without requiring any eigen‐spacing conditions or lower bounds on the eigenvalues of the covariance of the functional regressor. We apply our test in constructing confidence intervals for truncation points in truncated functional linear models and testing for edges in a functional graphical model for EEG data.

for truncation points in truncated functional linear models and testing for edges in a functional graphical model for EEG data.

K E Y W O R D S
functional graphical model, function-on-function regression, significance testing, truncated functional linear model, uniform type I error control

INTRODUCTION
In a variety of application areas, such as meteorology, neuroscience, linguistics and chemometrics, we observe samples containing random functions (Ramsay & Silverman, 2005;Ullah & Finch, 2013).The field of functional data analysis (FDA) has a rich toolbox of methods for the study of such data.For instance, there are a number of regression methods for different functional data types, including linear function-on-scalar (Reiss et al., 2010), scalar-on-function (Delaigle & Hall, 2012;Goldsmith et al., 2011;Hall & Horowitz, 2007;Reiss & Ogden, 2007;Shin 2009;Yuan & Cai, 2010) and function-on-function (Ivanescu et al., 2015;Scheipl et al., 2015) regression; there are also nonlinear and nonparametric variants (Fan et al., 2015;Ferraty & Vieu, 2006;Ferraty et al., 2011;Yao & Müller, 2010) and versions able to handle potentially large numbers of functional predictors (Fan et al., 2015) to give a few examples; see Wang et al. (2016), Morris (2015) for helpful reviews and a more extensive list of relevant references.The availability of software packages for functional regression methods, such as the R-packages refund (Goldsmith et al., 2020) and FDboost (Brockhaus et al., 2020), allow practitioners to easily adopt the FDA framework for their particular data.
One area of FDA that has received less attention is that of conditional independence testing.Given random elements X, Y , Z, the conditional independence X ⫫ Y | Z formalises the idea that X contains no further information about Y beyond that already contained in Z.A precise definition is given in Section 1.2.Inferring conditional independence from observed data is of central importance in causal inference (Pearl, 2009;Peters et al., 2017;Spirtes et al. 2000), graphical modelling (Koller & Friedman, 2009;Lauritzen, 1996) and variable selection.For example, consider the linear scalar-on-function regression model where X, Z are random covariate functions taking values in L 2 ([0, 1], R),  X ,  Z are unknown parameter functions, Y ∈ R is a scalar response and  ∈ R satisfying ε ⫫ (X, Z) represents stochastic error.In this model, conditional independence X ⫫ Y | Z is equivalent to  X = 0, that is, whether the functional predictor X is significant.
For nonlinear regression models, the conditional independence X ⫫ Y | Z still characterises whether X is useful for predicting Y given Z.Indeed, consider a more general setting where Y is a potentially infinite-dimensional response, and X 1 , … , X p are predictors, some or all of which may be functional.Then a set of predictors S ⊆ {1, … , p} that contain all useful information for predicting Y , that is such that Y ⫫ {X j } j∉S | {X j } j∈S , is known as a Markov blanket of Y in the functional predictor based on significance tests for generalised additive models developed in Wood (2013).These tests, while being computationally efficient, however, do not have formal uniform level control guarantees.

1.1
Our main contributions and organisation of the paper 1.It is impossible to test conditional independence with Gaussian functional data.In Section 2 we present our formal hardness result on conditional independence testing for Gaussian functional data.The proof rests on a new result on the maximum power attainable at any alternative when testing for conditional independence with multivariate Gaussian data.The full technical details are given in Section A of the supplementary material.As we cannot hope to have level control uniformly over the entirety of the null of conditional independence, it is important to establish, for any given test, subsets 0 of null distributions  0 over which we do have uniform level control.2. We provide new tools allowing for the development of uniform results in FDA.Uniform results are scarce in functional data analysis; we develop the tools for deriving such results in Section B of the supplementary material which studies uniform convergence of Hilbertian and Banachian random variables.3. Given sufficiently good methods for regressing each of X and Y on Z, the GHCM can test conditional independence with certain uniform level guarantees.In Section 3 we describe our new GHCM testing framework for testing X ⫫ Y | Z, where each of X, Y and Z may be collections of functional and scalar variables.In Section 4 we show that for the GHCM, an effective null hypothesis 0 may be characterised as one where in addition to some tightness and moment conditions, the conditional expectations E(X | Z) and E(Y | Z) can be estimated at sufficiently fast rates, such that the product of the corresponding in-sample mean squared prediction errors (MSPEs) decay faster than 1/n uniformly, where n is the sample size.Note that this does not contradict the hardness result: it is well known that there do not exist regression methods with risk converging to zero uniformly over all distributions for the data (Györfi et al., 2002, Theorem 3.1).Thus, the regression methods must be chosen appropriately in order for the GHCM to perform well.In Section 4.3 we show that a version of the GHCM incorporating sample splitting has uniform power against alternatives where the expected conditional covariance operator E{Cov(X, Y | Z)} has Hilbert-Schmidt norm of order n −1∕2 , and is thus rate optimal.4. The regression methods are only required to perform well on the observed data.The fact that control of the type I error of the GHCM depends on an in-sample MSPE rather than a more conventional out-of-sample MSPE, has important consequences.While in-sample and out-of-sample errors may be considered rather similar, in the context of function regression, they are substantially different.We demonstrate in Section 4.4 that bounds on the former are achievable under significantly weaker conditions than equivalent bounds on the latter by considering ridge regression in the functional linear model.In particular the required prediction error rates are satisfied over classes of functional linear models where the eigenvalues of the covariance operator of the functional regressor are dominated by a summable sequence; no additional eigen-spacing conditions, or lower bounds on the decay of the eigenvalues are needed, in contrast to existing results on out-of-sample error rates (Cai & Hall, 2006;Crambes & Mas, 2013;Hall & Horowitz, 2007). 5.The GHCM has several uses.Section 5 presents the results of numerical experiments on the GHCM.We study the following use cases.(i) Testing for significance of functional predictors in functional regression models.We are not aware of other approaches that provide significance statements in functional regression models and come with statistical guarantees.For example, in comparison to the p-values from pfr, which are highly anti-conservative in challenging setups, the type I error of the GHCM test is well-controlled (see Figure 1).(ii) Deriving confidence intervals for truncation points in truncated functional linear model.We demonstrate in Section 5.2 the use of the GHCM in the construction of a confidence interval for the truncation point in a truncated functional linear model, a problem which we show may be framed as one of testing certain conditional independencies.(iii) Testing for edge presence in functional graphical models.In Section 5.3, we use the GHCM to learn functional graphical models for EEG data from a study on alcoholism.
We conclude with a discussion in Section 6 outlining potential follow-on work and open problems.The supplementary material contains the proofs of all results presented in the main text and some additional numerical experiments, as well as the uniform convergence results mentioned above.An R-package ghcm (Lundborg et al., 2022) implementing the methodology is available on CRAN.

Preliminaries and notation
For three random elements X, Y and Z defined on the same probability space (Ω,  , P) with values in measurable spaces (, ), (, ) and (, ) respectively, we say that X is conditionally independent of Y given Z and write for all bounded and Borel measurable f ∶  → R and g ∶  → R. Several equivalent definitions are given in Constantinou and Dawid (2017, Proposition 2.3).As with Euclidean variables, the interpretation of X ⫫ Y | Z is that 'knowing Z renders X irrelevant for predicting Y ' (Lauritzen, 1996).
Throughout the paper we consider families of probability distributions  of the triplet (X, Y , Z), which we partition into the null hypothesis  0 of those P ∈  satisfying X ⫫ Y | Z, and set of alternatives  ∶=  ⧵  0 where the conditional independence relation is violated.We consider data (x i , y i , z i ), i = 1, … , n, consisting of i.i.d.copies of (X, Y , Z), and write X (n) ∶= (x i ) n i=1 and similarly for Y (n) and Z (n) .We apply to these data a test  n ∶ ( ×  × ) n → {0, 1}, with a value of 1 indicating rejection.We will at times write E P (⋅) for expectations of random elements whose distribution is determined by P, and similarly P P (⋅) = E P (1 {⋅} ).Thus, the size of the test  n may be written as sup P∈ 0 P P ( n = 1).
We always take  =  X and  =  Y for separable Hilbert spaces  X and  Y and write d X and d Y for their dimensions, which may be ∞.When these are finite dimensional, as will typically be the case in practice, X (n) will be a n × d X matrix and similarly for Y (n) .Similarly, we will take  = R d Z in the finite-dimensional case and then Z (n) ∈ R n×d Z .However, in order for our theoretical results to be relevant for settings where d X and d Y may be arbitrarily large compared to n, our theory must also accommodate infinite-dimensional settings, for which we introduce the following notation.
For g and h in a Hilbert space , we write ⟨g, h⟩ for the inner product of g and h and ||g|| for its norm; note we suppress dependence of the norm and inner product on the Hilbert space.The bounded linear operator on  given by x  → ⟨x, g⟩h is the outer product of g and h and is denoted by g ⊗ h.A bounded linear operator A on  is compact if it has a singular value decomposition, that is, there exists two orthonormal bases (e 1,k ) k∈N and (e 2,k ) k∈N of  and a non-increasing sequence ( k ) k∈N of singular values such that for all h ∈ .For a compact linear operator A as above, we denote by ∥A∥ op , ∥A∥ HS and ∥A∥ TR the operator norm, Hilbert-Schmidt norm and trace norm, respectively, of A, which equal the  ∞ ,  2 and  1 norms, respectively, of the sequence of singular values ( k ) k∈N .
A random variable on a separable Banach space  is a mapping X ∶ Ω →  defined on a probability space (Ω,  , P) which is measurable with respect to the Borel -algebra on , B().Integrals with values in Hilbert or Banach spaces, including expectations, are Bochner integrals throughout.For a random variable X on Hilbert space , we define the covariance operator of X by For another random variable Y with E||Y || 2 < ∞, we define the cross-covariance operator of X and Y by We define conditional variants of the covariance operator and cross-covariance operator by replacing expectations with conditional expectations given a -algebra or random variable.

THE HARDNESS OF CONDITIONAL INDEPENDENCE TESTING WITH GAUSSIAN FUNCTIONAL DATA
In this section we present a negative result on the possibility of testing for conditional independence with functional data in the idealised setting where all variables are Gaussian.We take  to consist of distributions of (X, Y , Z) that are jointly Gaussian with injective covariance operator, where X and Z take values in separable Hilbert spaces  X and  Z respectively with  Z infinite-dimensional, and Y ∈ R d Y .We note that in the case where d Y = 1 and  X =  Z = L 2 ([0, 1], R), each P ∈  admits a representation as a Gaussian scalar-on-function linear model (1) where Y is the scalar response, and functional covariates X, Z and error  are all jointly Gaussian with ε ⫫ (X, Z) (see Proposition 7 in the supplementary material); the settings with d Y > 1 may be thought of equivalently as multi-response versions of this.
For each Q in the set of alternatives , we further define  Q 0 ⊂  0 by  Q 0 ∶= {P ∈  0 ∶ the marginal distribution of Z under P and Q is the same}.
Theorem 1 below shows that not only is it fundamentally hard to test the null hypothesis of  0 against  for all dataset sizes n, but restricting to the null  Q 0 for Q ∈  presents an equally hard problem.
Theorem 1.Given alternative Q ∈  and n ∈ N, let  n be a test for null hypothesis  Q 0 against Q.Then we have that the power is at most the size: An interpretation of this statement in the context of the functional linear model is that regardless of the number of observations n, there is no non-trivial test for the significance of the functional predictor X, even if the marginal distribution of the additional infinite-dimensional predictor Z is known exactly.It is clear that the size of a test over  0 is at least as large as that over the null  Q 0 , so testing the larger null is of course at least as hard.It is known that testing conditional independence in simple multivariate (finite-dimensional) settings is hard in the sense of Theorem 1 when the conditioning variable is continuous.In such settings, restricting the null to include only distributions with Lipschitz densities, for example, allows for the existence of tests with power against large classes of the alternative.The functional setting is, however, very different, simply removing pathological distributions from the entire null of conditional independence does not make the problem testable.Even with the parametric restriction of Gaussianity, the null is still too large for the existence of non-trivial hypothesis tests.Indeed, the starting point of our proof is a result due to Kraft (1955) that the hardness in the statement of Theorem 1 is equivalent to the n-fold product Q ⊗n lying in the convex closure in total variation distance of the set of n-fold products of distributions in  Q 0 .A consequence of Theorem 1 is that we need to make strong modelling assumptions in order to test for conditional independence in the functional data setting.Given the plethora of regression methods for functional data, we argue that it can be convenient to frame these modelling assumptions in terms of regression models for each of X and Y on Z, or more generally, in terms of the performances of methods for these regressions.The remainder of this paper is devoted to developing a family of conditional independence tests whose validity rests primarily on the prediction errors of these regressions.

GHCM METHODOLOGY
In this section we present the Generalised Hilbertian Covariance Measure (GHCM) for testing conditional independence with functional data.To motivate the approach we take, it will be helpful to first review the construction of the Generalised Covariance Measure (GCM) developed in Shah and Peters (2020) for univariate X and Y , which we do in the next section.In Section 3.2 we then define the GHCM.

Motivation
Consider first therefore the case where X and Y are real-valued random variables, and Z is a random variable with values in some space .We can always write has the property that Cov(X, Y | Z) = 0 and hence E() = 0 whenever X ⫫ Y | Z.The GCM forms an empirical version of E() given data (x i , y i , z i ) n i=1 by first regressing each of X (n) and Y (n) onto Z (n) to give estimates f and ĝ of f and g respectively.Using the corresponding residuals εi ∶= x i − f (z i ) and ξi ∶= y i − ĝ(z i ), the product R i ∶= εi ξi is computed for each i = 1, … , n and then averaged to give R ∶= ∑ n i=1 R i ∕n, an estimate of E().The standard deviation of R under the null X ⫫ Y | Z may also be estimated, and it can be shown (Shah & Peters, 2020, Theorem 8) that under some conditions, R divided by its estimated standard deviation converges uniformly to a standard Gaussian distribution.
This basic approach can be extended to the case where X and Y take values in R d X and R d Y respectively, by considering a multivariate conditional covariance, This is a zero matrix when X ⫫ Y | Z, and hence E( ⊤ ) = 0 under this null.Thus, R defined as before but where R i ∶= εi ξ⊤ i can form the basis of a test of conditional independence.There are several ways to construct a final test statistic using R ∈ R d X ×d Y .The approach taken in Shah and Peters (2020) involves taking the maximum absolute value of a version of R with each entry divided by its estimated standard deviation.This, however, does not generalise easily to the functional data setting we are interested in here; we now outline an alternative that can be extended to handle functional data.
To motivate our approach, consider multiplying R by √ n: Observe that U n is a sum of i.i.d.terms and so the multivariate central limit theorem dictates that Applying the Frobenius norm ||⋅|| F to the a n term, we get by submultiplicativity and the Cauchy-Schwarz inequality, where ||⋅|| 2 denotes the Euclidean norm.The right-hand side here is a product of in-sample mean squared prediction errors for each of the regressions performed.Under the null of conditional independence, each term of b n and c n is mean zero conditional on (X (n) , Z (n) ) and (Y (n) , Z (n) ) respectively.Thus, so long as both of the regression functions are estimated at a sufficiently fast rate, we can expect a n , b n , c n to be small so the distribution of √ nR can be well-approximated by the Gaussian limiting distribution of U n ∕ √ n.As in the univariate setting, it is crucially the product of the prediction errors in (3) that is required to be small, so each root mean squared prediction error term can decay at relatively slow o(n −1∕4 ) rates.
Unlike the univariate setting, however, √ n R is now a matrix and hence we need to choose some sensible aggregator function t ∶ R d X ×d Y → R such that we can threshold t( √ n R) to yield a p-value.One option is as follows; we take a different approach as the basis of the GHCM for reasons which will become clear in the sequel.If we vectorise R, that is, view the matrix as a d X d Y -dimensional vector, then under the assumptions required for the above heuristic arguments to formally hold, √ nVec(R) converges to a Gaussian with mean zero and some covariance matrix R therefore converges to a Gaussian with identity covariance under the null and hence Replacing C with an estimate Ĉ then yields a test statistic from which we may derive a p-value.

3.2
The GHCM We now turn to the setting where X and Y take values in separable Hilbert spaces  X and  Y respectively.These could for example be L 2 ([0, 1], R), or R d X and R d Y respectively, but where X and Y are vectors of function evaluations.The latter case, which we will henceforth refer to as the finite-dimensional case, corresponds to how data would often be received in practice with the observation vectors consisting of function evaluations on fixed grids (which are not necessarily equally spaced).However, it is important to recognise that the dimensions d X and d Y of the grids may be arbitrarily large, and it is necessary for the methodology to accommodate this; as we will see, the approach for the multivariate setting described in the previous section does not satisfy this requirement whereas our proposed GHCM will do so.In some settings, our observed vectors of function evaluations will not be on fixed grids, and the numbers of function evaluations may vary from observation to observation.In Section 3.2.1 we set out a scheme to handle this case and bring it within our framework here.
Similarly to the approach outlined in Section 3.1, we propose to first regress each of X (n) and Y (n) onto Z (n) to give residuals εi ∈  X , ξi ∈  Y for i = 1, … , n.(In practice, these regressions could be performed by pfr or pffr in the refund package (Goldsmith et al., 2011;Ivanescu et al., 2015) or boosting (Brockhaus et al., 2020), for instance.)We centre the residuals, as these and other functional regression methods do not always produce mean-centred residuals.With these residuals we proceed as in the multivariate case outlined above but replacing matrix outer products in the multivariate setting with outer products in the Hilbertian sense, that is we define for We can show (see Theorem 2) that under the null, provided the analogous prediction error terms in (3) decay sufficiently fast and additional regularity conditions hold, T n above converges uniformly to a Gaussian distribution in the space of Hilbert-Schmidt operators.This comes as a consequence of new results we prove on uniform convergence of Banachian random variables.Moreover, the covariance operator of this limiting Gaussian distribution can be estimated by the empirical covariance operator where ⊗ HS denotes the outer product in the space of Hilbert-Schmidt operators.
An analogous approach to that outlined above for the multivariate setting would involve attempting to whiten this limiting distribution using the square root of the inverse of Ĉ.However, here we hit a clear obstacle: even in the finite-dimensional setting, whenever d X d Y ≥ n, the inverse of Ĉ or Ĉ from the previous section, cannot exist.Moreover, as indicated by Bai and Saranadasa (1996), who study the problem of testing whether a finite-dimensional Gaussian vector has mean zero, even when the inverses do exist, the estimated inverse covariance may not approximate its population level counterpart sufficiently well.Instead, Bai and Saranadasa (1996) advocate using a test statistic based on the squared  2 -norm of the Gaussian vector.
We take an analogous approach here, and use as our test statistic where ||⋅|| HS denotes the Hilbert-Schmidt norm.A further advantage of this test statistic is that it admits an alternative representation given by see Section C.1 for a derivation.Only inner products between residuals need to be computed, and so in the finite-dimensional case with the standard inner product, the computational burden is only O As T n has an asymptotic Gaussian distribution under the null with an estimable covariance operator, we can deduce the asymptotic null distribution of T n as a function of T n .This leads to the -level test function  n given by where q  is the 1 −  quantile of a weighted sum with weights given by the d non-zero eigenvalues These eigenvalues may also be derived from inner products of the residuals: they are equal to the eigenvalues of the n × n matrix where J ∈ R n×n is a matrix with all entries equal to 1/n, and Γ ∈ R n×n has ijth entry given by see Section C.1 of the supplementary material for a derivation.Thus, in the finite-dimensional case, the computation of the eigenvalues requires O(n 2 max(d X , d Y , n)) operations.In typical usage therefore, the cost for computing the test statistic given the residuals is dominated by the cost of performing the initial regressions, particularly those corresponding to function-on-function regression.Note that there are several schemes for approximating q  (Farebrother, 1984;Imhof, 1961;Liu et al., 2009); we use the approach of Imhof (1961) as implemented in the QuadCompForm package in R (Duchesne & de Micheaux, 2010) in all of our numerical experiments.We summarise the above construction of our test function for the finite-dimensional case with the standard inner product in Algorithm 1.
In principle, different inner products may be chosen, to yield different test functions.However, the theoretical properties of the test function rely on the prediction errors of the regressions, measured in terms of the norm corresponding to the inner product used, being small.In the common case where the observed data are finite vectors of function evaluations, that is, for each i = 1, … , n, x ik = W X,i (k∕d X ) for a function W X,i ∈ L 2 ([0, 1], R), and similarly for y i , our default recommendation is to use the standard inner product.The residuals, εi ∈ R d X and ξi ∈ R d Y , would then similarly correspond to underlying functional residuals via εik = W ε,i (k∕d X ) for W ε,i ∈ L 2 ([0, 1], R), and similarly for ξi .We may compare the test function computed based on the computed residuals εi and ξi with that which would be obtained when replacing these with the underlying functions W ε,i and W ξ,i .As the test function depends entirely on inner products between residuals, it suffices to compare We see that the LHS is d X times a Riemann sum approximation to the integral on the RHS.The p-value computed is invariant to multiplicative scaling of the test statistic, and so in the so-called densely observed case where d X is large, the p-value from the finite-dimensional setting would be a close approximation to that which would be obtained with the true underlying functions.
Other numerical integration schemes could be used to make the approximation even more precise.However, the theory we present in Section 4 that guarantees uniform asymptotic level control and power over certain classes of nulls and alternatives applies directly to the finite-dimensional or infinite-dimensional settings, and so there is no requirement that the approximation error above is small.In particular, there is no strict requirement that the residuals computed correspond to function evaluations on equally spaced grids.However, in that case ε⊤ i εj will not necessarily approximate a scaled version of the RHS of (10), and an inner product that maintains this approximation may be more desirable from a power perspective.
In the following section we explain how when the residuals εi and ξi correspond to function evaluations on different grids for each i, we can preprocess these to obtain residuals corresponding to fixed grids, which may then be fed into our algorithm.
An R-package ghcm (Lundborg et al., 2022) implementing the methodology is available on CRAN.

3.2.1
Data observed on irregularly spaced grids of varying lengths We now consider the case where εi ∈ R d X,i with its kth component given by εik = W ε,i (t ik ) for t X ik ∈ [0, 1], and similarly for ξi .Such residuals would typically be output by regression methods when supplied with functional data x i ∈ R d X,i and y i ∈ R d Y ,i corresponding to functional evaluations on grids (t ik ) In order to apply our GHCM methodology, we need to represent these residual vectors by vectors of equal lengths corresponding to fixed grids.Our approach is to construct for each i, natural cubic interpolating splines Ŵ ε,i and Ŵ ξ,i corresponding to εi and ξi respectively.We may compute the inner product between these functions in L 2 ([0, 1], R) exactly and efficiently as it is the integral of a piecewise polynomial with the degree in each piece at most 6.This gives us the entries of the matrix Γ (9) which we may then use in lines 7 and following in Algorithm 1. Furthermore, Theorems 3 and 4 apply equally well to the setting considered here provided the residuals are understood as the interpolating splines described above, and the fitted regression functions are defined accordingly as the difference between the observed functional responses these functional residuals.

THEORETICAL PROPERTIES OF THE GHCM
In this section, we provide uniform level control guarantees for the GHCM, and uniform power guarantees for a version incorporating sample splitting; note that we do not recommend the use of the latter in practice but consider it a proxy for the GHCM that is more amenable to theoretical analysis in non-null settings.Before presenting these results, we explain the importance of uniform results in this context, and set out some notation relating to uniform convergence.

Background on uniform convergence
In Section 2 we saw that even when  consists of Gaussian distributions over we cannot ensure that our test has both the desired size  over  0 and also non-trivial power properties against alternative distributions in .We also have the following related result.
Proposition 1.Let  Z be a separable Hilbert space with orthonormal basis (e k ) k∈N .Let  be the family of Gaussian distributions for (X, Y , Z) ∈ R × R ×  Z with injective covariance operator and where (X, Then, for any test  n , In other words, even if we know a basis (e k ) k∈N such that in particular the conditional expectations E(X | Z) and E(Y | Z) are sparse in that they depend only on finitely many components Z 1 , … , Z r (with r ∈ N unknown), and the marginal distribution of Z is known exactly, there is still no non-trivial test of conditional independence.
In this specialised setting, it is however possible to give a test of conditional independence that will, for each fixed null hypothesis P ∈  0 , yield exact size control and power against all alternatives  for n sufficiently large.These properties are for example satisfied by the nominal -level t-test (11) see Section C.2 in the supplementary material for a derivation.This illustrates the difference between pointwise asymptotic level control in the left-hand side of (11), and uniform asymptotic level control given by interchanging the limit and the supremum.
Our analysis instead focuses on proving that the GHCM asymptotically maintains its level uniformly over a subset of the conditional independence null.In order to state our results we first introduce some definitions and notation to do with uniform stochastic convergence.Throughout the remainder of this section we tacitly assume the existence of a measurable space (Ω,  ) whereupon all random quantities are defined.The measurable space is equipped with a family of probability measures (P P ) P∈ such that the distribution of (X, Y , Z) under P P is P.For a subset  ⊆ , we say that a sequence of random variables W n converges uniformly in distribution to W over  and write if where d BL denotes the bounded Lipschitz metric.We say, W n converges uniformly in probability to W over  and write We sometimes omit the subscript  when it is clear from the context.A full treatment of uniform stochastic convergence in a general setting is given in Section B of the supplementary material.Throughout this section we emphasise the dependence of many of the quantities in Section 3.1 on the distribution of (X, Y , Z) with a subscript P, for example, f P ,  P etc.
In Sections 4.2 and 4.3 we present general results on the size and power of the GHCM.We take  to be the set of all distributions over  X ×  Y × , and  0 to be the corresponding conditional independence null.We, however, show properties of the GHCM under smaller sets of distributions  ⊂  with corresponding null distributions 0 ⊂  0 , where in particular certain conditions on the quality of the regression procedures on which the test is based are met.In Section 4.1 we consider the special case where the regressions of each of X and Y on Z are given by functional linear models and show that Tikhonov regularised regression can satisfy these conditions.We note that throughout, the dimensions d X and d Y may be finite or infinite.

Size of the test
In order to state our result on the size of the GHCM, we introduce the following quantities.Let We further define the in-sample unweighted and weighted mean squared prediction errors of the regressions as follows: The result below shows that on a subset 0 of the null distinguished primarily by the product of the prediction errors in (12) being small, the operator-valued statistic T n converges in distribution uniformly to a mean zero Gaussian whose covariance can be estimated consistently.We remark that prediction error quantities in ( 12) and ( 13 where we interpret an empty sum as 0. Then uniformly over 0 we have ⇉ 0, and so allows for relatively slow o( √ n) rates for the mean squared prediction errors.Moreover, if one regression yields a faster rate, the other can go to zero more slowly.These properties are shared with the regular generalised covariance measure and more generally doubly robust procedures popular in the literature on causal inference and semiparametric statistics (Chernozhukov et al., 2018;Robins & Rotnitzky, 1995;Scharfstein et al., 1999).Condition (ii) is much milder, and if the conditional variances u P and v P are bounded almost surely, it is satisfied when simply M f n,P , M g n,P P ⇉ 0. We note that importantly, the regression methods are not required to extrapolate well beyond the observed data.We show in Section 4.4 that when the regression models are functional linear models and ridge regression is used for the functional regressions, (i) and (ii) hold under much weaker conditions than are typically required for out-of-sample prediction error guarantees in the literature.
Conditions (iii) and (iv) imply that the family { P ⊗  P ∶ P ∈ 0 } is uniformly tight.Similar tightness conditions are required in Chen and White (1998, lemma 3.1) in the context of functional central limit theorems.Note that if d X and d Y are both finite, this condition is always satisfied.
The result below shows that the GHCM test  n (8) has type I error control uniformly over 0 given in Theorem 2, provided an additional assumption of non-degeneracy of the covariance operators is satisfied.

Power of the test
We now study the power of the GHCM.It is not straightforward to analyse what happens to the test statistic T n when the null hypothesis is false in the setup we have considered so far.However, if we modify the test such that the regression function estimates f and ĝ are constructed using an auxiliary dataset independent of the main data (x i , y i , z i ) n i=1 , the behaviour of T n is more tractable.Given a single sample, this could be achieved through sample splitting, and cross-fitting (Chernozhukov et al., 2018) could be used to recover the loss in efficiency from the split into smaller datasets.However, we do not recommend such sample splitting in practice here and view this as more of a technical device that facilitates our theoretical analysis.As we require f and ĝ to satisfy (i) and (ii) of Theorem 2, these estimators would need to perform well out of sample rather than just on the observed data, which is typically a harder task.
Given that our test is based on an empirical version of E(Cov(X, Y | Z)) = E( ⊗ ), we can only hope to have power against alternatives where this is non-zero.For such alternatives, however, we have positive power whenever the Hilbert-Schmidt norm of the expected conditional covariance operator is at least c∕ √ n for a constant c > 0, as the following result shows.Furthermore, an -level GHCM test  n (constructed using independent estimates f and ĝ) satisfies the following two statements.
1. Redefining 0 =  ∩  0 , we have that (15) is satisfied, and so an -level GHCM test has size converging to  uniformly over 0 .2. For every 0 < <  < 1 there exists c > 0 and N ∈ N such that for any n ≥ N, inf P∈ c,n In a setting where X, Y and Z are related by linear regression models, we can write down ||ECov(X, Y | Z)|| HS more explicitly.Suppose Z,  and  are independent random variables in L 2 ([0, 1], R), with X and Y determined by Then ECov(X, Y | Z) is an integral operator with kernel where v(t, u) denotes the covariance function of ε.The Hilbert-Schmidt norm ||ECov(X, Y | Z)|| HS is then given by the L 2 ([0, 1] 2 , R)-norm of .We investigate the empirical performance of the GHCM in such a setting in Section 5.1.2.

GHCM using linear function-on-function ridge regression
Here we consider a special case of the general setup used in Sections 4.2 and 4.3 where we assume that  is a Hilbert space  Z and that, under the null of conditional independence, the Hilbertian X and Y are related to Hilbertian Z via linear models: Here S X P is a Hilbert-Schmidt operator such that S X P Z = f (Z) ∶= E(X | Z), with analogous properties holding for S Y P , and it is assumed that EZ = 0.If X, Y and Z are elements of L 2 ([0, 1], R), this is equivalent to where  X P is a square-integrable function, and similarly for the relationship between Y and Z.Such functional response linear models have been discussed by Ramsay and Silverman (2005, chapter 16), and studied by Chiou et al. (2004), Yao et al. (2005), Crambes and Mas (2013), for example.Benatia et al. (2017) propose a Tikhonov regularised estimator analogous to ridge regression (Hoerl & Kennard, 2000); applied to the regression model ( 16), this estimator takes the form where  > 0 is a tuning parameter.We now consider a specific instance of the general GHCM framework using regression estimates based on (19).Specifically, we form estimate ŜX of S X by solving the optimisation in ( 19) with regularisation parameter where μ1 ≥ μ2 ≥ • • • ≥ μn ≥ 0 are the ordered eigenvalues of the n×n matrix K with K ij = ⟨z i , z j ⟩∕n.We form estimate ŜY of S Y analogously but with the x i replaced by y i in ( 19).Note that in the case where K = 0 and so γ does not exist, we simply take ŜX and ŜY to be 0 operators, that is, no regression is performed.The data-driven choice of γ above is motivated by an upper bound on the in-sample MSPE of the estimators ŜX and ŜY (see Lemma 17 in the supplementary material) where we have omitted some distribution-dependent factors of ||S X P || 2 HS or ||S Y P || 2 HS and a variance factor; a similar strategy was used in an analysis of kernel ridge regression (Shah & Peters, 2020) which closely parallels ours here.This choice allows us to conduct a theoretical analysis that we present below.In practice, other choices of regularisation parameter such as cross validation-based approaches may perform even better and so could alternative methods that are not based on Tikhonov regularisation.
In the following result, we take  n to be the -level GHCM test ( 8) with estimated regression functions f and ĝ yielding fitted values given by Note that in the finite dimensional setting where X (n) ∈ R n×d X (which is also covered by the result below), we have that the matrix of fitted values ( f and similarly for the Y (n) regression.
Theorem 5. Let 0 ⊂  0 be such that ( 16) and ( 17) are satisfied and moreover (iii) and (iv) of Theorem 2 and (14) hold when f and ĝ are as in (21).Suppose further that ∑ ∞ k=1 min( k,P , ) = 0 where ( k,P ) k∈N denote the ordered eigenvalues of the covariance operator of Z under P.
Then the -level GHCM test  n satisfies Condition (iii) is generally satisfied, by the dominated convergence theorem, for any family 0 for which the sequence of eigenvalues of the covariance operators are uniformly bounded above by a summable sequence.As a very simple example where all the remaining conditions of Theorem 5 are satisfied, we may consider the family of distribution 0 where Z,  P in ( 22) and  P in (23) are independent, and the latter two are Brownian motions with variances  2 ,P and  2 ,P respectively.If the coefficient functions  X P corresponding to X in (18) are in L 2 ([0, 1] 2 , R) with norms bounded above for all P ∈  0 , and an equivalent assumption for the coefficient functions relating to Y holds, and  2 ,P and  2 ,P are bounded from above and below uniformly, we have that  0 satisfies all the requirements of Theorem 5.
The proof of Theorem 5 relies on Lemma 17 in Section C.5 of the supplementary material, which gives a bound on the in-sample MSPE of ridge regression in terms of the decay of the eigenvalues  k,P , which may be of independent interest.For example, we have that if these are dominated by an exponentially decaying sequence, the in-sample MSPE is o(log n/n) as n → ∞ (see Corollary 2).This matches the out-of-sample MSPE bound obtained in Crambes and Mas (2013), corollary 5 in the same setting as that described, but the out-of-sample result additionally requires convexity and lower bounds on the decay of the sequence of eigenvalues of the covariance operator, and stronger moment assumptions on the norm of the predictor.Similarly, other related results (e.g.Cai & Hall, 2006;Hall & Horowitz, 2007) require additional eigen-spacing conditions in place of convexity, and upper and lower bounds on the decay of the eigenvalues.Furthermore, while some of these bounds are uniform over values of the linear coefficient operator for fixed distributions of the predictors, our in-sample MSPE bound is uniform over both the coefficients and distributions of the predictor.This illustrates how in-sample and out-of-sample prediction are very different in the functional data setting, and reliance on the former being small, as we have with the GHCM, is desirable due to the weaker conditions needed to guarantee this.

EXPERIMENTS
In this section we present the results of numerical experiments that investigate the performance of our proposed GHCM methodology.We implement the GHCM as described in Algorithm 1 with scalar-on-function and function-on-function regressions performed using the pfr and pffr functions respectively from the refund package Goldsmith et al. (2020).These are functional linear regression methods which rely on fitting smoothers implemented in the mgcv package (Wood, 2017); we choose the tuning parameters for these smoothers (dimension of the basis expansions of the smooth terms) as per the standard guidance such that a further increase does not decrease the deviance.In Section 5.3 in the supplement, we study high-dimensional EEG data using the GHCM with regressions performed using FDboost.
We note that, to the best of our knowledge, neither FDboost nor the regression methods in refund come with prediction error bounds (such as the ones derived in Section 4.4) that are required for obtaining formal guarantees for the GHCM; nevertheless they are well-developed and well-used functional regression methods and our aim here is to demonstrate empirically that they perform suitably well in terms of prediction such that when used with the GHCM, type I error is maintained across a variety of settings.In Section D of the supplementary material, we include additional simulations that consider among others, settings with heavy tailed errors, test the GHCM with FDboost in further settings and examine the local power of the GHCM.

Size and power simulation
In this section we examine the size and power properties of the GHCM when testing the conditional independence X ⫫ Y | Z.We take X, Z ∈ L 2 ([0, 1], R), and first consider the setting where Y is scalar.In Section 5.1.2we present experiments for the case where Y ∈ L 2 ([0, 1], R), so all variables are functional.All simulated functional random variables are sampled on an equidistant grid of [0,1] with 100 grid points.
5.1.1Scalar Y , functional X and Z Here we consider the setup where Z is standard Brownian motion and X and Y are related to Z through the functional linear models The variables N X , N Y and Z are independent with N X a Brownian motion with variance  2 X , N Y ∼  (0, 1), so X ⫫ Y | Z. Nonlinear coefficient functions  a and  a are given by We vary the parameters  X ∈ {0.1, 0.25, 0.5, 1} and a ∈ {2, 6, 12}.We generate n i.i.d.observations from each of the 4 × 3 = 12 models given by ( 22), ( 23), for sample sizes n ∈ {100, 250, 500, 1000}.Increasing a or decreasing  X increase the difficulty of the testing problem: for large a,  a oscillates more, making it harder to remove the dependence of X on Z.A smaller  X makes Y closer to the integral of X, and so increases the marginal dependence of X and Y .We apply the GHCM and compare the resulting tests to those corresponding to the significance test for X in a regression of Y on (X, Z) implemented in pfr.The rejection rates of the two tests at the 5% level, averaged over 100 simulation runs, can be seen in Figure 1.We see that the pfr test has size greatly exceeding its level in the more challenging large a, small  X settings, with large values of n exposing most clearly the miscalibration of the test statistic.In these settings, Y may be approximated simply by the integral of X reasonably well, and is also well-approximated by the true regression function that features only Z. Regularisation encourages pfr to fit a model where X determines the response, rather than X, and the p-values reflect this.On the other hand, the GHCM tests maintain reasonable type I error control across the settings considered here.
To investigate the power properties of the test, we simulate Z as before with X also generated according to (22).We replace the regression model ( 23) for Y with where N Y ∼  (0, 1) as before.Note that the coefficient function for X oscillates more as a increases.The rejection rates at the 5% level can be seen in Figure 2.While the two approaches perform similarly when a = 2, the pfr test has higher power in the more complex cases.However, as the results from the size analysis in Figure 1 show, null cases are also rejected in the analogous settings.
To illustrate the full distribution of p-values from the two methods under the null and the alternative, we plot false positive rates and true positive rates in each setting as a function of the chosen significance level of the test .The full set of results can be seen in Section D of the supplementary material and a plot for a subset of the simulations settings where n = 500 and  X ∈ {0.1, 0.25, 0.5} is presented in Figure 3.We see that both tests distinguish null from alternative well in the cases with a small and  X large.The p-values of the GHCM are close to uniform in the settings considered, whereas the distribution of the pfr p-values is heavily dependent on the particular null setting, illustrating the difficulty with calibrating this test.In Section D of the supplementary material we also present the results of two additional sets of experiments.We repeat the experiments above using the FDboost package for regressions in place of the refund package.We see that the performance of the GHCM with FDboost is broadly similar to that displayed in Figures 1 and 2, supporting our theoretical results which indicate that provided the prediction errors of the regression methods used are sufficiently small, the test will perform similarly.
We also consider the case where the noise is heavy tailed.Specifically, we present analogous plots for setting where N Y is t-distributed with different degrees of freedom, n = 500 and  X = 0.25; the results are similar to Figure 3, with the GHCM maintaining type I error control, and pfr tending to be anti-conservative in the more challenging settings.

5.1.2
Functional X, Y and Z In this section we modify the setup and consider functional Y ∈ L 2 ([0, 1], R).We take X and Z as in Section 5.1.1 but in the null settings we let where N Y is a standard Brownian motion.Note that this is a particularly challenging setting to maintain type I error control as X and Y are then highly correlated, and moreover the biases from regressing each of X and Y on Z will tend to be in similar directions making the equivalent of the term a n in (2) potentially large.
In the alternative settings, we take with N Y again being a standard Brownian motion.
The rejection rates at the 5% level, averaged over 100 simulation runs, can be seen in Figure 4. We see that, as in the case where Y ∈ R, the GHCM maintains good type I error control in the settings considered, and has power increasing with n and  X as expected.We note that a comparison with the p-values from ff-terms in the pffr-function of the refund package here does not seem helpful.In our experiments the corresponding tests consistently reject in true null settings even for simple models.
In Section D of the supplementary material we look at the subset of the settings considered above with n = 500 and  X = 0.25 but where X and Y are observed on irregular grids of varying length grids.We first preprocess the residuals output by the regression method as described in Section 3.2.1 and then apply the GHCM.We observe that the performance is similar to that in the fixed grid setting, although the power is lower when the average grid length is smaller, and type I error increases slightly above nominal levels in the most challenging a = 12 setting.

Confidence intervals for truncated linear models
In this section we consider an application of the GHCM in constructing a confidence interval for the truncation point  ∈ [0, 1] in a truncated functional linear model (Hall & Hooker, 2016) typically assumes a Gaussian functional graphical model and outputs a point estimate of the conditional independence graph, here we are able to test for the presence of each edge, with type I error control guaranteed for data generating processes where our regression methods perform suitably well as indicated by Theorem 3. We illustrate this on an EEG dataset from a study on alcoholism (Zhang et al., 1995;Ingber, 1997Ingber, , 1998)).The study participants were shown one of three visual stimuli repeatedly and simultaneous EEG activity was measured across 64 channels over the course of 1 second at 256 measurements per second.While the study included both a control group and an alcoholic group we will restrict our analysis to the alcoholic group consisting of 77 subjects and further restrict ourselves to a single type of visual stimulus.We preprocess the data as in Qiao et al. (2019), averaging across the repetitions of the experiment for each subject and using an order 96 FIR filter implemented in the eegkit R-package (Helwig, 2018) to filter the averaged curves at the  frequency bands (between 8 and 12.5 Hz).We thus obtain 64 -filtered frequency curves for each of the 77 subjects.
Given the low number of observations compared to the 64 functional variables, there is not enough data to reject the null of edge absence even if a true edge were to be present.We therefore aim for a coarser analysis by grouping the variables by brain region and then further according to whether the variable corresponded to the right or left hemispheres of the brain.This yields disjoint groups G 1 , … , G 24 comprising 52 variables in total after omitting reference channels and midline channels that could not easily be classified as being in either hemisphere, that is, G 1 ∪ • • • ∪ G 24 = {1, … , 52}.We suppose the observed data are i.i.d.copies functional variables (X 1 , … , X 52 ), and then test the null hypothesis for each j, k ∈ {1, … , 24} with j ≠ k; that is, we test for edge presence in the conditional independence graph of the grouped variables.Here, the conditional independence graph over the grouped variables is defined as an undirected graph over G 1 , … , G 24 , in which the edge between G j and G k , j ≠ k is missing if and only if (29) holds; that is, rejection of the null in (29) for k and j indicates that the conditional independence graph has an edge between G k and G j .
To construct p-values for the null in (29) using the GHCM, we must regress for each l ∈ G j and r ∈ G k , each of the functional variables X l and X r on to the set of variables in the conditioning set.Since the regressions will involve large numbers of functional predictors, the refund package is not suitable to perform the regressions.Instead, we use the FDboost package in R, which is well-suited to high-dimensional functional regressions (Brockhaus et al., 2020).We fit a concurrent functional model (Ramsay & Silverman, 2005) of the form the inclusion of additional functional linear terms did not improve the fit.We assessed the appropriateness of this regression method to data of the sort studied here through simulations described in Section D of the supplement.Figure 6 summarises the results of GHCM applied to test the presence of each edge in the conditional independence graph.We see that some of the brain regions located close to each other appear to be connected, as one might expect.Note that the network presented includes all edges that had a p-value less than 5%.The edge PO-R-O-R has a Bonferroni-corrected p-value of 0.0027, and is the only edge yielding a corrected p-value less than 5%.Applying the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) to control the false discovery rate at the 5% level selects this edge and also PO-L-O-L.We may compare these results with those of Qiao et al. (2019) and Qiao et al. (2020) who study the same dataset but consider the different problem of estimation of the conditional independence graph rather than testing of edge presence as we do here.We see that our results are broadly in line with their estimates: for example, there are edges estimated between the groups represented by PO-R and O-R (the group pair which yields the lowest p-value) even in some of their sparsest estimated graphs.

CONCLUSION
Testing the conditional independence X ⫫ Y | Z has been shown to be a hard problem in the setting where X, Y , Z are all real valued and Z is absolutely continuous with respect to Lebesgue measure (Shah & Peters, 2020).This hardness takes a more extreme form in the functional setting: even when (X, Y , Z) are jointly Gaussian with non-degenerate covariance and Z and at most one of X and Y are infinite-dimensional, there is no non-trivial test of conditional independence.This requires us to (i) understand the form of an 'effective null hypothesis' for a given hypothesis test, and (ii) develop tests where these effective nulls are somewhat interpretable so that domain knowledge can more easily inform the choice of a conditional independence test to use on any given dataset.
In order to address these two needs, we introduce here a new family of tests for functional data and develop the necessary uniform convergence results to understand the forms of null hypotheses that we can have type I error control over.We see that for our proposed GHCM tests, error control is guaranteed under conditions largely determined by the in-sample prediction error rate of regressions upon which the test is based.Whilst in-sample and more common out-of-sample results share similarities in some settings, the lack of a need to extrapolate beyond the data in the former lead to important differences when regressing on functional data.In particular, no eigen-spacing conditions or lower bounds on the eigenvalues of the covariance of the regressor are required for the in-sample error to be controlled when ridge regression is used.It would be interesting to investigate the in-sample MSPE properties of other regression methods and understand whether such conditions can be avoided more generally.
One attractive feature of the GHCM is that it only depends on inner products between the residuals produced by the regression methods.An interesting question is whether different inner products can be constructed to have power against different sets of alternatives, by emphasising certain regions of the function domains, for example.
Another direction which may be fruitful to pursue is to adapt the GHCM so that it has power against alternatives where ECov(X, Y | Z) = 0.It is likely that further conditions will be required of the regression methods than simply that their in-sample prediction errors are small, and so some interpretability of the effective null hypotheses, and indeed its size compared to the full null of conditional independence, will need to be sacrificed.There are however settings where the severity of type I versus type II errors may be balanced such that this is an attractive option.
It would also be interesting to investigate the hardness of conditional independence in the setting where all of X, Y and Z are infinite-dimensional.For our hardness result here, at least one of X and Y must be finite-dimensional.It may be the case that requiring two infinite-dimensional variables to be conditionally independent is such a strong condition that the null is not prohibitively large compared to the entire space of Gaussian measures, and so genuine control of the type I error while maintaining power is in fact possible.Such a result, or indeed a proof that hardness persists, would certainly be of interest.

Theorem 4 .⇉
Consider a version of the GHCM test  n where f and ĝ are constructed on independent auxiliary data.Let  ⊂  be the set of distributions for (X, Y , Z) satisfying (i)-(iv) of Theorem 2 and (14) with  in place of 0 .Then writing K P ∶= E P ( P ⊗  P ) = E P (Cov P (X, Y | Z)), we have, uniformly over ,  (0, C P ) and || Ĉ − C P || TR P ⇉ 0.
Rejection rates in the various null settings considered in Section 5.1.1 for the nominal 5%-level pfr test (top) and GHCM test (bottom).[Colour figure can be viewed at wileyonlinelibrary.com]Downloaded from https://academic.oup.com/jrsssb/article/84/5/1821/7072972 by Royal Library Copenhagen University user on 13 April 2023 Rejection rates in the various alternative settings considered in Section 5.1.1 (see (25)) for the nominal 5%-level pfr test (top) and GHCM test (bottom).[Colour figure can be viewed at wileyonlinelibrary.com]Rejection rates against significance level for the pfr (red) and GHCM (green) tests under null (light) and alternative (dark) settings when n = 500.[Colour figure can be viewed at wileyonlinelibrary.com] Rejection rates in the various null (top) and alternative (bottom) settings considered in Section 5.1.2for the nominal 5%-level GHCM test.[Colour figure can be viewed at wileyonlinelibrary.com] Network summarising the output of conditional independence tests for each pair of groups.Only edges with p-values of less than 5% are shown with thicker lines indicating smaller p-values.[Colour figure can be viewed at wileyonlinelibrary.com] P∈ 0 E P (|| P || 2 || P || 2 ) > 0 and sup P∈ 0 E P (|| P || 2+ || P || 2+ ) < ∞ for some  > 0, and 4. for some orthonormal bases (e X,i ) X and  Y , respectively, writing  P,i ∶= ⟨e X,i ,  P ⟩ and  P,j ∶= ⟨e Y ,j ,  P ⟩, we have ) are 'in-sample' prediction errors, only reflecting the quality of estimates of the conditional expectations f and g at the observed values z 1 , … , z n .