Parameterizing and Simulating from Causal Models

Many statistical problems in causal inference involve a probability distribution other than the one from which data are actually observed; as an additional complication, the object of interest is often a marginal quantity of this other probability distribution. This creates many practical complications for statistical inference, even where the problem is non-parametrically identified. In particular, it is difficult to perform likelihood-based inference, or even to simulate from the model in a general way. We introduce the `frugal parameterization', which places the causal effect of interest at its centre, and then builds the rest of the model around it. We do this in a way that provides a recipe for constructing a regular, non-redundant parameterization using causal quantities of interest. In the case of discrete variables we can use odds ratios to complete the parameterization, while in the continuous case copulas are the natural choice; other possibilities are also discussed. Our methods allow us to construct and simulate from models with parametrically specified causal distributions, and fit them using likelihood-based methods, including fully Bayesian approaches. Our proposal includes parameterizations for the average causal effect and effect of treatment on the treated, as well as other causal quantities of interest.


Introduction
In many multivariate statistical problems, inferential interest lies in properties of specific functionals of the joint distribution, such as marginal or conditional distributions; this means it is generally desirable to specify a model for these functionals directly, with other parts of the distribution often being regarded as nuisance parameters.In causal inference problems, the target of inference may be a probability distribution other than the one that generates the observed data, but one which corresponds to some sort of experimental intervention on that system.
Example 1.1.Consider the causal system represented by the graph in Figure 1(a), and suppose we are interested in the causal effect of X on Y .For example, in a cohort of children X might be a measure of their diet, Y their BMI, and Z an indicator of the education level of their parents.Alternatively, Z could be an unobserved genetic factor.This can be formulated as a prediction problem: "what would happen if we performed an experiment in which we set X = x by external intervention?"Let the variables be distributed according to P with some density p.Under the causal DAG assumptions of Spirtes et al. (2000) and Pearl (2009), the conditional distribution of Y and Z after an experiment to fix X = x is Note that the idealized intervention on X removes any dependence of X on the confounder Z, but preserves the marginal distribution of Z, and the conditional distribution of Y given X, Z.This distribution is Markov with respect to the graph in Figure 1(b).Interest may then lie in the marginal effect on just Y , sometimes denoted P (Y = y | do(X = x)), or as the distribution of the potential outcome Y x .Models of this quantity are known as marginal structural models (or MSMs, Robins, 2000).
For the purposes of simulation and likelihood-based inference it is often necessary to work with the joint distribution P (X = x, Y = y, Z = z) directly, and it may be difficult to specify it so that it remains compatible with a particular marginal model on (1).Indeed, providing a model for the joint distribution parametrically may lead to a situation in which (1) cannot logically be independent of the value of x, unless we impose the much stronger condition that Y ⊥ ⊥ X | Z.More generally, specifying separate models for joint and marginal quantities-and ignoring the information that is shared between them-can lead to incompatible or incoherent models, nonregular estimators, and severe misspecification problems.

Contribution of this Paper
We will show that one can break down a joint distribution into three pieces: the distribution of 'the past', p ZX (z, x) := P (Z = z, X = x); the causal quantity of interest, p * Y |X (y | x) := P * (Y = y | X = x); and a conditional odds ratio, copula or other dependence measure φ * Y Z|X between Y and Z given X.Suppose that the respective parameterizations for these quantities are called θ ZX , θ * Y |X and (with some abuse of notation) φ * Y Z|X ; we call (θ ZX , θ * Y |X , φ * Y Z|X ) a frugal parameterization.The terminology is chosen because it is a direct parameterization of the causal quantity of interest, such that there is no redundancy and any distribution with a positive joint density can be decomposed in this manner.If we use smooth and regular ‡ parameterizations of the three pieces then the resulting parameterization of the joint model is also smooth and regular.We use a star (e.g.p * or φ * ) to denote that the distribution or parameter is from the causal or interventional distribution, and omit the star if the distribution or parameter is from the observational regime.Note that the causal quantity p * Y |X may be more general than just p Y |X (y | do(x)); see Section 2. Note that, in addition to providing a parameterization, the quantities θ ZX and θ * Y |X will always be variation independent; we can also always choose φ * Y Z|X to be variation independent of the other two parameters, unless we prefer to use (e.g.) a risk difference or risk ratio for interpretability.As an example of the benefits of this property, we add a dependence for Y on covariates C via a link function: for all c.Now we can be certain that-regardless of the values of P (X = x, Z = z, C = c) and φ * Y Z|XC (y, z | x, c)there is a coherent joint distribution which possesses the required functionals.This could allow us, for example, to model the causal effect of alcohol (X) on blood pressure (Y ) conditional on a person's genes (C), but marginally over factors such as socio-economic status (Z).
We start with a very simple example, to illustrate exactly what we propose to do.‡ That is, such that the model is differentiable in quadratic mean and has positive definite Fisher Information Matrix.See Appendix A for a formal statement.
Example 1.2.Suppose that (Z, X, Y ) T follow a multivariate Gaussian distribution with zero mean, and that we wish to specify that Y | do(X = x) is normal with mean βx and variance σ 2 .To complete the frugal parameterization we must specify 'the past' (i.e.p ZX ) and a dependence measure between Y and Z conditional upon X (φ * Y Z|X ).We therefore take Z and X to be normal with mean 0 and variances τ 2 , υ 2 respectively and correlation ρ, and assume the regression parameter for Y on Z (in the regression that includes X) is α; note that we could alternatively specify the covariance or partial correlation between Z and Y .Hence we have θ ZX = (τ 2 , υ 2 , ρ), θ Y |X = (β, σ 2 ) and φ * Y Z|X = α.Using this information, one can directly compute the distribution of (Z, Y ) T after the intervention and consequently the observational joint distribution of (Z, X, Y ) T is: We may do this for any value of ρ ∈ (−1, 1), α, β, and σ 2 , τ 2 , υ 2 > 0 provided that σ 2 > α 2 τ 2 , and indeed we can obtain any trivariate Gaussian distribution from these parameters.Note that, though the last inequality implies there is variation dependence in this case, we could easily choose φ * Y Z|X to be (for example) the partial correlation between Z and Y given X, and then there would be no such constraint.
Once we are able to construct the joint distribution, simulation is trivial.We take the Cholesky decomposition of the covariance matrix and apply the lower triangular part to independent standard normals.Likelihood-based inference is also straightforward once the covariance is known.
In this example we took our three pieces, p ZX (a bivariate normal), p * Y |X (a linear regression) and φ * Y Z|X (a regression parameter), and used them to obtain p ZXY .Note that our parameterization was chosen so that every quantity of interest is specified precisely once, and the overall model is saturated (i.e.any multivariate Gaussian distribution can be deconstructed in this manner, just by varying the parameters).This contrasts with the alternative of specifying Σ directly, as this does not give a simple explicit model for the causal effect.
The above example may seem somewhat trivial, but the main contribution of this paper is that we will do this in a much more general fashion, enabling simulation from a wide range of causal models.
Example 1.3.Now suppose that Z and Y are binary with X still continuous, and we continue to work with the model in Figure 1 (a).This time we specify that ), and that the log odds ratio between Y and Z conditional on X = x is φ (we could also allow φ to vary with x).
The joint distribution in this example is considerably more difficult to write in a closed form than the one in Example 1.2.However, in this paper we will show that we may: (i) specify this model using the parameters just given; (ii) simulate samples from the distribution described; and (iii) give a map for numerically evaluating the joint density and fit such a model to data using likelihood-based methods.Furthermore, we can do all this (almost) as easily as with the multivariate Gaussian distribution.Note that because logistic regression is not collapsible, this model illustrates why we should not just provide p Y |XZ to compute the joint distribution: doing so could lead to a very different marginal model for Y | do(X) than the one we chose.
As we show, the method is particularly applicable to survival models and dynamic treatment models where we marginalize over the time-varying confounders; both of these are widely used but are difficult to simulate from (Havercroft and Didelez, 2012;Young and Tchetgen Tchetgen, 2014).In addition, it allows Bayesian and other likelihood-based methods to be applied coherently to marginal causal models (Saarela et al., 2015).
Though Examples 1.1-1.3 are presented for univariate Gaussian or discrete variables, in fact the results are entirely general and can be adapted to vectors of arbitrary cardinality and general continuous or mixed variables; implementation does become more complicated in such situations, however.As noted by Robins (2000), calculation of the likelihood becomes a 'computational nightmare' for marginal structural models with continuous variables, but we show that copulas can be used to overcome this problem.In the sequel we denote the observational joint density by p with, for example, p Y |X (y | x) meaning the conditional density of Y given X.In the discrete case, this is just the probability mass function.

Existing Work
A commonly used alternative to likelihood-based approaches are generalized estimating equations (GEEs) or semiparametric methods, as these do not require full specification of the joint distribution (Diggle et al., 2002).However, neither method allows for simulation from the model, and they may be less powerful than likelihood-based methods.Robins (1992) provides an algorithm for simulating from a Structural Nested Accelerated Failure Time Model (SNAFTM), a survival model in which one models survival time as an exponential variable whose parameter varies with treatment.This is adapted by Young et al. (2008) to simulate from a Cox MSM model.Young et al. (2009) consider a special case of a Cox MSM that approximates a SNAFTM and also a SNCFTM (special cases of the structural nested model -see Section 7).Keogh et al. (2021) give a method for simulating from Cox MSMs using an additive hazard model.Havercroft and Didelez (2012) consider the problem of specifying (and thus characterizing) models such that, for simulation and educational purposes, bias due to selection effects and blocking mediation effects will be strong if a naïve approach is used.Richardson et al. (2017) give a variation independent parameterization for structural equation models by using the odds product; this also allows for fully-likelihood based methods.This is extended by Wang et al. (2022) to the Structural Nested Mean Model, which we will meet in Section 7. The main difference between this work and ours is that it is not obvious how to extend their approach to other models and to continuous variables.Indeed, much of the trend in causal inference is towards structural equations models (SEMs) in which each variable is modelled as a function of all previous variables and a stochastic noise term (Peters et al., 2017).In Example 1.3 this would have meant specifying p Y |ZX , which would not have allowed us to directly model p Y |X (y | do(x)).In particular, the work of Pearl generally assumes that causal distributions should be conditional on all previous variables, while allowing for some conditional independence constraints (see, for example, Peters et al., 2017, and large sections of Pearl, 2009).We certainly do not wish to single these authors out for criticism (indeed the authors of this paper have often considered such approaches), but they do seem to be less useful in epidemiological or other medical contexts, in which conditional independences are oftenthough not always-implausible assumptions.In such a context, one has to specify distributions conditional on the entire past, which may be very difficult if there are a large number of relevant variables.
We view our approach as complementary to the structural equation perspective, since each has advantages in terms of what assumptions can be expressed and the causal questions that can be easily answered within the framework.SEMs and the theory around them have received much attention; this work starts to fill in the gaps relating to marginal models.

Causal Models
Throughout the paper we will have a running example based on Figure 2; each of these examples is labelled with a prefix 'R'.
Example R1.The model in Figure 2 arises in dynamic treatment models and is studied in Havercroft and Didelez (2012).Robins (1986) as Havercroft and Didelez note that after specifying a model for p Y |AB (y | do(a, b)), it is difficult to parameterize and simulate from the full joint distribution, partly because of the complexity of the relationship (2).They are only able to simulate from the special case of Figure 2 in which L has no direct effect on Y , so any dependence is entirely due to the latent variable.We remark that we could replace instances of in (2) with (u, ) and obtain the same result, which means that the role of Z could be taken by either L alone or the pair (U, L).
For related reasons, the model in Figure 2 is also the subject of the so-called g-null paradox (Robins and Wasserman, 1997)  Note that the g-null paradox is not the same as the presence of singularities § or non-collapsibility, but rather it is a result of non-collapsibility over a marginal model that possibly leads to singularities.
Example R2.Considering Figure 2 again, suppose that we choose Y to depend linearly on A, L and B (including any interactions we wish), and that L is binary and we use a logistic parameterization for its dependence upon A. Then, if A takes four or more distinct values, it is essentially impossible for H 0 : Y ⊥ ⊥ A | do(B) to hold in such a distribution, even if Y doesn't depend directly upon A, L or B. This is because so the only way for this quantity to be independent of a variable A with at least four levels is for β 1 = 0 and either θ 1 = 0 or β 2 = 0.This 'union' model is singular (i.e.not regular) at θ 1 = β 1 = β 2 = 0, and being in it implies a much stronger null hypothesis (that either Y ⊥ ⊥ A, L | B or L ⊥ ⊥ A in addition to the causal independence) than the one we are interested in investigating.
Generally speaking, if we try to state a model for p Y |ALB as well as requiring that p Y |AB (y | do(a, b)) does not depend on A, we effectively try to specify the A-Y and B-Y relationships in two different margins; in the case above these margins are incompatible, leading to the singularity.This is avoided by constructing a smooth, regular and variation independent parameterization, without any redundancy.We show that, in fact, a frugal parameterization of the joint distribution exists that separates into variation independent parameterizations of the quantities Of course, some of these quantities will be unidentifiable, ¶ but we will want to be able to simulate how well the effects of A and B on Y are estimated in the presence of unobserved confounding of various strengths.
Remark 1.4.Statistical causality is represented using a number of different overlapping frameworks, including potential outcomes (Rubin, 1974), causal directed graphs (e.g.Spirtes et al., 2000), decision theory (Dawid and Didelez, 2010), non-parametric structural equation models (e.g.Pearl, 2009), Finest Fully Randomized Causally Interpretable Structured Tree Graphs (Robins, 1986) and their implementation as Single World Intervention Graphs (Richardson and Robins, 2013).The discussions in this paper are broadly applicable to any of these frameworks.Though slightly more verbose, the do(•) notation has the advantage that the quantity is more immediately seen to be a conditional distribution indexed by both a and b, which is critical to our method.We will exploit the fact that a do(X = x)-intervention can be obtained by conditioning on X = x after randomizing X, i.e. randomly generating it from an arbitrary (but not trivial) distribution p * X (x).Note also that it is ambiguous from notation alone whether p Y |X (y | do(x)) is identifiable or not, since it depends upon both the causal model being postulated and the available data; this problem also arises with the other frameworks.
Remark 1.5.In applications, when causal model are to be fitted on actual data, conditions for identifiability must be met.These are well-known for all models we consider: they essentially consist of the appropriate (possibly sequential) versions of causal consistency, positivity and conditional exchangeability (or no unmeasured confounding) given the measured covariates (Hernán and Robins, 2020).As we are here interested in properties of causal models and how to simulate from them, we will take identification as given.
The remainder of the paper is structured as follows: in Section 2 we provide our main assumptions and discuss issues such as how we might choose a dependence measure.Section 3 contains the main result outlined in the introduction.In Section 4 we describe how to simulate from our models and give a series of examples, and in Section 5 we show how to fit these models using maximum likelihood estimation.Section 5.2 contains an analysis of real data on the relationship between fibre intake, a polygenic risk score for obesity and children's BMI.Section 6 discusses an application of the frugal parameterization to survival models, and Section 7 contains an extension to models in which the causal parameter is different for distinct levels of the treatment.We note that Sections 6 and 7 are more technical, and not necessary for the reader to gain insight into the main ideas of the paper.We conclude with a discussion in Section 8.

The Frugal Parameterization
Here we present a formalization of the ideas in the introduction.Suppose we have three random vectors (Z, X, Y ) ∈ Z × X × Y, where Y is an outcome (or set of outcomes) of interest, and X, Z consist of relevant variables that are considered to be causally prior to Y ; this may be because they are temporally prior to Y , but that is not strictly necessary.There is no restriction on the state-spaces of these variables provided that they admit a joint density p := p ZXY with respect to a product measure µ Z • µ X • µ Y , and satisfy standard statistical regularity conditions.In particular, each of X, Y and Z may be finite-dimensional vector valued, and either continuous, discrete or a mixture of the two.The fact that each of these variables may be vector valued, and that there is no fixed ordering on variables in X and Z means the method is considerably more flexible than it might at first appear.
Throughout this paper we use the notation p X to denote the marginal density of the random variable X, and θ X to denote the parameter in a model for this distribution; similarly p Y |X and θ Y |X relate to the distribution of Y conditional upon X.We will need to consider marginal and conditional distributions that are not obtained by the usual operations; for example, a marginal distribution taken by averaging over a population with a different distribution of covariates.We will typically denote such non-standard distributions by indexing with a star: e.g.p * Y |X or θ * Y |X .We use φ Y Z to denote parameters that describe the dependence structure of a joint distribution; specifically, such that when combined with the relevant marginal distributions they allow us to recover an entire joint distribution.Examples include odds ratios or the parameters of a particular copula.We also consider quantities that provide such a dependence structure conditional on a third variable, and denote this as φ Y Z|X .Again, if the dependence is in p * ZXY (defined in the next subsection) then we will write this quantity as φ * Y Z|X .We will assume that we have three separate, smooth and regular parametric models for p ZX , p Y |X and φ Y Z|X , with corresponding parameters θ ZX , θ Y |X and φ Y Z|X .In this sense our model can be equated with θ := (θ ZX , θ Y |X , φ Y Z|X ), and for this reason we will often refer to θ as 'the model'.For convenience, we will refer to p ZXY as the observational distribution, and p * ZXY as the causal distribution; we do this even though in other possible contexts p * ZXY might not correspond to a standard causal intervention on p ZXY .

Cognate Distributions and the Frugal Parameterization
A parameterization is said to be frugal if it consists of at least three parts: the distribution of 'the past'; a (possibly) reweighted quantity relating to the distribution of the outcome; and then a conditional association measure that, combined with the first two pieces, smoothly parameterizes the joint distribution.
To be explicit, we require that the frugal parameterization includes a parameter θ * Y |X that models a conditional distribution of the form for some kernel (i.e.conditional density) w(z | x).We call a conditional distribution that can be written in the form (3) a cognate distribution (to p Y |X ).Note that cognate distributions include the ordinary conditional as a special case, since setting w = p Z|X we obtain Common causal quantities obtained by re-weighting also satisfy the definition; for example, given the causal model implied by Figure 1(a) we have In other words, this formulation allows for adjustment by a subset of the previous variables.Terms to derive the effect of treatment on the treated (ETT) also satisfy the definition by using the kernel and these can be written as The effect of treatment on the control individuals (ETC) is analogously defined using p Z|X (z | 0).
as being a conditional distribution taken from the larger distribution p * ZXY , where Note that p ZXY and p * ZXY share a conditional distribution for Y given X, Z-only the marginal distribution of Z and X has been altered.As noted in Remark 1.4 the marginal distribution p * X is essentially arbitrary, though later we may need it to satisfy some of Assumptions A2-A5 in order to apply our main results.
Definition 2.1.A smooth, regular parameterization of random variables (Z, X, Y ) is said to be frugal with respect to some kernel p * Y |X of the form (3), if it consists of separate parameterizations of: (i) the marginal distribution of Z, X; (ii) the kernel p * Y |X ; and (iii) a conditional association measure φ * Y Z|X for Y and Z given X.
Recall that the formal definitions of 'smooth' and 'regular' parameterizations are given in Appendix A.

Variation Independence
Take a set Θ and two functions defined on it φ, ψ.We say that φ and ψ are variation independent if (φ × ψ)(Θ) = φ(Θ) × ψ(Θ); i.e. the range of the pair of functions together is equal to the Cartesian product of the range of them individually.A variation independent parameterization helps to ensure that the parameters characterize separate, non-overlapping aspects of the the joint distribution.Note that we may sometimes refer to sets of distributions being variation independent, and in this case we are really referring to their respective parameterizations.
Note that the parameterizations of p ZX and p * Y |X are guaranteed to be variation independent, since there is always a parameter cut between marginal and conditional pieces of this form (we discuss this in Section 5).The following assumption will not actually be required for any of our results, but we note that, if satisfied, it makes interpretation and prediction somewhat easier.

A1. Given a frugal parameterization
We will see that this assumption is satisfied by both conditional odds ratios and copulas.

Choices of the Association Parameter
Now that we have formally defined the parameterization, let us return to the original problem.We want to be able to (i) construct, (ii) simulate from, and (iii) fit a model using the frugal parameterization.In order to do this we have to make some choices.We take the form of w and a model for p * Y |X as given, because they are chosen by the analyst using subject matter considerations; this leaves us to select a parametric family p ZX for (Z, X), and a conditional association parameter within the causal model, φ * Y Z|X .This raises the question of how one should choose the association parameter.In general there are many possibilities: a risk difference or ratio, an odds ratio, or something else.However, some of these objects have nicer properties than others.In the case of binary Y and Z the natural choice for such an object is the conditional odds ratio which is known to be variation independent of the margins p * Y |X and p * Z|X , and also has the property that if p * ZXY is multiplied by any function of (x, z) or (x, y) it does not change.More specifically, note that p * ZXY = p ZXY • p * ZX /p ZX , and hence In other words, the conditional odds ratio for the causal and observational distributions are the same, and this does not hold for other conditional association parameters (Edwards, 1963).
This definition and the invariance result (4) extends to distributions over any statespace under mild conditions (Osius, 2009), and-in theory-the joint distribution can be recovered from the odds ratio and marginal distributions using the iterative proportional fitting (IPF) algorithm (Bishop, 1967;Darroch and Ratcliff, 1972;Csiszár, 1975;Rüschendorf, 1995).Other fitting approaches are discussed by Tchetgen Tchetgen et al. (2010).Note that, for general continuous distributions, it is not possible to implement the algorithm in practice in most cases, because the intermediate distributions will not have a closed form; an obvious exception to this is the multivariate Gaussian distribution.
Alternative possibilities include the risk difference and risk ratio, though these lack the variation independence in A1 possessed by the odds ratio, unless combined with the odds product as in Richardson et al. (2017).We will use these difference and ratio contrasts in Section 7, to parameterize the 'blip' functions in a structural nested mean model.Proposition 2.2.If X, Y and Z are finite categorical variables and have strictly positive conditional distribution p Y Z|X > 0, then using smooth parameterizations of the marginal distributions p ZX and p Y |X , together with the conditional odds ratio φ Y Z|X is a frugal parameterization that satisfies assumption A1.Indeed, X can also be a continuous or mixed variable (c.f.Example 1.3).
Proof.This follows from the results of Bergsma and Rudas (2002).
Example 2.3.For multivariate Gaussian random variables, or other distributions that are defined by their first two moments, the partial correlation ρ Y Z|X ≡ Cor(Y, Z | X) satisfies the conditions for being a conditional association parameter φ Y Z|X , in the sense that when combined with the marginal distributions for each of Y and Z given X, one can recover the joint conditional distribution p Y Z|X .
Example 2.4.An alternative to the odds ratio for general continuous variables is to use a copula, which separates out the dependence structure from the margins by rescaling the variables via their univariate cumulative distribution functions.A multivariate copula is a cumulative distribution function with uniform marginals; i.e. a function C : [0, 1] d → [0, 1] which is increasing and rightcontinuous in each argument, and such that C(1, . . ., 1, u i , 1, . . ., 1) = u i for all u i ∈ [0, 1] and i ∈ {1, . . ., d}.
Recall that, for a continuous real-valued random variable Y with CDF F Y , the random variable U ≡ F Y (Y ) is uniform on (0, 1).The bivariate copula model for Y and Z ∈ R is then There is a one-to-one correspondence between copulas and multivariate continuous CDFs with uniform marginals.By Sklar's Theorem (Sklar, 1959, see also Sklar, 1973), any copula can be combined with any collection of continuous margins to give a joint distribution, via (in our bivariate example) We will assume that the copula is parametric, and then φ Y Z|X represents the parameters of the particular family of copulas.
Proposition 2.5.If Y and Z are continuous with a positive conditional distribution for each x ∈ X , then any smooth and regular parameterization of their marginals p ZX and p Y |X together with a smooth and regular conditional copula C Y Z|X is a frugal parameterization that satisfies assumption A1.
Proof.This follows from the results of Sklar (1973).
Note that the copula is only used to model the interaction, thus allowing us to retain the simple interpretation of the marginal model p * Y |X in terms of an interventional distribution.In contrast to the odds ratio note that conditional copulas do not satisfy (4), because the copula also depends upon the cumulative distribution function of the corresponding margins; this is a slight disadvantage in comparison to the odds ratio.We will return to these examples in Section 4.
Example 2.6.We can also use copulas to model variables in a more flexible way by including categorical variables.Suppose that we have a mixture of continuous and binary variables among the elements of Z and Y .Then we might choose to model them using an approach analogous to that of Fan et al. (2017), who propose a Gaussian copula model that is dichotomized for the binary components.Their estimation methods show that the resulting joint distribution is a smooth function of the parameters.This model, combined with smooth marginal models will also be frugal and satisfy A1.We use this approach in our data analysis example in Section 5.2.
More general versions of the frugal parameterization are given in Sections 6 and 7, though again we note that the rest of the paper can be read without reference to those sections.

Main Result
We now give the main result outlined in the introduction: given a weight function w, a pa- The requirement that the image of η ZX is contained within the set of possible distributions p ZX is a very mild condition.In addition, if we use a copula or odds ratio as the conditional association measure the implication always holds, regardless of this assumption.
Example R3.Picking up Example R1 again and, for now, consider only the observed variables (though see Example R7 in Appendix C for details on how to simulate from all the variables).Take Z = L and X = (A, B), then Theorem 3.1 says that we can parameterize the model using parametric models of the three pieces For convenience, we choose to factorize p ALB according to the ordering A, L, B. Set A ∼ Bernoulli(θ a ), L is conditionally exponentially distributed with mean E[L | A = a] = exp(−(α 0 + α a a)), and Let us suppose that Y is normally distributed under the intervention on A, B, with mean and variance σ 2 .Let φ * LY |AB be a conditionally bivariate Gaussian copula, with correlation parameter given by some function ρ ab of a and b.This parameterization is frugal and satisfies A1.
In addition, note that this approach entirely circumvents the g-null paradox discussed in Example R2, because the marginal dependence of Y on A (after intervention on A and B) is uniquely and explicitly encoded by the parameters β a , β ab .

Sampling from a marginal causal model
In this section we will consider how to sample from p ZXY using a frugal parameterization θ * , sometimes analytically, but more commonly via the method of rejection sampling.Note that, now we have constructed a valid parameterization, we will no longer need to refer to the model on p ZXY defined by θ.From this point on, we only discuss the model on p ZXY parameterized by θ * , and the corresponding model on p * ZXY that replaces θ ZX with η ZX (θ ZX ).We first review how one should go about choosing such a parameterization.Y |X and φ * Y Z|X will make up the frugal parameterization.To make point 3 more concrete, in Example 1.3 we can set θ ZX to be the combination (q, γ, σ 2 ), and then take p * X ∼ N (0, 2σ 2 ); this ensures it will have heavier tails than p X|Z ∼ N (γz, σ 2 ) which, as we will see in Section 4.2, is crucial for sampling.
For point 4, the question of suitability of the dependence measure, we would wish to consider: (i) whether the relevant variables can be modelled with the particular dependence measure selected (e.g.odds ratios are suitable for discrete variables, but not so useful in practice for continuous ones); (ii) the computational cost of constructing the joint distribution; (iii) whether we want the dependence measure to be variation independent of its baseline measure; if so that would rule out risk ratios and differences.For a larger model with a vector valued X, we might wish to fit different dependence measures for each treatment variable; see Section 7 for an example of this with a Structural Nested Mean Model.

Direct Sampling
For fully discrete or multivariate Gaussian models, it is possible to compute p * ZXY and then 'reweight' by p ZX /p * ZX to obtain the distribution p ZXY in closed form.As noted in Proposition 2.2, in the discrete case this is straightforward using (conditional) log odds ratios to obtain a frugal parameterization of the distributions.For example, if Y and Z are both binary, taking values in {0, 1}, we can use For further details, including what happens if there are more than two levels to Y or Z, see Bergsma and Rudas (2002).As noted in (4), a nice property of the odds ratios as the association parameter is that their values in the observational and causal distributions are always the same. Example Note that the 'observational' conditional parameters are quite different from their causal counterparts.

Sampling By Rejection
In most realistic situations the data cannot be modelled as entirely discrete or multivariate Gaussian.In such cases we suggest simulating from a distribution constructed analogously to the causal model, and then using rejection sampling to modify the marginal distribution of X and Z and obtain data from the corresponding observational distribution.The idea of rejection sampling is very simple.Suppose we have two distributions: a target p that is difficult to sample from, and a proposal q that is both easy to sample and dominates p, in the sense that there is some M such that p/q ≤ M in a p-almost sure sense; then we can obtain independent samples from q and reject only those samples X for which p(X)/q(X) > M • U , where U is an independent uniform random variable on (0, 1).The samples that are not rejected are then distributed independently from p (see, for example, Robert and Casella, 2004, Chapter 2).
We might hope that, since p * ZXY is relatively easy to sample from, then we would find that p * ZX = w • p * X dominates p ZX ; unfortunately this is generally not the case and is extremely implausible unless Z is discrete.However, a weaker assumption is sufficient.
A4.The set Z can be partitioned into a countable number of bins B = {B i } such that, for each i, there p ZX -almost surely exists The significance of this assumption is that given n i.i.d.realizations from p Z we can then partition them into B, and target obtaining the same number of observations via a local rejection sampling scheme in each bin.Note that this original sample of Zs is never used after determining the number of observations within each bin.
Of course, to use this assumption we must be able to sample from p * ZXY , and the feasibility of this depends upon the particular model; however it is generally a much easier condition to satisfy than being able to sample from p ZXY directly given that p * Y |X is already specified.With a copula, it is essentially trivial: we can just sample directly from the copula, and then use inversion to ensure the margins are correct (Clifford, 1994).
We note that if the weighting is sometimes particularly heavy or the model is high-dimensional, then some of bounding constants M i will be large and/or some of the bins for Z have very low probability of being proposed, so the rejection method becomes very inefficient.However, since we can evaluate the joint distribution exactly if we use a copula, other more advanced simulation methods can be used instead of rejection sampling.A disadvantage is that the samples would generally only be approximately distributed correctly, but the level of error could easily be chosen to be statistically undetectable.We leave this to future work.

Copulas
As previously discussed, copulas may provide an approach to a frugal parameterization of models with continuous Y and Z.In this section we describe how copulas may be used to simulate from and fit causal models with particular marginal specifications.
In the simplest case, we can start by simulating values for X using some p * X , and then use the copula to simulate from the causal distribution on the scale of quantiles.We then apply the inverse CDF of p * Y |X (y | X = x i ) and p Z (z) to the uniform margins to obtain the actual observations.The parameters for the copula itself (i.e.φ * Y Z|X ) may or may not depend upon X.To obtain samples from the observational distribution, we can use rejection sampling, provided that A4 is satisfied.
Example R5.Continuing our running example from Havercroft and Didelez (2012), suppose we now wish to simulate some data from the model specified in Example R3 by rejection sampling.We first select some values for the parameters: Code to replicate this analysis can be found in the vignette Comparison of the R package causl (Evans, 2021).
Copulas lack many of the attractive properties of odds ratios, such as the invariance in (4), and their interpretation is different because it is in terms of the quantiles of the margins rather than their actual value.However, they can be extremely flexible if one has a multivariate outcome, because one can make use of vine copulas to model them.See Appendix C for more details.

Fitting Methods
We start this subsection with a result telling us how to fit marginal structural models using maximum likelihood (ML) estimation.In fact, it turns out that if we have a marginal structural model and our full model parameterized by θ * is correctly specified for the observational data from p ZXY , then the MLE for p * Y |X is obtained by maximizing the likelihood for the causal model (i.e. with X and Z assumed to be independent) with respect to the observational data from p ZXY (so X and Z are in fact not independent).This is the content of Theorem 5.1 below.
Note also that, although this result will not generally hold if part of the model is misspecified, if the propensity score model p X|Z is incorrect then this will not affect inference about the remainder of the model when w(z) = p Z (z).This is because there is a parameter cut between p Z • p Y |ZX and p X|Z (see, e.g.Barndorff Nielsen, 1978), and the parameters θ * Y |X and φ * Y Z|X are (for MSMs) functions of this first quantity.
For results connected with fitting, we will assume that all our parameters are identifiable from the available data (cf.Remark 1.5).In particular, we will also make use of A3 again, since we cannot hope to recover a distribution that does not satisfy a positivity assumption.Since the result concerns maximum likelihood estimation, we will make the very slightly stronger assumption that the Kullback-Leibler divergence between p and p * is finite.(Note that this is a strictly weaker assumption than A4.) We refer to the parameters of the causal parameterization of the observational distribution as Proof.van der Vaart (1998, Lemma 5.35) shows that if the target distribution is identifiable, then maximum likelihood estimation converges to the KL-minimizing distribution.Consider the density for the causal model: where we suppress dependence upon parameters.For a comparison with the density of the data, note that , ) and θZX (or ηZX ).Then the asymptotic variance is just a standard result for MLEs (see, e.g.Ferguson, 1996, Chapter 18).
Note that we must apply the Fisher information under p ZXY in order to obtain the correct variance, since this is the distribution of the data being used to approximate the expectation.While the proof above is stated only for the single time-point exposure model, it extends to a longitudinal case with multiple treatments, similar to the obvious extension of the model in our running example.
When computing standard errors in practice we use the observed information (i.e. an empirical approximation to the Fisher Information), rather than its theoretical mean.In principle we could also use a 'sandwich estimate' to obtain more robust standard errors; because we know that our models are correct we do not do this, but for other users of this method on real data we would always recommend using sandwich errors.In our case these would be the square-roots of the diagonal entries of Note that although this result shows that we can fit models via maximum likelihood estimation, if the model is misspecified there is no guarantee that the estimator will be consistent or even close to the true value.Other less sensitive estimators, such as doubly robust approaches (see Remark 5.4 below), may therefore be more useful in practice than the MLE.
Remark 5.2.Note that the same result (i.e.convergence of the estimator to the KL closest distribution to p Z ) will hold for the ETT estimator with kernel w(z) = p Z|X (z | 1), since this is also independent of the value of X.In order to estimate the parameters for this kernel, we would have to consider the subset of data for which X takes the particular value 1.We could then obtain an MLE for the whole model by combining the complete data estimator with the separate estimate for w obtained from the treated patients.
Remark 5.3.Given maximum likelihood estimates for the parameterization of p ZXY , we can of course use the invariance properties of MLEs together with the delta method for the standard errors, to obtain an estimate for any (differentiable) function of the parameters that we choose.
Remark 5.4.Taking a doubly robust approach to estimating the causal parameters, we see that can fairly easily be computed numerically; indeed, if Y and the copula are both Gaussian, we obtain it in closed form.We then fit a model, say π(x | z), for the propensity score p X|Z (x | z).
A doubly robust estimator uses Q and π to construct an estimating equation, and will give a consistent estimate for the causal parameter if either model is correctly specified.If they are both correct, then this estimator is also semiparametric efficient (Scharfstein et al., 1999).Using a doubly robust approach to compare with the MLE will help to protect us against possible misspecification of φ * Y Z|X ; this is useful given that choosing the association parameter is not particularly intuitive.

Simulation
We now run a simulation to compare four methods: outcome regression, inverse probability weighting, our maximum likelihood estimation, and standard doubly robust estimation (i.e.just using an ordinary regression model, not as described in Remark 5.4).
We use the setup described in Examples R3 and R5 (Sections 3 and 4.3 respectively) to generate our data, so again Y (after intervening to set {A = a, B = b}) is normally distributed with mean −0.5 + 0.2a + 0.3b and variance 1.We then performed N = 1 000 runs of the analysis above with sample size n = 250.The results are shown in Table 1, with boxplots of the biases in Figure 3.The table contains the average bias, the empirical coverage of a 90% interval, and the standard error calibration, which we define as: If this value is less than one it suggests that the standard errors are conservative, if larger than one it suggest they are too small.
Outcome regression performs poorly, although this is to be expected as the model is misspecified.We see that the other three methods all have very comparable performance and efficiencies, and are mostly well calibrated: the MLE and DR methods give slight under coverage for the first two parameters, though the double robust method gives conservative standard errors for the interaction parameter.An example on a larger simulated dataset is given in Appendix D.

Data Analysis
To illustrate our method, we apply the maximum likelihood fitting procedure to data from the IDEFICS study (Ahrens et al., 2011).The subset of data we use consists of measurements of 531 German children aged between 2 and 9, including their sex, physical activity, screen time, parental education, a 'vegetable score', fibre intake, and a polygenic risk score (PRS) for BMI.
The study also records the child's BMI and their parents' BMIs.Preliminary analyses suggest that increased fibre intake can reduce BMI, especially for those children who have a strong genetic predisposition for obesity (Hüls et al., 2021).Our aim is to study the effect modification of PRS  on the relationship between fibre intake and actual BMI, whilst adjusting for confounding due to other covariates.
We replicate the setting in Nöhren ( 2021), which considers how the causal effect of a dichotomized indicator of fibre intake (X) on age and sex standardized BMI (Y , a z-score) interacts with the dichotomized polygenic risk score (C); like Nöhren we also use a marginal structural model: We assume that all other variables are causally prior to X, so that E[Y | C = c; do(X = x)] is our causal distribution of interest, where Z consists of other confounders; these include sex, age, physical activity, screen time, vegetable score and a dichotomized version of parental education level.
We choose an ordinary Gaussian linear model for the MSM, and also the other models used for variables in Z.The copula was also Gaussian.Note that in order to accommodate sex and parental education as binary variables it was necessary to integrate over the copula, effectively making it a probit model (see also Example 2.6).
The relevant coefficients from the model fit are shown in Table 2.Under our modelling assumptions, these results do not suggest that increased fibre intake reduces BMI and thus we cannot confirm previous results obtained on a larger dataset of 2,688 children from seven countries (Nöhren, 2021); the estimates from that study are within our (rather wide) confidence intervals, though.Furthermore, the analysis of Nöhren for the marginal structural model using inverse probability weighting on only the German data yields slightly different parameter estimates, and larger standard errors (see Appendix E for details).This illustrates-in a practical analysis-the differences between, on the one hand, modelling the C-Y -Z association directly or, on the other hand, modelling the propensity score for the inverse probability weights.

Survival Models
Another application of the frugal parameterization is to causal longitudinal models, and in particular to survival models.Note that with sequences of treatment variables, the sequential versions of identifying assumptions must be met (cf.Remark 1.5) known as sequential conditional exchangeability; we continue to take these as given in Sections 6 and 7.
The following corollary of Theorem 3.1 allows us to 'build up' a frugal parameterization of the joint distribution using several different cognate quantities.Given a collection of variables Y 1 , . . ., Y d under some natural ordering (typically a temporal ordering), let [i − 1] = {1, . . ., i − 1} denote the predecessors of each i = 1, . . ., d. Corollary 6.1.Let Y 1 , . . ., Y d have joint density p, and let Then there is a smooth and regular parameterization of the joint distribution, which can be chosen to be variation independent, containing each p * Yi|Xi (y i | x i ).
Proof.We proceed by induction.For i = 1 we just have a smooth and regular parameterization of p Y1 (y 1 ).For a general i, assume we have a smooth parameterization of the joint density for Y 1 , . . ., Y i−1 and of p * Yi|Xi (y i | x i ).Then using some appropriate φ * YiZi|Xi to make up a frugal parameterization (and A1 if required), by Theorem 3.1 we obtain a smooth and regular parameterization of the joint density for Y 1 , . . ., Y i , and-if A1 holds-the quantities used are all variation independent of one another.
We refer to this approach as a recursive or nested frugal parameterization, because in each case 'the past' (i.e.p ZX ) is itself parameterized in a frugal manner.
Example 6.2.Young and Tchetgen Tchetgen (2014) consider survival models with time-varying covariates and treatments.Let Y t = 0 be an indicator of survival up to time t (with Y t = 1 indicating failure).Let L t , A t be respectively covariates and treatment at time t = 0, . . ., T .Young and Tchetgen Tchetgen model the quantities i.e. probability of survival to the next time point given treatment history and survival so far.Under their assumptions these quantities are identifiable via the g-formula as note that we omit some subscripts on densities for brevity.Corollary 6.1 tells us that, setting X t = A t and Z t = L t , a parameterization exists of the joint distribution that uses these quantities for each t = 1, . . ., T .Given the distribution of p(a t−1 , t−1 , y t−1 ), the quantities may be used to recover p(a t−1 , t−1 , y t ).
Young and Tchetgen Tchetgen (2014) note that simulation from this model is difficult for certain parametric choices, because some parameters from the joint model and the marginal model are tied together in complicated ways.They derive results that allow them to compute particular causal parameters as functions of the joint distribution, and hence to evaluate the performance of simulation methods exactly.Our approach overcomes this problem by allowing causal quantities of interest to be specified explicitly, and then have the rest of the distribution constructed around them.
The model is parameterized so that failure is a rare outcome, which allows approximation of the expit function by an exponential function.The parameters of interest are then those of the Cox Marginal Structural Model: which, as we see above, the authors assume to depend only upon the previous two treatments.These parameters ψ are estimated by fitting an inverse weighted GLM to the data.
The authors also state that: '[we] therefore, may be limited to simulation scenarios with the proposed algorithm to particularly unrealistic settings if we wish simultaneously to generate data under the null.'Our results demonstrate that if one uses our algorithms this is not the case.The null in this example corresponds to ψ 0 = ψ 1 = ψ 01 = 0; since the model is discrete we are free to choose arbitrary regression models for the treatment on the observed past, for the covariates on their past values and treatments (and even unobserved quantities), and any arbitrary dependence structure between survival and the covariates, conditional on all previous treatments and covariates.This will allow us to simulate from any distribution under which treatment has no (marginal) causal effect upon survival.In Appendix F we perform some simulations on this model.

Structural Nested Model Parameterizations
Not all causal parameterizations involve modelling the entire conditional distribution for every level of the conditioning variable; i.e. quantities of the form p * Y |X (y | x) for every value of x ∈ X .The structural nested models of Robins and Tsiatis (1991) are an example of this.These allow for interactions between time-varying covariates and time-varying treatments, but they are always marginal over future covariates; this makes them considerably more flexible than marginal structural models, because they allow for dependence in treatment decisions on all observed data.We again continue to make the necessary assumptions for identifiability; see Robins and Tsiatis (1991) for more detail.
Example 7.1 (Structural Nested Models).Suppose we have a sequence of binary treatments A 1 , . . ., A T and time-varying covariates L 1 , . . ., L T , together with an outcome Y .Let L t ≡ (L 1 , . . ., L t ) and L t ≡ (L t , . . ., L T ), and similarly for A t , A t .The structural nested model (Robins and Tsiatis, 1991) involves contrasts between a t = 0, 1 of the form: The parameterization divides the effect of the treatments into pieces corresponding to 'blips' of effect at each time point: that is, at each time t, we consider the effect of receiving treatment at that time but no further treatment, versus never receiving any treatment from time t onwards.The contrast may be in the form of a risk difference, risk ratio or other suitable quantity.
We represent such a generic contrast by introducing a tilde above the variable being contrasted; in the above example we would write: See the more formal Definition 7.2 below.
We define two additional kinds of parameter to generalize these ideas.
Definition 7.2.Let q Y |XZ (y | x, z) be a conditional distribution.We denote by q Y |XZ (y | x 0 , z) a baseline parameter, which can smoothly recover the relevant conditional distribution at a particular baseline value X = x 0 .
We will denote by q Y |XZ (y | x, z) a contrast parameter (over X).We define the pair of baseline and contrast parameters to be a full parameterization if, when we combine them, we can smoothly recover all of q Y |XZ (y | x, z).
In the appendix we give Lemma B.1, showing we can use risk differences, risk ratios or odds ratios as contrast parameters, if p > 0 and each X t is binary.Examples of a set of baseline parameters might be (β 0 , β z , σ 2 ) for some regression model y = β 0 + β x x + β z z + ε, where Var ε = σ 2 ; the natural contrast parameter would then be β x .Alternatively it might be the density p Y |XZ (y | x 0 , z), y ∈ Y, z ∈ Z, for some value x 0 ∈ X ; the contrast parameter could then be a risk ratio:

Iterated Frugal Parameterization
How can we use the frugal parameterization to obtain the structural nested model?We now introduce the iterated frugal parameterization to allow us to do just that.
Consider a sequence of random variables L 1 , A 1 , L 2 , . . ., L T , A T and an outcome of interest Y .Assume also that there is a natural 'baseline' treatment level A i = a 0 i .Then the iterated frugal parameterization consists of a parameterization of 'the past' (i.e.)).Note that if we consider the contrast parameter to be all the possible values other than the baseline value, then each state of a T will appear on the right-hand side of a quantity p * Y |LtA T exactly once.

The Structural Nested Model
How can we use a parameterization that incorporates all the quantities (8)?Based on the temporal ordering, and given p Y |LtA T (y | t , a t−1 , do(a t , a t+1 = 0)) and is cognate for the particular baseline a 0 t+1 .In particular, our parameterization can include 'blips' such as those in (8).If either the contrast parameter is the odds ratio, or the risk ratio and the outcome is positive and unbounded, then these pieces are also variation independent.
The proof for the special case of binary treatment variables is given in Appendix B. With this general formulation we do not require p * Y |LtA T to be of the same form for each t = 1, . . ., T ; this flexibility may be useful for many settings.However, we do need the baseline level a 0 t+1 to be consistent over all t, since otherwise the inductive argument we use will not work.Note also that φ * This is similar to the form of a structural nested mean model, but in this case we attempt to model all future treatment regimes simultaneously, not just at a baseline a t = 0.This effectively requires us to model the association between Y and each A t multiple times in different margins, and hence we will be using parameters that are redundant; it therefore does not fall within our frugal framework.This was pointed out by Robins et al. (2007), who showed that it is a non-congenial parameterization, and may lead to incompatible distributions.

Discussion and Conclusion
As we have demonstrated, the principle of a frugal parameterization is widely applicable and useful in many marginal modelling contexts, especially causal models.We begin this discussion by briefly considering three more key settings for causal models: sensitivity analysis, instrumental variable (IV) analysis and mediation analysis.
In sensitivity analysis, a key challenge is to construct an augmented model that is compatible with the original model in the sense that it shares a marginal distribution over the observed variables, but can be tweaked to introduce various levels of unobserved confounding.This is clearly possible within our framework; considering Example R7 in Appendix C, we can set the correlations involving U to zero, and then increase them to test the dependence of our conclusions to the presence of an unobserved confounder.
In an IV analysis, the instrument is used as an imperfect replacement for randomization when the actual treatment X is affected by unobserved confounding.To formulate a generative IV model we typically want to combine a desired parameterization for p * Y |X (y | do(x)) with a model that includes the IV and the confounder U .The difficulty, here, is due to the particular properties of an IV which require the joint model to satisfy certain conditional independence properties while being compatible with the marginal causal model.This is especially problematic for non-collapsible cases, for instance for logistic structural mean models (Robins and Rotnitzky, 2004;Vansteelandt et al., 2011;Clarke and Windmeijer, 2012) or structural Cox models (Martinussen et al., 2017).As outlined in Appendix G, we believe that our approach based on the frugal parameterization can also be helpful in these situations, but we leave details for future work.
In contrast, causal mediation analysis is an example where models contain singularities and therefore our approach cannot be applied.Decomposing the effect of a treatment A on outcome Y into the indirect effect via mediator M , and the remaining direct effect, is conceptually the same as splitting A into two separate nodes A, A , where observationally we always have A = A ; mediation questions may then be considered as asking what would happen if A = A (Robins and Richardson, 2010).The quantities of interest are therefore generally functions of p Y |AA (y | do(a, a )), but where at the same time Y ⊥ ⊥ A | A, M holds in the full model where the two treatments are potentially different (Didelez, 2019).Because this independence requires us to model the Y -A association within the joint distribution, not within the (Y, A, A )-margin, the only parameters that we are free to specify are then those of the distribution of Y given each level of A (i.e. the strength of the direct effect); this is explicitly possible in the discrete case using results in Evans (2015).In other cases, attempts to specify both p * Y |AA and p * Y |AA M separately may lead to models which are not compatible; for example, the equations ( 4) and (5) of Loeys et al. (2013) do not generally give a valid model because the logit function is not closed under marginalization.Lange et al. (2012) avoid the problem of explicitly modelling the joint distribution by using marginal structural models instead, though their approach does not allow for simulation from the resulting model.

Another example of nonsmoothness comes from quantities such as
or some other contrast between these two distributions.‖ This leads to a parameterization which is degenerate, in the sense that its derivative (or nonparametric equivalent) is zero in some directions when the two distributions are the same.
While such nonsmooth models still remain a challenge, we are certain that marginal models based on a frugal parameterization have many further useful applications and extensions worth exploring in future work.For instance, classes of distribution that are closed under marginalization and conditioning, such as MTP 2 distributions (Karlin and Rinott, 1980), will naturally combine with our approach.On the technical side, the proposed rejection sampling method can be inefficient, and it would be desirable to improve this by using more advanced methods, along the lines of those suggested by Jacob et al. (2020).
As we noted in Section 1.2, we can see two opposing or complementary trends in causal modelling: ‖ This is related to (though distinct from) the parameter used by Hubbard and Van der Laan (2008) to estimate the effect of giving an entire population a particular treatment, versus no intervention at all.many approaches are based on specifying structural causal models that implicitly or explicitly condition on the entire past, and do not consider marginal objects such as p Y |X (y | do(x)).In contrast, our approach is found in the books by Pearl (2009, Chapter 3), Imbens and Rubin (2015) and Hernán and Robins (2020), which all consider marginal causal quantities to be fundamental.Beyond frugal parameterizations, we believe that thinking about causal models as a form of marginal model, for which there is an older and richer literature, may lead to many more advances in the field.
each give a regular representation of p Y |ZX (y | z, x) when combined with For the odds ratio we refer to Chen (2007) for details.The variation independence of the odds ratio from its margins is well known (e.g.Rüschendorf, 1995).If Y is unbounded and p Y |ZX (y | z, x) is the risk-ratio, then it is clear that we can modify it in any way and still obtain a valid joint distribution.
Proof of Proposition 7.3.We consider the special case in which each A t is binary, and proceed by induction on T .Note that we can combine all the conditionals p LtAt|At−1Lt−1 to obtain the joint distribution p L T A T .Now, by a simple adaptation of Theorem 3.1, we start with  In order to complete the parameterization we also need p ALB and φ * YL|A ; the latter of these could be the conditional odds ratio in the discrete case, for example.The advantage of this representation of an SNM is that it makes absolutely clear which (groups of) parameters are free to be varied.Indeed, like the previous examples this 'model' is such that any distribution over A, L, B, Y (or more generally A T , L T , Y ) can be represented using this parameterization.
We demonstrate this by constructing a distribution for a structural nested mean model over this graph.We take all variables to be binary, and let the blips be in the form of risk differences: Suppose also that p A (1) = 0.3, and

C Vine Copulas
As described in Example 2.4, a copula is a multivariate CDF with uniform (0, 1) margins, and can be obtained from any continuous parametric multivariate model by transforming each margin using its univariate CDF.However, there is a relative dearth of multivariate families in dimensions greater than two, and this limits the flexibility of such an approach.One solution to this problem has been to use vine copulas, which chain together bivariate families in order to give more flexible representations of multivariate models.
We do not describe vine copulas in full generality here for the sake of brevity, see Bedford and Cooke (2002) for details.Consider a system of three variables, U , L and Y .In the case that L ⊥ ⊥ Y | U , we can model the joint distribution using two separate copulas, one each for the L, U margin and the U, Y margin.Due to the conditional independence, the conditional quantiles of L | U and Y | U are uniformly distributed and uncorrelated.It is then possible to relax the conditional independence constraint, by placing another copula model on these conditional quantiles.Crucially, the distributions of the original bivariate margins remain the same.
Vine copulas also have the nice property that for the second level and above, parameters are conditional on the values of the first level; in particular they are variation independent.As a comparison, the standard parameters of a jointly Gaussian copula have to yield a positive definite matrix, which is hard to enforce (other than by using the vine copula approach of considering partial correlations).This is particularly useful if we introduce the treatment or other covariates as modifying the parameters, since the link functions can be much simpler.
Example R7.We will again apply this to Example R1 from Havercroft and Didelez (2012), this time including the latent variable U .We use Gaussian copulas in a vine for the triple (U, L, Y ), with U -L and U -Y correlation parameters 2 expit(1) − 1 ≈ 0.462, and L-Y partial correlation parameter 2 expit(0.5)− 1 ≈ 0.245.We take L and Y to be exponentially distributed with means Robust standard errors are shown in brackets, and each estimate is indeed less than one standard error away from its respective nominal value.Code to replicate this analysis is contained in the vignette Hidden_Variables of the R package causl.

D Simulation Example
We now apply the approach given in Section 5.1 to a single large dataset of size n = 10 4 .Table 5 shows the results, which this time are the estimates, standard errors and bias.We see that our maximum likelihood method indeed has the jointly smallest standard errors, and that for each of the IPW, MLE, and doubly robust approaches the estimates are suggestive of consistency.Only the outcome regression model fails, and this is unsurprising since it is misspecified.Code relating to this example is also found in the vignette Comparison in the R package causl.

E Data Analysis
The analysis of Nöhren consisted of using IPW with a propensity score model based on the logistic regression model that relates dichotomized fibre intake to as well the intercept and all other subsets of the terms above.Here isced is the average parental education level; AVM is the average time spent with audiovisual media in hours per week; MVPA is the average moderate-to-vigorous physical exercise performed in minutes per day; vegscore is the vegetable score.When we run the same analysis (indeed, the same code) for only the German children, the results obtained are shown in Table 6.

F Young and Tchetgen Tchetgen Simulations
The full model of Young and Tchetgen Tchetgen involves parameterizing t,at) = exp (ψ 0 a t + ψ 1 a t−1 + ψ 01 a t a t−1 ) .
We are also free to specify models for the dependence of each treatment and the covariates upon previous treatments and covariates, as well as the association parameters between each Y t and earlier covariates.Again, these can all be different for every t, but we follow Young and Tchetgen Tchetgen who use logistic regressions for each variable.They have logit p At|At−1LtYt−1 (1 | a t−1 , t−1 , y t−1 = 0) = α * + α 0 t logit p Lt|At−1Lt−1Yt (1 | a t−1 , t−1 , y t = 0) = β 1 a t−1 .They also use a logistic regression for the distribution of survival given the treatments and covariates, but we want to parameterize directly in terms of the ψs.We therefore define logit p Yt|AtLtYt−1 (1 | a t , t , y t−1 = 0) = θ * + θ a0 a t + θ 0 t + θ a1 a t−1 , noting that the parameters θ a0 and θ a1 are not actually free, because they are a function of the other parameters after specifying ψ 0 , ψ 1 and ψ 01 .
Young and Tchetgen Tchetgen specify the vectors α = (0.5, 0.5), β 1 = −2 and θ = (−7, −0.5, −0.8, 0) and then use the g-formula (7) to compute the corresponding values of ψ 0 , ψ 1 , ψ 01 .We will specify the values of ψ as well as θ * and θ 0 , and then compute the new values of other elements of θ.Note that all of the values of ψ 0 used are very close to −0.8, which is a consequence of the rare outcome assumption made by the original authors.
Continuing the example from Section 6, we simulate datasets of size n = 10 5 and a variety of values for β 1 and θ a0 , with θ 0 = −0.8.
Table 7 shows the bias that results in maximum likelihood estimates of θ a0 and estimates of ψ 0 via inverse probability weighting (compare this with Table I of Young and Tchetgen Tchetgen, 2014).We can see that this is indeed still small, implying that our simulation method works as expected.

G Instrumental Variables
One common causal approach, when faced with unobserved confounding, is to use an instrumental variables (IV) model, as shown in Figure 4.In this case interest may be in the average causal effect which is a function of the quantity p Y |X (y | do(x)); other popular IV approaches consider causal estimands such as the 'complier causal effect' or the 'effect of treatment on the treated' which we do not further address, here.The average causal effect, if everything is linear, can be identified by the ratio Cov(Z, Y )/ Cov(Z, X).More challenging is the case where the effect of X on Y is non-linear.
We can use our framework to simulate from the general IV model, by explicitly including the hidden variable U .We first parameterize the distribution of the 'past', i.e. (U, Z, X) so that U ⊥ ⊥ Z; Now, combine these to obtain the resulting joint distribution.In particular note that even if Y is binary, we can simulate using a copula model and then dichotomize Y from the resulting continuous distribution.This works particularly well with a probit or logistic model, for example.This gives a basic outline of how to represent an instrumental variable model so that we can simulate exactly from (almost † † ) any model of this kind.To reiterate Section 4.2, we simulate by sampling from p * UZY |X , and then rejecting samples based on the value of p X|UZ /p * X|UZ .However, further work is needed to extend this to structural mean models for IV analyses.These build on a particular no-effect modification assumption within a marginal (over unobserved confounders) model that is conditional on the natural treatment value and the IV, a restriction which cannot always be represented in a structural equation type model (Robins and Rotnitzky, 2004;Clarke and Windmeijer, 2010).† † Since it must satisfy A4.

Figure 1 :
Figure 1: (a) A causal model with three variables; (b) the same model after intervening on X.
when testing the hypothesis of whether p Y |AB (y | do(a, b)) depends upon A. This arises because seemingly innocuous parameterizations of the conditional distributions p Y |ALB (y | a, , b) and p L|A ( | a) (e.g. a linear and a logistic regression) lead to situations where the null hypothesis can almost never hold: that is, it is impossible for p Y |AB (y | do(a, b)) not to depend upon A unless either L or Y is completely independent of A. The reason for the 'paradox' can be understood as a problem of attempting to specify the relationship between Y and A in two different and potentially incompatible ways.
For notational purposes we choose to use Pearl's 'do(•)' operator to indicate interventions.For example, P (Y = y | A = a, do(B = b)) refers to the conditional distribution of Y given A = a under an experiment where B is fixed by intervention to the value b.We generally abbreviate this to p Y |AB (y | a, do(b)).The same quantity in the potential outcomes framework would generally be denoted by P (Y b = y | A b = a).

1.
Choose the quantity p * Y |X which you wish to model, or of which you wish to model a function, and select a parameterization θ * Y |X (this should include the quantity of interest).2. Determine the kernel w over which we need to integrate p Y |ZX in order to obtain p * Y |X , and a dummy marginal distribution p * X over X.This should not be degenerate, and for efficient sampling should be similar in form to the observational margin p X .3. Introduce a parameterization θ ZX of p ZX , such that p * ZX = w • p * X is smoothly and regularly parameterized by a twice differentiable function η ZX of θ ZX .4. Choose a 'suitable' parameterization φ * Y Z|X of the dependence in Z-Y conditional upon X in the causal distribution p * .The three pieces θ ZX , θ * θ a = 0.5 (γ 0 , γ a , γ , γ a ) = (−0.3,0.4, 0.3, 0) (α 0 , α 1 ) = (0.3, −0.2) (β 0 , β a , β b , β ab ) = (−0.5,0.2, 0.3, 0) and ρ ab = 2 expit(1 + a/2) − 1. Taking a large sample size of 10 6 , we indeed find (empirically, using goodness-of-fit tests) that EA = 0.5, that L appears to be exponentially distributed with the specified mean, and that E[B | A = a, L = ] has the correct form.In addition, if we fit an inverse probability weighted (IPW) linear model for Y (using the fitted value we obtain from the regression for B, see Hernán and Robins 2020, Chapter 12) the parameters for the interventional distribution of Y under do(A = a, B = b) are also as expected: β0 = −0.4985(0.0022) βa = 0.1999 (0.0033) βb = 0.3002 (0.0030) βab = −0.0030(0.0042).
and of the causal distribution as η(θ * ) := (η ZX (θ ZX ), θ * Y |X , φ * Y Z|X ).Theorem 5.1.Suppose that θ * is a frugal parameterization with weight function w(z) = p Z (z), so the model we are interested in is the marginal structural model; suppose also that A5 holds.The maximum likelihood estimator η of η(θ * ) obtained with the observed data (i.e.data generated using the distribution p ZXY with parameters θ * = (θ ZX , θ * Y |X , φ * Y Z|X )) will be consistent for the distribution in the causal model with parameters η = (η ZX , θ * Y |X , φ * Y Z|X ).In addition, for the estimates obtained in this way, we have θ * ) is the Fisher information under p ZXY and I(θ * ) −1 θ * Y |X ,φ * YZ|X is the submatrix of its inverse relating to θ * Y |X and φ * Y Z|X .
Two numerical examples are given as R6 and B.2 in Appendix B. Remark 7.4.The History-Adjusted Marginal Structural Models (HAMSMs) introduced by van der Laan et al. (2005) model (the mean of) the distributions ).Hence, by induction, we can obtain p Y |L T A T , and consequently p L T A T Y .The results on variation independence follow directly from the implications in Lemma B.1.Example R6.Consider again the model in Figure 2; in this case, we have A 1 = A and A 2 = B, with L 2 = L and L 1 being null.A structural nested mean model would include p Y |AB (y | do(a = b = 0)) p Y |AB (y | do(a = 1, b = 0)) and p Y |ALB (y | a, , do( b)).

Figure 4 :
Figure 4: A representation of the instrumental variables model.
then we take the distributions p Y |X (y | do(x)) and the association parameter φ * Y,UZ|X = φ * Y,U |X so as not to depend upon Z at all.This will allow us to simulate from an IV model, provided that the pieces p UZX , p * Y |X and φ * Y U |X are chosen from a sufficiently rich family of distributions.Specifically, suppose that we want to simulate from a particular model from Figure4, with a particular model for p * Y |X (y | x) (presumably this is p Y |X (y | do(x))).Then we should use the following algorithm: 1. select a model θ * Y |X for p * Y |X (y | x). 2. choose a distribution for (U, Z, X) such that U and Z are independent; 3. choose a model for φ * Y,U |X = φ * Y,UZ|X (i.e.such that Y ⊥ ⊥ Z | X, U ).
This entirely avoids the g-null paradox when considering hypotheses about p Y |AB (y | do(a, b)), since variation independence means that it may be freely specified.In addition this parameterization is such that one can logically specify any model with a joint density in this manner.Note that the example above does not give a separate specification of the dependence of Y on L that is causal, and the spurious dependence due to the latent parent U : both kinds of dependence are tied up in the association parameter φ * LY |AB .An alternative is to explicitly include U in the model, leaving us with |AB has to model the dependence between Y and (U, L), after intervention on A, B.
also of p ZXY .In particular, we can choose any parametric model for any cognate distribution p * Y |X , and use it to construct a smooth parameterization of the joint density.In other words, in terms of parameterization there is no essential difference between choosing a model for p * Y |X or for the ordinary conditional distribution p Y |X .When we do this, the smoothness and regularity of the parameterization of the observational model (θ Y |X ) as well as its variation independence to θ ZX and-possibly-the association parameters, is preserved in the new parameterization of p ZXY .The Theorem 3.1 below formalizes this.We first need to introduce a couple of additional assumptions.Recall that the functionals θ Y |X and φ Y Z|X for p ZXY have an identical form to the functionals θ * Y |X and φ * Y Z|X for p * ZXY .We will assume that p * ZX = p * X • w is smoothly and regularly parameterized by a function of θ ZX , and a relative positivity of the observational distribution.Recall also that the analyst chooses p * Y |X and w based on subject matter considerations.A2.The product p * ZX = p * X •w has a smooth and regular parameterization η ZX := η ZX (θ ZX ), where η ZX is a twice differentiable function with a Jacobian of constant rank.A3.p ZX is absolutely continuous with respect to p * ZX at the true distribution p ZXY .To clarify, we have two separate parameterizations of p ZXY .The first, θ, corresponds to using the ordinary conditional distribution p Y |X in our frugal parameterization and 'default' weight function w 0 (z | x) = p Z|X (z | x), whereas the second θ * uses a cognate distribution p * Y |X for some other weight w.As a note of caution, the two models for p ZXY induced by θ and θ * are not generally the same, because they apply to different functionals of p ZXY ; if the models are both saturated then the sets of distributions themselves will be the same, but the parameters have different interpretations, and their values are therefore generally different.Theorem 3.1.Let p ZXY be a distribution parameterized by θ := (θ ZX , θ Y |X , φ Y Z|X ) with weight function p Z|X , and w a kernel satisfying A2; we also assume that A3 holds.Then θ is frugal w.r.t.p Y |X if and only if θ This proves that θ is a smooth and regular parameterization if and only if θ * is.For A1, note that if φ Y Z|X is variation independent of θ ZX and θ Y |X , then we also have that φ * Y Z|X is variation independent of η ZX (θ ZX ) and θ * Y |X , because this is just A1 applied to the (possibly) smaller set of distributions p * ZXY .Then notice that modifying the value of θ ZX in such a way that keeps the value of η ZX the same will have no effect on the possible values of φ * Y Z|X , and hence A1 holds for θ * .Remark 3.2.The previous result tells us that, given a suitable dependence measure φ, we can propose almost arbitrary (i.e.provided that they satisfy the assumptions indicated in the Theorem) separate parametric models for each of the three quantities p ZX (z, x), p * Y |X (y | x) and φ * Y Z|X (y, z | x), and be sure that there exists a (unique) joint distribution p ZXY (z, x, y) compatible with that collection of models.Of course, this leaves open the question of how we should compute that joint distribution.
* := (θ ZX , θ * Y |X , φ * Y Z|X ) is also frugal w.r.t.p * Y |X .In addition, if φ Y Z|X satisfies A1 and η ZX (Θ ZX ) ⊆ Θ ZX , then φ * Y Z|X also does.Proof.First, note that by definition, either parameterization can use θ ZX to obtain p ZX .Then combining with A2 we can obtain w • p * X as a smooth function of η ZX (θ ZX ).Then note that so given that the fraction here is a smooth function of θ ZX from either parameterization, it is clear that we can obtain p * ZXY smoothly from θ * if and only if we can obtain p ZXY smoothly from θ.
R4. Let us apply this to a discrete version of Example R3 from Havercroft and Didelez (2012); we know that the objects in (6) are sufficient to define the model of interest.If all the variables are binary, then we start with a parameterization of p ALB and p * Y |AB using (conditional) probabilities, and φ * LY |AB (= φ LY |AB ) using conditional odds ratios.Assume, for example, that ZX within the causal model.The result for marginal structural models is a consequence of the fact that the minimizing distribution in this case is p X • p Z .For the asymptotic distribution of the estimators θ * Y |X and φ * Y Z|X , notice that for a marginal structural model we have p Z = p * Z (and of course we always have p Y |XZ = p * Y |XZ ) so the parameter cut mentioned above applies to both models.Hence, there is no asymptotic correlation between ( θ * Y |X , φ * Y Z|X * ZX ).Now, in general the result of minimizing this expression will depend upon the precise parameterization of p * ZX , but the minimization will pick out the distribution that is 'closest' to p

Table 1 :
Table giving the average bias, coverage of a 90% confidence interval, and standard error calibration (the ratio of absolute bias to standard error) of four methods: outcome regression; inverse probability (IP) weighting; a doubly robust estimator; and maximum likelihood estimation (MLE).

Table 2 :
Table giving estimated coefficients in the marginal structural model for effect modification of the PRS on BMI by fibre intake.
p ZX ), of p * a t ), we need φ * YLt+1|LtAt (y, t+1 | t , a t ) to recover the joint p YLt+1|LtA T (y, t+1 | t , a t , do(a t+1 = 0)).Then noticep(y, T | a t , do(a t+1 )) = p(y, T | a t+1 , do(a t+2 )) • p(a t+1 | a t ) p(a t+1 | t , a t ), so we can 'change worlds' and obtain probabilities with the same settings from a reweighting that is identifiable from the previous variables.The following proposition gives the general result, proved and illustrated by examples in Appendix B.Proposition 7.3.We can parameterize p L T A T Y ( T , a T , y) using smooth and regular parameterizations for p * YL2|L1A T (y, 2 | 1 , a 1 , a 0 2 ).Now, assume for induction that we can recover p Y |LtA T (y | t , a t−1 , a 0 t ); we have shown this for t = 2.We can reweight with some function of p L T A T to obtain p * YLt+1|LtA T when A t+1 = a 0 t+1 , and hence p Y |Lt+1A T (y | t+1 , a t , a 0 t+1 t , a t , a 0 t+1 ).Then, we can again use φ * YLt+1|LtAt together with p * Y |LtA T and p * Lt+1|LtA T (for A t+1 = a 0 t+1 ) to obtain p * YLt+1|LtAt .Reweighting again yields an expression for p

Table 5 :
Table giving coefficients from the marginal structural model via outcome regression (i.e.naïve regression on A and B); inverse probability weighting (IPW); doubly robust method (DR); and our maximum likelihood approach (MLE).

Table 6 :
Table giving estimated coefficients in the marginal structural model fitted by Nöhren for effect modification of the PRS on BMI by fibre intake, when applied to the same subset of the data that we used.