Analyses of ‘change scores’ do not estimate causal effects in observational data

Abstract Background In longitudinal data, it is common to create ‘change scores’ by subtracting measurements taken at baseline from those taken at follow-up, and then to analyse the resulting ‘change’ as the outcome variable. In observational data, this approach can produce misleading causal-effect estimates. The present article uses directed acyclic graphs (DAGs) and simple simulations to provide an accessible explanation for why change scores do not estimate causal effects in observational data. Methods Data were simulated to match three general scenarios in which the outcome variable at baseline was a (i) ‘competing exposure’ (i.e. a cause of the outcome that is neither caused by nor causes the exposure), (ii) confounder or (iii) mediator for the total causal effect of the exposure variable at baseline on the outcome variable at follow-up. Regression coefficients were compared between change-score analyses and the appropriate estimator(s) for the total and/or direct causal effect(s). Results Change-score analyses do not provide meaningful causal-effect estimates unless the baseline outcome variable is a ‘competing exposure’ for the effect of the exposure on the outcome at follow-up. Where the baseline outcome is a confounder or mediator, change-score analyses evaluate obscure estimands, which may diverge substantially in magnitude and direction from the total and direct causal effects. Conclusion Future observational studies that seek causal-effect estimates should avoid analysing change scores and adopt alternative analytical strategies.


Introduction
Studies of change are a cornerstone of research in the health sciences. Understanding the natural history of disease, and in turn predicting prognoses, is of enormous interest to physicians and patients alike. Analyses of 'change' are, however, deceptively complex in observational data. One of the most common, yet poorly recognized, challenges stems from the use and interpretation of 'change scores'.
Change scores (e.g. DY ¼ Y 1 À Y 0 ), also known as 'difference scores', 'gain scores' and 'change-from-baseline variables', are composite variables that have been constructed from repeated measures of a single parent variable (Y) by subtracting a subsequent measure of the parent (Y 1 , 'follow-up') from a prior measure (Y 0 , 'baseline'). The resulting composite variable retains information from both of its determining parents and hence will share a tautological association with either if analysed by regression or correlation. 1 This was first recognized by Oldham in 1962, who demonstrated that an association averaging r ¼ 61= ffiffiffi 2 p occurs between either of the parent variables (i.e. Y 0 or Y 1 ) and their difference (i.e. Y 1 À Y 0 ) if both have similar variances but are otherwise unrelated. 2 This phenomenon explains the 'law of initial value' as a consequence of the sign disagreement between the baseline parent variable (Y 0 ) and its transformation in the composite change score (ÀY 0 ), and is distinct from regression-to-the-mean. 1 Relatively few analyses of change scores, however, involve straightforward tautological associations. More often, change scores are used as outcome variables in relation to a separate baseline treatment or exposure X 0 (e.g. 'How do beta-blockers affect change in blood pressure?'). One of the most widely recognized issues in this context is the discordance between change-score analyses (i.e. where the outcome-change score DY is regressed on the baseline exposure X 0 ) and analyses of covariance (ANCOVA; i.e. where the follow-up outcome Y 1 is regressed on the baseline exposure X 0 and 'adjusted for' the baseline outcome Y 0 ). 3,4 For example, Senn (2006) and Van Breukelen (2006) found that change-score analyses and ANCOVA provide similar and unbiased estimates when the exposure is randomized but provide 'contradictory results' when the exposure is not randomized. Frederick Lord's eponymous paradox centres on this same 'contradiction' and the lack of an obvious 'correct' answer. 5 Although studies of change are extremely common, the concept of change-and the use of change scores as a putative measure thereof-has received relatively limited formal consideration within a causal framework. Causal diagrams such as directed acyclic graphs (DAGs) provide a useful framework for understanding some challenges associated with observational data analysis, but they have not often been used to consider analyses of change scores specifically. Of the exceptions, Glymour et al. (2006) focused on the role of measurement error, arguing that analyses of outcome-change scores provide unbiased causal-effect estimates in some cases, but that error can be introduced by conditioning on the baseline outcome. 6 Conversely, Shahar and Shahar (2010) argue that change scores are 'not of causal interest' and that 'modelling the change between two time points is justified only in a few situations'. 7 The present article aims to provide an accessible explanation of why analyses of change scores do not estimate causal effects in observational (i.e. non-randomized) data and illustrate the potentially misleading consequences of doing so.
Change scores do not represent exogenous change not been determined at baseline (Y 0 ) and can therefore potentially still be modified after baseline.
DAGs are semi-parametric graphical representations of hypothesized causal relationships between variables. 8 Variables or events (depicted as nodes) are connected by unidirectional arcs (depicted as arrows), representing the presence and direction-though neither the nature nor the magnitude-of each hypothesized causal relationship. A path is a collection of one or more arcs that connect two nodes and a causal path is one in which all constituent arcs flow in the same direction. No variable can cause itself. By convention, we depict deterministic variables as doubleoutlined nodes. 9 We first consider the simple example of repeated measures of an outcome variable (Y) that only fluctuate due to randomness (R) (see Figure 1, panel A). Values of the follow-up (Y 1 ) are entirely determined by the baseline (Y 0 ) plus the random features at follow-up (R 1 ). In this scenario, Y 1 cannot be modified except by modifying Y 0 ; no exogenous change exists. This is obvious in repeated measures of a fixed variable, such as height in healthy middle-aged adults. Although each individual's height values Y 0 and Y 1 would likely differ slightly due to the random features at baseline (R 0 ) and follow-up (R 1 ), this only dilutes the observed relationship between Y 0 and Y 1. In the population, there would be no overall change in the average values of height at baseline and follow-up, and this would be correctly reflected by a change score with a mean of zero (Figure 1, panel Aþ).
The same causal scenario (i.e. Figure 1, panel A) could also describe repeated measures of a dynamic variable, whereby follow-up values are fully determined by baseline values via an algebraic function. As an example, consider the total expected number of radioactive particles Y in a sample of (non-depleted) uranium rods at some future point in time (Y 1 ), which may be estimated without bias Figure 1. Directed acyclic graphs (DAGs) depicting the relationship between an outcome variable at baseline (Y 0 ) and follow-up (Y 1 ), where the follow-up measure is completely determined. In panel A, the values of Y 1 are fully determined by Y 0 (and random processes R 1 ), so there exists no exogenous change. In panel B, the values of Y 1 are partly determined by Y 0 (and random processes R 1 ) and partly determined by exogenous factors representing 'change' (C 1 ). C 1 , R 0 and R 1 are depicted as dashed (latent) variables, as they cannot be directly measured and are encapsulated within their descendent variables Y 1 , Y 0 and Y 1 , respectively. Panels Aþ and Bþ depict the same causal scenarios as panels A and B, respectively, but also show the composite change-score variable (Y 1 À Y 0 ), which itself is completely determined by Y 0 and Y 1 . from the current observed number of radioactive particles (Y 0 ) by the Universal Law of Radioactive Decay. 10 The total observed value of Y would irrefutably change between Y 0 and Y 1 , and each individual uranium rod would have a negative change score (the magnitude of which would increase with the size of Y 0 ). Nevertheless, no exogenous change exists; as previously, Y 1 cannot be modified except by modifying Y 0 .
Finally, we consider a more realistic dynamic variable (Y), whose future values (Y 1 ) are only partly determined by the past values (Y 0 ), with the remainder determined by random features (R 1 ) plus other exogenous change (C 1 ) (see Figure 1, panel B). Here, C 1 represents all non-random changes in Y that are not pre-determined by Y 0 , and so the concept of exogenous change can thus be considered an average of all the processes in C 1 ! Y 1 . In reality, C 1 is an unmeasurable, ongoing latent process whose value is only defined once the point of follow-up is fixed (in the same way as 'age' is undefined until the date of measurement is defined). Thus, the exogenous change between two time points is fundamentally encapsulated within, and can only be determined from, Y 1 .
We do not specify the time window between Y 0 and Y 1 , but it seems plausible that change could also be introduced after baseline by altering the effect of Y 0 on Y 1 . This is equivalent to creating an intermediate node (Y 0:5 ) along the path between Y 0 and Y 1 that provides a later chance to modify Y 1 without invoking exogenous change. However, this only serves to delay the distinction between the determined and change components of Y 1 , since, after Y 0:5 , there is again no means to alter Y 1 other than through exogenous change. In theory, we could introduce another node and another, but eventually we would reach the node immediately prior to Y 1 in time (Y 1Àdt ), at which point there is no way to intervene in the effect of Y 1Àdt after Y 1Àdt , and exogenous change is the only way to introduce change in Y.

Isolating exogenous change with respect to a baseline exposure
The causal effect of a baseline exposure X 0 on 'change' in Y hence corresponds to the effect of X 0 on 'exogenous change' in Y, i.e. the structural part of Y 1 that has not already been determined by Y 0 . This quantity can be expressed as the effect of X 0 on Y 1 jY 0 or the estimand a 1 ¼ x 0 are two contrasting levels of the exposure. This effect may be estimated by constructing, e.g., a regression model of the form c which we refer to as the follow-up adjusted for baseline analysis, where b a 1 represents the estimate for the estimand of interest (a 1 ).
Construction and analysis of a change score likely represent an attempt to isolate this same effect from the apparent 'effect' of X 0 on DY ¼ Y 1 À Y 0 or the estimand x 0 are again two contrasting levels of the exposure. This quantity may be estimated by constructing a regression model of the form c which we refer to as the change-score analysis and where b b 1 represents the coefficient that is often (mis)interpreted as estimating the true effect of interest (a 1 ). Instead of 'standardizing' Y 1 relative to Y 0 , the change-score approach treats two separate events (i.e. Y 0 and Y 1 ) as one, thereby conflating the causal pathways involved. This can be seen by rewriting the estimand in full as g , which depends jointly on elements of the effects of X 0 on both Y 0 and Y 1 , including the negative of the total causal effect of X 0 on Y 0 .
The degree of discordance between these two estimands (a 1 and b 1 ), and hence the coefficients in the follow-up adjusted for baseline analysis (b a 1 ) and the change-score analysis ( b b 1 ), will depend on the strength of the association between the baseline exposure X 0 and the baseline outcome Y 0 . Where the association between X 0 and Y 0 is trivial, the association between X 0 and DY will converge on the association between X 0 and Y 1 because, when X 0 ? Y 0 , This would be expected in large, well-conducted randomized experimental studies, in which change-score analyses may be used without invoking inferential bias (see Figure 2, panel A).
However, as the association between X 0 and Y 0 strengthens-as in non-randomized, non-experimental (i.e. observational) settings-the association between X 0 and DY will be increasingly dominated by the component components of the estimand, thereby diverging from the association between X 0 and Y 1 . Whilst b b 1 provides a statistically unbiased estimate of b 1 , it may nevertheless invoke serious inferential bias if misinterpreted as estimating a 1 , since the divergence between a 1 and b 1 can be substantial and even sign-discordant. For example, if X 0 and Y 0 share a strong positive correlation, the negative transformation of Y 0 in the change score may dominate a smaller positive correlation between X 0 and Y 1 , resulting in an overall negative association between X 0 in DY.

Exogenous change vs total causal effects
It may be tempting to conclude that a 1 is always the estimand of interest in analyses of change and a follow-up adjusted for baseline analysis will therefore always provide the best solution where an association between X 0 and Y 0 is expected. Consideration must, however, also be given to the direction of the causal relationship between X 0 and Y 0 , and the implications for which estimand(s) delivers the most useful causal effect(s).
The randomized experimental setting is unique for ensuring that X 0 occurs at the same time or after Y 0 by design. This guarantees that all changes in Y that are caused by X 0 will be fully realized by the effect of X 0 on Y 1 . In other words, the experimental setting ensures that the Figure 2. Directed acyclic graphs (DAGs) depicting three causal scenarios for analyses of change in an outcome (Y ) in relation to a baseline exposure (X 0 ). Panel A represents a scenario in which the baseline outcome (Y 0 ) is a 'competing exposure' for the total causal effect of X 0 on the follow-up outcome (Y 1 ), i.e. X 0 is unrelated to Y 0 as in a well-conducted randomized experimental study. In this scenario, the total causal effect of X 0 on Y 1 is identical to the total causal effect of X 0 on 'exogenous change' in the outcome (C 1 ). Panel B represents a scenario in which Y 0 is a confounder for the total causal effect of X 0 on Y 1 . In this scenario, the total causal effect of X 0 on Y 1 is again identical to the total causal effect of X 0 on C 1 . Panel C represents a scenario in which Y 0 is a mediator for the total causal effect of X 0 on Y 1 . In this scenario, the direct causal effect of X 0 on Y 1 that is not mediated through Y 0 is identical to the total causal effect of X 0 on C 1 . Panels Aþ, Bþ and Cþ depict the same causal scenarios as panels A, B and C, respectively, but also depict the composite change score variables (Y 1 À Y 0 ), which are completely determined by Y 0 and Y 1 : However, this cannot be generalized to all observational settings.
In some non-randomized contexts, such as where the baseline exposure is fast-acting and/or weakly autocorrelated over time, it may be obvious that X 0 occurs after Y 0 , and that the dominant direction of causality therefore flows from Y 0 to X 0 (see Figure 2, panel B). In this setting, the effect of X 0 on exogenous change in Y again corresponds to the total causal effect of X 0 on Y 1 , and a followup adjusted for baseline analysis-to estimate g -is appropriate (and necessary), since Y 0 is a classical confounder for the effect of X 0 on Y 1 .
However, in many other contexts, it is plausible that the baseline exposure causes both the baseline values of the outcome and the follow-up values of the outcome, due to delayed or prolonged causal effects. In such circumstances, the dominant direction of causality flows from X 0 to Y 0 , and X 0 causes Y due to its effects on both Y 0 and Y 1 (see Figure 2, panel C). In this context, the effect of X 0 on exog- g -is arguably less meaningful, since it only captures the direct effect of X 0 on Y 1 . If this effect is sought, then a follow-up adjusted for baseline analysis may be appropriate-though such a strategy would involve conditioning on the mediator Y 0 , which introduces additional methodological challenges. 11,12 However, if it is the total causal effect that is sought, then a follow-up unadjusted for baseline analysis should be conducted to estimate g . This would involve constructing, e.g., a regression model of the form c where b c 1 represents the estimate for the estimand (c 1 ) of interest.
The choice of whether to adjust for the baseline outcome (i.e. Y 0 ) is therefore context-dependent, as it depends upon the hypothesized causal relationship between the baseline exposure and outcome, in particular whether Y 0 is a confounder and which causal effect (a 1 or c 1 ) is of most interest.

Illustrative example
To illustrate the inferential bias that may be introduced from naïve analyses of change scores, we consider the causal effects of waist circumference (WC) on (logtransformed) serum insulin concentration (IC) at two times points in US adults aged 18-49 years from 2009 to 2014. 13

Methods
Data were simulated to match eight simplified causal scenarios (see Figure 3): For each scenario, we then conducted three analyses using the resulting data: We consider the resulting regression coefficients for WC 0 (i. e. b b 1 , b a 1 or b c 1 ) and how they relate to the causal effects of interest. To demonstrate the impact of unmeasured confounding by U and U 2 in Scenarios 1B, 2B, 3Aþ, 3B and 3Bþ, we do not explicitly adjust for these variables. Coefficient units (i.e. Log[mmol/L]/dm) are omitted to aid readability. Figure 3. Directed acyclic graphs (DAGs) of the eight simulated scenarios. For ease of illustration, the exogenous change variable (C 1 ) is not explicitly depicted, but is implicitly encapsulated within log insulin concentration at follow-up (IC 1 ). IC 1 , waist circumference at baseline (WC 0 ), log insulin concentration at baseline (IC 0 ), one or more unobserved confounding variables (U) and one or more unobserved mediator-outcome confounding variables (U 2 ) were simulated with the specified path coefficients; for more details, see the Supplementary Materials, available as Supplementary data at IJE online. Composite change-score variables (IC 1 À IC 0 ) were derived and are therefore depicted as a double-outlined nodes with dashed incoming arcs, to indicate that these were not simulated. The standardized total causal effect of WC 0 on IC 1 was fixed at 0.433, as this corresponded to a regression coefficient of 0.200 Log[mmol/L]/dm. When mediated through IC 0 , the standardized direct effect of WC 0 on IC 1 was fixed at 0.108, as this corresponded to a regression coefficient of 0.05 Log[mmol/L]/dm.

Results
The resulting regression coefficients of WC 0 for each of the three methods of analysis for each of the three scenarios are summarized in Table 1.
(i) Scenario 1: Baseline insulin is a 'competing exposure' (i.e. is neither caused by, nor the cause of, baseline waist circumference) In Scenario 1: In Scenario 1B, the association between WC 0 and DIC again consists of the causal effect of WC 0 on IC 1 but this is now confounded by U. All three methods of analysis provide a biased estimate of the causal effect of WC 0 ( b b 1 ¼ 0:191, b a 1 ¼ 0:203, b c 1 ¼ 0:228). However, it is worth noting that the follow-up adjusted for baseline estimate (i.e. b a 1 ) is less biased than the follow-up unadjusted for baseline estimate (i.e. b c 1 ), since adjustment for IC 0 closes one of the two confounding paths between WC 0 and IC 1 .
(ii) Scenario 2: Baseline insulin is a confounder In Scenario 2:  Table 1 Regression coefficients (and 95% simulation limits) returned from three analytical approaches to estimating the 'effect' of waist circumference on 'change' in a (log) insulin concentration for the eight causal scenarios shown in Figure 3 Analysis approach Change score(

À0.023)
Follow-up adjusted for baseline In Scenario 3A, the association between WC 0 and DIC consists of both the direct causal effect of WC 0 on IC 1 and the indirect causal effect that is mediated through IC 0 . The change-score analysis ( b b 1 ¼ À0:031) provides a biased estimate of opposite sign to both the direct causal effect (a 1 ) of WC 0 on IC 1 (equivalent to the effect of WC 0 on exogenous change in IC) and the total causal effect (c 1 ) of WC 0 on IC 1 . The follow-up adjusted for baseline analysis provides an unbiased estimate of the direct causal effect of WC 0 on IC 1 (b a 1 ¼ 0:050), though the estimate is biased (b a 1 ¼ 0:025) in the presence of mediator-outcome confounding (Scenario 3Aþ), since conditioning on IC 0 opens a confounding path through U 2 . 12 The follow-up unadjusted for baseline analysis provides an unbiased estimate of the total causal effect of WC 0 on IC 1 (b c 1 ¼ 0:200), which remains robust in the presence of mediator-outcome confounding (Scenario 3Aþ).
In Scenario 3B, as previously, the association between WC 0 and DIC again consists of the direct causal effect of WC 0 on IC 1 and the indirect causal effect mediated through IC 0 , but this is now confounded by U. The change-score analysis remains biased ( b b 1 ¼ À0:031) and with the opposite sign to both the direct and total causal effects. Both the follow-up adjusted for baseline analysis and follow-up unadjusted for baseline analysis provide biased estimates of the direct causal effect (b a 1 ¼ 0:047) and total causal effect (b c 1 ¼ 0:228) of WC 0 , respectively. The bias of the follow-up adjusted for baseline analysis is exacerbated (b c 1 ¼ 0:015) in the presence of mediator-outcome confounding (Scenario 3Bþ) due to conditioning on the collider IC 0 .

Discussion
Our study explains why analyses of change scores do not estimate causal effects in observational data. To demonstrate, we explored the ostensibly simple context of analysis of change in an outcome (insulin concentration) with respect to a baseline exposure (waist circumference) for eight different causal scenarios. Misleading coefficients, sometimes of opposite sign to the true effects of interest, were observed in every scenario except where the baseline outcome was a 'competing exposure', i.e. was neither caused by, nor the cause of, the baseline exposure. Although such independence is plausible, and is indeed actively sought in randomized experimental studies, it is extremely unlikely when the exposure is not assigned randomly. Many analyses of change scores in observational studies are therefore likely to suffer inferential bias, the size of which will vary with the strength and nature of the association between the baseline exposure and baseline outcome.

Recommendations
Analyses of outcome-change scores to estimate causal effects in observational data should be avoided, including 'percentage'-change scores, where the change between baseline and follow-up is expressed as a percentage of the baseline value. If the follow-up outcome is not normally distributed, appropriate transformations and/or non-parametric methods should be preferred to calculating and analysing change scores. 14 Ideally, all causal effect(s) of interest should be formally identified using DAGs and estimated accordingly. We believe the total causal effect of the baseline exposure (i.e. X 0 ) on the follow-up outcome (i.e. Y 1 ) will generally offer the greatest interest and utility, as it provides the simplest summary of how changing the exposure would be expected to change future values of the outcome. Where the baseline outcome (i.e. Y 0 ) is a 'competing exposure' or confounder for the effect of the exposure on the follow-up outcome, the total causal effect of the exposure on the follow-up outcome is the same as its effect on exogenous change in the outcome. Where the baseline outcome is a mediator for the effect of the exposure on the follow-up outcome, the direct causal effect of the exposure on the follow-up outcome captures its effect on exogenous change in the outcome. If the direct causal effect is sought, estimating this will need to account for potential mediator-outcome confounding. 11,12 Caveats Not all uses of outcome-change scores will necessarily produce incorrect or misleading estimates. Change scores may provide a robust summary of the average change in a variable between two time points for a group or individual; problems only arise when statistical comparisons are made either between groups or individuals, or in relation to one or more other variables. Change scores may therefore still be qualitatively useful for tracking the progress of individuals, provided it is recognized that the magnitude of any expected change is functionally determined by the baseline value.
Where the exposure is unrelated to the outcome at baseline (such as in randomized experimental studies), analyses of change scores provide unbiased estimates. However, even under these circumstances, analyses of change scores are less efficient than follow-up adjusted for baseline analyses (e.g. ANCOVA), unless the change-score analysis is also adjusted for the baseline outcome. 15 In fact, analyses of change scores that adjust for the baseline outcome (i.e. change score adjusted for baseline analyses) can provide unbiased estimates even in non-randomized data, because they are mathematically identical to follow-up adjusted for baseline analyses. This is because adjusting for Y 0 eliminates the contribution of the 'ÀY 0 ' component in the out- 17,18 However, extra care must be taken to avoid interpreting the coefficient for the baseline outcome as a model covariate, as this will primarily reflect the tautological association with the change score.
In some situations, the coefficient of a change-score analysis ( b b 1 ) may coincide with the desired estimand (a 1 ) if the spurious elements of the change-score estimand happen to equal all other unobserved confounding 19 or else provide less biased 'estimates' than the appropriate estimator. Unfortunately, since it is impossible to know when such situations occur, it is inconceivable that this may ever offer practical utility.
Even when adopting a robust analytical strategy, analyses of change with only two measurements will almost always produce inaccurate effect estimates due to random variation (whether error or otherwise) in the baseline and/ or follow-up measures. A diluted estimate can be expected because it is not possible to distinguish between the (desired) effect on exogenous change from the association with the random determinants of change (which will average at zero). Some information about the random variation can, however, be gained from the baseline outcome and this explains why adjusting for the baseline outcome (e.g. using ANCOVA) offers improved precision over unconditional analyses of the follow-up outcome in randomized experimental data. In observational data, this benefit is secondary to considering the causal relationship between X 0 and Y 0 . When Y 0 is a confounder for the effect of X 0 on Y 1 (and hence 'change' in Y), reducing this confounding through conditioning is theoretically appropriate and necessary. However, some residual confounding will remain because it is not possible to distinguish between the 'stable' or structural features of Y 0 (that may cause Y 1 ) and the random features (that cannot cause Y 1 ). Change scores cannot offer a solution to these consequences of limited measurement, since they contain no additional information than their parent variables Y 0 and Y 1 .
Additional measurements are necessary to reduce the issues with random variation. Latent variable methods provide an elegant means to summarize the pattern of growth over multiple time points, although care must be taken to avoid other inferential biases due to regression-to-themean. 20 When used appropriately, latent growth-curve models avoid the same problems as change-score analyses because they are centred across all datapoints, ensuring the intercept and slope do not share the same spurious negative correlation as in analyses of change scores. This is conceptually similar to Oldham's suggestion that change between baseline and follow-up be compared against the mean of the two values [i.e. Y 0 þ Y 1 ð Þ =2] 2 -the same approach as recommended by Bland and Altman for calculating limits of agreement. 20 Summary features, such as 'slope', nevertheless still possess some conceptual challenges, due to the conflation of causal information from multiple time points. 21 Ontology of change Whether analyses of change are meaningful or misleading is ultimately a matter of ontology, since the problems that arise are inferential, not statistical. We conceptualize three reasons for a variable changing value over time. The first, 'determined change', is not really change, but the realization of a past event at a later point in time. This is analogous to the inevitable future consequences of a present event within space-time. 22 The second, 'random change', represents all the random reasons for a variable changing value beyond what has been determined. Strictly, this consists of all uncertainty arising from the quantum, although, pragmatically, it will also include all apparently random behaviour arising from intractable complexity. 23 Finally, 'exogenous change' represents all non-random reasons for a variable changing value beyond what has been determined. This is analogous to the influence of all events in the 'absolute elsewhere' within space-time. 22 Of these three reasons for a variable changing value, exogenous change offers the only route to external influence, making it the principal interest of causal enquiry.

Study limitations
Our simulations were deliberately simplified and made several distributional assumptions that may not be entirely realistic. Multiple variables are likely to confound the true causal effect of waist circumference on insulin concentration. Rather than simulating these individually, we simulated a single summary confounder U for illustrative purposes. The focus of this paper was not, however, on one specific context; rather, we sought to demonstrate the potential problems with analysing and interpreting change scores in observational studies and the utility of DAGs for exploring and identifying such issues. No inferences should be drawn from our simulations about the assumed causal effect of waist circumference on insulin concentration, which may not exist. We did not consider the additional complications that would result from non-linear relationships, where change scores and linear conditioning for the baseline outcome (e.g. using ANCOVA) would introduce further bias. Where confounding is present and conditioning is required, appropriate parameterization should be sought to reduce residual confounding.
Comparison with Lord (1967) and Glymour et al. (2005) Scenario 3A, in which the baseline outcome mediates the effect of the exposure on the follow-up outcome, represents the same situation that originally puzzled Lord in 1967. 5 Lord's confusion arose because the change-score analysis and follow-up adjusted for baseline analysis produced very different results, neither of which seemed to resolve the 'pre-existing' differences in weight at baseline. Using a causal perspective, we can recognize that this 'paradox' occurred for two distinct reasons: (i) follow-up adjusted for baseline analyses do not provide total causal effects because the baseline outcome is a mediator and (ii) change-score analyses do not provide meaningful causal-effect estimates in observational data. Although these points have been individually recognized elsewhere, 7,24 they have not yet been explicitly recognized jointly.
Our conclusion that change scores do not estimate causal effects in non-randomized contexts, including any effect on 'exogenous' change, may explain the divergence between our conclusions and those of Glymour et al. 6 Glymour et al.'s study compares two change-score analyses: one with and one without adjustment for a mediating baseline outcome. However, as discussed above, change score adjusted for baseline analyses are equivalent to follow-up adjusted for baseline analyses, 17,18 meaning that the scenario in Glymour et al. mirrored Lord's paradox and gave similarly divergent results. Glymour et al. attributed this divergence to the introduction of measurement error when adjusting for the baseline outcome and concluded that 'change-score analyses without baseline adjustment provide unbiased causal effect estimates'. 6 We suspect that the difference instead reflects the differing estimands, with only the change score adjusted for baseline analyses returning a potentially meaningful estimand-the direct causal effect.

Conclusion
Judgements regarding clinical significance and the funding and delivery of treatment are dependent on obtaining meaningful causal-effect estimates, and analyses of outcome-change scores in non-randomized data do not provide this. Moreover, such analyses may even suggest an 'effect' that is of the opposite sign to the total causal effect. Observational studies that have analysed outcome-change scores should therefore be viewed with caution and their recommendations revisited.

Supplementary data
Supplementary data are available at IJE online.

Ethics approval
Ethics approval was not required for this research, as it did not involve human subjects.

Funding
This study received no specific funding. P.W.G.T., K.F.A. and M.S.G. are supported by the Alan Turing Institute [grant number EP/N510129/1].

Data availability
The simulation code is available on Github at https://github.com/ pwgtennant/change-score.