Pathway-specific population attributable fractions

Abstract Introduction A population attributable fraction represents the relative change in disease prevalence that one might expect if a particular exposure was absent from the population. Often, one might be interested in what percentage of this effect acts through particular pathways. For instance, the effect of a sedentary lifestyle on stroke risk may be mediated by blood pressure, body mass index and several other intermediate risk factors. Methods We define a new metric, the pathway-specific population attributable fraction (PS-PAF), for mediating pathways of interest. PS-PAFs can be informally defined as the relative change in disease prevalence from an intervention that shifts the distribution of the mediator to its expected distribution if the risk factor were eliminated, and sometimes more simply as the relative change in disease prevalence if the mediating pathway were disabled. A potential outcomes framework is used for formal definitions and associated estimands are derived via relevant identifiability conditions. Computationally efficient estimators for PS-PAFs are derived based on these identifiability conditions. Results Calculations are demonstrated using INTERSTROKE—an international case–control study designed to quantify disease burden attributable to a number of known causal risk factors. The applied results suggest that mediating pathways from physical activity through blood pressure, blood lipids and body size explain comparable proportions of stroke disease burden, but a large proportion of the disease burden due to physical inactivity may be explained by alternative pathways. Conclusion PS-PAFs measure disease burden attributable to differing mediating pathways and can generate insights into the dominant mechanisms by which a risk factor affects disease at a population level.


Introduction
Population attributable fractions (PAFs) represent the relative change in disease prevalence that one might expect if a particular exposure was absent from the population. This metric was originally introduced in 1953 by Levin 1 to estimate the percentage of lung cancer that would not have occurred under a counterfactual scenario in which nobody smoked in the population. Since then, these families of metrics have become a standard way of measuring total disease burden attributable to a risk factor 2 and also to rank differing risk factors for prioritization as intervention targets. 3 Partitioning this overall disease burden into contributions from the known pathways through which the risk factor affects disease is also useful both in understanding pathogenic mechanisms and also when comparing interventions that may reduce disease. For instance, we might estimate that in a hypothetical world, where dietary red meat was completely substituted for by plant-based protein, the prevalence of heart disease might be reduced by 10%. How much of this reduction in disease burden is attributable to the pathway by which diet affects blood pressure? Here, we introduce the pathway-specific population attributable fraction (PS-PAF) to help answer this question. In the preceding example, the PS-PAF can be informally understood as the relative change in disease prevalence in a hypothetical world in which the distribution of blood pressure was altered to match the distribution expected under the aforementioned dietary substitution. Under certain assumptions (described later), the same quantity can be described more mechanistically as the proportion of disease that would be avoided from completely disabling the corresponding mediating pathway at an individual level (here the pathway is diet!blood pressure!heart disease). PS-PAFs can be calculated for multiple mediating pathways for the same exposure. For example, the effect of the previous dietary substitution might be partially mediated by the effect on cholesterol as well as blood pressure; separate PS-PAFs could be compared for both pathways. In this case, the aim is not to provide an additive decomposition of the overall attributable fraction into PAFs for mediating pathways. Just as differing attributable fractions for a set of risk factors typically sum to more than the joint attributable fraction, 4,5 differing PS-PAFs corresponding to various mediating pathways typically sum to more than the overall PAF for the risk factor. Rather than decomposing the total PAF, the aim instead is to fairly compare disease burden attributable to differing pathways and as a result gain insights into the dominant mechanisms by which the risk factor affects disease at a population level. These insights may in turn be useful in comparing possible interventions to prevent disease.
Presently, the only metric we know of in the literature that measures the extent of population disease burden through a risk factor!mediator!outcome pathway is indirect PAF, recently introduced by Sjö lander. 6 This metric operates by subtracting a direct attributable fraction (which corresponds to the disease burden contributed by all pathways not operating through the mediator) from the total PAF for the risk factor. However, as we explain later, the direct component will differ over differing mediators and so differing indirect PAFs over different mediators will be noncomparable. In this regard, the PS-PAF that we introduce here is more suited to measuring and comparing disease burden over differing known and unknown pathways. Our work is also related to the work of Vansteelandt and Daniel 7 who examine pathway-specific effects of a treatment in a multi-mediator setting. A major difference between our work and theirs is that our results pertain to attributable fractions rather than individual-level treatment effects. Attributable fractions (such as the regular attributable fraction and the PS-PAF defined here) depend innately on the prevalence of a risk factor as well as relative risk and correspond to the proportion of disease burden attributable to the risk factor (as opposed to comparisons of the situations in which everyone and no one had the risk factor). As such, the PS-PAF could be regarded as the proportion of disease burden in the population related to a particular risk factor!mediator!outcome pathway, or alternatively how prominent that particular pathway is in the pathogenesis of the disease.
To maximize both the applicability and the interpretability of the described methods, we will define PS-PAFs under causal identifiability conditions of various stringency. Separable conditions are the most stringent and presume that an intervention is possible (at least in theory) to exactly replicate the effect of eliminating the risk factor on a particular mediator, with the intervention having no residual effect on disease that does not operate through the mediator. In this setting, PS-PAFs are associated with the effect of applying this hypothesized intervention at a population level. More applicable but perhaps less interpretable are the mechanistic and interventional forms that unlike the separable PS-PAFs require consideration of potential outcomes but can be applied when a separable intervention cannot be imagined. Fortunately, all three forms of the PS-PAF are identified in the same way so one can use the same estimation procedure and interpret depending on the assumptions they find reasonable. Having defined and identified PS-PAFs, we will compare and contrast them with the related concept of indirect PAFs introduced by Sjö lander. 6 We will illustrate results using data from INTERSTROKE-a casecontrol study designed to quantify disease burden attributable to a number of known causal risk factors for stroke. A discussion section concludes the manuscript.

Identification and estimation of PS-PAFs
Potential outcome notation used for mediation analyses We borrow notation from VanderWeele 8 in defining potential outcomes. As is usual in the causal inference literature, observable random variables will be denoted using unscripted notation and random variables for potential outcomes will be denoted using subscripts. In all cases, we use uppercase letters to denote random quantities and lowercase to denote quantities that are fixed or intervened on.
In particular, let C denote a random vector of known baseline covariates not affected by the exposure, the random variable A 2 f0; 1g a binary exposure of interest and the random variables M 1 ; . . . M K mediators on separate causal pathways from A to Y; each M k could be binary, multi-category or continuous. Finally the random variable Y 2 f0; 1g represents a binary disease outcome. Figure 1 demonstrates a multi-mediator scenario with three mediators: M 1 , M 2 and M 3 .
The potential outcome setting exposure to a and M k to m k is denoted by Y a;m k , where the subscript k indicates which mediator is intervened on. On occasion, we will consider potential outcomes setting the exposure to a and the K mediators to m 1 ; . . . ; m K , denoted by Y a;m 1 ;...;m K. Since an intervention setting A ¼ 0 may well affect the distribution of M k , the potential outcome Y a;m 1 ;...;m K for particular individuals may be unrealizable even under plausible interventions on the risk factor. However, under standard counterfactual models, such as Pearl's NPSEM-IE (Non-Parametric Structural Equations Model with Independent Errors), the quantity Y a;m 1 ;...;m K is still well defined for these individuals and can be utilized when defining causal effects. One can also define potential outcomes for each M k , k K for an intervention setting the exposure to a as M k a . In addition, we will use the following abbreviated notation: Y a ¼ Y a;M 1 a ;...;M K a for an intervention that sets the exposure A to a, Y A;m k (corresponding to an intervention setting the kth mediator to m k , with the exposure taking its natural value) and Y 0;M k for an intervention 'eliminating' the exposure but where the kth mediator takes its natural value. Note that implicit in this notation is the possibility that an intervention on the kth mediator might affect causally downstream mediators when several mediators lie on the same causal pathway. If on the other hand the mediators lie on separate causal pathways that are d-separated given A and C (briefly, this means that in the causal graph relating C, A, M 1 ,..,M K , any path of adjoining arrows starting from M j and ending at M k that do not pass through Y must intersect either A or C), then Y A;m k ¼ Y A;m 1 ;...;m K , where m j ¼ M j for j 6 ¼ k, M j being the observed value of the jth mediator under no intervention, and Y 0;M k ¼Y 0;m 1 ;...;m K , where m k ¼ M k and m j ¼ M j 0 for j 6 ¼ k; M j 0 being the value that the jth mediator would take under an intervention that sets the exposure to 0. A summary of all of this notation is given in Summary Box 1.
As is usual with causal inference using the potential outcomes framework, we make Stable Unit Treated Value Assumptions (SUTVA): 9 the relationship between potential and observed outcomes satisfy 'consistency', which implies In addition, mediation analysis requires some technical conditional independence assumptions that we will list as needed in the following sections.

The PAF
Despite being an intrinsically causal idea, attributable fractions were originally defined via conditional probabilities in a non-causal framework, which has contributed to confusion regarding what they really purport to measure. Here, we will instead use a definition based on potential outcomes that is now becoming more prominent: 10,11 Thinking of Y 0 as the potential outcome for a random individual in a population in which no one was exposed to the risk factor, Equation (1) can be directly interpreted as the relative change in disease prevalence if that risk factor was absent from the population.

The PS-PAF
Whereas PAF total measures the total disease burden attributable to a risk factor, PS-PAF measures disease burden attributable to a risk factor!mediator!outcome pathway. We will give three differing causal definitions for PS-PAFs with interventional, mechanistic and separable interpretations. We start by defining the interventional pathwayspecific PAF, which is most generally applicable, and as shown later reduces to the mechanistic and separable forms when additional assumptions are met.
For a binary risk factor A 2 f0; 1g, the interventional PS-PAF for the mediating pathway A!M k !Y is defined as: where the random variable G k 0jC is a random draw from the conditional distribution of M k 0 given an individual's covariates, C. Marginally, G k 0 (simulated via first randomly sampling an individual from the population and using their covariates, C, to generate G k 0jC ) is a draw from the population distribution of M k 0 , i.e. the distribution that M k would have in a hypothetical world in which the risk factor was eliminated. Noting this, the PS-PAF can then be interpreted in terms of a population intervention on M k , where its distribution is shifted to the distribution that would be observed in a population in which the risk factor A was eliminated, with other quantities in the population (covariates C, exposure A and other mediators M j ; j 6 ¼ k, which are not causally downstream of M k ) remaining unchanged. This idea of randomly assigned interventions has previously been introduced to estimate versions of natural direct and indirect effects (sometimes termed interventional direct and indirect effects) in the presence of exposure-induced confounding. 7,12 One might also want to compare the disease burden attributable to pathways that are unknown (or involve unobserved mediators). To do this, we will adapt the concept of the direct PAF proposed by Sjö lander 6 as follows: We will refer to Equation (3) as the PS-PAF for the direct pathway A!Y. Equation (3) measures the disease burden attributable to direct pathogenic pathways from A to Y, i.e. any pathway from A to Y (that may or may not be known) excluding the set of mediating pathways under consideration. See Table 1 for a summary of the interpretations of Equations (1), (2) and (3).
It is instructive to compare the interpretations of direct and PS-PAFs in a more concrete setting. In the introduction we informally illustrated PS-PAFs in the context of the effect of a dietary substitution (which was red meat being completely substituted for by plant-based proteins). In this case the PS-PAF through blood pressure could be informally defined as the relative change in disease prevalence in a world in which the distribution of blood pressure was somehow altered to match the distribution expected under such a population-level substitution but without changing the dietary habits of the population. In contrast the direct PS-PAF would imagine the relative change in disease prevalence in a second world in which the dietary substitution was rigorously implemented at a population level but the effect of this intervention on blood pressure as well as on other known mediators (e.g. body mass index) was Summary Box 1 Potential outcome notation used in the manuscript C Known baseline confounders A 2 f0; 1g Binary exposure or risk factor Y 2 f0; 1g Binary indicator for disease The definition by Sjö lander 6 is very similar except that it refers to the direct pathway with reference to a single mediating pathway, k, and will change dependent on that mediating pathway as follows: In the previous example, this direct definition (with reference to blood pressure) compares prevalence in a world in which the dietary substitution was implemented and as before the effect of this intervention on blood pressure is disabled but with the values of other known mediators of the diet/CVD relationship (where CVD is cardiovascular disease) now affected by the intervention. In the case that there is only a single mediating pathway, Equations (3) and (4) will coincide.

Identifiability conditions
As is usual in mediation analysis, causal identifiability conditions need to hold to estimate Equation (2) in an unbiased way from data (the data comprising observed values of covariates, exposure, mediator and response sampled from some population). In particular, Conditions (i) and (ii) below are necessary to identify PS-PAFs for each mediator M k of interest: i. M k 0 ? ?AjC (i.e. conditional on covariate strata, the potential outcome for the mediator, under elimination of the risk factor, and the natural value of the risk factor should be independent random variables. This condition can be informally understood as associations between the risk factor and mediator having causal interpretations within strata of covariates). ii. Y a;m k ? ?M k jA ¼ a; C (i.e. conditional on covariate and risk factor strata, potential outcomes for the response, under various assignments for risk factor and the k th mediator, and the natural value of M k should be independent. Informally, this can be understood as associations between the kth mediator and outcome having causal interpretations within joint strata of risk factor and covariates). An additional condition is necessary to identify PAF AÀ>Y iii. Note that under identifiability Conditions (i), (ii) and (iii), we show in the Supplementary material (available as Supplementary data at IJE online) that: and where for generic functions g 1 , g 2 and g 3 , E A;C ðg 1 ðA;CÞÞ¼ Ð a;c g 1 ða;cÞdF A;C ða;cÞ and E C;M 1 ;...;M K g 2 ðC;M 1 ;::;M K Þ¼ Ð c;m 1 ;...;m K g 2 ðc;m 1 ...;m K ÞdF C;M 1 ;...;M K ðc;m 1 ;...;m K Þ represent expectations of the random variables g 1 ðA;CÞ and g 2 ðC;M 1 ;...;M K Þ, integrated over the marginal distributions F A;C and F C;M 1 ;...;M K of the subscripted variables and E M k jA¼0;c ðg 3 ðM k ÞÞ¼ Ð m k g 3 ðm k ÞdF M k jA¼0;C¼c ðm k Þ is an expectation of g 3 ðM k Þ integrated according to the conditional distribution of M k given A¼0 and C¼c, F M k jA¼0;C¼c . [Note also the shorthand that is used in the notation for conditional distributions of Y and M k , e.g. a probability like PðY¼ 1jA;C;M k Þ stands for the probability that Y¼1 given the Relative decrease in disease prevalence in hypothetical population in which risk factor has been eliminated but without affecting the distribution of any of the known mediators PAF I Relative decrease in disease prevalence in hypothetical population in which M k is distributed as if the risk factor A was eliminated but the distribution of A itself is unaffected PAF total Relative decrease in disease prevalence in hypothetical population in which risk factor A has been removed PAF M 1 ;:::;M K AÀ>Y represents the direct PAF, PAF I AÀ>M k À>Y represents the interventional version of the pathway-specific attributable fraction for the pathway A!M k !Y and PAF total represents the total PAF. In each case the disease prevalence in the current population is compared with prevalence in a hypothetical population in which a particular intervention is enforced. risk factor takes the value A, confounders take the value C and the kth mediator takes the value M k , where A, C and M k are thought of as random. When evaluating this probability at fixed values C¼c, M k ¼m k and A¼a, we instead write: PðY¼1jA¼a;C¼a;M k ¼m k Þ.]

Mechanistic PS-PAFs and disabling pathways
Under an extra identifiability condition, the pathwayspecific PAF can be also expressed in a mechanistic form where the mediator assignment to an individual (within the hypothetical population where the distribution of the mediator is altered) is the mediator that would result for that individual under no exposure to the risk factor: This final condition is: This condition is less intuitive than Conditions (i), (ii) and (iii) described in the preceding section and involves consideration of cross-world counterfactuals (i.e. if a ¼ 1, then Y a;m k and M k 0 would never be observed on the same individual). However, as shown in the Supplementary material (available as Supplementary data at IJE online), this condition does hold in a non-parametric structural equations model provided there is no post-treatment confounding of the mediator-outcome relationship. The mechanistic pathway-specific PAF can be thought of as the relative change in disease burden in a hypothetical population in which the mediated pathway A!M k !Y is disabled. For example, in a simple setting in which there is only a single known mediator, M, there are two potential pathways by which the risk factor affects disease, represented by the pathways A!M!Y and A!Y in Figure 2a. The total PAF (which refers to a population in which a binary risk factor was eliminated) corresponds to the disabling of both pathways, i.e. the comparison is of disease risk in the populations with causal graphs shown in the left-hand and righthand panes of Figure 2a. In contrast, direct PAF involves a comparison of disease risk in the current population with the hypothetical population in which the direct pathway is disabled (Figure 2b) whereas the pathway-specific PAF represents a comparison of the current population with a hypothetical population in which the pathway A!M!Y is disabled (Figure 2c). Note that in each case, current disease risk is compared with disease risk in some hypothetical population in which a particular pathway (or pathways) has been disabled.

Alternative framework using separable paths
An alternative interpretation of pathway-specific PAFs may be obtained using the separable paths framework detailed in Section 5 of Robins and Richardson. 13 Under this framework for mediation, we imagine that the effect of a treatment A can be split into two (or more) components, each component representing differing mechanisms by which A might affect Y. For example, in the directed acyclic graph (DAG) represented by Figure 3, component A M k only affects Y through its direct relationship with the   If it is plausible that the effect of increased physical activity on blood pressure could be replicated by taking an antihypertensive pill with the pill having no direct effect on stroke that is not mediated through blood pressure, assignment of the antihypertensive pill may represent an intervention that sets A M k ¼ 0 independently of the value of A O k and the PS-PAF could be defined as in Equation (8) below. In contrast, there is no obvious intervention replicating the effect of physical activity on waist-hip ratio with no casual effect on stroke that is not mediated through the waist-hip ratio. As a result, defining the PS-PAF for paths from physical inactivity through the waist-hip ratio via Equation (8) may not be appropriate. More formally, the separable PS-PAF is defined as: where here we use the do notation, popularized by Pearl 14 to represent the 'interventional' distribution where A M k is set to 0 in the population. In analogy with how attributable fractions are usually defined, the above represents the relative change in the probability of disease in a population in which the 'component' risk factor A M k was set to 0 in the population but with the distribution of A O k remaining unchanged (effectively disabling the pathway A!M k !Y). More specifically, suppose that risk factor A 2 f0; 1g is divisible into separable components A M k and A O k , with causal DAG as represented in Figure 3, Assuming that 0 < PðA ¼ 0jC ¼ cÞ for all possible values of the covariate vector c, we prove in the Supplementary appendix (available as Supplementary data at IJE online) that: indicating that the same identification formula results under interventional, mechanistic and separable interpretations. These three interpretations are compared in Table 2.

PS-PAFs vs indirect PAFs
In the context of a single mediating pathway, Sjö lander 6 introduced the concept of indirect PAFs: defined so that it sums with PAF M k AÀ>Y , in Equation (4), to the total PAF. Rather than comparing disease risk in the current and hypothetical populations, Equation (8) implicitly compares disease risk in two hypothetical populations: A can be represented as the composition of two independently manipulable variables, A M k and A O k . The intervention setting A M k ¼ 0 is equivalent to disabling the mediating pathway A!M k !Y (see Figure 3) Interpretation Relative decrease in disease prevalence in hypothetical population in which M k is distributed as if the risk factor A was eliminated (given covariates C) but the distribution of A itself is unaffected Relative decrease in disease prevalence in hypothetical population in which the mediating pathway A!M k !Y is disabled at an individual level The PAF corresponding to an intervention that sets the component A M k of A to 0 at a population level. Treating A M k as a risk factor in its own right, this is a standard attributable fraction þVe/-Ve þVe: most generally applicable (identifiable under the most general conditions). Note that mechanistic and separable PS-PAFs are also interventional PS-PAFs -Ve: Perhaps least interpretable þVe: Interpretation in terms of disabling mediating pathway at an individual level -Ve: Identification requires crossworld assumptions þVe: Interpretation as a regular attributable fraction -Ve: Limited applicability due to separability requirement for A one in which the direct pathway A!Y has been disabled, with a second in which the direct and mediated pathways are both disabled (see Figure 4). An indirect PAF will usually be smaller than the corresponding pathway-specific PAF in a one mediator situation, as it is likely that some disease cases in the current population that are exposed to the risk factor might be equally well prevented by either eliminating the effect of the direct pathway (i.e. perhaps Y 1;M1 ¼ 1, but Y 0;M1 ¼ 0) or eliminating the mediated pathway (i.e. perhaps Y 1;M1 ¼ 1, but Y 1;M0 ¼ 0). The prevention of such disease cases would contribute to the pathwayspecific PAF but not to the indirect PAF. See Table 3 for a comparison of the differences between pathway-specific and indirect population attributable fractions.

Estimating PS-PAFs
We will denote values for the risk factor A i 2 f0; 1g, covariate vector C i and mediators M 1 i ; . . . ; M K i for each individual i N in the data set. Estimation in cohort and cross-sectional studies is more straightforward than in case-control studies since the components of Equations (5) and (6) can in theory be consistently estimated using empirical conditional distributions for the observed variables in our data. In contrast, case-control studies cannot be regarded as random or representative samples from the overall population and the corresponding estimated empirical distributions will be biased for their population counterparts. However, if the prevalence of disease, p, is known and the cases and controls are randomly selected from their source populations, one can weight the contributions of each individual i N so that the reweighted sample is representative of the source population and consistent estimation of distributional quantities from the population is feasible. If the case:control matching ratio is 1:r for some r ! 1, this implies weights w i ¼ 1 for cases and w i ¼ ð1=p À 1Þ=r for controls. For cohort and cross-sectional studies, we define w i ¼ 1 for all i N. We first describe the estimator when M k is continuous. We suppose that the researcher specifies and estimates models for PðY ¼ 1jA ¼ a; C ¼ c; M k ¼ mÞ and EðM k jA ¼ c; C ¼ cÞ (perhaps using the reweighted data set in casecontrol scenarios). For each individual i,

Then our estimator for PS-PAF is:
If M k is discrete, having a finite number of levels M k ¼ fm 1 ; . . . ; m n k g, we need to estimate the probabilities Given that the interventional PS-PAF is most generally applicable (i.e. the PS-PAF can always be interpreted in this way), we focus on the interventional PS-PAF in this comparison.

Figure 4
An indirect population attributable fraction compares disease risk P(Y 0, M ¼ 1) in a hypothetical population with the direct pathway disabled with the disease risk P(Y 0, M0 ¼ 1) in a hypothetical population in which both direct and mediated pathways have been disabled PðM k ¼ mjA i ¼ 0; C i Þ for each m 2 M k , in which case the estimator for PS-PAF is as follows: In the case in which there are multiple mediators, an inconvenience in applying the above estimators for differing mediators M 1 ; . . . ; M K is separate outcome models: Here, we assume the mediators lie on separate causal pathways that are d-separated given A and C (or alternatively M 1 ,..,M K are mutually conditionally independent given A and C) in which case it follows that: letting M 6 ¼k be the set of mediators excluding M k , which implies that a single outcome . . . ; M K ¼ m K Þ can be used, suggesting the alternative estimators: where M 6 ¼k i are the set of mediator values for individual i, excluding mediator k. In the alternative case in which mediators lie on the same causal pathway (with some mediators acting as confounders for other mediator-outcome relationships), Condition (ii) will be violated for those M k subject to confounding by other mediators and alternative estimation approaches that are described in the Supplementary material (available as Supplementary data at IJE online) in the section titled "Identification of interventional pathwayspecific population attributable fractions (PS-PAF) with more general structural dependence between multiple mediators involving post treatment confounding" need to be used. Justifications of these estimators including the modelling assumptions necessary for consistent estimation are given in the Supplementary material (available as Supplementary data at IJE online). Since these estimators combine models for M k and Y, analytic formulae for standard errors are difficult to derive. In the data example given below, we instead calculate approximate confidence intervals for pathwayspecific PAFs using the bootstrap (see Table 4). In contrast, estimating PAF AÀ>Y is more straightforward and boils down to estimating E C;M 1 ;...;M K ðPðY ¼ 1jA ¼ 0; C; M 1 ; ... M K ÞÞÞ. This can be achieved by estimating PðY ¼ 1jA ¼ 0; C i ; M 1 i ; ...; M K i Þ for each individual and then averaging this quantity over individuals in the data, taking care to incorporate weighting under a case-control design. As an alternative, a double robust estimator for EðPðY ¼ 1jA ¼ 0; C; M 1 ; ... ; M K ÞÞ can be derived using the same approaches as Sjö lander describes. 6 Data example INTERSTROKE 15 is a large international case-control study designed to quantify the contribution of established risk factors to stroke prevalence at a global level. Here we investigate the possible mediating effects of physical inactivity (PHYS: a two-level variable with levels 'mainly inactive' coded as 1 and mainly active coded as 0) on incidence of first stroke through waist-hip ratio (WHR) [waist measurement (cm)]/[hip measurement in (cm)], ratio between measured apolipoprotein-B and apolipoprotein-A (APOB) (these are proteins in the blood responsible for lipid PAF AÀ>M k À>Y represents the pathway À specific population attributable fraction PS À PAF ð Þ ; PAF M 1 ;:::;M K AÀ>Y the direct PAF; PAF Indirect;M k the indirect PAF: PHYS codes for physical inactivity yes or no ð Þ ; HBP for diagnosed hypertension yes or no ð Þ ; WHR for waist-hip ratio measured continuously ð Þ and APOB for the ratio: apo À lipoprotein A=apo À lipoprotein B; again measured continuously: Sjolander's form of the direct PAF : PAF M k AÀ>Y is also given for comparison purposes: Estimated 95% CIs (estimate61.96*SE) are shown in brackets. Standard errors (SEs) were estimated using 200 bootstrap iterations. PAF estimates are rounded to two significant digits. metabolism) and prior clinical diagnosis of high blood pressure (HBP). We treat WHR and APOB as continuous variables and HBP as binary. Covariates and assumed mediators are as assumed in the causal structure shown in Figure 5.
To estimate Sjö lander's direct and indirect attributable fractions, and the PS-PAFs described above, we fit a maineffects logistic regression predicting stroke status as a function of age, sex, region, education, healthy-eating score, self-reported stress levels, smoking status, alcohol use, PHYS, WHR, APOB and HPB, with the terms for WHR and APOB entering as 5-degree of freedom natural cubic splines to ensure sufficient flexibility to model the speculated non-linear relationships between these mediators and stroke. The covariates were chosen with the hope that relationships between mediators (APOB, WHR and HBP) and the outcome might be close to unconfounded conditional on physical activity and the other covariates (as required in condition (ii)). Stroke controls were upweighted by a factor of 284 to reflect a yearly stroke incidence of first stroke of 0.0035 or 3.5 strokes per 1000 individuals per year, estimated via data from the global burden of disease. 16 Models for each mediator are conditioned on age, sex, region, education, healthy-eating score, stress levels, smoking status, alcohol use and PHYS. These covariates (excluding PHYS) are assumed to to act as a sufficient adjustment set to deconfound the relationship between PHYS and each mediator with the hope that this set of covariates satisfies condition (i).
In this example, the mediating pathways and direct pathway due to physical inactivity (here A ¼ 1 represents individuals that are physically inactive, A ¼ 0 represents individuals who are physically active) are likely to all increase the probability of stroke, and one would expect PS-PAFs to be somewhat larger than the corresponding indirect PAFs. This pattern is seen in Table 4 with the estimated PS-PAFs all being larger than the corresponding indirect PAFs. As an example, the PS-PAF for HBP, which is interpreted as the relative decrease in disease prevalence if one disabled the physical activity!HBP!stroke pathway is estimated to be 4.2% (one might informally interpret this as the percentage of stroke burden that is attributable to this pathway). The indirect PAF, interpretable as the relative decrease in disease prevalence associated with disabling the same mediating pathway, but now subsequent to disabling all pathways from physical inactivity to stroke that don't involve HBP is estimated to be 2.5%. Similarly, the estimated PS-PAFs through APOB (2.5%) and waist-hip ratio (2.8%) are larger than the corresponding indirect PAFs (1.6%) and (1.7%). The direct PAF M 1 ;:::;M K AÀ>Y (defined in Equation (3)) is estimated as 34% and represents the relative decrease in disease prevalence if all pathways not mediated by HBP, APOB or WHR were disabled. In Table 4, we also report estimates for Sjö lander's version of direct PAF: PAF M K AÀ>Y given by Equation (4), which changes depending on the mediating pathway (essentially this estimates disease burden through all other pathways except through M k ). In summary, this analysis suggests that population disease burden for stroke attributable to physical inactivity partially depends on the mediating pathways through blood pressure, WHR and APOB but depends mostly on other (unknown) mechanisms, exemplified by the large direct PAF. As with any causal analysis, these tentative conclusions depend jointly on correct modelling of conditional probability distributions and on the validity of the causal identifiability assumptions listed earlier.

Discussion
In this paper we have introduced PS-PAFs (in interventional, mechanistic and separable forms) as a metric to measure the disease burden attributable to particular exposure mediator pathways. Whereas at first sight this triplicate of definitions seems to add complexity, it should be noted that all three metrics are identified the same way in observational data (but under differing assumptions) and in some way represent a sequence of generalizations of separable PS-PAFs. For instance, in contrast to the interventional and mechanistic versions, the separable PS-PAF is identifiable without a counterfactual model and counterfactual notation is unnecessary in its definition. However, if we associate a typical counterfactual model to the causal DAG in Figure 3  . In other words, if the separable PS-PAF is identifiable then so is the mechanistic PS-PAF and the two will correspond. Similarly, if assumptions (i), (ii) and (iv) are true (the assumptions necessary to identify the mechanistic PS-PAF), then certainly assumptions (i) and (ii) are both true and the interventional PS-PAF can be identified (and since the identification formula is the same, they must again correspond). Indeed, as we show in the Supplementary material (available as Supplementary data at IJE online), the interventional PS-PAF is identifiable under more general structural relationships between the mediators (compared with the mechanistic and separable PS-PAFs) where certain mediators of the risk factor-outcome relationship act as confounders for the relationship between other mediators and the outcome. In contrast, the mechanistic and separable pathway-specific PAFs are not identifiable in situations involving post-treatment confounding. Although the least identifiable, the separable PS-PAF has perhaps the cleanest interpretation both in terms of a PAF for disabling the mediating pathway and also as a regular PAF for A M k (i.e. treating the separable component A M k of A as a risk factor). Likewise the interpretation of the mechanistic PS-PAF (in terms of disabling the mediating pathway) is more palatable than the interventional PS-PAF (in terms of an intervention in which each individual is assigned a random value of the mediator from its distribution under elimination of the risk factor, conditional on their covariates).
PS-PAFs are related to the ideas of direct and indirect PAF recently proposed by Sjö lander. 6 Whereas PS-PAFs have an interpretation that compares disease prevalence in the actual and hypothetical populations, indirect PAFs do not and are defined as the leftover disease burden after subtracting the direct PAF from the total PAF. As such, the interpretation of indirect PAFs is intrinsically linked to direct PAFs and is perhaps difficult to motivate in a clinical setting. For instance, in a situation in which the direct PAF and PS-PAF are both 100%, disease could be eliminated by disabling either the direct pathway or alternatively disabling the mediating pathway. Despite the possibility of eradicating disease by disabling the mediating pathway, the indirect PAF is defined as 0%. In a one mediator situation, one way to understand the distinctions between these concepts is in terms of modified sequential attributable fractions, 4 i.e. attributable fractions that are constructed from disabling pathways in a particular order (as demonstrated in Figures 2 and 4) with the order in which pathways are disabled differing for PS-PAFs and indirect PAFs. In more detail, a (mechanistic) PS-PAF as given by Equation (7) can be interpreted as the relative change in disease burden from disabling a particular mediating pathway. The corresponding indirect PAF, Equation (9), is also associated with disabling that same mediating pathway but this time subsequent to disabling the direct pathway (see Figure 4). Since the effect of disabling both the direct and mediating pathways is equivalent to the effect of eliminating the risk factor, this effectively forces the additivity property that the total PAF is the sum of the direct and indirect PAFs. Note that in general, sequential attributable fractions 19 are constructed to sum to some well-defined overall PAF but usually the sequence corresponds to the hypothetical elimination of each of a group of risk factors in some order rather than disabling mediating pathways related to the effect of a particular risk factor in some order. Whereas this additivity property at first seems appealing, it perhaps is unnatural in the context of attributable fractions, where it is well recognized that the PAF for differing risk factors may sum to more than the joint PAFs and sometimes to >1. 20 The sufficient/component cause framework 21 gives a simple but enlightening explanation for this phenomenon. For particular individuals, a certain collection of risk factors (perhaps diet, stress and tobacco usage) might collectively lead to disease at a particular point in time but the disease may not have occurred at that time if any of the risk factors were not present. The same logic implies that pathway-specific PAFs will tend to be larger than indirect PAFs, as illustrated in this manuscript, if the direct and indirect pathways act as independent diseasecausing mechanisms both of which need to be operational in particular individuals for disease to occur. If additivity to the total PAF is required while at the same time preserving the comparability of measured disease burden for direct and mediating pathways, it is possible to average sequential analogues of PS-PAFs to produce a kind of average PS-PAF that, when summed over differing mediating and direct pathways, equals the total PAF. This, however, is beyond the scope of the paper and as a quantity may be difficult to interpret.
Although, throughout this manuscript, we imagine that a mediator is observed at a point in time, often longitudinal exposure to the mediator (e.g. sustained HBP over several years due to a sedentary lifestyle) is an important contributor to disease risk. Whereas it is relatively easy to extend definitions of PS-PAFs to reflect counterfactual mediator trajectories, identifiability conditions are more complicated and longitudinal data on mediators and associated time-varying confounders are then necessary for estimation. Nevertheless, even if longitudinal data on mediator progression are lacking, as they are here, estimating PS-PAFs for differing pathways as calculated here can act as a proxy for a longitudinal calculation and still help in determining the dominant pathways by which a risk factor affects an outcome. For instance, the analysis here suggests that the pathways from physical inactivity through the waist-hip ratio, blood pressure and cholesterol may have reasonably similar contributions to disease burden, and suggests that much of the disease burden due to inactivity is explained by alternative pathways.
Although attributable fractions can measure the total disease burden associated with a risk factor, they are less useful to measure the real-world impact of a public health intervention on that risk factor since even successful health interventions usually only partially eliminate a risk factor and in addition cannot alter prior history to a risk factor when cumulative exposure might also impact disease. As an example, rather than considering a hypothetical population in which smoking is eliminated, a realistic populationlevel intervention (such as increasing the tax on cigarettes) may result in a 5% decrease in the number of cigarettes consumed rather than total elimination of smoking. Impact fractions are generalized versions of attributable fractions that measure the reduction in disease prevalence associated with such a population intervention. 22 The ideas described here can easily be adapted to define and estimate pathway-specific impact fractions for such real-world interventions that may characterize the dominant mechanisms by which the intervention affects disease burden. For example, the pathway-specific impact fraction for the pathway A!M k !Y could be defined by letting G k jC represent a random variable having the population distribution of the mediator M k under the proposed intervention (simulated again conditional on an individuals' covariate vector C) and replacing G k 0jC with G k jC in Equation (2).

Ethics approval
No human patients were directly involved in this study. The research involves secondary analysis of anonymized data and ethics approval was not required.

Data availability
The INTERSTROKE data set used in the real data example is not publicly available. Simulated case-control data for INTERSTROKE risk factors based on a fitted Bayesian network model applied to the real INTERSTROKE data, as well as examples of calculating PS-PAFs using this simulated data, are available in the R-package graphPAF: https://github.com/johnfergusonNUIG/graphPAF.

Supplementary data
Supplementary data are available at IJE online.

Funding
This work was supported by the grant EIA-2017-017 from the Health Research Board, Ireland.