Covariate association eliminating weights: a unified weighting framework for causal effect estimation

Summary Weighting methods offer an approach to estimating causal treatment effects in observational studies. However, if weights are estimated by maximum likelihood, misspecification of the treatment assignment model can lead to weighted estimators with substantial bias and variance. In this paper, we propose a unified framework for constructing weights such that a set of measured pretreatment covariates is unassociated with treatment assignment after weighting. We derive conditions for weight estimation by eliminating the associations between these covariates and treatment assignment characterized in a chosen treatment assignment model after weighting. The moment conditions in covariate balancing weight methods for binary, categorical and continuous treatments in cross-sectional settings are special cases of the conditions in our framework, which extends to longitudinal settings. Simulation shows that our method gives treatment effect estimates with smaller biases and variances than the maximum likelihood approach under treatment assignment model misspecification. We illustrate our method with an application to systemic lupus erythematosus data.


Introduction
Weighting methods are widely used to estimate causal treatment effects. The propensity function, the conditional probability of receiving treatment given a set of measured pretreatment covariates (Imai & van Dyk, 2004), features prominently in weighting methods. A natural choice of weights is a ratio of the marginal probability of treatment assignment and the propensity function, henceforth referred to as the stabilized inverse probability of treatment weights (Robins, 2000;Robins et al., 2000). Despite the appeal of weighting methods, problems arise when the propensity function is unknown. Hence weights are usually constructed using the stabilized inverse probability of treatment weight structure with an estimated propensity function, often obtained by maximum likelihood (Imbens, 2000;Robins et al., 2000), although other methods have been proposed (Lee et al., 2010). However, because these estimation procedures do not directly aim at the goal of weighting, which is to eliminate the association between a set of measured pretreatment covariates satisfying the conditions in § 2·1 and treatment assignment after weighting, a slightly misspecified propensity function model can result in badly biased treatment effect estimates (Kang & Schafer, 2007). Recently, this problem has motivated new weighting methods that optimize covariate balance, the covariate balancing weights (Graham et al., 2012;Hainmueller, 2012;McCaffrey et al., 2013;Imai & Ratkovic, 2014, 2015Zhu et al., 2015;Zubizarreta, 2015;Chan et al., 2016;Fong et al., 2018). These weights can dramatically improve the performance of weighting methods, but there is a lack of a framework to generalize them to complex treatment types, such as semicontinuous or multivariate treatments, and even to longitudinal settings.
In this paper, we introduce covariate association eliminating weights, a unified framework for constructing weights with the goal being that a set of measured pretreatment covariates will be unassociated with treatment assignment after weighting. Our method can be used to estimate causal effects for semicontinuous, count, ordinal, or even multivariate treatments, and it extends to longitudinal settings. An example of estimating the direct effect of a timevarying treatment on a longitudinal outcome is provided in § 8. Utilizing the generality of the propensity function and its capacity to characterize covariate associations with treatment assignment, we derive conditions for weight estimation by eliminating the association between the set of measured pretreatment covariates and treatment assignment specified in a chosen propensity function model after weighting, i.e., by solving the weighted score equations of the propensity function model at parameter values which indicate that the covariates are unassociated with treatment assignment.
Our method has several attractive characteristics. First, it encompasses existing covariate balancing weight methods and provides a unified framework for weighting with treatments of any distribution; see § 4. By eliminating the associations between the covariates and treatment assignment after weighting, our method can provide some robustness against misspecification of the functional forms of the covariates in a propensity function model, particularly if they are predictive of the outcome; see § 6. Second, it is clear from our framework what type of covariate associations are eliminated after weighting. For example, the covariate balancing weight method proposed in Fong et al. (2018) will only eliminate the associations between the covariates and the mean of a continuous treatment; see § 4·2. Our method can also eliminate the associations between the covariates and the variance of the continuous treatment. Third, our method extends to longitudinal settings; see § 8. In particular, apart from handling treatments of any distribution, it can accommodate unbalanced observation schemes and can incorporate a variety of stabilized weight structures. In contrast, to the best of our knowledge, the only available covariate balancing weight method for longitudinal settings, proposed by Imai & Ratkovic (2015), focuses on binary treatments in a balanced observation scheme, and it is not clear how to incorporate arbitrary stabilized weight structures in their approach. Finally, our method can be implemented with standard statistical software by solving a convex optimization problem that identifies minimum-variance weights subject to our conditions (Zubizarreta, 2015). This is especially appealing for nonbinary treatments with outliers (Naimi et al., 2014), because it protects against extreme weights which often lead to unstable treatment effect estimates in practice.

The propensity function 2·1 Definition and assumptions
Let X i , T i and Y i be respectively a set of measured pretreatment covariates, the possibly multivariate treatment variable and the outcome for the ith unit (i = 1, … , n) in a simple random sample of size n. Following Imai & van Dyk (2004), we define the propensity function as the conditional probability of treatment given the set of measured pretreatment covariates, i.e., pr(T i | X i ; β true ), where β true parameterizes this distribution. The parameter β true is assumed to be unique and finite-dimensional, and is such that pr(T i | X i ) depends on X i only through a subset of β true ; this is the uniquely parameterized propensity function assumption of Imai & van Dyk (2004). For example, if pr(T i | X i ) follows a regression model, then β true would include regression coefficients that characterize the dependence of T i on X i and intercept terms that describe the baseline distribution of T i . In addition, we make the strong ignorability of treatment assignment assumption (Rosenbaum & Rubin, 1983), also known as the unconfoundedness assumption, pr{T i | Y i (t P ), X i } = pr(T i | X i ) where Y i (t P ) is a random variable that maps a potential treatment t P to a potential outcome, and the positivity assumption (Imai & van Dyk, 2004), pr T i ∈ X i > 0 for all X i and any set with positive measure. Finally, the distribution of potential outcomes for one unit is assumed to be independent of the potential treatment value of another unit given the set of pretreatment covariates; this is the stable unit treatment value assumption. Throughout the paper, we make the above assumptions; otherwise our method may result in severely biased causal effect estimates, even compared with an unadjusted analysis. For example, when unconfoundedness holds without conditioning on X i , adjusting for X i can induce M-bias (Ding & Miratrix, 2015).

2·2 Covariate selection
We briefly review some methods for covariate selection. When the causal structure is known and represented by a directed acyclic graph, Shpitser et al. (2010) gave a complete graphical criterion, the adjustment criterion, to determine whether adjusting for a set of covariates ensures unconfoundedness. The adjustment criterion generalizes the back-door criterion of Pearl (1995), which is sufficient but not necessary for unconfoundedness. In the absence of knowledge about how covariates are causally related to each other, VanderWeele & Shpitser (2011) proposed the disjunctive cause criterion. This says that if any subset of pretreatment covariates suffices to ensure unconfoundedness, then the subset of pretreatment covariates that are causes of the treatment assignment and/or the outcome will also suffice.
Given that an adjustment set that ensures unconfoundedness has been identified, many researchers have proposed dimension reduction procedures to increase efficiency while maintaining unconfoundedness (de Luna et al., 2011;VanderWeele & Shpitser, 2011), or to minimize mean squared error (Vansteelandt et al., 2012). Broadly, these methods tend to remove from the adjustment set covariates that are unassociated with the outcome.

2·3 Stabilized inverse probability of treatment weighting
A popular approach to causal effect estimation is to weight each unit's data by stabilized inverse probability of treatment weights W i = W i (T i , X i ) = pr(T i )/pr(T i | X i ) . The idea is that if the propensity function is known, the propensity function after weighting by W i , pr*(T i | X i ), will be equivalent to pr(T i ) and hence does not depend on X i , as shown in the Supplementary Material. Here * denotes the pseudo-population after weighting. Under the assumptions in § 2·1, weighting by W i also preserves the causal effect of t P on E{Y i (t P )} in the original population (Robins, 2000;Zhang et al., 2016), and so the causal effect can be consistently estimated without adjusting for X i in the weighted data. For example, E{Y i (t P )} can be consistently estimated by modelling E(Y i | T i ) in the weighted data (Robins, 2000).

2·4 Maximum likelihood estimation
Estimating the weights by maximum likelihood involves specifying parametric models pr(T i ; α) and pr T i | X i ; β , where X i are functionals of elements in X i , and then estimating the unknown parameters α and β by solving the score equations If pr(T i | X i ; β) and pr(T i ; α) are correctly specified, then the weights where α and β are the maximum likelihood estimates of α and β, are equivalent to pr(T i )/pr(T i | X i ) asymptotically. Thus weighting by W i will result in pr*(T i | X i ) being asymptotically equivalent to pr(T i ), i.e., T i does not depend on X i after weighting. Here pr*(T i | X i ) depends on α and β through the estimated weights.
However, when pr(T i | X i ; β) is misspecified, this estimation procedure not only will result in pr*(T i | X i ) diverging from pr(T i ) but also does not even guarantee that the association between X i and T i is reduced in the weighted data relative to the observed data. Researchers are therefore encouraged to check for the absence of this association in the weighted data before proceeding to causal effect estimation. For nonbinary treatments, correctly specifying the propensity function model will generally also entail correct specification of the distribution and the dependence structure on covariates for higher-order moments of the treatment variable. Therefore, model misspecification for nonbinary treatments is likely to be worse.

3·1 General framework
Maximum likelihood estimation indirectly aims to achieve the asymptotic equivalence of pr*(T i | X i ) and pr(T i ) by fitting a model pr(T i | X i ; β) for the propensity function. We instead propose to use weighting to directly eliminate the association between X i and T i characterized by a chosen propensity function pr(T i | X i ; β) in the weighted data. When pr(T i | X i ; β) is misspecified, our method, in contrast to maximum likelihood estimation, will eliminate the association between X i and T i as characterized by pr(T i | X i ; β) after weighting. This is necessary for T i to be independent of X i after weighting. In the unlikely scenario that is correctly specified, maximum likelihood estimation will asymptotically eliminate the association between X i and T i after weighting, while our method will eliminate their association in finite samples.
We now formalize our ideas. Given a set of known weights W = (W 1 , … , W n ), we can fit a parametric propensity function model pr{T i | X i ; β W } to the data weighted by W by solving the score equations where β(W) is a vector of parameters. Here we write β(W) as a function of W because the resulting maximum likelihood estimates, β W , will depend on W. We use the uniquely parameterized propensity function assumption in § 2·1 to partition β( are the unique parameters that characterize the dependence of T i on X i , e.g., regression coefficients, and β b (W) are parameters that characterize the baseline distribution, e.g., the intercept terms. Here the subscripts 'd' and 'b' stand for dependence and baseline, respectively. Without loss of generality, we assume The conditions for covariate association eliminating weights are then derived by inverting (1) such that the weights W satisfy the where α is obtained by fitting pr(T i ; α) to the observed data. Our method therefore sets the goal of weighting as attaining distribution of T i is preserved from the observed data, as described by pr(T i ; α). Here (ii) is a statement concerning the projection function, the treatment assignment distribution in the weighted data. The choice of projection function, as long as it does not depend on X i , does not affect the consistency of weighted estimators for causal treatment effects, although it does affect their efficiency (Peterson et al., 2010). In our method, the projection function may be altered by fixing β b (W) in (2) at other values instead of α; an example is provided in § 4·1. Our method is linked to the use of regression to assess covariate balance, for example in matched data (Lu, 2005). Specifically, if applying regression to matched data indicates small associations between the covariates and treatment assignment, it would be reasonable to assume that the covariate distributions are approximately balanced across treatment levels. Inspired by this, we propose to invert this covariate balance measure for weight estimation such that it implies no imbalances in covariate distributions across treatment levels, i.e., β d W = 0, after weighting.

3·2 Convex optimization for weight estimation
We consider a convex optimization approach in the spirit of Zubizarreta (2015), intended to estimate minimum-variance weights subject to the conditions (2) being satisfied. We formulate the estimation problem as a quadratic programming task and use the lsei function in the R (R Development Core Team, 2018) package limSolve (Soetaert et al., 2009) to obtain the optimal weights. Specifically, we solve the following quadratic programming problem for W: The constraints (4) are the equations (2), and so should eliminate the associations between covariates and treatment assignment after weighting. They also preserve the marginal distribution of the treatment variable in the observed data. The first constraint in (5) ensures equality of the numbers of units in the weighted and observed data. Since this also ensures that the mean of W is unity, (3) minimizes the variance of W. Interestingly, since the weights in the observed data are ones, (3) can also be interpreted as identifying the least extrapolated data as characterized by the L 2 -norm. The second constraint in (5) requires that each element of W be nonnegative (Hainmueller, 2012;Zubizarreta, 2015). Allowing some elements of W to be zero can let the estimation problem be formulated as a convex quadratic programming problem. Then the estimation procedure could remove units which contribute greatly to the variability of the weights, while forcing the remaining units to allow unbiased estimation of causal treatment effects (Crump et al., 2009). Following Zubizarreta (2015), it is possible to relax the strict equality constraints (4) to inequalities; the R function lsei has this option.
This allows for less extrapolation at the expense of possibly introducing bias. For simplicity, we consider only the case where the strict equality constraints (4) are enforced.

4·1 Binary treatment
The moment conditions specified in existing covariate balancing weight methods (Imai & Ratkovic, 2014;Fong et al., 2018) are special cases of the conditions (2) after slight modifications. In this section we focus on binary treatments. The details for more general categorical treatments can be found in the Supplementary Material.
Unless stated otherwise in our derivations, we implicitly take β d (W) and β b (W) in (2) to be, respectively, the vector of regression coefficients of X i and the vector of the remaining parameters, including intercept terms, in the chosen propensity function model.
T . When T i is a binary treatment variable, i.e., T i = 1 if the ith unit received treatment and T i = 0 otherwise, the following covariate balancing conditions for the estimation of the propensity score have been proposed (Imai & Ratkovic, 2014): Because X i * includes 1, these covariate balancing conditions constrain the number of units in the treated and control groups to be equal in the weighted data. The other conditions for X i constrain the weighted means of each element in X i in the treated and control groups to be equal.
These conditions can be derived using our framework. Suppose that we specify a logistic regression model for the propensity function/score, pr{T i = 1 | X i ; β W } = 1/[1 + exp{−β W T X i *}], in the weighted data induced by a set of known weights W. This model can be fitted to the weighted data by solving the score Conditions can be derived by fixing 1/[1 + exp{−β W T X i *}] = π 0 , i.e., letting β d W = 0, in these weighted score equations, where π 0 is the proportion of units that received treatment in the observed data, and then inverting these equations so that we are solving for W: The correspondence between (6) and (7) can then be established by changing the projection function from π 0 to 1/2.

4·2 Continuous treatment
When T i is continuous on the real line, Fong et al. (2018) proposed the following covariate balancing conditions for weight estimation, where the superscript c denotes the centred version of the variable and X i * = (1, X i T ) T . We now derive these covariate balancing conditions using the proposed framework. First, we assume a simple normal linear model for the propensity function, The score equations for this model in the weighted data are By inverting these score equations, we find weights W that satisfy where μ 0 and σ 0 2 are the sample mean and variance of T i . The first set of conditions in (9) is equivalent to the conditions (8), except that the X i * are not necessarily centred. The usefulness of our framework can also be exemplified by the insight it gives into how conditions can be specified for the variance of T i . Specifically, suppose that we specify an alternative propensity function model that is, we allow the variance of T i , σ i 2 , to depend on X i with σ i = exp{β σ (W) T X i *} . For this model, the conditions for weight estimation are derived by setting the regression coefficient elements in β μ (W) and β σ (W) to zero in the score equations. This corresponds to solving the equations Yiu Thus, the additional conditions in (10) are designed to remove the association between X i and the variance of T i . More details can be found in § 6.

Other treatment types
Having demonstrated that our framework encompasses previously proposed work, we now widen its applicability by considering semicontinuous treatments, motivated by our application in § 7. Details about count treatments can be found in the Supplementary Material.
Semicontinuous variables are characterized by a point mass at zero and a right-skewed continuous distribution with positive support (Olsen & Schafer, 2001). Semicontinuous treatments are common in clinical settings because only treated patients will be prescribed a continuous dose of treatment; otherwise their dose will be recorded as zero (Moodie & Stephens, 2010). A common approach to modelling semicontinuous data is by using a twopart model, such as that in Olsen & Schafer (2001): T , ϕ( ⋅ ) is the standard normal density function, and I (·) is an indicator function.
Here g(·) is a monotonic function included to make the normal assumption for the positive values of T i more tenable. The likelihoods for the binary and continuous components of the two-part model are separable, so the results in § 4 imply that the conditions based on (11) are where π 0 , μ 0 and σ 0 are maximum likelihood estimates of π i , μ i and σ i obtained by fitting (11), but without covariates, to the observed T i . The conditions (12) are derived from the score equations for the binary component and are equivalent to (7). The conditions (13) are derived from the score equations of the continuous component and are similar to (10). In our framework, the weights W are estimated by solving (12) and (13) simultaneously, whereas maximum likelihood estimation obtains weights W bi and W ci separately from the binary and continuous components, respectively, and then uses their unit-specific product W i = W bi W ci as the final weight.
For each of the four sets of simulations, we use the proposed method for semicontinuous treatments in § 5, referred to as Approach 1, and maximum likelihood estimation, referred to as Approach 2, with the propensity function model (11) to obtain the weights. For both methods we consider two different model structures and two sets of covariates for the propensity function model. The correct model structure A allows the mean and variance of T i conditional on T i > 0 to depend on covariates, whereas the incorrect model structure B restricts σ i to be a constant σ. The first set of covariates are the correct covariates, X i = (X 1i , X 2i , X 3i ) T , and the second set are transformed covariates of the form The covariates X 1i t and X 3i t were chosen to be highly predictive of the outcome. In total, we fit eight models to each simulated dataset to estimate the weights, which are scaled by their averages so that they sum to n. Then, using the estimated weights, we fit a weighted negative binomial model (14) to the outcome with the marginal mean λ i = exp(γ 0 + γ 1 T i ), where γ 0 , γ 1 and θ are parameters to be estimated. The true causal treatment effect of interest is γ 1 = 0·5.
The left panels of Fig. 1 show that the estimates from Approach 1 have smaller biases than those from Approach 2, particularly when the covariates are transformed. In this case, bias Yiu  increases with sample size in Approach 2 but not in Approach 1. When the correct model structure A is used with the correct covariates, Approach 1 has smaller bias than Approach 2, perhaps because Approach 2 requires the law of large numbers to work well before the associations between covariates and treatment assignment can be eliminated after weighting.
The middle panels of Fig. 1 present the empirical variances of the treatment effect estimates. When the correct covariates are used, both approaches have small variances that decrease with sample size, although Approach 1 is better. Within each approach, estimates from the incorrect model structure B are less variable than those from model structure A, perhaps because the weights are less variable under model structure B as it has fewer degrees of freedom. The variances under Approach 1, but not under Approach 2, decrease with sample size when the transformed covariates are used. This behaviour in Approach 2 with the transformed covariates is due to a few sets of estimated weights exacerbating the extremeness of the tails of the sampling distribution as the sample size increases; see Robins et al. (2007, pp. 553-4) for more details. Similar phenomena are observed with the mean squared error.
Because Approach 2 performs so poorly relative to Approach 1 when the transformed covariates are used, it is difficult to distinguish between the performances of Approach 1 under model structures A and B in Fig. 1, so we summarize the results here: within Approach 1, estimates from model structure A have smaller biases but larger variances than estimates from model structure B. Overall, estimates from model structure A have smaller mean squared errors for n ≥ 1000.
In summary, in all examined scenarios, estimates based on the proposed method have smaller biases and variances than the maximum likelihood estimates.

Application
Steroids are effective and low-cost treatments for relieving disease activity in patients with systemic lupus erythematosus, a chronic autoimmune disease that affects multiple organ systems. However, there is evidence that steroid exposure could be associated with permanent organ damage that might not be attributable to disease activity. Motivated by a recent study (Bruce et al., 2015), we aim to estimate the causal dose-response relationship between steroid exposure and damage accrual shortly after disease diagnosis using data from the Systemic Lupus International Collaborating Clinics inception cohort.
We focus on 1342 patients who were enrolled between 1999 and 2011 from 32 sites and had at least two yearly assessments in damage accrual after disease diagnosis. The outcome of interest Y i is the number of items accrued in the Systemic Lupus International Collaborating Clinics/American College of Rheumatology Damage Index during the period O i , defined as the time interval between the two initial assessments in years. The semicontinuous treatment variable T i is the average steroid dose per day in milligrams over the same time period. Based on clinical inputs, we consider the following pretreatment covariates X i : the numerical scoring version of the British Isles Lupus Assessment Group disease activity index (Hay et al., 1993;Ehrenstein et al., 1995), which was developed according to the principle of physicians' intentions to treat; age at diagnosis in years; disease duration in years; and the groups of race/ethnicity and geographic region.
We consider the model (11) for T i with g(x) = log(x + 1) to reduce right skewness in the positive steroid dose. For X i , we include the main effects of X i . We use the same four models as in the simulations to estimate weights. Further details can be found in the Supplementary Material.
We construct 95% bootstrap percentile confidence intervals for parameter estimates using 1000 nonparametric bootstrap samples.
The results of the outcome regression models based on four sets of estimated weights are reported in Table 1. Consistent with current clinical evidence (Bruce et al., 2015), estimates from all weighted outcome regression models indicate that steroid use and the average steroid dose are positively and significantly associated with damage accrual in the initial period following diagnosis, although the estimated effect sizes from Approach 1 are larger than those from Approach 2. Approach 1 also yields narrower confidence intervals than Approach 2. Within each approach, model structure B gives slightly smaller standard errors than model structure A.
We also fitted models with I (T i > 0) and I (T i > 0) log(T i + 1) included in the outcome regression. This provides the dose-response relationship after removing the patients unexposed to steroids (Greenland & Poole, 1995). The estimated dose-response functions were similar to those obtained from the models that include the semicontinuous treatment.
Overall, our results suggest that steroid dose is related to the damage accrual rate at the early disease stage of systemic lupus erythematosus. This suggests that clinicians might need to seek steroid-sparing therapies even at the early disease stage in order to reduce damage accrual.

Longitudinal setting
Our framework can be extended to the longitudinal setting. Here we give an example using a similar setting to that in Moodie & Stephens (2010). Suppose that for the ith unit (i = 1, … , n), in each time interval [s ij−1 , s ij ) (j = 1, … , m i ; s i0 = 0) of a longitudinal study, we observe in chronological order a vector of covariates X ij , a time-varying treatment T ij that can be of any distribution, and a longitudinal outcome Y ij . The units are not necessarily followed up at the same time-points. Let the random variable Y i j (t i j P ) be the potential outcome that would have arisen had treatment t i j P been administered in the time interval [s ij−1 , s ij ). We consider the setting where interest lies in estimating the direct causal effect of t i j P on E{Y i j (t i j P )}, which may be confounded by histories of covariates, treatment assignment and response, X i j , T i j − 1

Europe PMC Funders Author Manuscripts
Europe PMC Funders Author Manuscripts and Ȳ ij−1 . Here an overbar represents the history of a process; for example, In order to identify the direct causal effect of t i j P on E{Y i j (t i j P )}, we make the sequential ignorability assumption, pr{T i j | Y i j (t i j P ), X i j , T i j − 1 , Y i j − 1 } = pr T i j |X i j , T i j − 1 , Y i j − 1 for any time interval, and the positivity assumption, pr T i j ∈ | X i j , T i j − 1 , Y i j − 1 > 0 for all X i j , T i j − 1 and Ȳ ij−1 and for any set with positive measure. Under these assumptions, the effect of t i j P on E{Y i j (t i j P )} can be consistently estimated by weighting the ith unit's data in the interval [s ij−1 , s ij ) with W i j = pr T i j /pr T i j | X i j , T i j − 1 , Y i j − 1 for all units and time intervals.
The weights are typically estimated by W i j = pr T i j ; α /pr(T i j | X i j ; β), where X i j are functionals of X i j , T i j − 1 and Ȳ ij−1 , and α and β are the maximum likelihood estimates of α and β.
An alternative approach to weight estimation is to generalize our proposed method. Following the same strategy as in § 3·1, we assume that given known weights W ij , the timevarying propensity function in the time interval [s ij−1 , s ij ) follows a model pr{T i j | X i j ; β W } .
As in § 3·1, we partition β(W) into {β b (W), β d (W)}, where β d (W) are regression coefficients that characterize the association between X i j and T ij over time, and β b (W) are parameters that characterize the baseline distribution, e.g., the intercept terms. Similarly to before, conditions for weight estimation can be derived by inverting the weighted score equations where α is obtained by fitting the model pr(T ij ; α) to the observed data for the time-varying treatment T ij . Thus these conditions are designed to eliminate the association between X i j and T ij and to preserve the observed marginal distribution of T ij after weighting. Other projection functions that can help to further stabilize the weights, such as those that depend on some of the covariates, can also be considered in the proposed framework with only minor modifications. This would be useful when the interactions between these covariates and treatment are of interest in the outcome regression model. Finally, the convex optimization approach in § 3·2 can be used for weight estimation by replacing the equations (4) with (15). For large sample sizes, a parametric approach to solving the conditions (15) would be useful.

Discussion
The proposed method has some limitations. First, both our method and existing covariate balancing weight methods treat covariates equally and balance them simultaneously. This can lead to poor performance in high-dimensional settings, so it would be of interest to incorporate different weights for the covariates when estimating the weights for the units. Second, it can be hard to detect near violations of the positivity assumption with our method, because it generally results in small variance of the causal effect estimates by exactly balancing the covariates. In such circumstances, e.g., when there is strong confounding, the results from our method can hide the fact that the observed data alone carry little information about the target causal effect and can have large bias under model misspecification because our method will almost entirely rely on extrapolation. It is therefore important to assess the positivity assumption when applying our framework. Third, our method does not necessarily estimate the causal effect for the population of interest, such as the target population of a health policy. This can be remedied by supplementing the conditions (4) with the additional conditions ∑ i = 1 n W i X i /n = c, where c is the sample mean of X i with respect to the target population (Stuart et al., 2011).

Supplementary Material
Refer to Web version on PubMed Central for supplementary material. Plots of the bias (left panels), empirical variance (middle panels) and mean squared error (right panels) of the treatment effect estimates as a function of sample size, for the correct (top panels) and transformed (bottom panels) covariate sets; in each panel the different line types represent Approach 1 with model structure A (dotted), Approach 1 with model structure B (dot-dash), Approach 2 with model structure A (solid), and Approach 2 with model structure B (dashed).