-
PDF
- Split View
-
Views
-
Cite
Cite
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, James Robins, Double/debiased machine learning for treatment and structural parameters, The Econometrics Journal, Volume 21, Issue 1, 1 February 2018, Pages C1–C68, https://doi.org/10.1111/ectj.12097
Close -
Share
Summary
We revisit the classic semi‐parametric problem of inference on a low‐dimensional parameter θ0 in the presence of high‐dimensional nuisance parameters η0. We depart from the classical setting by allowing for η0 to be so high‐dimensional that the traditional assumptions (e.g. Donsker properties) that limit complexity of the parameter space for this object break down. To estimate η0, we consider the use of statistical or machine learning (ML) methods, which are particularly well suited to estimation in modern, very high‐dimensional cases. ML methods perform well by employing regularization to reduce variance and trading off regularization bias with overfitting in practice. However, both regularization bias and overfitting in estimating η0 cause a heavy bias in estimators of θ0 that are obtained by naively plugging ML estimators of η0 into estimating equations for θ0. This bias results in the naive estimator failing to be consistent, where N is the sample size. We show that the impact of regularization bias and overfitting on estimation of the parameter of interest θ0 can be removed by using two simple, yet critical, ingredients: (1) using Neyman‐orthogonal moments/scores that have reduced sensitivity with respect to nuisance parameters to estimate θ0; (2) making use of cross‐fitting, which provides an efficient form of data‐splitting. We call the resulting set of methods double or debiased ML (DML). We verify that DML delivers point estimators that concentrate in an ‐neighbourhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements. The generic statistical theory of DML is elementary and simultaneously relies on only weak theoretical requirements, which will admit the use of a broad array of modern ML methods for estimating the nuisance parameters, such as random forests, lasso, ridge, deep neural nets, boosted trees, and various hybrids and ensembles of these methods. We illustrate the general theory by applying it to provide theoretical properties of the following: DML applied to learn the main regression parameter in a partially linear regression model; DML applied to learn the coefficient on an endogenous variable in a partially linear instrumental variables model; DML applied to learn the average treatment effect and the average treatment effect on the treated under unconfoundedness; DML applied to learn the local average treatment effect in an instrumental variables setting. In addition to these theoretical applications, we also illustrate the use of DML in three empirical examples.
1. Introduction and Motivation
Motivation
We develop a series of simple results for obtaining root‐N consistent estimation, where N is the sample size, and valid inferential statements about a low‐dimensional parameter of interest, θ0, in the presence of a high‐dimensional or ‘highly complex’ nuisance parameter, η0. The parameter of interest will typically be a causal parameter or treatment effect parameter, and we consider settings in which the nuisance parameter will be estimated using machine learning (ML) methods, such as random forests, lasso or post‐lasso, neural nets, boosted regression trees, and various hybrids and ensembles of these methods. These ML methods are able to handle many covariates and they provide natural estimators of nuisance parameters when these parameters are highly complex. Here, highly complex formally means that the entropy of the parameter space for the nuisance parameter is increasing with the sample size in a way that moves us outside the traditional framework considered in the classical semi‐parametric literature where the complexity of the nuisance parameter space is taken to be sufficiently small. The main contribution of this paper is to offer a general and simple procedure for estimating and to perform inference on θ0 that is formally valid in these highly complex settings.
Regularization Bias
Overcoming Regularization Biases Using Orthogonalization
Now consider a second construction that employs an orthogonalized formulation obtained by directly partialling out the effect of X from D to obtain the orthogonalized regressor . Specifically, we obtain , where is an ML estimator of m0 obtained using the auxiliary sample of observations. We are now solving an auxiliary prediction problem to estimate the conditional mean of D given X, so we are doing ‘double prediction’ or ‘double machine learning’.
Figure 1 provides a numerical illustration of the negative impact of regularization bias and the benefit of orthogonalization. The left panel shows the behaviour of a conventional (non‐orthogonal) ML estimator, , in the partially linear model in a simple simulation experiment where we learn g0 using a random forest. The g0 in this experiment is a very smooth function of a small number of variables, so the experiment is seemingly favourable to the use of random forests a priori. The histogram shows the simulated distribution of the centred estimator, . The estimator is badly biased, shifted much to the right relative to the true value θ0. The distribution of the estimator (approximated by the blue histogram) is substantively different from a normal approximation (shown by the red curve) derived under the assumption that the bias is negligible. The right panel shows the behaviour of the orthogonal, DML estimator, , in the partially linear model in a simple experiment where we learn nuisance functions using random forests. Note that the simulated data are exactly the same as those underlying in the left panel. The simulated distribution of the centred estimator, (given by the blue histogram) illustrates that the estimator is approximately unbiased, concentrates around θ0, and is well approximated by the normal approximation obtained in Section 3 (shown by the red curve).
Comparison of the conventional and double ML estimators. [Color figure can be viewed at wileyonlinelibrary.com]
Comparison of the conventional and double ML estimators. [Color figure can be viewed at wileyonlinelibrary.com]
The Role of Sample Splitting in Removing Bias Induced By Overfitting
While sample splitting allows us to deal with remainder terms such as , its direct application does have the drawback that the estimator of the parameter of interest only makes use of the main sample, which can result in a substantial loss of efficiency as we are only making use of a subset of the available data. However, we can flip the role of the main and auxiliary samples to obtain a second version of the estimator of the parameter of interest. By averaging the two resulting estimators, we can regain full efficiency. Indeed, the two estimators will be approximately independent, so simply averaging them offers an efficient procedure. We call this sample‐splitting procedure – where we swap the roles of main and auxiliary samples to obtain multiple estimates and then average the results – ‘cross‐fitting’. We formally define this procedure and discuss a K‐fold version of cross‐fitting in Section 3.
Without sample splitting, terms such as 1.6 might not vanish and can lead to poor performance of estimators of θ0. The difficulty arises because model errors, such as , and estimation errors, such as , are generally related because the data for observation i are used in forming the estimator . The association can then lead to poor performance of an estimator of θ0 that makes use of as a plug‐in estimator for g0 even when this estimator converges at a very favourable rate, say .
This bias due to overfitting is illustrated in the left panel of Figure 2. The histogram in the figure gives a simulated distribution for the studentized resulting from using the full sample and the contrived estimator given above. We can see that the histogram is shifted markedly to the left demonstrating substantial bias resulting from overfitting. The right panel of Figure 2 also illustrates that this bias is completely removed by sample splitting. The results in the right panel of Figure 2 make use of the twofold cross‐fitting procedure discussed above using the estimator and the contrived estimator exactly as in the left panel. The difference is that is formed in one half of the sample and then is estimated using the other half of the sample. This procedure is then repeated, swapping the roles of the two samples, and the results are averaged. We can see that the substantial bias from the full sample estimator has been removed and that the spread of the histogram corresponding to the cross‐fitting estimator is roughly the same as that of the full sample estimator, clearly illustrating the bias‐reduction property and efficiency of the cross‐fitting procedure.
Comparison of full‐sample and cross‐fitting procedures. [Color figure can be viewed at wileyonlinelibrary.com]
Comparison of full‐sample and cross‐fitting procedures. [Color figure can be viewed at wileyonlinelibrary.com]
Neyman Orthogonality and Moment Conditions
Literature Overview
Our paper builds upon two important bodies of research within the semi‐parametric literature. The first is the literature on obtaining ‐consistent and asymptotically normal estimates of low‐dimensional objects in the presence of high‐dimensional or non‐parametric nuisance functions. The second is the literature on the use of sample splitting to relax entropy conditions. We provide links to each of these bodies of literature in turn.
There is also a related targeted maximum likelihood learning approach, introduced in Scharfstein et al. (1999) in the context of treatments effects analysis. This is substantially generalized by van der Laan and Rubin (2006), who use maximum likelihood in a least favourable direction and then perform one‐step or k‐step updates using the estimated scores, in an effort to better estimate the target parameter.6 This procedure is like the least favourable direction approach in semi‐parametrics; see, for example, Severini and Wong (1992). The introduction of the likelihood introduces major benefits, such as allowing simple and natural imposition of constraints inherent in the data (e.g. support restrictions when the outcome is binary or censored) and permitting the use of likelihood cross‐validation to choose the nuisance parameter estimator. This data adaptive choice of the nuisance parameter has been dubbed the ‘super learner’ by van der Laan et al. (2007). In subsequent work, van der Laan and Rose (2011) emphasize the use of ML methods to estimate the nuisance parameters for use with the super learner. Much of this work, including recent work such as Luedtke and van der Laan (2016), Toth and van der Laan (2016) and Zheng et al. (2016), focuses on formal results under a Donsker condition, though the use of sample splitting to relax these conditions has also been advocated in the targeted maximum likelihood setting, as discussed below.
The Donsker condition is a powerful classical condition that allows rich structures for fixed function classes , but unfortunately it is unsuitable for high‐dimensional settings. Examples of function classes where a Donsker condition holds include functions of a single variable that have total variation bounded by 1 and functions that have uniformly bounded derivatives. As a further example, functions composed from function classes with VC dimensions bounded by p through a fixed number of algebraic and monotonic transforms are Donsker. However, this property will no longer hold if we let grow to infinity with the sample size as this increase in dimension would require that the VC dimension also increases with n. More generally, Donsker conditions are easily violated once dimensions become large. A major point of departure of the present work from the classical literature on semi‐parametric estimation is its explicit focus on high‐complexity/entropy cases. One way to analyse the problem of estimation in high‐entropy cases is to see to what degree equicontinuity results continue to hold while allowing moderate growth of the complexity/entropy of . Examples of papers taking this approach in approximately sparse settings are Belloni et al. (2016, 2016, 2017), Chernozhukov et al. (2015b), Javanmard and Montanari (2014a), van de Geer et al. (2014) and Zhang and Zhang (2014). In all of these examples, entropy growth must be limited in what can be very restrictive ways. The entropy conditions rule out the contrived overfitting example mentioned above, which does approximate realistic examples, and might otherwise place severe restrictions on the model. For example, in Belloni et al. (2010, 2012), the optimal instrument needs to be sparse of order .
A key device that we use to avoid strong entropy conditions is cross‐fitting via sample splitting. Cross‐fitting is a practical, efficient form of data splitting. Importantly, its use here is not simply as a device to make proofs elementary (which it does), but as a practical method to allow us to overcome the overfitting/high‐complexity phenomena that commonly arise in data analysis based on highly adaptive ML methods. Our treatment builds upon the sample‐splitting ideas employed in Belloni et al. (2010, 2012) who considered sample splitting in a high‐dimensional sparse optimal IV model to weaken the sparsity condition mentioned in the previous paragraph to . This work, in turn, was inspired by Angrist and Krueger (1995). We also build on Ayyagari (2010) and Robins et al. (2013), where ML methods and sample splitting were used in the estimation of a partially linear model of the effects of pollution while controlling for several covariates. We use the term cross‐fitting to characterize our recommended procedure, partly borrowing the jargon from Fan et al. (2012), who employed a slightly different form of sample splitting to estimate the scale parameter in a high‐dimensional sparse regression. Of course, the use of sample splitting to relax entropy conditions has a long history in semi‐parametric estimation problems. For example, Bickel (1982) considered estimating nuisance functions using a vanishing fraction of the sample, and these results were extended to sample splitting into two equal halves and discretization of the parameter space by Schick (1986). Similarly, van der Vaart (1998) uses two‐way sample splitting and discretization of the parameter space to give weak conditions for k‐step estimators using the efficient scores where sample splitting is used to estimate the updates; see also Hubbard et al. (2016). Robins et al. (2008, 2017) use sample splitting in the construction of higher‐order influence function corrections in semi‐parametric estimation. Some recent work in the targeted maximum likelihood literature, e.g. Zheng and van der Laan (2011), also notes the utility of sample splitting in the context of k‐step updating, though this sample splitting approach is different from the cross‐fitting approach we pursue.
Plan of The Paper
We organize the rest of the paper as follows. In Section 2, we formally define Neyman orthogonality and provide a brief discussion that synthesizes various models and frameworks that can be used to produce estimating equations satisfying this key condition. In Section 3, we carefully define DML estimators and develop their general theory. We then illustrate this general theory by applying it to provide theoretical results for using DML to estimate and carry out inference for key parameters in the PLR model, and for using DML to estimate and carry out inference for coefficients on endogenous variables in a partially linear IV model in Section 4. In Section 5, we provide a further illustration of the general theory by applying it to develop theoretical results for DML estimation and inference for average treatment effects (ATEs) and average treatment effects on the treated (ATTEs) under unconfoundedness, and for DML estimation of local average treatment effects (LATEs) in an IV context within the potential outcomes framework; see Imbens and Rubin (2015). Finally, we apply DML in three empirical illustrations in Section 6. In the Appendix, we define additional notation and present proofs.
Notation
The symbols Pr and E denote probability and expectation operators with respect to a generic probability measure that describes the law of the data. If we need to signify the dependence on a probability measure P, we use P as a subscript in and . We use capital letters, such as W, to denote random elements and we use the corresponding lowercase letters, such as w, to denote fixed values that these random elements can take. In what follows, we use to denote the norm; for example, we denote . We use to denote the transpose of a column vector x. For a differentiable map , mapping to , we use to abbreviate the partial derivatives , and we correspondingly use the expression to mean , etc.
2. Construction of Neyman Orthogonal Score/Moment Functions
Here we formally introduce the model and we discuss several methods for generating orthogonal scores in a wide variety of settings, including the classical Neyman construction. We also use this as an opportunity to synthesize some recent developments in the literature.
2.1. Moment Condition/Estimating Equation Framework
We remark here that condition 2.3 holds with when η is a finite‐dimensional vector as long as for all , where denotes the vector of partial derivatives of the function for .
Sometimes it will also be helpful to use an approximate Neyman orthogonality condition as opposed to the exact one given in Definition 2.1.
2.2. Construction of Neyman Orthogonal Scores
If we start with a score φ that does not satisfy the orthogonality condition above, we first transform it into a score ψ that does. Here we outline several methods for doing so.
2.2.1. Neyman Orthogonal Scores for Likelihood and Other M‐Estimation Problems With Finite‐Dimensional Nuisance Parameters
First, we describe the construction used by Neyman (1959) to derive his celebrated orthogonal score and ‐statistic in a maximum likelihood setting.7 Such a construction also underlies the concept of local unbiasedness in the construction of optimal tests in, e.g., Ferguson (1967), and it was extended to non‐likelihood settings by Wooldridge (1991). The discussion of Neyman's construction here draws on Chernozhukov et al. (2015a).
Note that the orthogonal score ψ in 2.7 has nuisance parameters consisting of the elements of μ in addition to the elements of β, and Lemma 2.1 shows that Neyman orthogonality holds both with respect to β and with respect to μ. We will find that Neyman orthogonal scores in other settings, including infinite‐dimensional ones, have a similar property.
Note that in this example, μ0 not only creates the necessary orthogonality but also creates the efficient score for inference on the target parameter θ when the quasi‐log‐likelihood function is the true (possibly conditional) log‐likelihood, as demonstrated by Neyman (1959).
If 2.6 holds, J exists, the solution of the optimization problem 2.14 exists, and μ0 is taken to be this solution, then the score ψ defined in 2.7 is Neyman near‐orthogonal at with respect to the nuisance realization set , where the norm on is defined by with the supremum being taken over all matrices A such that .
Note that the regularized μ0 in 2.14 creates the necessary near‐orthogonality at the cost of giving up somewhat on the efficiency of the score ψ. At the same time, regularization may generate additional robustness gains as achieving full efficiency by estimating μ0 in 2.10 may require stronger conditions.
2.2.2. Neyman Orthogonal Scores in Generalized Method of Moments Problems
The construction in the previous section gives a Neyman orthogonal score whenever the moment conditions 2.6 hold, and, as discussed in Remark 2.2, the resulting score is efficient as long as is the log‐likelihood function. The question, however, remains about constructing the efficient score when is not necessarily a log‐likelihood function. In this section, we answer this question and describe a generalized method of moments (GMM)‐based method of constructing an efficient and Neyman orthogonal score in this more general case. The discussion here is related to Lee (2005), Bera et al. (2010) and Chernozhukov et al. (2015b).
2.2.3. Neyman Orthogonal Scores for Likelihood and Other M‐Estimation Problems With Infinite‐Dimensional Nuisance Parameters
Suppose that 2.5 holds, and let T be a convex set of functions mapping Θ into such that . Also, suppose that for each , the function is continuously differentiable almost surely. Then, under mild regularity conditions, the score ψ in 2.21 is Neyman orthogonal at with respect to the nuisance realization set .
It is important to note that the concentrating‐out approach described here gives a Neyman orthogonal score without requiring that is the log‐likelihood function. Except for the technical conditions needed to ensure the existence of derivatives and their interchangeability, the only condition that is required is that θ0 and β0 solve the optimization problem 2.5. If is the log‐likelihood function, however, it follows from Newey (1994, p. 1359), that the concentrating‐out approach actually yields the efficient score. An alternative, but closely related, approach to derive the efficient score in the likelihood setting would be to apply Neyman's construction described above for a one‐dimensional least favourable parametric submodel; see Severini and Wong (1992) and Chapter 25 of van der Vaart (1998).
2.2.4. Neyman Orthogonal Scores for Conditional Moment Restriction Problems With Infinite‐Dimensional Nuisance Parameters
where is a known vector‐valued function. This framework is of interest because it covers a rich variety of models without having to explicitly rely on the likelihood formulation.
Suppose that (a) 2.22 holds, (b) the matrices , , and are finite, and (c) for all , there exists a constant such that . Then the score ψ in 2.26 is Neyman orthogonal at with respect to the nuisance realization set .
2.2.5. Neyman Orthogonal Scores and Influence Functions
Neyman orthogonality is a joint property of the score , the true parameter value η0, the parameter set T, and the distribution of W. It is not determined by any particular model for the parameter θ. Nevertheless, it is possible to use semi‐parametric efficiency calculations to construct the orthogonal score from the original score as in Chernozhukov et al. (2016). Specifically, an orthogonal score can be constructed by adding to the original score the influence function adjustment for estimation of the nuisance functions that is analysed in Newey (1994). The resulting orthogonal score will be the influence function of the limit of the average of the original score.
The form of the adjustment term depends on the estimator and, of course, on the form of . Such adjustment terms have been derived for various by Newey (1994). Also, Ichimura and Newey (2015) show how the adjustment term can be computed from the limit of a certain derivative. Any of these results can be applied to a particular starting score and estimator to obtain an orthogonal score.
Influence functions have been used to estimate functionals of non‐parametric estimators by Hasminskii and Ibragimov (1979) and Bickel and Ritov (1988). Newey et al. (1998, 2004) showed that from 2.29 will have a second‐order remainder in , which is the key asymptotic property of orthogonal scores. Orthogonality of influence functions in semi‐parametric models follows from van der Vaart (1991), as shown for higher‐order counterparts in Robins et al. (2008, 2017). Chernozhukov et al. (2016) point out that, in general, an orthogonal score can be constructed from an original score and non‐parametric estimator by adding to the original score the adjustment term for estimation of β0 as described above. This construction provides a way of obtaining an orthogonal score from any initial score and non‐parametric estimator .
3. Dml: Post‐Regularized Inference Based on Neyman‐Orthogonal Estimating Equations
3.1. Definition of Dml and Its Basic Properties
We assume that we have a sample , modelled as independent and identically distributed (i.i.d.) copies of W, whose law is determined by the probability measure P on . Estimation will be carried out using the finite‐sample analogue of the estimating equation 2.1.
We assume that the true value η0 of the nuisance parameter η can be estimated by using a part of the data . Different structured assumptions on η0 allow us to use different machine‐learning tools for estimating η0, for example:
(1)
approximate sparsity for η0 with respect to some dictionary calls for the use of forward selection, lasso, post‐lasso, ℓ2‐boosting, or some other sparsity‐based technique;
(2)
well‐approximability of η0 by trees calls for the use of regression trees and random forests;
(3)
well‐approximability of η0 by sparse neural and deep neural nets calls for the use of ℓ1‐penalized neural and deep neural networks;
(4)
well‐approximability of η0 by at least one model mentioned in (1)–(3) above calls for the use of an ensemble/aggregated method over the estimation methods mentioned in (1)–(3).
There are performance guarantees for most of these ML methods that make it possible to satisfy the conditions stated below. Ensemble and aggregation methods ensure that the performance guarantee is approximately no worse than the performance of the best method.
We assume that N is divisible by K in order to simplify the notation. The following algorithm defines the simple cross‐fitted DML as outlined in the Introduction.
This approach generalizes the 50–50 cross‐fitting method mentioned in the Introduction. We now define a variation of this basic cross‐fitting approach that may behave better in small samples.
The choice of K has no asymptotic impact under our conditions but, of course, the choice of K may matter in small samples. Intuitively, larger values of K provide more observations in from which to estimate the high‐dimensional nuisance functions, which seems to be the more difficult part of the problem. We have found moderate values of K, such as 4 or 5, to work better than in a variety of empirical examples and in simulations. Moreover, we generally recommend DML2 over DML1 though in some problems such as estimation of ATE in the interactive model, which we discuss later, there is no difference between the two approaches. In most other problems, DML2 is better behaved because the pooled empirical Jacobian for the equation in 3.4 exhibits more stable behaviour than the separate empirical Jacobians for the equation in 3.1.
3.2. Moment Condition Models With Linear Scores
Let , , and be some finite constants such that , and let and be some sequences of positive constants converging to zero such that . Also, let be some fixed integer, and let be some sequence of sets of probability distributions P of W on .
Assumption 3.1 requires scores to be Neyman orthogonal or near‐orthogonal and imposes mild smoothness requirements as well as the canonical identification condition.
Assumptions 3.2(a)–(c) state that the estimator of the nuisance parameter belongs to the realization set , which is a shrinking neighbourhood of η0, that contracts around η0 with the rate determined by the ‘statistical’ rates , and . These rates are not given in terms of the norm on T, but rather are the intrinsic rates that are most connected to the statistical problem at hand. However, in smooth problems, as discussed below this translates, in the worst cases, to the crude requirement that the nuisance parameters are estimated at the rate .
(1) the optimal instrument problem (see Belloni et al., 2012);
(2) the PLR model when or is otherwise known (see Section 4);
(3) the treatment effect examples when the propensity score is known, which includes randomized control trials as an important special case (see Section 5).
The result establishes that the estimator based on the orthogonal scores achieves the root‐N rate of convergence and is approximately normally distributed. It is noteworthy that this convergence result, both the rate of concentration and the distributional approximation, holds uniformly with respect to P varying over an expanding class of probability measures . This means that the convergence holds under any sequence of probability distributions with for each N, which in turn implies that the results are robust with respect to perturbations of a given P along such sequences. The same property can be shown to fail for methods not based on orthogonal scores.
Theorems 3.1 and 3.2 can be used for standard construction of confidence regions, which are uniformly valid over a large, interesting class of models:
Next we note that the estimators need not be semi‐parametrically efficient, but under some conditions they can be.
Under the conditions of Theorem 3.1, if the score ψ is efficient for estimating θ0 at a given , in the semi‐parametric sense as defined in van der Vaart (1998), then the large sample variance of reaches the semi‐parametric efficiency bound at this P relative to the model .
3.3. Models With Non‐Linear Scores
Let , , , , and be some finite constants, and let , and be some sequences of positive constants converging to zero. To derive the properties of the DML estimator, we will use the following assumptions.
Assumption 3.3 is mild and rather standard in moment condition problems. Assumption 3.3(a) requires θ0 to be sufficiently separated from the boundary of Θ. Assumption 3.3(b) only requires differentiability of the function and does not require differentiability of the function . Assumption 3.3(c) implies sufficient identifiability of θ0. Assumption 3.3(d) is the orthogonality condition that has already been extensively discussed.
are bounded from below by c0.
Assumptions 3.3(a)–(c) state that the estimator of the nuisance parameter belongs to the realization set , which is a shrinking neighbourhood of η0 that contracts at the statistical rates , and . These rates are not given in terms of the norm on T, but rather are intrinsic rates that are connected to the statistical problem at hand. In smooth problems, these conditions translate to the crude requirement that nuisance parameters are estimated at the rate, as discussed previously in the case with linear scores. However, these conditions can be refined as, for example, when or when some cross‐derivatives vanish in ; see the linear case in the previous subsection for further discussion. Suitable measurability and pointwise entropy conditions, required in Assumption 3.4(b), are mild regularity conditions that are satisfied in all practical cases. The assumption of a bounded parameter space Θ in Assumption 3.4(b) is embedded in the entropy condition, but we state it separately for clarity. This assumption was not needed in the linear case, and it can be removed in the non‐linear case with the imposition of more complicated Huber‐like regularity conditions. Assumption 3.4(c) is a set of mild growth conditions.
Similar to the discussion in the linear case, the conditions in Assumption 3.4 are very flexible and embody refined requirements on the quality of the nuisance parameter estimators. The conditions essentially reduce to the previous conditions in the linear case, with the exception of compactness, which is imposed to make the conditions easy to verify in non‐linear cases.
Moreover, in the statement above, σ2 can be replaced by a consistent estimator , obeying uniformly in , with the size of the remainder term updated as . Furthermore, Corollaries 3.1 and 3.2 continue to hold under the conditions of this theorem.
3.4. Finite‐Sample Adjustments To Incorporate Uncertainty Induced By Sample Splitting
We recommend using medians, reporting and , as these quantities are more robust to outliers.
If S is fixed, as and maintaining either Assumptions 3.1 and 3.2 or Assumptions 3.3 and 3.4 as appropriate, and are first‐order equivalent to and obey the conclusions of Theorems 3.1 and 3.2 or of Theorem 3.3. Moreover, and can replace in the statement of the appropriate theorems.
It would be interesting to investigate the behaviour under the regime where as .
4. Inference in Partially Linear Models
4.1. Inference in Partially Linear Regression Models
In the partially linear model, the DML approach complements Belloni et al. (2013, 2014a,b, 2015), Zhang and Zhang (2014), van de Geer et al. (2014) and Javanmard and Montanari (2014b), all of which consider estimation and inference for parameters within the partially linear model using lasso‐type methods without cross‐fitting. By relying upon cross‐fitting, the DML approach allows for the use of a much broader collection of ML methods for estimating the nuisance functions and also allows relaxation of sparsity conditions in the case where lasso or other sparsity‐based estimators are used. Both the DML approach and the approaches taken in the aforementioned papers can be seen as heuristically debiasing the score function , which does not possess the orthogonality property unless .
Let and be sequences of positive constants approaching 0 as before. Also, let c, C and q be fixed strictly positive constants such that , and let be a fixed integer. Moreover, for any , where ℓ1 is a function ℓ1 and ℓ2 are functions mapping the support of X to , denote . For simplicity, assume that is an integer.
Let be the collection of probability laws P for the triple such that (a) 4.1 and 4.2 hold; (b) ; (c) and ; (d) and ; and (e) given a random subset I of [N] of size , the nuisance parameter estimator obeys the following conditions for all . With P‐probability no less than , , and (i) for the score ψ in 4.3, where , , and (ii) for the score ψ in 4.4, where , .
The only non‐primitive condition here is the assumption on the rate of estimating the nuisance parameters. These rates of convergence are available for most often used ML methods and are case‐specific, so we do not restate conditions that are needed to reach these rates.
The following theorem follows as a corollary to the results in Section 3 by verifying Assumptions 3.1 and 3.2, and it will be proven as a special case of Theorem 4.2 below.
Under conditional homoscedasticity, i.e. , the asymptotic variance σ2 reduces to , which is the semi‐parametric efficiency bound for θ.
4.2. Inference in Partially Linear Iv Models
Note that the score in 4.8 has a minor advantage over the score in 4.7 because all of its nuisance parameters are conditional mean functions, which can be directly estimated by the ML methods. If one prefers to use the score in 4.7, one has to construct an estimator of g0 first. To do so, one can first obtain a DML estimator of θ0 based on the score in 4.8, say . Then, using the fact that , one can construct an estimator by applying an ML method to regress on X. Alternatively, one can use assumption‐specific methods to directly estimate g0, without using the score 4.8 first. For example, if g0 can be approximated by a sparse linear combination of a large set of transformations of X, one can use the methods of Gautier and Tsybakov (2014) to obtain an estimator of g0.
Let and be sequences of positive constants approaching 0 as before. Also, let c, C and q be fixed strictly positive constants such that , and let be a fixed integer. Moreover, for any mapping the support of X to , denote . For simplicity, assume that is an integer.
For all probability laws for the quadruple the following conditions hold: (a) equations 4.5 and 4.6 hold; (b) ; (c) and ; (d) and ; and (e) given a random subset I of [N] of size , the nuisance parameter estimator obeys the following conditions. With P‐probability no less than , , and (i) for the score ψ in 4.7, where , , and (ii) for the score ψ in 4.8, where , .
The following theorem follows as a corollary to the results in Section 3 by verifying Assumptions 3.1 and 3.2.
5. Inference on Treatment Effects in The Interactive Model
5.1. Inference on Ate and Atte
In this section, we specialize the results of Section 3 to estimate treatment effects under the unconfoundedness assumption of Rosenbaum and Rubin (1983). Within this setting, there is a large classical literature focused on low‐dimensional settings that provides methods for adjusting for confounding variables including regression methods, propensity score adjustment methods, matching methods, and ‘doubly‐robust’ combinations of these methods; see, e.g. Robins and Rotnitzky (1995), Hahn (1998), Hirano et al. (2003) and Abadie and Imbens (2006) as well as the textbook overview provided in Imbens and Rubin (2015). In this section, we present results that complement this important classic work as well as the rapidly expanding body of work on estimation under unconfoundedness using ML methods; see, among others, Athey et al. (2016), Belloni et al. (2014a, 2017), Farrell (2015) and Imai and Ratkovic (2013).
The confounding factors X affect the policy variable via the propensity score and the outcome variable via the function . Both of these functions are unknown and potentially complicated, and we can employ ML methods to learn them.
where the nuisance parameter consists of P‐square‐integrable functions g and m mapping the support of to and the support of X to , respectively, for some . The true value of η is . This orthogonal moment condition is based on the influence function for the mean for missing data from Robins and Rotnitzky (1995).
where the nuisance parameter consists of P‐square‐integrable functions and m mapping the support of X to and to , respectively, and a constant , for some . The true value of η is , where and . Note that estimating ATTE does not require estimating . Note also that because p is a constant, it does not affect the DML estimators based on the score ψ in 5.4, but having p simplifies the formula for the variance of .
Using their respective scores, it can be easily seen that true parameter values θ0 for ATE and ATTE obey the moment condition , and also that the orthogonality condition holds.
Let and be sequences of positive constants approaching 0. Also, let and q be fixed strictly positive constants such that , and let be a fixed integer. Moreover, for any , denote . For simplicity, assume that is an integer.
For all probability laws for the triple the following conditions hold: (a) equations 5.1 and 5.2 hold, with ; (b) ; (c) ; (d) ; (e) ; and (f) given a random subset I of [N] of size , the nuisance parameter estimator obeys the following conditions. With P‐probability no less than , , and (i) for the score ψ in 5.3, where and the target parameter is ATE, , and (ii) for the score ψ in 5.4, where and the target parameter is ATTE, .
The only non‐primitive condition here is the assumption on the rate of estimating the nuisance parameters. These rates of convergence are available for the most often used ML methods and are case‐specific, so we do not restate conditions that are needed to reach these rates. The conditions are not the tightest possible, but offer a set of simple conditions under which Theorem 5.1 follows as a special case of the general theorem provided in Section 3. We could obtain more refined conditions by doing customized proofs.
The following theorem follows as a corollary to the results in Section 3 by verifying Assumptions 3.1 and 3.2.
5.2. Inference on Local Average Treatment Effects
In this section, we consider estimation of LATEs with a binary treatment variable, , and a binary instrument, .10 As before, Y denotes the outcome variable, and X is the vector of covariates.
where and the nuisance parameter consists of P‐square‐integrable functions μ, m and p, with μ mapping the support of to and m and p, respectively, mapping the support of and X to for some . It is easy to verify that this score satisfies the moment condition and also the orthogonality condition for .
Let and be sequences of positive constants approaching 0. Also, let c, C and q be fixed strictly positive constants such that , and let be a fixed integer. Moreover, for any , where ℓ1 is a function mapping the support of to and ℓ2 and ℓ3 are functions respectively mapping the support of and X to for some , we denote . For simplicity, assume that is an integer.
For all probability laws for the quadruple the following conditions hold: (a) equations 5.6–5.8 hold, with and ; (b) ; (c) ; (d) , (e) ; (f) ; and (g) given a random subset I of [N] of size , the nuisance parameter estimator obeys the following conditions. With P‐probability no less than , , , and .
The following theorem follows as a corollary to the results in Section 3 by verifying Assumptions 3.1 and 3.2.
6. Empirical Examples
To illustrate the methods developed in the preceding sections, we consider three empirical examples. The first example reexamines the Pennsylvania Reemployment Bonus experiment, which used a randomized control trial to investigate the incentive effect of unemployment insurance. In the second, we use the DML method to estimate the effect of 401(k) eligibility, the treatment variable, and 401(k) participation, a self‐selected decision to receive the treatment that we instrument for with assignment to the treatment state, on accumulated assets. In this example, the treatment variable is not randomly assigned and we aim to eliminate the potential biases due to the lack of random assignment by flexibly controlling for a rich set of variables. In the third, we revisit the IV estimation by Acemoglu et al. (2001) of the effects of institutions on economic growth by estimating a partially linear IV model.
6.1. Effect of Unemployment Insurance Bonus on Unemployment Duration
In this example, we reanalyse the Pennsylvania Reemployment Bonus experiment, which was conducted by the US Department of Labor in the 1980s to test the incentive effects of alternative compensation schemes for unemployment insurance (UI). This experiment has been previously studied by Bilias (2000) and Bilias and Koenker (2002). In these experiments, UI claimants were randomly assigned either to a control group or to one of five treatment groups.11 In the control group, the standard rules of the UI system applied. Individuals in the treatment groups were offered a cash bonus if they found a job within some pre‐specified period of time (qualification period), provided that the job was retained for a specified duration. The treatments differed in the level of the bonus, the length of the qualification period, and whether the bonus was declining over time in the qualification period; see Bilias and Koenker (2002) for further details.
In our empirical example, we focus only on the most generous compensation scheme, treatment 4, and we drop all individuals who received other treatments. In this treatment, the bonus amount is high and the qualification period is long compared to other treatments, and claimants are eligible to enroll in a workshop. Our treatment variable, D, is an indicator variable for being assigned treatment 4, and the outcome variable, Y, is the log of duration of unemployment for the UI claimants. The vector of covariates, X, consists of age group dummies, gender, race, the number of dependents, quarter of the experiment, location within the state, existence of recall expectations and type of occupation.
We report results based on five simple methods for estimating the nuisance functions used in forming the orthogonal estimating equations. We consider three tree‐based methods, labelled ‘Random forest’, ‘Reg. tree’ and ‘Boosting’, one ℓ1‐penalization based method, labelled ‘Lasso’ and a neural network method, labelled ‘Neural network’. For Reg. tree, we fit a single CART tree to estimate each nuisance function with penalty parameter chosen by tenfold cross‐validation. The results in the Random forest column are obtained by estimating each nuisance function with a random forest that averages over 1000 trees. The results in Boosting are obtained using boosted regression trees with regularization parameters chosen by tenfold cross‐validation. To estimate the nuisance functions using the neural networks, we use two neurons and a decay parameter of 0.02, and we set activation function as logistic for classification problems and as linear for regression problems.12 Lasso estimates an ℓ1‐penalized linear regression model using the data‐driven penalty parameter selection rule developed in Belloni et al. (2012). For Lasso, we use a set of 96 potential control variables formed from the raw set of covariates and all second‐order terms (i.e. all squares and first‐order interactions). For the remaining methods, we use the raw set of covariates as features.
We also consider two hybrid methods labelled ‘Ensemble’ and ‘Best’. Ensemble optimally combines four of the ML methods listed above by estimating the nuisance functions as weighted averages of estimates from Lasso, Boosting, Random forest and Neural network. The weights are restricted to sum to one and are chosen so that the weighted average of these methods gives the lowest average mean squared out‐of‐sample prediction error estimated using fivefold cross‐validation. The final column in Table 1 (Best) reports results that combine the methods in a different way. After obtaining estimates from the five simple methods and Ensemble, we select the best methods for estimating each nuisance function based on the average out‐of‐sample prediction performance for the target variable associated with each nuisance function obtained from each of the previously described approaches. As a result, the reported estimate in the last column uses different ML methods to estimate different nuisance functions. Note that if a single method outperformed all the others in terms of prediction accuracy for all nuisance functions, the estimate in the Best column would be identical to the estimate reported under that method.
Estimated effect of cash bonus on unemployment duration
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Panel A: interactive regression model | |||||||
| ATE | −0.081 | −0.084 | −0.074 | −0.079 | −0.073 | −0.079 | −0.078 |
| (twofold) | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] |
| (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | |
| ATE | −0.081 | −0.085 | −0.074 | −0.077 | −0.073 | −0.078 | −0.077 |
| (fivefold) | [0.036] | [0.036] | [0.036] | [0.035] | [0.036] | [0.036] | [0.036] |
| (0.036) | (0.037) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | |
| Panel B: partially linear regression model | |||||||
| ATE | −0.080 | −0.084 | −0.077 | −0.076 | −0.074 | −0.075 | −0.075 |
| (twofold) | [0.036] | [0.036] | [0.035] | [0.035] | [0.035] | [0.035] | [0.035] |
| (0.036) | (0.036) | (0.037) | (0.036) | (0.036) | (0.036) | (0.036) | |
| ATE | −0.080 | −0.084 | −0.077 | −0.074 | −0.073 | −0.075 | −0.074 |
| (fivefold) | [0.036] | [0.036] | [0.035] | [0.035] | [0.035] | [0.035] | [0.035] |
| (0.036) | (0.037) | (0.036) | (0.035) | (0.036) | (0.035) | (0.035) | |
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Panel A: interactive regression model | |||||||
| ATE | −0.081 | −0.084 | −0.074 | −0.079 | −0.073 | −0.079 | −0.078 |
| (twofold) | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] |
| (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | |
| ATE | −0.081 | −0.085 | −0.074 | −0.077 | −0.073 | −0.078 | −0.077 |
| (fivefold) | [0.036] | [0.036] | [0.036] | [0.035] | [0.036] | [0.036] | [0.036] |
| (0.036) | (0.037) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | |
| Panel B: partially linear regression model | |||||||
| ATE | −0.080 | −0.084 | −0.077 | −0.076 | −0.074 | −0.075 | −0.075 |
| (twofold) | [0.036] | [0.036] | [0.035] | [0.035] | [0.035] | [0.035] | [0.035] |
| (0.036) | (0.036) | (0.037) | (0.036) | (0.036) | (0.036) | (0.036) | |
| ATE | −0.080 | −0.084 | −0.077 | −0.074 | −0.073 | −0.075 | −0.074 |
| (fivefold) | [0.036] | [0.036] | [0.035] | [0.035] | [0.035] | [0.035] | [0.035] |
| (0.036) | (0.037) | (0.036) | (0.035) | (0.036) | (0.035) | (0.035) | |
Note: Estimated ATE and standard errors from a linear model (Panel B) and heterogeneous effect model (Panel A) based on orthogonal estimating equations. Column labels denote the method used to estimate nuisance functions. Results are based on 100 splits with point estimates calculated the median method. The median standard errors across the splits are reported in brackets and standard errors calculated using the median method to adjust for variation across splits are provided in parentheses. Further details about the methods are provided in the main text.
Estimated effect of cash bonus on unemployment duration
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Panel A: interactive regression model | |||||||
| ATE | −0.081 | −0.084 | −0.074 | −0.079 | −0.073 | −0.079 | −0.078 |
| (twofold) | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] |
| (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | |
| ATE | −0.081 | −0.085 | −0.074 | −0.077 | −0.073 | −0.078 | −0.077 |
| (fivefold) | [0.036] | [0.036] | [0.036] | [0.035] | [0.036] | [0.036] | [0.036] |
| (0.036) | (0.037) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | |
| Panel B: partially linear regression model | |||||||
| ATE | −0.080 | −0.084 | −0.077 | −0.076 | −0.074 | −0.075 | −0.075 |
| (twofold) | [0.036] | [0.036] | [0.035] | [0.035] | [0.035] | [0.035] | [0.035] |
| (0.036) | (0.036) | (0.037) | (0.036) | (0.036) | (0.036) | (0.036) | |
| ATE | −0.080 | −0.084 | −0.077 | −0.074 | −0.073 | −0.075 | −0.074 |
| (fivefold) | [0.036] | [0.036] | [0.035] | [0.035] | [0.035] | [0.035] | [0.035] |
| (0.036) | (0.037) | (0.036) | (0.035) | (0.036) | (0.035) | (0.035) | |
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Panel A: interactive regression model | |||||||
| ATE | −0.081 | −0.084 | −0.074 | −0.079 | −0.073 | −0.079 | −0.078 |
| (twofold) | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] | [0.036] |
| (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | |
| ATE | −0.081 | −0.085 | −0.074 | −0.077 | −0.073 | −0.078 | −0.077 |
| (fivefold) | [0.036] | [0.036] | [0.036] | [0.035] | [0.036] | [0.036] | [0.036] |
| (0.036) | (0.037) | (0.036) | (0.036) | (0.036) | (0.036) | (0.036) | |
| Panel B: partially linear regression model | |||||||
| ATE | −0.080 | −0.084 | −0.077 | −0.076 | −0.074 | −0.075 | −0.075 |
| (twofold) | [0.036] | [0.036] | [0.035] | [0.035] | [0.035] | [0.035] | [0.035] |
| (0.036) | (0.036) | (0.037) | (0.036) | (0.036) | (0.036) | (0.036) | |
| ATE | −0.080 | −0.084 | −0.077 | −0.074 | −0.073 | −0.075 | −0.074 |
| (fivefold) | [0.036] | [0.036] | [0.035] | [0.035] | [0.035] | [0.035] | [0.035] |
| (0.036) | (0.037) | (0.036) | (0.035) | (0.036) | (0.035) | (0.035) | |
Note: Estimated ATE and standard errors from a linear model (Panel B) and heterogeneous effect model (Panel A) based on orthogonal estimating equations. Column labels denote the method used to estimate nuisance functions. Results are based on 100 splits with point estimates calculated the median method. The median standard errors across the splits are reported in brackets and standard errors calculated using the median method to adjust for variation across splits are provided in parentheses. Further details about the methods are provided in the main text.
Table 1 presents DML2 estimates of the ATE on unemployment duration using the median method described in Section 3.4. We report results for the heterogeneous effect model in Panel A and for the partially linear model in Panel B. Because the treatment is randomly assigned, we use the fraction of treated as the estimator of the propensity score in forming the orthogonal estimating equations.13 For both the partially linear model and the interactive model, we report estimates obtained using twofold cross‐fitting and fivefold cross‐fitting. All results are based on taking 100 different sample splits. We summarize results across the sample splits using the median method. For comparison, we report two different standard errors: in brackets, we report the median standard errors from across the 100 splits; in parentheses, we report standard errors adjusted for variability across the sample splits using the median method in parentheses.
The estimation results are consistent with the findings of previous studies that have analysed the Pennsylvania Bonus Experiment. The ATE on unemployment duration is negative and significant across all estimation methods at the 5% level regardless of the standard error estimator used. Interestingly, we see that there is no practical difference across the two different standard errors in this example.
6.2. Effect of 401(K) Eligibility and Participation on Net Financial Assets
The key problem in determining the effect of 401(k) eligibility is that working for a firm that offers access to a 401(k) plan is not randomly assigned. To overcome the lack of random assignment, we follow the strategy developed in Poterba et al. (1994a,b). In these papers, the authors use data from the 1991 Survey of Income and Program Participation and they argue that eligibility for enrolling in a 401(k) plan in these data can be taken as exogenous after conditioning on a few observables – of which the most important for their argument is income. The basic idea of their argument is that, at least around the time 401(k) initially became available, people were unlikely to be basing their employment decisions on whether an employer offered a 401(k) but would instead focus on income and other aspects of the job. Following this argument, whether one is eligible for a 401(k) may then be taken as exogenous after appropriately conditioning on income and other control variables related to job choice.
A key component of the argument underlying the exogeneity of 401(k) eligibility is that eligibility can only be taken as exogenous after conditioning on income and other variables related to job choice that might correlate with whether a firm offers a 401(k). Poterba et al. (1994a,b) and many subsequent papers adopt this argument but control only linearly for a small number of terms. One might wonder whether such specifications are able to adequately control for income and other related confounds. At the same time, the power to learn about treatment effects decreases as one allows more flexible models. The principled use of flexible ML tools offers one resolution to this tension. The results presented below thus complement previous results that rely on the assumption that confounding effects can adequately be controlled for by a small number of variables chosen ex ante by the researcher.
In the example in this paper, we use the same data as in Chernozhukov and Hansen (2004). We use net financial assets – defined as the sum of IRA balances, 401(k) balances, checking accounts, US saving bonds, other interest‐earning accounts in banks and other financial institutions, other interest‐earning assets (such as bonds held personally), stocks, and mutual funds less non‐mortgage debt – as the outcome variable, Y, in our analysis. Our treatment variable, D, is an indicator for being eligible to enroll in a 401(k) plan. The vector of raw covariates, X, consists of age, income, family size, years of education, a married indicator, a two‐earner status indicator, a defined benefit pension status indicator, an IRA participation indicator, and a home‐ownership indicator.
In Table 2, we report DML2 estimates of ATE of 401(k) eligibility on net financial assets both in the partially linear model as in 1.1 and allowing for heterogeneous treatment effects using the interactive model outlined in Section 5.1. To reduce the disproportionate impact of extreme propensity score weights in the interactive model, we trim the propensity scores at 0.01 and 0.99. We present two sets of results based on sample splitting as discussed in Section 3 using twofold cross‐fitting and fivefold cross‐fitting. As in the previous section, we consider 100 different sample partitions and summarize the results across different sample splits using the median method. For comparison, we report two different standard errors: in brackets, we report the median standard errors from across the 100 splits; in parentheses, we report standard errors adjusted for variability across the sample splits using the median method. We consider the same methods with the same tuning choices for estimating the nuisance functions as in the previous example, with one exception, and so we do not repeat details for brevity. The one exception is that we implement neural networks with eight neurons and a decay parameter of 0.01 in this example.
Estimated effect of 401(k) eligibility on net financial assets
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Panel A: interactive regression model | |||||||
| ATE | 6830 | 7713 | 7770 | 7806 | 7764 | 7702 | 7546 |
| (twofold) | [1282] | [1208] | [1276] | [1159] | [1328] | [1149] | [1360] |
| (1530) | (1271) | (1363) | (1202) | (1468) | (1170) | (1533) | |
| ATE | 7170 | 7993 | 8105 | 7713 | 7788 | 7839 | 7753 |
| (fivefold) | [1201] | [1198] | [1242] | [1155] | [1238] | [1134] | [1237] |
| (1398) | (1236) | (1299) | (1177) | (1293) | (1148) | (1294) | |
| Panel B: partially linear regression model | |||||||
| ATE | 7717 | 8709 | 9116 | 8759 | 8950 | 9010 | 9125 |
| (twofold) | [1346] | [1363] | [1302] | [1339] | [1335] | [1309] | [1304] |
| (1749) | (1427) | (1377) | (1382) | (1408) | (1344) | (1357) | |
| ATE | 8187 | 8871 | 9247 | 9110 | 9038 | 9166 | 9215 |
| (fivefold) | [1298] | [1358] | [1295] | [1314] | [1322] | [1299] | [1294] |
| (1558) | (1418) | (1328) | (1328) | (1355) | (1310) | (1312) | |
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Panel A: interactive regression model | |||||||
| ATE | 6830 | 7713 | 7770 | 7806 | 7764 | 7702 | 7546 |
| (twofold) | [1282] | [1208] | [1276] | [1159] | [1328] | [1149] | [1360] |
| (1530) | (1271) | (1363) | (1202) | (1468) | (1170) | (1533) | |
| ATE | 7170 | 7993 | 8105 | 7713 | 7788 | 7839 | 7753 |
| (fivefold) | [1201] | [1198] | [1242] | [1155] | [1238] | [1134] | [1237] |
| (1398) | (1236) | (1299) | (1177) | (1293) | (1148) | (1294) | |
| Panel B: partially linear regression model | |||||||
| ATE | 7717 | 8709 | 9116 | 8759 | 8950 | 9010 | 9125 |
| (twofold) | [1346] | [1363] | [1302] | [1339] | [1335] | [1309] | [1304] |
| (1749) | (1427) | (1377) | (1382) | (1408) | (1344) | (1357) | |
| ATE | 8187 | 8871 | 9247 | 9110 | 9038 | 9166 | 9215 |
| (fivefold) | [1298] | [1358] | [1295] | [1314] | [1322] | [1299] | [1294] |
| (1558) | (1418) | (1328) | (1328) | (1355) | (1310) | (1312) | |
Note: Estimated ATE and standard errors from a linear model (Panel B) and heterogeneous effect model (Panel A) based on orthogonal estimating equations. Column labels denote the method used to estimate nuisance functions. Results are based on 100 splits with point estimates calculated the median method. The median standard errors across the splits are reported in brackets and standard errors calculated using the median method to adjust for variation across splits are provided in parentheses. Further details about the methods are provided in the main text.
Estimated effect of 401(k) eligibility on net financial assets
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Panel A: interactive regression model | |||||||
| ATE | 6830 | 7713 | 7770 | 7806 | 7764 | 7702 | 7546 |
| (twofold) | [1282] | [1208] | [1276] | [1159] | [1328] | [1149] | [1360] |
| (1530) | (1271) | (1363) | (1202) | (1468) | (1170) | (1533) | |
| ATE | 7170 | 7993 | 8105 | 7713 | 7788 | 7839 | 7753 |
| (fivefold) | [1201] | [1198] | [1242] | [1155] | [1238] | [1134] | [1237] |
| (1398) | (1236) | (1299) | (1177) | (1293) | (1148) | (1294) | |
| Panel B: partially linear regression model | |||||||
| ATE | 7717 | 8709 | 9116 | 8759 | 8950 | 9010 | 9125 |
| (twofold) | [1346] | [1363] | [1302] | [1339] | [1335] | [1309] | [1304] |
| (1749) | (1427) | (1377) | (1382) | (1408) | (1344) | (1357) | |
| ATE | 8187 | 8871 | 9247 | 9110 | 9038 | 9166 | 9215 |
| (fivefold) | [1298] | [1358] | [1295] | [1314] | [1322] | [1299] | [1294] |
| (1558) | (1418) | (1328) | (1328) | (1355) | (1310) | (1312) | |
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Panel A: interactive regression model | |||||||
| ATE | 6830 | 7713 | 7770 | 7806 | 7764 | 7702 | 7546 |
| (twofold) | [1282] | [1208] | [1276] | [1159] | [1328] | [1149] | [1360] |
| (1530) | (1271) | (1363) | (1202) | (1468) | (1170) | (1533) | |
| ATE | 7170 | 7993 | 8105 | 7713 | 7788 | 7839 | 7753 |
| (fivefold) | [1201] | [1198] | [1242] | [1155] | [1238] | [1134] | [1237] |
| (1398) | (1236) | (1299) | (1177) | (1293) | (1148) | (1294) | |
| Panel B: partially linear regression model | |||||||
| ATE | 7717 | 8709 | 9116 | 8759 | 8950 | 9010 | 9125 |
| (twofold) | [1346] | [1363] | [1302] | [1339] | [1335] | [1309] | [1304] |
| (1749) | (1427) | (1377) | (1382) | (1408) | (1344) | (1357) | |
| ATE | 8187 | 8871 | 9247 | 9110 | 9038 | 9166 | 9215 |
| (fivefold) | [1298] | [1358] | [1295] | [1314] | [1322] | [1299] | [1294] |
| (1558) | (1418) | (1328) | (1328) | (1355) | (1310) | (1312) | |
Note: Estimated ATE and standard errors from a linear model (Panel B) and heterogeneous effect model (Panel A) based on orthogonal estimating equations. Column labels denote the method used to estimate nuisance functions. Results are based on 100 splits with point estimates calculated the median method. The median standard errors across the splits are reported in brackets and standard errors calculated using the median method to adjust for variation across splits are provided in parentheses. Further details about the methods are provided in the main text.
Turning to the results, it is first worth noting that the estimated ATE of 401(k) eligibility on net financial assets is $19,559 with an estimated standard error of 1413 when no control variables are used. Of course, this number is not a valid estimate of the causal effect of 401(k) eligibility on financial assets if there are neglected confounding variables as suggested by Poterba et al. (1994a,b). When we turn to the estimates that flexibly account for confounding reported in Table 2, we see that they are substantially attenuated relative to this baseline that does not account for confounding, suggesting much smaller causal effects of 401(k) eligibility on financial asset holdings. It is interesting and reassuring that the results obtained from the different flexible methods are broadly consistent with each other. This similarity is consistent with the theory that suggests that results obtained through the use of orthogonal estimating equations and any sensible method of estimating the necessary nuisance functions should be similar. Finally, it is interesting that these results are also broadly consistent with those reported in the original work of Poterba et al. (1994a,b), who used a simple intuitively motivated functional form, suggesting that this intuitive choice was sufficiently flexible to capture much of the confounding variation in this example.
As a further illustration, we also report the LATE in this example where we take the endogenous treatment variable to be participating in a 401(k) plan. Even after controlling for features related to job choice, it seems likely that the actual choice of whether to participate in an offered plan would be endogenous. Of course, we can use eligibility for a 401(k) plan as an instrument for participation in a 401(k) plan under the conditions that were used to justify the exogeneity of eligibility for a 401(k) plan provided above in the discussion of estimation of the ATE of 401(k) eligibility.
We report DML2 results of estimating the LATE of 401(k) participation using 401(k) eligibility as an instrument in Table 3. We employ the procedure outlined in Section 5.2 using the same ML estimators to estimate the quantities used to form the orthogonal estimating equation as we employed to estimate the ATE of 401(k) eligibility outlined previously, so we omit the details for brevity. Looking at the results, we see that the estimated causal effect of 401(k) participation on net financial assets is uniformly positive and statistically significant across all of the considered methods. As when looking at the ATE of 401(k) eligibility, it is reassuring that the results obtained from the different flexible methods are broadly consistent with each other. It is also interesting that the results based on flexible ML methods are broadly consistent with, though somewhat attenuated relative to, those obtained by applying the same specification for controls as used in Poterba et al. (1994a,b) and using a linear IV model, which returns an estimated effect of participation of $13,102 with estimated standard error of (1922). The mild attenuation may suggest that the simple intuitive control specification used in the original baseline specification is too simplistic.
Estimated effect of 401(k) participation on net financial assets
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| LATE | 8978 | 11073 | 11384 | 11329 | 11094 | 11119 | 10952 |
| (twofold) | [2192] | [1749] | [1832] | [1666] | [1903] | [1653] | [1657] |
| (3014) | (1849) | (1993) | (1718) | (2098) | (1689) | (1699) | |
| LATE | 8944 | 11459 | 11764 | 11133 | 11186 | 11173 | 11113 |
| (fivefold) | [2259] | [1717] | [1788] | [1661] | [1795] | [1641] | [1645] |
| (3307) | (1786) | (1893) | (1710) | (1890) | (1678) | (1675) |
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| LATE | 8978 | 11073 | 11384 | 11329 | 11094 | 11119 | 10952 |
| (twofold) | [2192] | [1749] | [1832] | [1666] | [1903] | [1653] | [1657] |
| (3014) | (1849) | (1993) | (1718) | (2098) | (1689) | (1699) | |
| LATE | 8944 | 11459 | 11764 | 11133 | 11186 | 11173 | 11113 |
| (fivefold) | [2259] | [1717] | [1788] | [1661] | [1795] | [1641] | [1645] |
| (3307) | (1786) | (1893) | (1710) | (1890) | (1678) | (1675) |
Note: Estimated LATE based on orthogonal estimating equations. Column labels denote the method used to estimate nuisance functions. Results are based on 100 splits with point estimates calculated the median method. The median standard errors across the splits are reported in brackets and standard errors calculated using the median method to adjust for variation across splits are provided in parentheses. Further details about the methods are provided in the main text.
Estimated effect of 401(k) participation on net financial assets
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| LATE | 8978 | 11073 | 11384 | 11329 | 11094 | 11119 | 10952 |
| (twofold) | [2192] | [1749] | [1832] | [1666] | [1903] | [1653] | [1657] |
| (3014) | (1849) | (1993) | (1718) | (2098) | (1689) | (1699) | |
| LATE | 8944 | 11459 | 11764 | 11133 | 11186 | 11173 | 11113 |
| (fivefold) | [2259] | [1717] | [1788] | [1661] | [1795] | [1641] | [1645] |
| (3307) | (1786) | (1893) | (1710) | (1890) | (1678) | (1675) |
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| LATE | 8978 | 11073 | 11384 | 11329 | 11094 | 11119 | 10952 |
| (twofold) | [2192] | [1749] | [1832] | [1666] | [1903] | [1653] | [1657] |
| (3014) | (1849) | (1993) | (1718) | (2098) | (1689) | (1699) | |
| LATE | 8944 | 11459 | 11764 | 11133 | 11186 | 11173 | 11113 |
| (fivefold) | [2259] | [1717] | [1788] | [1661] | [1795] | [1641] | [1645] |
| (3307) | (1786) | (1893) | (1710) | (1890) | (1678) | (1675) |
Note: Estimated LATE based on orthogonal estimating equations. Column labels denote the method used to estimate nuisance functions. Results are based on 100 splits with point estimates calculated the median method. The median standard errors across the splits are reported in brackets and standard errors calculated using the median method to adjust for variation across splits are provided in parentheses. Further details about the methods are provided in the main text.
Looking at Tables 2 and 3, there are other interesting observations that can provide useful insights into understanding the finite sample properties of the DML estimation method. First, the standard errors of the estimates obtained using fivefold cross‐fitting are lower than those obtained from twofold cross‐fitting for all methods across all cases. This fact suggests that having more observations in the auxiliary sample may be desirable. Specifically, the fivefold cross‐fitting estimates use more observations to learn the nuisance functions than twofold cross‐fitting and thus are likely learn them more precisely. This increase in precision in learning the nuisance functions may then translate into more precisely estimated parameters of interest. While intuitive, we note that this statement does not seem to be generalizable, in that there does not appear to be a general relationship between the number of folds in cross‐fitting and the precision of the estimate of the parameter of interest (see the next example). Secondly, we also see that the standard errors of the Lasso estimates after adjusting for variation due to sample splitting are noticeably larger than the standard errors coming from the other ML methods. We believe that this is due to the fact that the out‐of‐sample prediction errors from a linear model tend to be larger when there is a need to extrapolate. In our framework, if the main sample includes observations that are outside of the range of the observations in the auxiliary sample, the model has to extrapolate to those observations. The fact that the standard errors are lower in fivefold cross‐fitting than in twofold cross‐fitting for the Lasso estimations also supports this hypothesis, because the higher number of observations in the auxiliary sample reduces the degree of extrapolation. We also see that there is a noticeable increase in the standard errors that account for variability due to sample splitting relative to the simple unadjusted standard errors in this case, though these differences do not qualitatively change the results.
6.3. Effect of Institutions on Economic Growth
To demonstrate DML estimation of partially linear structural equation models with instrumental variables, we consider estimation of the effect of institutions on aggregate output following the work of Acemoglu et al. (2001, hereafter AJR). Estimating the effect of institutions on output is complicated by the clear potential for simultaneity between institutions and output. Specifically, better institutions may lead to higher incomes, but higher incomes may also lead to the development of better institutions. To help overcome this simultaneity, AJR use mortality rates for early European settlers as an instrument for institution quality. The validity of this instrument hinges on the arguments that settlers set up better institutions in places where they are more likely to establish long‐term settlements, that where they are likely to settle for the long term is related to settler mortality at the time of initial colonization, and that institutions are highly persistent. The exclusion restriction for the instrumental variable is then motivated by the argument that GDP, while persistent, is unlikely to be strongly influenced by mortality in the previous century, or earlier, except through institutions.
In their paper, AJR note that their IV strategy will be invalidated if other factors are also highly persistent and related to the development of institutions within a country and to the country's GDP. A leading candidate for such a factor, as they discuss, is geography. AJR address this by assuming that the confounding effect of geography is adequately captured by a linear term in distance from the equator and a set of continent dummy variables. Using DML allows us to relax this assumption and to replace it by a weaker assumption that geography can be sufficiently controlled by an unknown function of distance from the equator and continent dummies, which can be learned by ML methods.
We use the same set of 64 country‐level observations as AJR. The data set contains measurements of GDP, settler morality, an index that measures protection against expropriation risk and geographic information. The outcome variable, Y, is the logarithm of GDP per capita and the endogenous explanatory variable, D, is a measure of the strength of individual property rights that is used as a proxy for the strength of institutions. To deal with endogeneity, we use an instrumental variable Z, which is mortality rates for early European settlers. Our raw set of control variables, X, include distance from the equator and dummy variables for Africa, Asia, North America and South America.
We report results from applying DML2 following the procedure outlined in Section 4.2 in Table 4. The considered ML methods and tuning parameters are the same as the previous examples except for the Ensemble method, from which we exclude Neural network, as the small sample size causes stability problems in training the neural network. We use the raw set of covariates and all second‐order terms when doing Lasso estimation, and we simply use the raw set of covariates in the remaining methods. As in the previous examples, we consider 100 different sample splits and report the median estimates of the coefficient and two different standard error estimates. In brackets, we report the median standard errors from across the 100 splits, and we report standard errors adjusted for variability across the sample splits using the median method in parentheses. Finally, we report results from both twofold cross‐fitting and fivefold cross‐fitting as in the other examples.
Estimated effect of institutions on output
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Twofold | 0.85 | 0.81 | 0.84 | 0.77 | 0.94 | 0.80 | 0.83 |
| [0.28] | [0.42] | [0.38] | [0.33] | [0.32] | [0.35] | [0.34] | |
| (0.22) | (0.29) | (0.30) | (0.27) | (0.28) | (0.30) | (0.29) | |
| Fivefold | 0.77 | 0.95 | 0.90 | 0.73 | 1.00 | 0.83 | 0.88 |
| [0.24] | [0.46] | [0.41] | [0.33] | [0.33] | [0.37] | [0.41] | |
| (0.17) | (0.45) | (0.40) | (0.27) | (0.30) | (0.34) | (0.39) |
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Twofold | 0.85 | 0.81 | 0.84 | 0.77 | 0.94 | 0.80 | 0.83 |
| [0.28] | [0.42] | [0.38] | [0.33] | [0.32] | [0.35] | [0.34] | |
| (0.22) | (0.29) | (0.30) | (0.27) | (0.28) | (0.30) | (0.29) | |
| Fivefold | 0.77 | 0.95 | 0.90 | 0.73 | 1.00 | 0.83 | 0.88 |
| [0.24] | [0.46] | [0.41] | [0.33] | [0.33] | [0.37] | [0.41] | |
| (0.17) | (0.45) | (0.40) | (0.27) | (0.30) | (0.34) | (0.39) |
Note: Estimated coefficient from a linear IV model based on orthogonal estimating equations. Column labels denote the method used to estimate nuisance functions. Results are based on 100 splits with point estimates calculated the median method. The median standard errors across the splits are reported in brackets and standard errors calculated using the median method to adjust for variation across splits are provided in parentheses. Further details about the methods are provided in the main text.
Estimated effect of institutions on output
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Twofold | 0.85 | 0.81 | 0.84 | 0.77 | 0.94 | 0.80 | 0.83 |
| [0.28] | [0.42] | [0.38] | [0.33] | [0.32] | [0.35] | [0.34] | |
| (0.22) | (0.29) | (0.30) | (0.27) | (0.28) | (0.30) | (0.29) | |
| Fivefold | 0.77 | 0.95 | 0.90 | 0.73 | 1.00 | 0.83 | 0.88 |
| [0.24] | [0.46] | [0.41] | [0.33] | [0.33] | [0.37] | [0.41] | |
| (0.17) | (0.45) | (0.40) | (0.27) | (0.30) | (0.34) | (0.39) |
| . | Lasso . | Reg. tree . | Random forest . | Boosting . | Neural network . | Ensemble . | Best . |
|---|---|---|---|---|---|---|---|
| Twofold | 0.85 | 0.81 | 0.84 | 0.77 | 0.94 | 0.80 | 0.83 |
| [0.28] | [0.42] | [0.38] | [0.33] | [0.32] | [0.35] | [0.34] | |
| (0.22) | (0.29) | (0.30) | (0.27) | (0.28) | (0.30) | (0.29) | |
| Fivefold | 0.77 | 0.95 | 0.90 | 0.73 | 1.00 | 0.83 | 0.88 |
| [0.24] | [0.46] | [0.41] | [0.33] | [0.33] | [0.37] | [0.41] | |
| (0.17) | (0.45) | (0.40) | (0.27) | (0.30) | (0.34) | (0.39) |
Note: Estimated coefficient from a linear IV model based on orthogonal estimating equations. Column labels denote the method used to estimate nuisance functions. Results are based on 100 splits with point estimates calculated the median method. The median standard errors across the splits are reported in brackets and standard errors calculated using the median method to adjust for variation across splits are provided in parentheses. Further details about the methods are provided in the main text.
In this example, we see uniformly large and positive point estimates across all procedures considered, and estimated effects are statistically significant at the 5% level. As in the second example, we see that adjusting for variability across sample splits leads to noticeable increases in estimated standard errors but does not result in qualitatively different conclusions. Interestingly, we see that the estimated standard errors based on fivefold cross‐fitting are larger than those based on twofold cross‐fitting in all procedures except lasso, which differs from the finding in the 401(k) example. Further understanding these differences and the impact of the number of folds on inference for objects of interest seems to be an interesting question for future research. Finally, although the estimated coefficients are somewhat smaller than the baseline estimates reported in AJR – an estimated coefficient of 1.10 with estimated standard error of 0.46 (see AJR, Table 4, Panel A, column 7) – the results are qualitatively similar, indicating a strong and positive effect of institutions on output.
6.4. Comments on Empirical Results
Before closing this section, we want to emphasize some important conclusions that can be drawn from these empirical examples. First, the choice of the ML method used in estimating nuisance functions does not substantively change the conclusion in any of the examples, and we have obtained broadly consistent results regardless of which method we employ. The robustness of the results to the different methods is implied by the theory assuming that all of the employed methods are able to deliver sufficiently high‐quality approximations to the underlying nuisance functions. Secondly, the incorporation of uncertainty due to sample splitting using the median method increases the standard errors relative to a baseline that does not account for this uncertainty, though these differences do not alter the main results in any of the examples. This lack of variation suggests that the parameter estimates are robust to the particular sample split used in the estimation in these examples.
Acknowledgements
We would like to acknowledge research support from the National Science Foundation. We also thank participants of the MIT Stochastics and Statistics seminar, the Kansas Econometrics conference, the Royal Economic Society Annual Conference, the Hannan Lecture at the Australasian Econometric Society meeting, the Econometric Theory lecture at the EC2 meetings 2016 in Toulouse, the CORE 50th Anniversary Conference, the Becker–Friedman Institute Conference on Machine Learning and Economics, the INET conferences on Big Data at the University of Southern California in Los Angeles, the World Congress of Probability and Statistics 2016, the Joint Statistical Meetings 2016, the New England Day of Statistics Conference, CEMMAP's Masterclass on Causal Machine Learning, and St Gallen's summer school on Big Data, for many useful comments and questions. We would like to thank Susan Athey, Peter Aronow, Jin Hahn, Guido Imbens, Mark van der Laan and Matt Taddy for constructive comments. We thank Peter Aronow for pointing us to the literature on targeted learning on which we build, along with prior works of Neyman, Bickel, and the many other contributions to semi‐parametric learning theory.
Footnotes
We consider the case where D is a scalar for simplicity. Extension to the case where D is a vector of fixed, finite dimension is accomplished by introducing an equation such as 1.2 for each element of the vector.
For instance, we could use lasso if we believe g0 is well approximated by a sparse linear combination of pre‐specified functions of X. In other settings, we could, for example, use iterative methods that alternate between random forests, for estimating g0, and least squares, for estimating θ0.
In Section 4, we also consider another debiased estimator, based on the partialling‐out approach of Robinson (1988):
Each of these works differs in terms of detail but can be viewed through the lens of either debiasing or orthogonalization to alleviate the impact of regularization bias on subsequent estimation and inference.
See Section 2 for the definition of the Gateaux derivative operator.
Targeted minimum loss estimation, which shares similar properties, is also discussed in, e.g., van der Laan and Rose (2011) and van der Laan (2015).
The ‐statistic, or the orthogonal score statistic, has been explicitly used for testing and estimation in high‐dimensional sparse models in Belloni et al. (2015).
It is interesting to note that the methods for constructing Neyman orthogonal scores described in Section 2 may give scores that are different from those in 4.7 and 4.8. For example, applying the method for conditional moment restriction problems in Section 2.2.4 with gives the score , where the true values of and are and , respectively. It may be interesting to compare properties of the DML estimators based on this score with those based on 4.7 and 4.8 in future work.
Without unconfoundedness/conditional exogeneity, these quantities measure association, and could be referred to as average predictive effect (APE) and average predictive effect for the exposed (APEX). Inferential results for these objects would follow immediately from Theorem 5.1.
Similar results can be provided for the local average treatment effect on the treated (LATT) by adapting the following arguments to use the orthogonal scores for the LATT; see, e.g. Belloni et al. (2017).
There are six treatment groups in the experiments. Following Bilias (2000). we merge groups 4 and 6.
We also experimented with Deep Learning methods from which we obtained similar results for some tuning parameters. However, we ran into stability and computational issues and we have chosen not to report these results in the empirical section.
We also estimated the effects using non‐parametric estimates of the conditional propensity score obtained from the ML procedures given in the column labels. As expected due to randomization, the results are similar to those provided in Table 1 and are not reported for brevity.
References
Appendix
Proofs of Results
In this appendix, we use C to denote a strictly positive constant that is independent of n and . The value of C may change at each appearance. Also, the notation means that for all n and some C. The notation means that . Moreover, the notation means that there exists a sequence of positive numbers such that (a) for all n, (b) is independent of for all n and (c) as . Finally, the notation means that for all , there exists C such that for all n. Using this notation allows us to avoid repeating ‘uniformly over ’ many times in the proofs.
The following lemma is useful particularly in the sample‐splitting contexts.
Let and be sequences of random vectors. (a) If, for , , then . In particular, this occurs if for some , by Markov's inequality. (b) Let be a sequence of positive constants. If conditional on , namely, that for any , , then unconditionally, namely, that for any , .
(a) For any , as the sequence is uniformly integrable. To show the second part note that by Markov's inequality. (b) This follows from (a).
Let be a sequence of independent copies of a random element W taking values in a measurable space according to a probability law P. Let be a set of suitably measurable functions , equipped with a measurable envelope .
where and is a constant depending only on q and c.
This completes the proof of the lemma.
where is the identity matrix and ⊗ is the Kronecker product.
: To start with, note that 3.11 follows immediately from the assumptions. Hence, it suffices to show that 3.10 holds uniformly over .
- Step 1. DenoteIn Steps 2–5 respectively, we will show thatA.5A.6A.7Because and all singular values of J0 are bounded below from zero by Assumption 3.1, it follows from 189 that with ‐probability , all singular values of are bounded below from zero as well. Therefore, with the same ‐probability,A.8andIn addition, given thatA.9it follows from 189 thatMoreover, because , it follows from 190 and 191 thatA.10Combining 196 and 197 givesA.11Now, substituting the last bound into 194 yieldswhere in the second line we used 190 and the definition of . Combining this with 192 givesby the definition of given in the statement of the theorem. In turn, because , combining 200 with the Lindeberg–Feller central limit theorem (CLT) and the Cramer–Wold device yields A.4. To complete the proof of the theorem, it remains to establish the bounds 189–192. We do so in the following four steps.A.12
- Step 2. In this step, we establish 189. Because K is a fixed integer, which is independent of N, it suffices to show that for any ,To do so, fix any and observe that by the triangle inequality,A.13whereA.14To bound , note that on the event , which holds with ‐probability ,and so . To bound , note that conditional on , the estimator is non‐stochastic, and so on the event ,where the last inequality holds by Assumption 3.2. Hence, by Lemma 6.1. Combining the bounds and with 202 gives 201.
- Step 3. In this step, we establish 190. This is the step where we invoke the Neyman orthogonality (or near‐orthogonality) condition. Again, because K is a fixed integer, which is independent of N, it suffices to show that for any ,To do so, fix any and introduce the following additional empirical process notation,A.15where ϕ is any ‐integrable function on . Then observe that by the triangle inequality,whereA.16To bound , note that, as above, conditional on , the estimator is non‐stochastic, and so on the event ,by the definition of in Assumption 3.2. Hence, by Lemma 6.1. To bound , introduce the functionThen, by Taylor expansion,However, becauseIn addition, on the event , by the Neyman near‐orthogonality condition imposed in Assumption 3.1,Moreover, on the event ,by the definition in Assumption 3.2. Hence,Combining the bounds on and with 208 and using the fact that gives 206.
- Step 5. Here we establish 192. Note that all eigenvalues of the matrixare bounded from below by because all singular values of J0 are bounded from above by c1 by Assumption 3.1 and all eigenvalues of are bounded from below by c0 by Assumption 3.2. Hence, given that is the largest eigenvalue of , it follows that . This gives 192 and completes the proof of the theorem.
: As in the case of the DML2 version, note that 3.11 follows immediately from the assumptions, and so it suffices to show that 3.10 holds uniformly over .
In this proof, all bounds hold uniformly in for , and we do not repeat this qualification throughout. Also, the second asserted claim follows immediately from the first one and Theorem 3.1. Hence, it suffices to prove the first asserted claim.
We only consider the case of the DML1 estimator and note that the DML2 estimator can be treated similarly.
With the help of Lemma 6.3, which establishes approximate linearity of the subsample DML estimators and is presented below, the proof is the same as that in the linear case presented in the Proof of Theorem 3.1 (DML1 case).
uniformly over , where and where .
Fix any and any sequence such that for all . To prove the asserted claim, it suffices to show that the estimator satisfies A.28 with P replaced by . To do so, we split the proof into four steps. In the proof, we use , , I and instead of , , and , respectively.
- Step 1. (Preliminary rate result) We claim that with ‐probability ,To show this claim, note that the definition of implies thatA.29which in turn implies via the triangle inequality that, with ‐probability ,whereA.30Here because , and . Also, by Assumption 3.4(c). Moreover, applying Lemma 6.2 to the function class for and defined in Assumption 3.4, conditional on and , so that is fixed after conditioning, shows that with ‐probability ,Hence, it follows from 253 and Assumption 3.3 that with ‐probability ,Combining this bound with the fact that the singular values of J0 are bounded away from zero, which holds by Assumption 3.3, gives the claim of this step.A.31
- Step 2. (Linearization) Here we prove the claim of the lemma. First, by definition of , we haveAlso, it will be shown in Step 4 thatA.32Moreover, for any and , we haveA.33where we are using the fact that . Finally, by Taylor expansion of the function , which vanishes at ,A.34Therefore, because and with ‐probability , and because by Neyman ‐near orthogonality,A.35applying 259 with and , we have with ‐probability ,where by Assumption 3.4,and by Step 3 below, with ‐probability ,Therefore, because all singular values of J0 are bounded below from zero by Assumption 3.3(d), it follows thatA.36The asserted claim now follows by multiplying both parts of the display by (under the norm on the left‐hand side) and noting that singular values of Σ0 are bounded below from zero by Assumptions 3.3 and 3.4.
- Step 3. Here we derive a bound on in 263. We haveTo bound , we apply Lemma 6.2 conditional on and so that can be treated as fixed. Observe that with ‐probability , where we used Assumption 3.4. Thus, an application of Lemma 6.2 to the empirical process with an envelope and for sufficiently large constant C conditional on and yields that with ‐probability ,This follows because by Assumption 3.4(b) and the triangle inequality, andA.37because for defined in Assumption 3.4(b), andby the proof of Theorem 3 in Andrews (1994b). The claim of this step follows.
- Step 4. Here we derive a bound on in 258. Let . Then as is bounded and the singular values of J0 are bounded below from zero by Assumption 3.3(d). Therefore, with ‐probability by Assumption 3.3(a). Hence, with the same probability,and so it suffices to show that with ‐probability ,To prove it, substitute and into 259 and use the Taylor expansion in 260. This shows that with ‐probability ,Combining this with the bounds on and derived above gives the claim of this step and completes the proof of the lemma.
Because Theorem 4.1 is a special case of Theorem 4.2 (with ), it suffices to prove the latter. Also, we only consider the DML estimators based on the score 4.7 and note that the estimators based on the score 4.8 can be treated similarly.
- Step 1. We first verify Neyman orthogonality. We have that by definition of θ0 of η0. Also, for any , the Gateaux derivative in the direction is given byby the law of iterated expectations, as and obey and . This gives Assumption 3.1(d) with .
- Step 2. Note thatby Assumption 4.2(c). In addition,by the triangle inequality, the Hölder inequality, the Jensen inequality and Assumption 4.2(b). This gives Assumption 3.1(e). Hence, given that Assumptions 3.1(a)–(c) hold trivially, Steps 1 and 2 together show that all conditions of Assumption 3.1 hold.
- Step 4. Here we verify Assumption 3.2(b). For any , we haveby Assumption 4.2(b), which gives the bound on in Assumption 3.2(b). Also, becauseby Assumption 4.2(c), it follows that θ0 satisfiesHence,where we used the fact that because , by the Jensen inequality. This gives the bound on in Assumption 3.2(b). Hence, Assumption 3.2(b) holds.
- Step 5. Finally, we verify Assumption 3.2(c). For any , we havewhich gives the bound on in Assumption 3.2(c). Further,which gives the bound on in Assumption 3.2(c). Finally, letThen, for any ,and soHence,which gives the bound on in Assumption 3.2(c). Thus, all conditions of Assumptions 3.1 are verified. This completes the proof.
The proof of Theorem 5.2 is similar to that of Theorem 5.1 and is therefore omitted. In turn, regarding Theorem 5.1, we show the proof for the case of ATE and note that the proof for the case of ATTE is similar.
- Step 1. We first verify Neyman orthogonality. We have that by definition of θ0 and η0. Also, for any , the Gateaux derivative in the direction is given bywhich is 0 by the law of iterated expectations, asThis gives Assumption 3.1(d) with .
Step 2. Note that , and so Assumption 3.1(e) holds trivially. Hence, given that Assumptions 3.1(a)–(c) hold trivially as well, Steps 1 and 2 together show that all conditions of Assumption 3.1 hold.
- Step 4. Here we verify Assumption 3.2(b). We havewhere in the third line, we used the facts that and . Hence, given that by the Jensen inequality and Assumption 5.1(b), it follows thatSimilarly, for any ,as . In addition,Therefore, for any , we haveThis gives the bound on in Assumption 3.2(b). Also, we haveThis gives the bound on in Assumption 3.2(b). Hence, Assumption 3.2(b) holds.
- Step 5. Finally, we verify Assumption 3.2(c). For any , we havewhich gives the bound on in Assumption 3.2(c). Further, by the triangle inequality,whereTo bound , note that by the same argument as that used in Step 4,and so . To bound , we haveA.38Here, the first inequality follows from the bounds and , the second from the facts that and for , , the third from the triangle inequality, the fourth from the facts that and , and the fifth from 302. Similarly, . Combining these inequalities shows thatas long as in the definition of satisfies . This gives the bound on in Assumption 3.2(c). Finally, letThen for any ,and so, given thatit follows that for some constant that depends only on ε and C,as long as the constant in the definition of satisfies . This gives the bound on in Assumption 3.2(c). Thus, all conditions of Assumptions 3.1 are verified. This completes the proof.

![Comparison of the conventional and double ML estimators. [Color figure can be viewed at wileyonlinelibrary.com]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/ectj/21/1/10.1111_ectj.12097/7/m_ectj12097-fig-0001.png?Expires=1605668717&Signature=lQqnBFOjktzgjis32A3ArsJx7jJqfbkffRTMr1PkWD1xP9lMvKnb~vTGua07wksGhuu5XxLfRkvOG2UMGmnvIpHvy-PblvvvdSyOvhJuzabcr9QI5klFGL1XkBjea2EYJPK28MBxyfFY9dgE52zGsAnU8VDaebkVfz3BAQHjMOGdKEaXqCQbWU2qlUq9-U4W8csT8WjlrWh3FQSCd5c7Q1FKaHGHT3U-QnQxbWrdBKSbkdUeGcFM1tbjhg-uycRSTqPrbb0Hxk0gbEl6rEDK8cSM4XBcmLITll0lgMDZmkO3VPVWwOMgFQOnQRIW4EWinDvFI4RNTWHAQKd-6DsRUA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![Comparison of full‐sample and cross‐fitting procedures. [Color figure can be viewed at wileyonlinelibrary.com]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/ectj/21/1/10.1111_ectj.12097/7/m_ectj12097-fig-0002.png?Expires=1605668717&Signature=G6fg-nsar6S2Uw19~UkDAKcClGPe7WlxubDO4xJVcjAVNbmjfkWypGgTp9t3qN~yZBDNHCtOa6FT9845t~4nXBYDloJV3qnYOFSlDcYYQ0o6BA7jBFtLRHU4d8CsqhncMutB8cPiDVIiJJkcWo1Qo9KV7GhF8EfPc2FLvPGtBMQ-SnJw4Dm7iFwCz2ED74~xnGZaBsOuDiL9X7zG0ly~yD3EddUA~IotNLJtQc-bOCijUzPm8TDCfCDo534j8-zKh-ajI-pfjKwcit1ANRASCfv1Fvkqj12SyOfTYCPUrSLoSqquzfRXWtV3~iIawY9rs92IKkIALUwVwblXZPMXrg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)