- Split View
-
Views
-
Cite
Cite
Victor Chernozhukov, Whitney K Newey, Rahul Singh, Debiased machine learning of global and local parameters using regularized Riesz representers, The Econometrics Journal, Volume 25, Issue 3, September 2022, Pages 576–601, https://doi.org/10.1093/ectj/utac002
- Share Icon Share
Summary
We provide adaptive inference methods, based on |$\ell _1$| regularization, for regular (semiparametric) and nonregular (nonparametric) linear functionals of the conditional expectation function. Examples of regular functionals include average treatment effects, policy effects, and derivatives. Examples of nonregular functionals include average treatment effects, policy effects, and derivatives conditional on a covariate subvector fixed at a point. We construct a Neyman orthogonal equation for the target parameter that is approximately invariant to small perturbations of the nuisance parameters. To achieve this property, we include the Riesz representer for the functional as an additional nuisance parameter. Our analysis yields weak ‘double sparsity robustness’: either the approximation to the regression or the approximation to the representer can be ‘completely dense’ as long as the other is sufficiently ‘sparse’. Our main results are nonasymptotic and imply asymptotic uniform validity over large classes of models, translating into honest confidence bands for both global and local parameters.
1. INTRODUCTION
Many statistical objects of interest can be expressed as a linear functional of a regression function (or projection, more generally). Examples include global parameters: average treatment effects, policy effects from changing the distribution of or transporting regressors, and average directional derivatives, as well as their local versions defined by taking averages over regions of shrinking volume. This variety of important examples motivates the problem of learning linear functionals of regressions. Global parameters are typically regular (estimable at |$1/\sqrt{n}$| rate), and local parameters are nonregular (estimable at slower than |$1/\sqrt{n}$| rates). Global parameters can also be nonregular under weak identification (for example, in average treatment effects, when propensity scores accumulate mass near zero or one, along a given sequence of models).
Often the regression is high dimensional, depending on many variables such as covariates in a treatment effect model. Plugging a machine learner into a functional of interest can give a badly biased estimator. To avoid such bias, we use debiased/‘double’ machine learning (DML) based on Neyman orthogonal scores that have zero derivative with respect to each first step learner (e.g., Neyman, 1959; Belloni et al., 2014, 2015; Chernozhukov et al., 2016, 2018a; Foster and Syrgkanis, 2019). Note that the word ‘double’ emphasizes the connection to double robustness, a property that orthogonal scores have in this case. Such scores are constructed by adding a bias correction term: the average product of the regression residual with a learner of the functional’s Riesz representer (RR). This construction builds upon and is directly inspired by Newey (1994), where such scores arise in the computation of the semiparametric efficiency bound for regular functionals. We also remove overfitting bias (high entropy bias) by using cross-fitting, an efficient form of sample splitting, where we average over data observations different from those used by the nonparametric learners. See, for example, Schick (1986) for early use and Chernozhukov et al. (2018a) for more recent use in the context of debiased machine learning.
Using closed-form solutions for RRs in several examples, Chernozhukov et al. ( 2016, 2018a) defined DML estimators in high-dimensional settings and established their good properties. In comparison, the new approach proposed in this paper has the following advantages and some limitations:
We provide a novel algorithm based on |$\ell _1$| regularization to automatically estimate the RR from the empirical analog of equations that implicitly characterize it.
Even when a closed-form solution for the RR is available, the method avoids estimating each of its components. For example, the method avoids explicit density derivative estimation for the average derivative, and it avoids inverting estimated propensity scores for average treatment effects.
The adaptive inference theory covers both regular objects (estimable at the |$1/\sqrt{n}$| rate) and nonregular ones (with rates |$L/\sqrt{n}$|, where |$L \rightarrow \infty$| is the operator norm of the linear functional).
As far as we know, the adaptive inference theory given here is the first nonasymptotic Gaussian approximation analysis of debiased machine learning.
Our approach remains interpretable under misspecification, estimating a linear functional of the projection rather than regression. (This point is made explicit in Section 4).
We provide a nonasymptotic analysis when using the |$\ell _1$|-penalized method to learn the regression, and an asymptotic analysis when using other modern machine learning estimators to learn the regression.
The current analysis focuses on linear functionals. In follow-up work, Chernozhukov et al. (2018) extend the approach to nonlinear functionals through a linearization.
This paper is a revised version of Chernozhukov et al. (2018d) that gave an algorithm based on |$\ell _1$| regularization for automatically estimating the RR. This version is distinguished from Chernozhukov et al. (2018d), Chernozhukov et al. (2018a), Chernozhukov et al. ( 2016), and Chernozhukov et al. (2018c) in covering local objects that are estimated at a rate slower than |$1/\sqrt{n}$|. Providing debiased machine learning for such local objects is an important contribution of this paper.
Sections 2 and 3 present the main ideas for a general audience. In Section 2, we define global, local, and perfectly localized linear functionals of the regression, and provide orthogonal representations for these functionals. In Section 3, we present two empirical examples: local and global average treatment effects, and local and global average derivatives.
Sections 4 and 5 are theoretical. In Section 4, we provide estimation theory, demonstrating concentration and approximate Gaussianity of the DML estimator with regression and RR estimated via regularized moment conditions. We provide rates of convergence for estimating the RR, giving both fast rates under approximate sparsity. In Section 5, we demonstrate asymptotic consistency and Gaussianity of the DML estimator with regression estimated via general machine learning.
The Online Appendix provides supporting material. In Section S1, we give a detailed account of how our work relates to previous and contemporary work. In Section S2, we review prelimaries of functional analysis. In Section S3, we analyse the structure of the leading examples, providing bounds on operator norm, variance of the score, and kurtosis. Finally, we provide proofs for each section.
2. OVERVIEW OF TARGET FUNCTIONALS, ORTHOGONAL REPRESENTATION, ESTIMATION, AND INFERENCE
2.1. Target functionals
We consider a random element W with distribution P taking values w in its support |$\mathcal {W}$|. Denote the |$L^q(P)$| norm of a measurable function |$f: \mathcal {W} \rightarrow \mathbb {R}$| and also the |$L^q(P)$| norm of random variable |$f(W)$| by |$\Vert f \Vert _{P,q} = \Vert f(W)\Vert _{P,q}$|. For a differentiable map |$x \mapsto g(x)$|, from |$\mathbb {R}^d$| to |$\mathbb {R}^k$|, we use |$\partial _{x^{\prime }} g$| to abbreviate the partial derivatives |$(\partial /\partial x^{\prime }) g(x)$|, and we use |$\partial _{x^{\prime }} g(x_0)$| to mean |$\partial _{x^{\prime }} g (x) \mid _{x = x_0}$|, etc. We use |$x^{\prime }$| to denote the transpose of a column vector x.
Our goal is to construct high-quality inference methods for real-valued linear functionals of |$\gamma _0^\star$|. To present examples below we need to endow |$\gamma _0^\star$| with a causal interpretation, which requires us to assume that it is a structural function, invariant to the changes in the distribution of X under policies described below. This property is not guaranteed for an arbitrary regression problem. For the reader who is unfamiliar with these concepts, we note that a simple sufficient condition for invariance is follows: given a stochastic process |$x \mapsto Y(x)$|, called potential outcomes or structural function, vector X is generated to follow distribution F independently of |$x \mapsto Y(x)$| and Y is generated as |$Y = Y(X)$|. In this case we have |$\gamma _0^\star (x) = {\mathrm{E}}Y(x)$| for any F. This condition is conventionally called exogeneity in econometrics and random assignment in statistics. The measurability requirement here is that |$(x, \omega )\mapsto Y(x,\omega )$| is a measurable map. We refer to Imbens and Rubin (2015), Hernan and Robins ( 2020), and Peters et al. (2017) for the relevant formalizations that enable causal interpretation.
Here and below, a weighting function is a measurable function |$x \mapsto \ell (x)$| such that |$\int \ell dF = 1$| and |$\int \ell ^2 dF \lt \infty$|. In this example, setting
|$\ell (x) = 1$| gives average treatment effect in the entire population,
|$\ell (x)= 1(d=1)/P(D=1)$| gives the average treatment effect for the treated population,
|$\ell (x) = 1( z \in N)/P(Z \in N)$| the average treatment effect conditional on the covariates Z being in the group or neighborhood N,
and so on. We can model small neighbourhoods N as shrinking in volume with the sample size. The local weighting and kernel weighting discussed below are applicable to all key examples. Moreover, they are combinable with other weighting functions so that, for example, we can target inference on local average treatment effects for the treated.
In this example, consider the case when |$X=(D,Z)$| consists of continuous treatment variable D and covariates Z. Further suppose |$\ell (x)=\ell (d)$| and |$t(x)=1$|. Then the parameter of interest is |$\theta _{0}^{\ast }=E[\ell (D)T(D)],$| where |$T(d)=E[\partial _d\gamma _{0}^{\ast }(D,Z)|D=d].$| When |$Y=Y(D)$| for a potential outcome process |$Y(d)$| that is independent of treatment D conditional on the covariates Z and differentiable in |$d,$| it was shown by Altonji and Matzkin (2005) and Florens et al. (2008) that |$T(d)=E[\partial _d Y(D)|D=d],$| which is an average treatment effect on the treated. Thus |$\theta _{0}^{\ast }$| is a weighted average of the effect of treatment on the treated and would be equal to |$T(d)$| for the perfectly localized |$\ell (d)=1(D=d)/f_{D}(d),$| where |$f_{D}(d)$| is the pdf of |$D.$| Also for |$\ell (d)\equiv 1$|, Imbens and Newey (2009) showed that |$\theta _{0}^{\ast }=E[T(D)]=E[\partial _d Y(D)]$|, which is an average treatment effect. See also Rothenhäusler and Yu (2019).
In Example 2.4, we consider the case where the variable of differentiation is also the variable of localization. As explained above, this case corresponds to effects of continuous treatments, and it turns out to require extra care in Section S3. The other possible case is where the variable of differentiation is different from the variable of localization. Such a case turns out to be simpler and is handled by similar arguments as Examples 2.1, 2.2, and 2.3 in Section S3.
All of these statistical parameters play an important role in causal inference, counterfactual decompositions, and predictive analyses. Introduction of the weighting function |$\ell (X)$| allows us to study subgroup effects and local effects, and these will be covered by our nonasymptotic results and asymptotic results. All of the above examples can be viewed as real-valued linear functionals of the regression function.
The linear operator |$\gamma \mapsto \theta (\gamma )$| has the following generating function m in these examples:
2.1 |$m(w, \gamma ) = (\gamma (1,z) - \gamma (0,z)) \ell (x)$|;
2.2 |$m (w, \gamma ) = m(\gamma ) = \int \gamma (x) \ell (x) d G(x); \ \ G(x) = F_1(x) - F_0(x)$|;
2.3 |$m(w, \gamma ) = \ell (x)(\gamma (T(x)) - \gamma (x))$|;
2.4 |$m(w, \gamma ) = \ell (x) t(x)^{\prime }\partial _d \gamma (x)$|.
In these examples, we can recognize the dependency on the weighting function by writing |$m(w, \gamma ; \ell ).$| In Examples 2.1, 2.3, and 2.4 we can decompose |$m(w, \gamma ;\ell ) = m_0(w, \gamma )\ell (x).$|
Estimation of some parameters of the form in Definition 2.1 is very straightforward, such as |${\mathrm{E}}[w(X)\gamma _0(X)]$| for a known function |$w(x)$|. These can be estimated as the sample mean of |$w(X)Y$|. Such simple estimation is not possible for the causal, counterfactual parameters in Examples 2.1, 2.2, 2.3, and 2.4. The approach of this paper provides estimators for these counterfactual parameters and can be used for many others.
2.2. Building an orthogonal representation of the target functional
Equation (2.1) can be thought of as a direct formulation of the target parameter. Next we introduce a dual formulation and finally an orthogonal formulation. Towards this end, we define the RR |$\alpha _0$|.
To motivate the upcoming orthogonal representation, we note that either the direct (2.1) or the dual (2.5) identification strategies can be used for direct plug-in estimation, but this does not give good estimators, as explained in the following technical remark.
Therefore we proceed to construct another representation for |$\theta ^\star _0$| that has the required Neyman orthogonality structure.
The Neyman orthogonality property states that the representation of the target parameter |$\theta _0$| in terms of the nuisance parameters |$(\alpha , \gamma )$| is invariant to the local perturbations of the values of the nuisance parameters. This property makes the orthogonal representation an excellent basis for constructing high-quality point and interval estimators of |$\theta ^\star _0$| in modern high-dimensional settings when we will be plugging-in biased estimators in lieu of |$\gamma _0^\star$| and |$\alpha _0^\star$|, where the bias occurs because of the regularization (see, e.g., Chernozhukov et al., 2016, 2018a).
Both |$\gamma _0^{*}$| and |$\alpha _0^{*}$| are identified, |$\gamma _0^{*}$| as |$E[Y|X]$| and |$\alpha _0^{*}$| by virtue of the consistent estimator we give in Section 2.5. Identification allows us to use the orthogonal representation to estimate target parameters.
2.3. The case of finite-dimensional linear regression
2.1 |$M = {\mathrm{E}}(b(1,Z) - b(0,Z))\ell (X)$| 2.2 |$M= \int b \ell (d F_1- dF_0)$|,
2.3 |$M= {\mathrm{E}}(b(T(X)) - b(X) )\ell (X)$| 2.4 |$M = {\mathrm{E}}\partial _d b(D,Z) t(X) \ell (X)$|.
2.4. The case of infinite-dimensional regression
In the infinite-dimensional case, we can employ the Riesz–Frechet representation theorem and Hahn–Banach extension theorem to establish existence of the linear RR.
(i) If|$L\lt \infty$|, there exists a unique minimal representer|$\alpha _0^\star \in \bar{\Gamma }$|and|$L= \Vert \alpha ^{\star }_0\Vert _{P,2}$|. (ii) If there exists a linear representer|$\alpha _0$|on|$\Gamma$|with|$\Vert \alpha ^{}_0\Vert _{P,2} \lt \infty$|, then|$L = \Vert \alpha ^{\star }_0\Vert _{P,2} \le \Vert \alpha ^{}_0\Vert _{P,2} \lt \infty$|, where|$\alpha ^{\star }_0$|, obtained by projecting|$\alpha _0$|onto|$\bar{\Gamma }$|, is the unique minimal representer. In both cases|$\gamma \mapsto \theta (\gamma )$|can be extended to|$\bar{\Gamma }$|or to the entire|$L^2(F)$|with the modulus of continuity L.
The first part of the lemma shows (implicit) existence of a linear representer when |$L \lt \infty$|. Our estimation results will rely only on the existence of minimal representers. In some cases, however, we may utilize the closed-form solutions for linear representers (see, e.g., Section S3 for the key examples), to improve the basis functions for estimating the minimal representers. There is also an efficiency reason to work with minimal representers rather than any linear representer, as highlighted in Section 4 analysing semiparametric efficiency.
2.5. Informal preview of estimation and inference results
Our estimation and inference will exploit empirical analogs of both the orthogonal representation of the parameter (2.6) and the equation defining the RR property (2.4).
To approximate the regression function and the RR, we consider the p-vector of dictionary functions b, where the dimension p of the dictionary can be large, potentially much larger than n. We approximate |$\alpha ^\star _0$| by a linear form |$b^{\prime }\rho _0$|, and we approximate |$\gamma ^\star _0$| by a linear form |$b^{\prime }\beta _0$|, and estimate the parameters using the algorithms below.
(1) Let |$(W_i)_{i=1}^n= (Y_i, X_i)_{i=1}^n$| denote i.i.d. copies of data vector W. We use cross-fitting to avoid biases from overfitting that can arise in high-dimensional settings. To this end, let |$(I_{1},\ldots , I_K)$| be a partition of the observation index set |$\lbrace 1,\ldots ,n\rbrace$| into K distinct subsets of about equal size. Let |$\mathbb {E}_A f = \mathbb {E}_A f(W)$| denote the empirical average of |$f(W)$| over |$i \in A \subset \lbrace 1,\ldots , n\rbrace$|: |$\mathbb {E}_{A} f := \mathbb {E}_{A} f(W) = | A | ^{-1} \sum _{i \in A} f(W_i).$|
- (2)For each block |$k=1,\ldots ,K$|, we obtain generalized Dantzig selector (GDS) estimates |$\hat{\alpha }_k= b^{\prime }\hat{\rho }_k$| and |$\hat{\gamma }_k= b^{\prime }\hat{\beta }_k$|, wherewhere |$I^c_k = \lbrace 1,\ldots ,n\rbrace \setminus I_k$| is the set of observation indices leaving |$I_k$| out, |$\lambda$|’s are tuning parameters, and |$\hat{D}$| is a scaling detailed in Section S5. Typically |$\lambda$|’s scale like |$\sqrt{ \log (p \vee n)/n}$|; Section 3 provides concrete choices.(2.10)$$\begin{equation*} \begin{array}{l}\hat{\rho }_k = \arg \min _{\rho \in \mathbb {R}^p} \Vert \rho \Vert _1 : \Vert \hat{D}^{-1}\left\lbrace \mathbb {E}_{I^c_k} m(W, b) - \mathbb {E}_{I^c_k} b(X) b(X)^{\prime }\rho \right\rbrace \Vert _\infty \le \lambda _\rho ,\\ \hat{\beta }_k = \arg \min _{\beta \in \mathbb {R}^p} \Vert \beta \Vert _1 : \Vert \hat{D}^{-1}\left\lbrace \mathbb {E}_{I^c_k} (Y- b(X)^{\prime }\beta ) b(X))\right\rbrace \Vert _\infty \le \lambda _\beta , \end{array} \end{equation*}$$
- (3) The DML estimator is an average of estimated orthogonal representations over k:The estimator of its asymptotic variance is(2.11)$$\begin{equation*} \hat{\theta }=\frac{1}{n}\sum _{k=1}^{K}\sum _{i\in I_{k}}\lbrace m(W_{i} ,\hat{\gamma }_{k})+\hat{\alpha }_{k}(X_{i})[Y_{i}-\hat{\gamma }_{k }(X_{i})]\rbrace . \end{equation*}$$(2.12)$$\begin{equation*} \hat{\sigma }^2=\frac{1}{n}\sum _{k=1}^{K}\sum _{i\in I_{k}}\lbrace m(W_{i} ,\hat{\gamma }_{k})+\hat{\alpha }_{k}(X_{i})[Y_{i}-\hat{\gamma }_{k }(X_{i})]-\hat{\theta }\rbrace ^2. \end{equation*}$$
We remark that the RR estimator in step 2 (2.10) is of Dantzig selector type, but is not exactly the Dantzig selector, requiring some new analysis. We use the GDS rather than series or spline estimation to accommodate high dimensional specifications for the regression and RR.
The dictionary |$b(x)$| is very important for the GDS estimator. This dictionary should be chosen so that linear combinations of |$b(x)$| can approximate in mean square any element of |$\Gamma$|. For example if |$\Gamma$| is the set of linear combinations of an infinite sequence of regressors, as for a high dimensional regression, then |$b(x)$| could be chosen as the first p elements of that sequence. Also p can be chosen flexibly, because p will be allowed to grow faster than the sample size, as specified in the asymptotic theory to follow. In practice multiple choices of p could be tried.
There are two cases to consider:
(1) Regular case: the parameters |$\sigma$|, |$\kappa /\sigma$|, and L are bounded, leading to |$1/\sqrt{n}$| concentration, adaptation, and Gaussian approximation.
- (2) Nonregular case, the parameters |$\sigma , \kappa /\sigma$|, and L diverge, so that we needfor |$\sigma /\sqrt{n}$| concentration, adaptation, and Gaussian approximation.$$\begin{equation*} \sigma /\sqrt{n} \rightarrow 0, L/\sqrt{n} \rightarrow 0, (\kappa /\sigma )/\sqrt{n} \rightarrow 0, \end{equation*}$$
We think it is remarkable that a single inference theory covers both regular and nonregular cases, and provides uniform validity over large classes of P.
3. APPLICATIONS
3.1. Global and local effects of 401(k) eligibility on net financial assets
First, we use our method to answer a question in household finance: what is the average treatment effect of 401(k) eligibility on net financial assets (over a horizon of about two years)? 401(k) is a retirement savings and investing plan that gives employees a tax break on money they contribute. We follow the identification strategy of Poterba et al. (1995), who assume selection on observables. The authors assume that when 401(k) was introduced, workers ignored whether a given job offered 401(k) and instead made employment decisions based on income and other observable job characteristics; after conditioning on income and job characteristics, 401(k) eligibility was exogenous at the time. This empirical question corresponds to Example 2.1.
We use data from the 1991 US Survey of Income and Program Participation (Chernozhukov et al., 2018b), using sample selection and variable construction as in Abadie (2003) and Chernozhukov and Hansen (2004). The outcome Y is net financial assets defined as the sum of individual retirement account (IRA) balances, 401(k) balances, checking accounts, US saving bonds, other interest-earning accounts, stocks, mutual funds, and other interest-earning assets minus nonmortgage debt. The treatment D is an indicator of eligibility to enroll in a 401(k) plan. The raw covariates X are age, income, years of education, family size, marital status, two-earner status, benefit pension status, IRA participation, and home-ownership. We impose common support of the propensity score for the treated and untreated groups based on these covariates, yielding |$n=9869$| observations. We consider the fully-interacted specification |$b(D,X)$| of Chernozhukov et al. (2018a) with |$p=277$| including polynomials of continuous covariates, interactions among all covariates, and interactions between covariates and treatment status.
Tables 1 and 2 summarize results for the entire population and for each quintile of the income distribution. We use |$K=5$| folds in cross-fitting. To estimate the RR, we use the GDS procedure introduced in the present work. To estimate the regression, we use GDS, Lasso, random forest, or neural network. GDS is implemented using the tuning procedure described in Section S5. Lasso is implemented using the tuning procedure described in Chernozhukov et al. (2018). Random forest and neural network are implemented with the same settings as Chernozhukov et al. (2018a), i.e., with 1000 trees or a single hidden layer of eight neurons, respectively. We find average treatment effect (ATE) of 7608 (1395) using GDS for both the RR and the regression. This ATE estimate is stable across different choices of regression estimator. We find that localized ATE is not statistically significant for the second quintile, and it is statistically significant, positive, and strongly heterogeneous for the other quintiles. Interpreting the relatively high effect of 401(k) eligibility for the first quintile is a question for future research.
Income quintile . | N treated . | N untreated . | GDS . | Lasso . | ||
---|---|---|---|---|---|---|
All | 3682 | 6187 | 7607.95 | (1394.92) | 7733.31 | (1416.46) |
1 | 272 | 1702 | 4500.33 | (924.12) | 4477.43 | (920.31) |
2 | 527 | 1447 | 1051.60 | (1501.03) | 1119.06 | (1500.78) |
3 | 755 | 1219 | 5204.93 | (1199.87) | 4919.65 | (1200.10) |
4 | 962 | 1012 | 9515.58 | (2141.92) | 8837.39 | (2150.58) |
5 | 1166 | 807 | 19354.00 | (7934.70) | 14138.37 | (8310.59) |
Income quintile . | N treated . | N untreated . | GDS . | Lasso . | ||
---|---|---|---|---|---|---|
All | 3682 | 6187 | 7607.95 | (1394.92) | 7733.31 | (1416.46) |
1 | 272 | 1702 | 4500.33 | (924.12) | 4477.43 | (920.31) |
2 | 527 | 1447 | 1051.60 | (1501.03) | 1119.06 | (1500.78) |
3 | 755 | 1219 | 5204.93 | (1199.87) | 4919.65 | (1200.10) |
4 | 962 | 1012 | 9515.58 | (2141.92) | 8837.39 | (2150.58) |
5 | 1166 | 807 | 19354.00 | (7934.70) | 14138.37 | (8310.59) |
Income quintile . | N treated . | N untreated . | GDS . | Lasso . | ||
---|---|---|---|---|---|---|
All | 3682 | 6187 | 7607.95 | (1394.92) | 7733.31 | (1416.46) |
1 | 272 | 1702 | 4500.33 | (924.12) | 4477.43 | (920.31) |
2 | 527 | 1447 | 1051.60 | (1501.03) | 1119.06 | (1500.78) |
3 | 755 | 1219 | 5204.93 | (1199.87) | 4919.65 | (1200.10) |
4 | 962 | 1012 | 9515.58 | (2141.92) | 8837.39 | (2150.58) |
5 | 1166 | 807 | 19354.00 | (7934.70) | 14138.37 | (8310.59) |
Income quintile . | N treated . | N untreated . | GDS . | Lasso . | ||
---|---|---|---|---|---|---|
All | 3682 | 6187 | 7607.95 | (1394.92) | 7733.31 | (1416.46) |
1 | 272 | 1702 | 4500.33 | (924.12) | 4477.43 | (920.31) |
2 | 527 | 1447 | 1051.60 | (1501.03) | 1119.06 | (1500.78) |
3 | 755 | 1219 | 5204.93 | (1199.87) | 4919.65 | (1200.10) |
4 | 962 | 1012 | 9515.58 | (2141.92) | 8837.39 | (2150.58) |
5 | 1166 | 807 | 19354.00 | (7934.70) | 14138.37 | (8310.59) |
Income quintile . | N treated . | N untreated . | Random forest . | Neural network . | ||
---|---|---|---|---|---|---|
All | 3682 | 6187 | 8638.15 | (1621.78) | 7364.66 | (1844.39) |
1 | 272 | 1702 | 4874.49 | (937.86) | 4664.61 | (1309.59) |
2 | 527 | 1447 | 1957.72 | (1738.61) | 1635.69 | (1603.19) |
3 | 755 | 1219 | 3973.11 | (1474.72) | 5106.03 | (1287.70) |
4 | 962 | 1012 | 10056.79 | (2375.44) | 9529.03 | (2205.61) |
5 | 1166 | 807 | 21168.13 | (8015.79) | 20138.57 | (7506.92) |
Income quintile . | N treated . | N untreated . | Random forest . | Neural network . | ||
---|---|---|---|---|---|---|
All | 3682 | 6187 | 8638.15 | (1621.78) | 7364.66 | (1844.39) |
1 | 272 | 1702 | 4874.49 | (937.86) | 4664.61 | (1309.59) |
2 | 527 | 1447 | 1957.72 | (1738.61) | 1635.69 | (1603.19) |
3 | 755 | 1219 | 3973.11 | (1474.72) | 5106.03 | (1287.70) |
4 | 962 | 1012 | 10056.79 | (2375.44) | 9529.03 | (2205.61) |
5 | 1166 | 807 | 21168.13 | (8015.79) | 20138.57 | (7506.92) |
Income quintile . | N treated . | N untreated . | Random forest . | Neural network . | ||
---|---|---|---|---|---|---|
All | 3682 | 6187 | 8638.15 | (1621.78) | 7364.66 | (1844.39) |
1 | 272 | 1702 | 4874.49 | (937.86) | 4664.61 | (1309.59) |
2 | 527 | 1447 | 1957.72 | (1738.61) | 1635.69 | (1603.19) |
3 | 755 | 1219 | 3973.11 | (1474.72) | 5106.03 | (1287.70) |
4 | 962 | 1012 | 10056.79 | (2375.44) | 9529.03 | (2205.61) |
5 | 1166 | 807 | 21168.13 | (8015.79) | 20138.57 | (7506.92) |
Income quintile . | N treated . | N untreated . | Random forest . | Neural network . | ||
---|---|---|---|---|---|---|
All | 3682 | 6187 | 8638.15 | (1621.78) | 7364.66 | (1844.39) |
1 | 272 | 1702 | 4874.49 | (937.86) | 4664.61 | (1309.59) |
2 | 527 | 1447 | 1957.72 | (1738.61) | 1635.69 | (1603.19) |
3 | 755 | 1219 | 3973.11 | (1474.72) | 5106.03 | (1287.70) |
4 | 962 | 1012 | 10056.79 | (2375.44) | 9529.03 | (2205.61) |
5 | 1166 | 807 | 21168.13 | (8015.79) | 20138.57 | (7506.92) |
For comparison, Chernozhukov et al. (2018a) report ATE of 7170 (1398) by DML, which estimates the RR by estimating the propensity score and plugging it into the RR functional form. Though these two estimators are asymptotically equivalent under correct specification, our estimator avoids the estimated propensity score in the denominator which could cause numerical instability. The ATE results are broadly consistent with Poterba et al. (1995), who use a simpler specification motivated by economic reasoning. The localized ATE estimates by income quintile group appear to be new empirical results and are of interest in their own right. In Section S5 we report analogous estimates without debiasing. Without debiasing, the GDS and Lasso estimates of ATE are attenuated due to regularization. The bias is smaller for the random forest and neural network estimates of ATE.
3.2. Global and local price elasticity of petrol demand
Second, we use our method to estimate the average price elasticity of household petrol demand: the percentage change in demand due to a unit percentage change in price. This parameter is critical for assessing the welfare consequences of tax changes, and it has been studied in Hausman and Newey (1995), Schmalensee and Stoker (1999), Yatchew and No (2001), and Blundell et al. (2012). Formally, the parameter of interest is the average derivative of log demand with respect to log price holding income and demographic characteristics fixed. The exact version of this empirical question corresponds to Example 2.4. The approximate version of this empirical question corresponds to Example 2.3.
We use data from the 1994–1996 Canadian National Private Vehicle Use Survey (Semenova and Chernozhukov, 2021b), using sample selection and variable construction as in Yatchew and No (2001) and Belloni et al. (2019). The outcome Y is log petrol consumption. The variable D with respect to which we differentiate is log price per litre. The raw covariates X are log age, log income, and log distance as well as geographical, time, and household composition dummies. In total we have |$n=5001$| observations. We consider the specification |$b(D,X)$| previously considered by Semenova and Chernozhukov (2021a) augmented with additional interactions. The Semenova and Chernozhukov (2021a) specification includes polynomials of continuous covariates, and interactions of log price (and its square) with time and household composition dummies. We further include interactions of log price (and its square) with log age, log age squared, log income, and log income squared to allow for heterogeneity. Altogether, |$p=99$|.
Table 3 summarizes results for the entire population and for each quintile of the income distribution. We use |$K=5$| folds in cross-fitting. To estimate the RR, we use the GDS procedure introduced in the present work. To estimate the regression, we use GDS, Lasso, random forest, or neural network. Again, GDS is implemented using the tuning procedure described in Section S5, Lasso is implemented using the tuning procedure described in Chernozhukov et al. (2018), and random forest and neural network are implemented with the same settings as Chernozhukov et al. (2018a). We find average price elasticity of |$-0.28$| (0.06) using GDS for both the RR and the regression. Lasso gives similar results. Note that random forest is not differentiable, and the derivative of a neural network may be difficult to extract from a black-box implementation. When using these estimators, we implement a partial difference approximation of the derivative, detailed in Section S5. We conjecture that this approximation explains why the results using random forest appear attenuated and why the results using neural network appear positive or statistically insignificant. Using GDS, we find that localized average price elasticity is statistically significant and negative in each income quintile, with substantial heterogeneity.
Income quintile . | N . | GDS . | Lasso . | Random forest . | Neural network . | ||||
---|---|---|---|---|---|---|---|---|---|
All | 5001 | -0.28 | (0.06) | -0.16 | (0.05) | -0.01 | (0.06) | 0.15 | (0.05) |
1 | 1001 | -0.84 | (0.13) | -0.44 | (0.12) | -0.37 | (0.14) | 0.06 | (0.12) |
2 | 1000 | -0.36 | (0.12) | -0.27 | (0.11) | -0.13 | (0.13) | 0.42 | (0.12) |
3 | 1000 | -1.40 | (0.15) | -0.91 | (0.13) | -0.60 | (0.13) | -0.28 | (0.13) |
4 | 1000 | -1.06 | (0.14) | -0.79 | (0.14) | -0.32 | (0.15) | 0.13 | (0.14) |
5 | 1000 | -0.11 | (0.14) | -0.03 | (0.11) | 0.16 | (0.12) | 0.58 | (0.10) |
Income quintile . | N . | GDS . | Lasso . | Random forest . | Neural network . | ||||
---|---|---|---|---|---|---|---|---|---|
All | 5001 | -0.28 | (0.06) | -0.16 | (0.05) | -0.01 | (0.06) | 0.15 | (0.05) |
1 | 1001 | -0.84 | (0.13) | -0.44 | (0.12) | -0.37 | (0.14) | 0.06 | (0.12) |
2 | 1000 | -0.36 | (0.12) | -0.27 | (0.11) | -0.13 | (0.13) | 0.42 | (0.12) |
3 | 1000 | -1.40 | (0.15) | -0.91 | (0.13) | -0.60 | (0.13) | -0.28 | (0.13) |
4 | 1000 | -1.06 | (0.14) | -0.79 | (0.14) | -0.32 | (0.15) | 0.13 | (0.14) |
5 | 1000 | -0.11 | (0.14) | -0.03 | (0.11) | 0.16 | (0.12) | 0.58 | (0.10) |
Income quintile . | N . | GDS . | Lasso . | Random forest . | Neural network . | ||||
---|---|---|---|---|---|---|---|---|---|
All | 5001 | -0.28 | (0.06) | -0.16 | (0.05) | -0.01 | (0.06) | 0.15 | (0.05) |
1 | 1001 | -0.84 | (0.13) | -0.44 | (0.12) | -0.37 | (0.14) | 0.06 | (0.12) |
2 | 1000 | -0.36 | (0.12) | -0.27 | (0.11) | -0.13 | (0.13) | 0.42 | (0.12) |
3 | 1000 | -1.40 | (0.15) | -0.91 | (0.13) | -0.60 | (0.13) | -0.28 | (0.13) |
4 | 1000 | -1.06 | (0.14) | -0.79 | (0.14) | -0.32 | (0.15) | 0.13 | (0.14) |
5 | 1000 | -0.11 | (0.14) | -0.03 | (0.11) | 0.16 | (0.12) | 0.58 | (0.10) |
Income quintile . | N . | GDS . | Lasso . | Random forest . | Neural network . | ||||
---|---|---|---|---|---|---|---|---|---|
All | 5001 | -0.28 | (0.06) | -0.16 | (0.05) | -0.01 | (0.06) | 0.15 | (0.05) |
1 | 1001 | -0.84 | (0.13) | -0.44 | (0.12) | -0.37 | (0.14) | 0.06 | (0.12) |
2 | 1000 | -0.36 | (0.12) | -0.27 | (0.11) | -0.13 | (0.13) | 0.42 | (0.12) |
3 | 1000 | -1.40 | (0.15) | -0.91 | (0.13) | -0.60 | (0.13) | -0.28 | (0.13) |
4 | 1000 | -1.06 | (0.14) | -0.79 | (0.14) | -0.32 | (0.15) | 0.13 | (0.14) |
5 | 1000 | -0.11 | (0.14) | -0.03 | (0.11) | 0.16 | (0.12) | 0.58 | (0.10) |
For comparison, OLS regression of log consumption on log price, log age, log income, and log distance as well as geographical, time, and household composition dummies yields an estimate of 0.14 (0.06). The linear specification leads to a positive elasticity estimate, contradicting economic intuition (since it says there would be more petrol consumption when prices are higher). Our localized average price elasticity results using GDS are broadly consistent with Semenova and Chernozhukov (2021a), who more explicitly consider the relationship between average price elasticity and income. In Section S5 we report analogous estimates without debiasing. Without debiasing, the GDS and Lasso estimates of quintile elasticities are attenuated due to regularization. The bias is smaller for the random forest and neural network estimates of quintile elasticities.
4. ESTIMATION AND INFERENCE FOR HIGH DIMENSIONAL APPROXIMATELY LINEAR MODELS
4.1. Best linear approximations for the regression function and the Riesz representer
Here we define |$\gamma ^\star _0$| as a projection of Y onto |$\bar{\Gamma }$|, i.e., |$\gamma ^\star _0$| is the projection of Y on the infinite set of variables |$\lbrace \tilde{b}_j(X)\rbrace _{j=1}^{\infty }$|. This setup is slightly more general than in the introduction, where |$\gamma ^\star _0$| was the conditional expectation function. Of course, if the latter is an element of |$\bar{\Gamma }$|, it automatically coincides with |$\gamma ^\star _0$|.
The second claim of the lemma is immediate from the definition of |$(\beta _0, \rho _0)$| and the first follows from elementary calculations. The orthogonality property above says that the score function is invariant to small perturbations of the nuisance parameters |$\rho$| and |$\beta$| around their ‘true values’ |$\rho _0$| and |$\beta _0$|. This invariance property plays a crucial role in removing the impact of biased estimation of nuisance parameters |$\rho _0$| and |$\beta _0$| on the estimation of the main parameters |$\theta _0$|.
4.2. Estimators
Estimation will be carried out using the following Dantzig selector-type estimators (Candes and Tao, 2007). In a follow-up work, Chernozhukov et al. (2018) consider Lasso-type estimators.
Here we record the possibility of convex restrictions on the parameter space by placing t in a convex parameter space T. If parameter restrictions are correct, then this can potentially improve theoretical guarantees by weakening the requirements on G and other primitives.
In this setting, our estimator specializes to the original Dantzig selector. In practice, we use |$T_\beta = \mathbb {R}^p$|, although when we are interested in average derivative functionals, it is theoretically helpful to impose the convex restrictions of the sort |$T = \lbrace t \in \mathbb {R}^p: \sup _{x \in \mathcal {X}} | \partial _d b(x)^{\prime }t | \le B\rbrace$|, where B is some a priori known upper bound on the derivative. Ideally, |$D_{\beta }$| is chosen such that |$\mathrm{diag}( Var( D^{-1}_{\beta } (\hat{G} \beta _0 - \hat{M}) ) = I$|. Our practical algorithm given in Section S5 estimates |$D_\beta$| from the data.
In this setting, our estimator is a generalization of the original Dantzig selector. In practice, we are using |$T_\rho =\mathbb {R}^p$|, even though it is possible to exploit some structured restrictions on the problem motivated by the nature of the universal RRs. Ideally, |$D_{\rho }$| is chosen such that |$\mathrm{diag}( Var( D^{-1}_{\rho } (\hat{G} \rho _0 - \hat{M}) ) = I$|. Our practical algorithm given in Section S5 estimates |$D_\rho$| from the data.
We now define the DML estimator with RRs, which makes use of cross-fitting.
4.3. Properties of DML: Main result
We provide a single nonasymptotic result that allows us to cover both global and local functionals, implying uniformly valid rates of concentration and normal approximations over large sets of P.
Consider P that satisfies the following conditions.
- R|$(\delta )$|With probability |$1-\varepsilon$|, the estimation errors |$\lbrace (\hat{\beta }_k- \beta _0, \hat{\rho }_k- \rho _0)\rbrace _{k=1}^{K}$| take values in |$\mathsf {S}^K$|, with quality of the guarantee obeying$$\begin{equation*} \sigma ^{-1} ( \sqrt{m}\sigma r_3 + \mu r_1 (1+ \sigma ) + \mu \sigma r_2 ) \le \delta . \end{equation*}$$
|$R(\delta )$| is a requirement on how the sequences |$(r_1,r_2,r_3)$| evolve relative to |$(\sigma ,\mu ,m)$|. We will formally verify |$R(\delta )$| for the approximately sparse setting, in Corollary 4.4 below. |$R(\delta )$| is the key condition for our main result, Theorem 4.1.
The conclusions of this result are distinguished from those of Chernozhukov et al. (2018a) and Chernozhukov et al. (2018) in applying to local, nonparametric objects, in providing finite sample bounds, and in being uniform over the parameter space. The conclusions are similar to this previous work in relying on a rate condition that is the product of rates of estimation for two distinct functions, here the regression and the RR.
The constants can be chosen to yield an asymptotic result.
Hence the DML estimator of the linear functionals of the BLP function |$\gamma _0$| enjoys good properties under the stated regularity conditions. This result does not distinguish between inference on global functionals from inference on local functionals, as long as the latter are not perfectly localized. We state a separate result for perfectly localized functionals below.
The approximation bias for the ultimate target can be plausibly small due to the fact that many rich function classes admit regularized linear approximations with respect to conventional dictionaries b. For instance, Tsybakov (2012) and Belloni et al. (2014) show small approximation bias using Fourier bases as dictionaries, and using Sobolev and rearranged Sobolev balls, respectively, as the function classes.
4.4. Semiparametric efficiency
Below we use concepts from semiparametric efficiency, as presented in Bickel et al. (1993) and Van der Vaart (2000); we do not recall them here for brevity.
The DML estimator |$\hat{\theta }$| will be asymptotically efficient for estimating |$\theta ^\star _0$|, defined as a functional of |$\gamma ^\star _{0}$|, the projection of Y on |$\bar{\Gamma }$|. The distribution of a data observation is unrestricted in this case, so that there will only be one influence function for each functional of interest, and the estimator is asymptotically linear with that influence function. The standard semiparametric efficiency results then imply that our estimator will have the smallest asymptotic concentration among estimators that are locally regular; see Bickel et al. (1993) and Van der Vaart (2000).
Our formal result stated below only implies efficiency for the regular case, where the operator norm of the function L is bounded, holding P fixed. We expect that a similar result continues to hold with |$L \rightarrow \infty$|, by developing an appropriate formalization that handles P changing with n and rules out super-efficiency phenomena. However, this formalization requires a separate major development, which we leave to future research. In what follows, the notation |$\gamma ^\star _{0,P}$| emphasizes the dependence of the projection |$\gamma ^\star _0$| on P.
4.5. Properties of GDS estimators
Our goal is to verify that the guarantee |$R(\delta )$| holds. In particular we have to analyse |$(r_1,r_2,r_3)$| by bounding the population prediction norm |$v \mapsto \sqrt{v^{\prime } G v}.$| This is a more nuanced problem than bounding the empirical prediction norm |$v \mapsto \sqrt{v^{\prime } \hat{G} v}$|, which has been accomplished in a variety of prior analyses done on Dantzig-type and Lasso-type estimators.
We begin with the following condition, which only controls the max of error rates and controls the |$\ell _1$| norm of true parameters:
- We have that |$t_0 \in T$| and |$\Vert t_0\Vert _1 \le B$|, where |$B \ge 1$|, and the empirical moments obey the following bounds with probability at least |$1 - \varepsilon$|, for |$\bar{\lambda }\ge \lambda$|$$\begin{equation*} \Vert \hat{G} - G\Vert _\infty \le \bar{\lambda }, \ \Vert \hat{G} t_0 - \hat{M}\Vert _{\infty }\le \lambda . \end{equation*}$$
The bounds on |$\ell _1$| norm of coefficients are naturally motivated, for example, by working in Sobolev or rearranged Sobolev spaces (see, Tsybakov, 2012 and Belloni et al., 2014, respectively). Rearranged Sobolev spaces allow the first p regression coefficients in the series expansion to be arbitrarily rearranged, allowing a much greater degree of oscillatory behaviours than in the original Sobolev spaces. The complexity of these function classes are also different. Sobolev spaces are Donsker sets under sufficient smoothness, whereas rearranged Sobolev spaces have the covering entropy bounded below by |$\log p$| and are not Donsker if |$p \rightarrow \infty$|.
The effective dimension is defined in terms of the population (rather than sample) covariance matrix G, which makes it easy to verify regularity conditions. Note that if |$G=I$| and |$\Vert t_0\Vert _0 =s$|, then |$s(t_0) \le s.$| More generally, |$s(t_0)$| measures the effective difficulty of estimating |$t_0$| in the prediction norm, created by design G and the structure of |$t_0$|. The condition imposes no conditions on the restricted or sparse eigenvalues of G. For example, take |$G = 11^{\prime }$|, a rank 1 matrix, and suppose |$\Vert t_0\Vert _0 = 1$|. Then |$s(t_0) \le 1$| holds in this case, giving useful and intuitive performance bounds, while the standard restricted eigenvalues and cone invertibility factors are all zero in this case, yielding no bounds on the performance in the population prediction norm. This type of example illustrates the possibility of accommodation of overcomplete (multiple or amalgamated) dictionaries in b, whose use in conjunction with |$\ell _1-$| penalization has been advocated by Donoho et al. (2005). Of course, the bounds on effective dimension follow from the bounds on cone-invertibility factors and restricted eigenvalues.
Given a vector |$\delta \in \mathbb {R}^p$|, let |$\delta _A$| denote a vector with the j-th component set to |$\delta _j$| if |$j \in A$| and 0 if |$j \not\in A$|.
The cone invertibility factor is a generalization of the restricted eigenvalue condition of Bickel et al. (2009), proposed by Ye and Zhang (2010).
Since approximate sparsity is a simple condition that implies a bound on effective dimension, we pause and interpret approximate sparsity in the context of a motivating example from causal inference. In particular, we revisit ATE (Example 2.1). For simplicity, consider the global parameter and assume that the function |${\mathrm{E}}[Y|D,Z]$| is an element of |$\bar{\Gamma }$|, so that |$\gamma _0^{\star }(D,Z)={\mathrm{E}}[Y|D,Z]$| and |$\alpha _0^{\star }(D,Z)=D/\pi _0^{\star }(Z)-(1-D)/(1-\pi _0^{\star }(Z))$| where |$\pi _0^{\star }(Z)={\mathrm{E}}[D|Z]$| is the propensity score. Consider the dictionary |$b(d,z)=(dq(z)^{\prime },(1-d)q(z)^{\prime })^{\prime }$| where |$\lbrace q_j(z)\rbrace _{j=1}^{p/2}$| are the initial |$p/2$| elements of a sequence of basis functions that approximates the functions |${\mathrm{E}}[Y|1,Z]$|, |${\mathrm{E}}[Y|0,Z]$|, |$1/\pi _0^{\star }(Z)$|, and |$1/(1-\pi _0^{\star }(Z))$|.
Suppose the minimal |$\ell _1$|-norm mean square projections of |${\mathrm{E}}[Y|1,Z]$| and |${\mathrm{E}}[Y|0,Z]$| onto |$\lbrace q_j(z)\rbrace _{j=1}^{p/2}$| are approximately sparse after rescaling appropriately by |$D^{-1}_{\beta }$|. (Note that if |${\mathrm{E}}[Y|1,Z]$| and |${\mathrm{E}}[Y|0,Z]$| are already approximately sparse then so are their projections.) It follows that the minimal |$\ell _1$|-norm mean square projection of |$\gamma _0^{\star }$| is approximately sparse and |$s_{\beta }:=s(D^{-1}_\beta \beta _0;\nu )$| is small.
Suppose instead that the minimal |$\ell _1$|-norm mean square projections of |$1/\pi _0^{\star }(Z)$| and |$1/(1-\pi _0^{\star }(Z))$| onto |$\lbrace q_j(z)\rbrace _{j=1}^{p/2}$| are approximately sparse after rescaling appropriately by |$D^{-1}_{\rho }$|. (Note that if |$1/\pi _0^{\star }(Z)$| and |$1/(1-\pi _0^{\star }(Z))$| are already approximately sparse then so are their projections.) It follows that the minimal |$\ell _1$|-norm mean square projection of |$\alpha _0^{\star }$| is approximately sparse and |$s_{\rho }:=s(D^{-1}_\rho \rho _0;\nu )$| is small.
The concept of the effective dimension does not split |$t_0$| into a sparse component and a small dense component, as is done in the now standard analysis of |$\ell _1$|-regularized estimators of approximately sparse |$t_0$|. The effective dimension is simply stated in terms of |$t_0$| alone.
The bound is a minimum of what is called the ‘fast rate bound’ and the ‘slow rate’ bound. This result tightens the result in Chatterjee (2013), who established a ‘slow rate’ bound (in the context of Lasso) that applies under no assumptions on G. If the effective dimension is not too big, as in the examples above, the ‘fast rate’ |$s(t_0;\nu ) \nu ^2$| provides a tighter bound under weak assumptions on G. It is important to emphasize that the result is stated in terms of the population prediction norm rather than the empirical norm.
The following is a sufficient condition that will deliver the guarantee |$R(\delta )$| for |$\delta \rightarrow 0$|. Let |$\tilde{\ell }$| denote a positive constant (that increases to |$\infty$| as |$n \rightarrow \infty$| in the asymptotic results).
(a) The |$\ell _1$| norms of coefficients are bounded as |$\Vert D^{-1}_\rho \rho _0 \Vert _1 \le B$| and |$\Vert D^{-1}_\beta \beta _0\Vert _1 \le B$|, for |$B \ge 1$|, and the scaling matrices obey |$\Vert D_{\rho } v\Vert \le \mu _D \sigma \Vert v\Vert $| for |$D^{-1}_\rho v \in S ( D^{-1}_\rho \rho _0, \nu )$| and |$\Vert D_{\beta } u\Vert \le \mu _D \Vert u\Vert $| for |$D^{-1}_\beta u \in S (D^{-1}_\beta \beta _0, \nu )$| for |$\nu = 4 B \tilde{\ell }/\sqrt{n}$|. (b) Given a random subset A of |$\lbrace 1,\ldots ,n\rbrace$| of size |$m \ge n - \lfloor n/K \rfloor$|, dictionary b obeys with probability at least |$1-\epsilon$|, |$\Vert \mathbb {G}_{A} b b^{\prime }\Vert _\infty \le \tilde{\ell }.$|(c) The penalty levels |$\lambda _{\rho }$| and |$\lambda _{\beta }$| are chosen such that with probability at least |$1- \epsilon$|, |$\Vert D_{\beta }^{-1} (\mathbb {G}_{A} b b^{\prime } \beta _0 - \mathbb {G}_{A} Y b(X))\Vert _\infty /\sqrt{m} \le \lambda _\rho ,$||$\Vert D_{\rho }^{-1} (\mathbb {G}_{A} b b^{\prime } \beta _0 - \mathbb {G}_{A} m(W, b))\Vert _\infty /\sqrt{m} \le \lambda _\beta ,$| and are not overly large, |$\lambda _{\beta } \vee \lambda _{\rho } \le \tilde{\ell }/\sqrt{m}$|.
SC(a) records a restriction on the |$\ell _1$| norm of |$\beta _0$| and |$\rho _0$|. For instance, in Examples 2.1, 2.2, and 2.3, |$D_\rho \asymp \sigma I \asymp L I$|, which requires the |$\ell _1$|-norm of |$\rho _0$| to increase at most at the speed |$L\asymp \sigma$|.
In other words, we have instantiated |$(r_1,r_2,r_3)$| for approximately sparse models in the guarantee set |$\mathsf {S}$|. We have the following corollary, which verifies |$R(\delta )$| for approximately sparse models and hence provides sufficient conditions for Theorem 4.1.
5. ESTIMATION AND INFERENCE USING GENERAL REGRESSION LEARNERS
In this section we generalize the previous analysis to allow for any regression learner |$\hat{\gamma }$| of |$E[Y|X]$| to be used in the construction of the estimator. As we have done in preceding sections we continue to include local functionals in our analysis, so that the results apply to nonregular objects as well as regular ones that can be estimated |$\sqrt{n}$|-consistently.
Compared to the global case, the local case may have smaller regularization and model selection biases relative to the variance. Nonetheless, bias correction is important for inference in theory and in practice. Theoretically, the local case begins to resemble the global case as the number of dimensions being integrated increases. Empirically, we provide local estimates without bias correction in Section S5. The differences can be substantial.
The only conditions we will impose on the regression learner are certain |$L^{2}$| convergence properties that we will specify in this section. These properties will allow for a wide variety of learners, including GDS, Lasso, neural nets, boosting, and others. Thus we provide estimators of local functions that can be constructed using many regression learners.
This result shows that asymptotic linearity of the estimator |$\hat{\theta }$| will result if |$r_{2}^{\star }\rightarrow 0$| fast enough relative to |$r_{1}^{\star }$|. Asymptotic linearity implies asymptotic Gaussian inference by standard central limit theorem arguments. As in regular doubly robust estimation problems it allows for a tradeoff between the speed of convergence |$r_{2}^{\star }$| of the RR and |$r_{1}^{\star }$| of the regression. It only requires a mean square convergence rate for the regression learner |$\hat{\gamma }_{k}$| and so allows for a wide variety of first step machine learning estimators.
We could also formulate a nonasymptotic analogue to this asymptotic result. This would depend on the availability of nonasymptotic results for the learner |$\hat{\gamma }_{k}$|. To the best of our knowledge such results are not available for many learners, such as neural nets and random forests. To allow the results of this section to include as many first steps as possible we focus here on the asymptotic result and reserve the nonasymptotic result to future work.
ACKNOWLEDGEMENTS
The National Science Foundation provided partial financial support via grants 1559172 and 1757140. Rahul Singh thanks the Jerry Hausman Dissertation Fellowship.
REFERENCES
Supporting Information
Additional Supporting Information may be found in the online version of this article at the publisher’s website:
Online Appendix
Replication Package
Notes
Managing editor Jaap Abbring handled this manuscript.